Post-Placement Power Optimization
Post-Placement Power Optimization
Flip-Flops
Yao-Tsung Chang, Chih-Cheng Hsu, Mark Po-Hung Lin, Yu-Wen Tsai, and Sheng-Fong Chen
Department of Electrical Engineering, National Chung Cheng University Faraday Technology Corporation
Chiayi 621, Taiwan Hsinchu 300, Taiwan
I. I NTRODUCTION
With limited power/thermal budgets for modern system Fig. 1. An example of merging two 1-bit flip-flops into one 2-bit flip-flop.
on chips (SOCs) which integrate an increasing number of
transistors, power minimization has become one of the most
important objectives in designing SOCs for various appli- TABLE I
C OMPARISONS OF THE NORMALIZED POWER CONSUMPTION AND AREAS
cations. High power dissipation of an SOC will not only OF FLIP - FLOPS WITH DIFFERENT BIT NUMBERS .
increase its system costs but also affect the product lifetime
and reliability. To optimize power consumption in electrical Bit Normalized Power Normalized Area
and physical design, many design methodologies have been Number Consumption per bit per bit
introduced, such as creating multi-supply-voltage (MSV) de- 1 1.00 1.00
signs [7], replacing non-timing-critical cells with their high- 2 0.86 0.96
𝑉𝑡 counter parts [7], [8], minimizing clock networks [3], 4 0.78 0.71
[4], [9], and applying multi-bit registers [4], [6]. Among
these methodologies, applying multi-bit flip-flops, or multi- Only few previous works [4], [6] in the literature have
bit registers [6], or register banks [4], is one of the most considered power optimization using MBFFs. Kretchmer [6]
effective methodologies in saving both chip area and power introduced a design methodology to create the models of
consumption. multi-bit registers in a cell library which can be inferred by
Figure 1 shows an example of merging two 1-bit flip-flops existing logic synthesis tools. Based on the multi-bit register
into one 2-bit flip-flop. Each flip-flop contains two inverters inference, it is possible to map an RTL design directly to a
to generate opposite-phase clock signals. As the process gate-level design with multi-bit register cells. Hou et al. [4]
technology advances to 65𝑛𝑚 and beyond, even a minimum- presented a power-aware placement flow which integrates
sized inverter can still drive multiple flip-flops. Replacing register banking during incremental placement and placement-
several 1-bit flip-flops with one multi-bit flip-flop (MBFF) will based logic optimization resulting in low-power clock trees.
significantly reduce the number of inverters. Consequently, the Although it is desirable to apply MBFFs in both logic syn-
total power and area of all flip-flops in a design are reduced. thesis and early physical synthesis, it is difficult to carry out
Table I further shows the comparisons of the normalized power the trade-offs among power, timing, area, and other design
consumption and areas of flip-flops with different bit numbers. objectives at such earlier design stages based on the weighting
In addition to the benefit from the reduced number of inverters, ratios [3] among different objectives.
applying MBFFs would also have the benefits in power saving Different from the previous works that applied MBFFs at
from both reductions of clock networks [4] and clock-gating earlier design stages, in this paper, we address the problem
the Post-Placement Power Optimization Problem is to min- To avoid the timing violation, it is essential to consider
imize total power consumption of all flip-flips by replacing the timing slack constraint, which is defined in Equation (3),
existing flip-flop cells in the design with MBFF cells from the during the power optimization with MBFFs. In Equation (3),
cell library while satisfying the placement density and timing 𝑇𝑠 (𝑝𝑗 , 𝑓𝑝𝑗 ) denotes the timing slack between a pin, 𝑝𝑗 , and its
slack constraints. In addition, the newly generated MBFFs connected flip-flop, 𝑓𝑝𝑗 , which should be always larger than
should not overlap any other cell in the design. or equal to zero. The value of the timing slack can be calcu-
The total power consumption of all flip-flops, 𝑃𝐹 , can be lated by Equation (4), where 𝑇𝑑,𝑚𝑎𝑥(𝑝𝑗 , 𝑓𝑝𝑗 ) and 𝑇𝑤 (𝑝𝑗 , 𝑓𝑝𝑗 )
calculated by summing up the power consumption of each denote the maximum allowable delay and interconnect delay
flip-flop, 𝑃𝑓𝑖 , in the design, as seen in Equation (1). between 𝑝𝑗 and 𝑓𝑝𝑗 , respectively.
∑
𝑃𝐹 = 𝑃𝑓𝑖 . (1) 𝑇𝑠 (𝑝𝑗 , 𝑓𝑝𝑗 ) ≥ 0, ∀𝑝𝑗 . (3)
A. Placement Density Constraint
In order to avoid routing congestion, when merging two 𝑇𝑠 (𝑝𝑗 , 𝑓𝑝𝑗 ) = 𝑇𝑑,𝑚𝑎𝑥(𝑝𝑗 , 𝑓𝑝𝑗 ) − 𝑇𝑤 (𝑝𝑗 , 𝑓𝑝𝑗 ). (4)
or more flip-flops into one MBFF, the placement density
constraint should be considered to place the MBFF because a It should be noted that the design inputs should also meet
MBFF occupies a larger area compared with any of the merged all the aforementioned constraints before performing the post-
flip-flops. To consider the placement density constraint, a chip placement power optimization with MBFFs.
219
III. T HE P ROPOSED A LGORITHMS A. Progressive Window-based Optimization
As modern SOCs usually contain hundred thousands of flip-
Based on the problem formulation described in Section II, flops, it is inefficient to handle such a large flattened design
we propose our algorithms to further reduce total power during post-placement power optimization. The progressive
consumption by replacing the placed flip-flops with as many window-based optimization is proposed to improve the defi-
MBFFs as possible at the post-placement stage. The flow ciency. Figure 3 shows the relationship among windows, bins,
of our algorithms is illustrated in Algorithm 1. First of all, and the chip. The size of a window is a multiple of bins
the set of MBFF cells, 𝐹𝐿 , in the cell library are sorted in in two dimensions. Figure 3(a) shows a window size of 2
ascending order with respect to the power consumption per x 2 bins, while Figure 3(b) shows another window size of
bit of the MBFF cells, which can be calculated by the power 4 x 4 bins. During the window-based optimization, only the
consumption of an MBFF cell divided by its bit number. Once flip-flops in the same window are considered to be optimized
the MBFF cells in the cell library are sorted, the algorithms with MBFFs. To prevent the algorithms from searching only
start to merge the flip-flops in the design with the most power- in the suboptimal solutions, two major techniques, window
efficient MBFF cell. sliding and progressive window-size expansion, are applied
when performing the window-based optimization. For the
Algorithm 1 Post-Placement Power Optimization with MBFFs window sliding technique, a window is always moved with
𝑃𝑓 𝑚 half of its size along an X or Y direction every iteration as
1: Sort 𝐹𝐿 in ascending order with respect to 𝑚𝐿 ;
shown in Figure 3(a) and (b) such that the algorithms can
2: 𝐹′ ← 𝐹;
find out more possible solutions at the window boundaries.
3: for each 𝑓𝐿𝑚 ∈ 𝐹𝐿 do
For the technique of progressive window-size expansion, the
4: Find a set of 𝑚-bit flip-flop groups, 𝐺𝑚 , in 𝐹 ′ ;
optimization process starts with the smallest window size of 2
5: Determine the position of each 𝑔𝑗𝑚 ∈ 𝐺𝑚 ;
x 2 bins as seen in Figure 3(a). After a window of a specific
6: for all 𝑔𝑗𝑚 ∈ 𝐺𝑚 do
size have slid through the whole chip area, the window size
7: if the position of 𝑔𝑗𝑚 is legal then
is enlarged such that the algorithms can find out more global
8: Create an MBFF with 𝑓𝐿𝑚 to merge all 𝑓𝑖 ∈ 𝐺𝑚 ;
solutions.
9: Place the MBFF cell at the position of 𝑔𝑗𝑚 ;
10: 𝐹 ′ ← 𝐹 ′ − 𝐺𝑚 ;
11: end if
12: end for
13: end for
There are three major steps in the flow during merging the
flip-flops in the design with 𝑚-bit flip-flop cells, and all these
steps are performed together with the progressive window-
based optimization which is introduced in Section III-A. The
first step is to find a set of 𝑚-bit flip-flop groups in the design.
220
Equation (4). Definition 2: A timing-slack-free group (TSFG) is a flip-
flop group containing a set of flip-flops satisfying both Theo-
𝑇𝑑,𝑚𝑎𝑥 (𝑝𝑗 , 𝑓𝑝𝑗 ) = 𝑇𝑠 (𝑝𝑗 , 𝑓𝑝𝑗 ) + 𝑇𝑤 (𝑝𝑗 , 𝑓𝑝𝑗 ). (5)
rem 1 and Corollary 1.
Based on some kind of wire delay model, such as Elmore 2) Exploration of 𝑚-Bit Flip-Flop Groups: Before ex-
delay model for instance, the input timing slack, 𝑇𝑠 (𝑝𝑗 , 𝑓𝑝𝑗 ), ploring 𝑚-bit TSFGs of a design, the TSFR intersection
can be transformed into a slack distance, 𝑑𝑠𝑙𝑎𝑐𝑘 (𝑝𝑗 , 𝑓𝑝𝑗 ), graph should be constructed, which is defined in Definition 3.
between 𝑝𝑗 and 𝑓𝑝𝑗 . The maximum allowable distance, Figure 5(a) shows the TSFR intersection graph representing
𝑑𝑚𝑎𝑥 (𝑝𝑗 , 𝑓𝑝𝑗 ), between 𝑝𝑗 and 𝑓𝑝𝑗 is then derived by Equa- the relationship of the TSFRs in Figure 4(b). There is no
tion (6), where 𝑑𝐻𝑃 𝑊 𝐿 (𝑝𝑗 , 𝑓𝑝𝑗 ) is the half-perimeter wire- edge between two nodes in the TSFR intersection graph, if
length between 𝑝𝑗 and 𝑓𝑝𝑗 . Consequently, every flip-flop and only if there is no intersection between the TSFRs of the
should be placed in the timing-slack-free region which is corresponding flip-flops in the design.
defined in Definition 1. Definition 3: A TSFR intersection graph is a graph,
𝐺(𝑉, 𝐸), where each vertex, 𝑛𝑖 , corresponds to a flip-flop,
𝑑𝑚𝑎𝑥 (𝑝𝑗 , 𝑓𝑝𝑗 ) = 𝑑𝑠𝑙𝑎𝑐𝑘 (𝑝𝑗 , 𝑓𝑝𝑗 ) + 𝑑𝐻𝑃 𝑊 𝐿 (𝑝𝑗 , 𝑓𝑝𝑗 ). (6)
𝑓𝑖 , in the design, and an edge, 𝑒𝑖𝑗 , between 𝑛𝑖 and 𝑛𝑗 exists
Definition 1: A timing-slack-free region (TSFR) of a flip- if there is an intersection between the TSFRs of 𝑓𝑖 and 𝑓𝑗 .
flop is a region where the flip-flop is placed within the
maximum allowable distances from its connected pins such
that the timing slack constraints are satisfied.
Figure 4(a) illustrates the TSFR of 𝑓2 which is a tilted
rectangular region [2] intersected by the Manhattan rings [2],
[9] of 𝑝1 and 𝑝2 . Every point on the Manhattan ring of 𝑝1 (𝑝2 )
has the same Manhattan distance from 𝑝1 (𝑝2 ), which is equal
to 𝑑𝑚𝑎𝑥 (𝑝1 , 𝑓2 ) (𝑑𝑚𝑎𝑥 (𝑝2 , 𝑓2 )). Figure 4(b) further shows all
Fig. 5. (a) The TSFR intersection graph representing the relationship
among the TSFRs in Figure 4(b). (b) The branch-and-bound and backtracking
algorithms [1] which find all 4-vertex cliques in (a).
221
as seen in Algorithm 2 to generate the IS of TSFGs from
𝐺𝑚 with the consideration of the placement area, 𝐴𝑔𝑖𝑚 , of the
MBFF corresponding to a TSFG, 𝑔𝑖𝑚 , and the interconnecting
wirelength, 𝑊𝑔𝑖𝑚 of 𝑔𝑖𝑚 . 𝐴𝑔𝑖𝑚 can be calculated by the inter-
section area of the corresponding TSFRs. 𝑊𝑔𝑖𝑚 is estimated
by the HPWL which is bounded by the locations of the pins
connected to the flip-flops in 𝑔𝑖𝑚 . In Algorithm 2, a 𝑔𝑖𝑚 with
222
TABLE IV
C OMPARISONS OF # OF FLIP - FLOPS WITH 1, 2, AND 4 BITS , POWER RATIO , HPWL RATIO , AND RUNTIME FOR THREE DIFFERENT APPROACHES : (1) THE
PROPOSED APPROACH WITHOUT APPLYING THE PROGRESSIVE WINDOW- BASED OPTIMIZATION , (2) THE PROPOSED APPROACH BASED ON THE
PROGRESSIVE WINDOW- BASED OPTIMIZATION WITH THE CONSIDERATION OF PLACEMENT DENSITY ONLY, AND (3) THE PROPOSED APPROACH BASED ON
THE PROGRESSIVE WINDOW- BASED OPTIMIZATION WITH THE CONSIDERATIONS OF BOTH PLACEMENT DENSITY AND INTERCONNECTING WIRELENGTH .
TABLE III
A REAS AND POWER CONSUMPTION OF THE FLIP - FLOP CELLS IN THE CELL largest circuit containing hundred thousands of flip-flops, the
LIBRARY. runtime based on Approach (3) is only 79 seconds. Although
Approach (2) is 7% faster in runtime, the HPWL ratio is 21%
Bit # of Flip-Flop Power Area worse than Approach (3). Therefore, the proposed approach
1 100 172 based on the progressive window-based optimization with the
2 172 192 considerations of both placement density and interconnecting
4 312 285 wirelength is very effective and efficient, which is capable of
incrementally merging existing MBFFs in the design to gain
more power saving.
We implemented our algorithms in the C++ programming
V. C ONCLUSIONS
language with STL on a 2.66GHz Intel i7 PC under the Linux
operation system. We empirically tested our approach on six In this paper, we have introduced a new problem formu-
industrial circuits with the numbers of 1-bit flip-flops ranging lation of post-placement power optimization with multi-bit
from 76 to 146400 and the numbers of 2-bit flip-flops ranging flip-flops. We have also proposed our algorithms to solve
from 22 to 22800. There is no 4-bit flip-flop in the benchmark the addressed problem based on the progressive window-
circuits. The placements of all flip-flops in each circuit have based optimization with the considerations of both placement
also been optimized. Table II lists the names of the benchmark density and interconnecting wirelength. Experimental results
circuits (“Circuit”), the numbers of 1-bit flip-flops (“# of 1-bit based on the industry benchmark circuits have shown that our
FFs”), the numbers of 2-bit flip-flops (“# of 2-bit FFs”), and approach is very effective and efficient, which is capable of
the numbers of 4-bit flip-flops (“# of 4-bit FFs”). A cell library incrementally merging existing MBFFs in the design to gain
containing 1-bit, 2-bit, and 4-bit flip-flops is also provided more power saving.
with the specifications of their power consumption and areas. R EFERENCES
Table III lists the bit numbers of each flip-flop (“Bit # of [1] C. Bron and J. Kerbosch, “Algorithm 457 - Finding all cliques of an
Flip-Flop”), and the corresponding the power consumption undirected graph,” ACM Comm., vol. 16, no. 9, pp. 575–577, September
(“Power”) and areas (“Area”). 1973.
[2] T.-H. Chao, Y.-C. Hsu, J.-M. Ho, K. D. Boese, and A. B. Kahng, “Zero
We compared the numbers of flip-flops with 1, 2, and skew clock routing with minimum wirelength,” IEEE TCAS-II, vol. 39,
4 bits, the power reduction, HPWL ratio, and runtime for no. 11, pp. 799–814, November 1992.
three different approaches: (1) the proposed approach without [3] Y. Cheon, P.-H. Ho, A. B. Kahng, S. Reda, and Q. Wang, “Power-aware
placement,” Proc. DAC, pp. 795–800, 2005.
applying the progressive window-based optimization, (2) the [4] W. Hou, D. Liu, P.-H. Ho, “Automatic register banking for low-power
proposed approach based on the progressive window-based clock trees,” Proc. ISQED, pp. 647–652, 2009.
optimization with the consideration of placement density only, [5] R. Karp, “Reducibility among combinatorial problems,” Complexity of
Computer Computations, Plenum Press, 1972.
and (3) the proposed approach based on the the progressive [6] Y. Kretchmer, “Using multi-bit register inference to save area and power:
window-based optimization with the considerations of both the good, the bad, and the ugly,” EE Times Asia, May 2001.
placement density and interconnecting wirelength. Table IV [7] A. Khan, P. Watson, G. Kuo, D. Le, T. Nguyen, S. Yang, P. Bennett,
P. Huang, J. Gill, C. Hawkins, J. Goodenough, D. Wang, I. Ahmed,
lists the names of the benchmark circuits (“Circuit”), the P. Tran, H. Mak, O. Kim, F. Martin, Y. Fan, D. Ge, J. Kung, and V. Shek,
numbers of flip-flops with 1, 2, and 4 bits (“# of FFs (1, 2, 4 “A 90-𝑛𝑚 power optimization methodology with application to the ARM
bits)”), the power reduction (“Power Red.”), the HPWL ratio 1136JF-S microprocessor,” IEEE JSSC, vol. 41, no. 8, pp. 1707–1717,
August 2006.
between the resulting and input circuits (“HPWL Ratio”), and [8] T. Luo, D. Newmark, and D. Z. Pan, “Total power optimization combining
the runtimes (“Time”) for the three approaches. The results placement, sizing and multi-Vt through slack distribution management,”
show that Approach (2) and (3) outperforms Approach (1) by Proc. ASPDAC, pp. 352–357, 2008.
[9] Y. Lua, C. N. Sze, X. Hong, Q. Zhou, Y. Cai, L. Huang, and J. Hu,
at least 37222X, which is a significant improvement based “Navigating registers in placement for clock network minimization,” Proc.
on the progressive window-based optimization. Even for the DAC, pp. 176–181, 2005.
223