A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application To A Double-Throughput MAC Unit
A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application To A Double-Throughput MAC Unit
3073
I. INTRODUCTION
HE multiply-accumulate (MAC) unit is a common digital block used extensively in microprocessors and digital
signal processors for data-intensive applications. For example,
many filters, orthogonal frequency-division multiplexing algorithms, and channel estimators require FIR or FFT/IFFT computations that MAC units can accelerate efficiently.
A basic MAC architecture consists of a multiplier and an accumulate adder organized as in Fig. 1. Inputs are fed to the multiplier, and successive products are summed by the accumulate
adder. Multipliers are typically comprised of a partial-product
unit (the PP unit) and a carry-propagate adder (the final adder).
Manuscript received April 02, 2010; revised August 04, 2010; accepted October 12, 2010. Date of current version December 15, 2010. This work was supported in part by VR, the Swedish Research Council, under Contract 2006-2927
and by the European Commission Framework Programme 7, Embedded Reconfigurable Architectures under Grant 249059.
The authors are with the Department of Computer Science and Engineering,
Chalmers University of Technology, SE-412 96 Gothenburg, Sweden (e-mail:
([email protected]; [email protected]; [email protected]).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSI.2010.2091191
Fig. 1. Block diagram of a general MAC architecture. Here, the register between the PP unit and the final adder is removed/included to obtain a two/threecycle MAC architecture.
3074
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 12, DECEMBER 2010
be implemented using high-speed compressors [3] or speed-optimized structures [4]. Mathew et al. propose a sparse-tree carry
look-ahead adder for fast addition of the PP unit outputs [5], and
Liu et al. introduce a hybrid adder [6] to reduce delay compared
to a design that assumes equal arrival time on all adder inputs.
Performing two different carry propagations in the same
MAC circuit is wasteful, since carry propagation is time consuming. Feeding the multiplier output back to the input of the
PP unit reduction tree obviates the need for a conventional
accumulate adder [7][9]. Accumulation is thus handled by the
final adder of the multiplier, and only one carry-propagating
stage is required. The problem is that this optimization only
applies to one-cycle MACs, where the long critical delay is a
limiting factor in most applications. If a pipeline register were
to be inserted, the MAC output would no longer produce the
correct result each cycle. In fact, to get the final result, we
would have to add an extra, empty cycle after the final multiply-accumulate cycle of a loop. Furthermore, it is not obvious
how guard bits can be accommodated in these designs. Guard
bits are important for avoiding overflow when computing long
sequences of multiply-accumulate operation. Ercegovac and
Lang present a MAC architecture in which the multipliers
final adder is replaced by a stage of 4:2 compressors and guard
bits are handled by an incrementer circuit [10]. However, this
architecture only supports sign-magnitude numbers.
In general, two-cycle MAC architectures have a first (multiplication) stage that is significantly slower than the second (accumulation) stage. We propose a new two-cycle MAC architecture in which the second stage is somewhat slower, but the first
stage is significantly faster, leading to a better delay balance between the two stages. The key to the new architecture is the implementation of product sign extension: the sign-extension circuitry is located in the second stage, together with the accumulate adder and the saturation unit. As a result, the feedback of
the product is contained within the second pipeline stage.
The remainder of this paper is organized as follows: Section II
describes our MAC architecture and contrasts it to a basic twocycle architecture. Next, Section III provides an evaluation with
respect to performance, power, energy, and area. The new MAC
architecture has some features that enable us to design a unit
that efficiently performs multiply-accumulate operations on different data operand sizes. Thus, as an extension to Section II, we
introduce and evaluate the double-throughput MAC (DTMAC)
architecture in Section IV. Finally, we conclude the paper in
Section V.
3075
Fig. 3. A multiply-accumulate operation using inputs and , assuming the three-cycle MAC architecture of Fig. 1. The multiply-accumulate operation starts
with the generation (assuming the BaughWooley algorithm) and reduction of partial products. The final adder performs carry propagation of the sums and carries
produced by the PP unit. Finally, the accumulate adder sums the pipelined products ( ) to the accumulated result ( ), producing the new result ( ).
NgbG[2N 1]
THEN
# not saturated
MAC_output[2N 1:0]
ELSE IF G[2N Ng 1]
G[2N 1:0]
'1' THEN
1 b1, 2N 2 b0, 1 b1
ELSE
# set to the maximum positive 2N-bit value
MAC_output[2N 1:0]
1 b0, 2N 1 b1
END
III. EVALUATION
A. Evaluated Architectures
We consider three architectures that share the same structure
for the PP unit, the final adder, and the accumulate adder.
MAC-2C represents the two-cycle MAC whose critical
path goes through the PP unit and the final adder (Fig. 1).
3076
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 12, DECEMBER 2010
TABLE I
EVALUATION RESULTS OF THREE ARCHITECTURES FOR 16, 32, 48, AND 64 BITS IN OPERAND SIZE
TABLE II
ENERGY PER OPERATION AT IDENTICAL CLOCK RATE AND TIMING CONSTRAINT
3077
toward the centre in order to balance the output delay for the PP
unit while still meeting the timing constraint.
Table II shows the power dissipation of individual units for
the two architectures3 when they are operated at MAC-2Cs
maximal clock rate. In MAC-NEW, the removal of the final
adder reduces power. However, since the PP unit only produces
extra registers between stages
a partial product, there are
one and two in Fig. 2, which increases the clock power compared to MAC-2C. More importantly, replacing the final adder
with a carry-save adder leads to a 30% reduction of total power,
on average. Furthermore, the removal of the final adder relaxes
the timing constraint on the PP unit to the extent that its power
drops 57% across the four operand sizes. This is significant, as
the PP unit represents a big fraction of total power.
In summary, by downsizing gates in the PP unit, MAC-NEW
dissipates an average of 52% less energy than MAC-2C for the
same operating frequency and supply voltage. Downsizing thus
offers a 29% additional reduction in energy per cycle over the
MAC-NEW implementation in Section III-C that uses the same
gate sizes as MAC-2C. Again, the numbers given here depend
on the enforced timing constraint for the individual stages. For
a more relaxed constraint, the utilization of slack in the faster
architecture would be less effective.
IV. EXTENSION TO A VERSATILE MAC UNIT
Adapting circuits to operate on the actual data precision of
an application can save energy, as demonstrated in micro-processors [18] and dedicated circuits [19]. Many embedded applications are based on a 16-bit dynamic range, while embedded
processors generally have a 32-bit datapath. Thus, potentially a
32-bit datapath could accommodate the execution of two simultaneous 16-bit operations. When the dynamic range of the data
varies significantly across applications, run-time adaptation of
the computational precision of a single circuit would be useful,
rather than using several circuits that each has its own fixed
operand size. Our previous work [13] shows that adding this
kind of run-time adaptation to the multiplier of a general-purpose processor reduces execution time of an FFT application by
15%.
We refer to a MAC unit that can optionally switch be-bit operations as a
tween -bit operations and
double-throughput MAC unit (DTMAC). A 32-bit instance of
such a MAC unit could be implemented by tying together two
separate, 16-bit MAC units [20]. To support 32-bit operations
the two 16-bit multipliers must be combined into one 32-bit
multiplier, which requires complex routing and is difficult to
implement efficiently. FPGA technology offers reconfigurability that can support double-throughput multiply-accumulate
operations [21], but FPGAs are still inefficient in terms of speed
and power compared to the ASIC solutions we consider here.
A. An Efficient Double-Throughput MAC Unit
A critical feature of any double-throughput MAC unit is that
it should support several operating modes, without incurring any
significant overheads on timing and power for the default
-bit mode. Thanks to the architecture introduced in Section II,
3This
N = 32
Ng = 8
and the fact that product sign extension, accumulation, and carry
propagation take place in the second stage of a two-cycle MAC
unit, we can create an efficient DTMAC unit [22], see Fig. 5.
While other schemes, such as Kuang and Wangs scheme
[23], may be used, our twos complement DTMAC unit employs
the Twin-Precision (TP) technique [24]. A twin-precision partial-product reduction tree generates the TP-PP units outputs,
as shown in Fig. 6(a), which in conventional schemes are fed to
a final adder in order to obtain the final product. Instead, here we
insert the proposed carry-save adder that sums the TP-PP unit
outputs and the accumulate adder output according to Fig. 6(b).
The output of the carry-save adder is fed to an accumulate adder
that performs the carry propagation to produce the final result,
as shown in Fig. 6(c).
As for conventional MACs, the TP-PP unit dominates the critical path delay. The DTMAC unit actually has the same critical delay as that of a basic three-cycle 32-bit MAC architecture, in which a pipeline register is inserted between the PP unit
and the final adder to shorten the critical path of the multiplication. The result is that, despite the operating-mode flexibility,
the DTMAC unit has small area requirements, low power dissipation, and short critical path delay.
B. Components of the DTMAC Unit
1) TP-PP Unit: To support double-throughput operations,
the partial-product generation and reduction are based on the
twin-precision (TP) technique [24]. Here, the partial products
3078
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 12, DECEMBER 2010
Fig. 6. Structure of the DTMAC units components: (a) The TP-PP unit, which is based on the BaughWooley multiplication algorithm. (b) The carry-save adder.
(c) The accumulate adder, which is based on the conditional-sum adder architecture. (Dark shaded and white colors denote computation in most significant and
least significant
2=
circuit sections, respectively.)
N=
3079
Fig. 7. Normalized values of clock period, energy per cycle and area for MAC16-2C, MAC32-2C, and DTMAC.
C. Operating Modes
The DTMAC unit supports six operating modesthree for
multiply-accumulate operations and three for multiply operationsas determined by the value of the three-bit control signal
(CTRL):
000: Full-Precision 32-bit multiply-accumulate
(FP_MAC).
001: Half-Precision 1 16-bit multiply-accumulate
(HP_MAC).
011: Double-Throughput 2 16-bit multiply-accumulate
(DT_MAC).
100: Full-Precision 32-bit multiplication (FP_MULT).
101: Half-Precision 1 16-bit multiplication
(HP_MULT).
111: Double-Throughput 2 16-bit multiplication
(DT_MULT).
In Figs. 5 and 6, CRTL, CTRL0, and CTRL1 denote the
three-bit control signal, its one-cycle delayed version, and
its two-cycle delayed version, respectively. Moreover, in
CTRL[2:0], CTRL[2] is the leftmost of the three bits and is
used to force the output of the accumulate register to zero
during multiply operations.
TABLE III
EVALUATION OF 32-BIT DTMAC UNIT
D. Evaluation Methodology
We evaluate our design with a VHDL model of a 32-bit
DTMAC unit. The DTMAC implementation is fully verified
in logic simulation. We use MAC16-2C and MAC32-2C
for comparison and the DTMAC implementation is synthesized using the same tool (Synopsys Design Compiler) and
65-nm cell library as in Section III. The implementation is
placed-and-routed using SoC Encounter, and PrimeTime is
used to find the critical path delay. Power dissipation is estimated through a VCD analysis on RC-extracted data from SoC
Encounter using the same test vectors as for MAC32-2C.
E. Evaluation Results
Table III and Fig. 7 present the results of the evaluation.
Thanks to the short critical path delay of the proposed MAC
architecture, the 32-bit DTMAC unit can be used at 10% and
26% higher clock rates than conventional two-cycle 16-bit and
32-bit MAC units, respectively.
In terms of energy per cycle, when the DTMAC unit operates
in 1 16-bit MAC mode it dissipates a negligible 0.3% more
than the basic, fixed-function, 16-bit MAC unit. The DTMAC
unit has a 2.8% larger footprint than MAC32-2C due to extra
circuitry to support the multiple operation modes. These comparisons reveal that the implementation of operating-mode flexibility in the DTMAC unit comes at a limited overhead.
The important point is that we can save energy by adjusting
the operating mode to the precision of the data:
When the DTMAC unit operates in the default 32-bit MAC
mode (FP_MAC), its energy dissipation is 8% lower than
MAC32-2C when performing 32-bit computations.
When the DTMAC unit operates in 1 16-bit MAC mode
(HP_MAC), the 32-bit DTMAC unit performs 16-bit
3080
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 12, DECEMBER 2010
multiply-accumulate operations 67% more energy efficiently than MAC32-2C performs computations on 16-bit
operands. This reduction largely stems from avoiding
unnecessary switching caused by the 16-bit sign extension
of twos complement 32-bit data that carry only 16 bits of
information.
When the DTMAC unit operates in the 2 16-bit MAC
mode (DT_MAC), its energy dissipation per 16-bit
multiply-accumulate operation is similar to that of
MAC16-2C. However, the DTMAC unit uses only half the
cycles of MAC16-2C to compute all operations, so the surrounding datapath circuits are engaged for a significantly
shorter time. This leads to significant energy savings for
a system in which the DTMAC unit is integrated. In the
next section, we give a brief account of an evaluation of
this design scenario.
F. Double-Throughput Processor Datapath
In the context of a processor, the execution time reduction offered by the double-throughput modes may help save substantial amounts of energy. In order to evaluate the extent to wich
the DTMAC unit can improve the execution of a C application,
we integrated one such 32-bit unit into a 32-bit embedded FlexCore processor [22]. The FlexCore processor has a flexible datapath interconnect [25] that allows for accelerator extensions in
a fairly straightforward manner, and it has a flexible instruction
set that supports run-time reconfiguration.
We use two benchmarks from the EEMBC Telecom suite that
make use of the multiply-accumulate operation in quite different
ways: The auto-correlation benchmark (Autcor) contains many
long sequences of 16-bit operations; the fast Fourier transform
benchmark (FFT) has many short sequences of the same. A
FlexCore datapath with a 32-bit DTMAC unit executes Autcor
4.37 faster than a conventional five-stage datapath that lacks
the DTMAC unit, but in exchange has a 32-bit integer multiplier. The accompanying reduction in energy dissipation was
4.00 , since the DTMAC unit incurs a small power dissipation
overhead. The sequences of multiply-accumulate operations are
quite short for the FFT benchmark, so the computational efficiency of the dedicated MAC accelerator drops. Still, the datapath equipped with the DTMAC unit executes 1.82 faster than
the reference, leading to a 1.64 reduction in energy.
V. CONCLUSION
We describe a new high-speed, energy-efficient twos complement, two-cycle multiply-accumulate (MAC) architecture.
Replacing the final adder of the multiplier by a carry-save adder
with a new sign extension technique makes our two-cycle MAC
architecture faster and more area- and energy-efficient than a
basic two-cycle MAC architecture. Our evaluation for a commercial 65-nm 1.1-V cell library shows that the new architecture
computes 31% faster and reduces energy per operation by 32%,
on average. The timing slack difference allows us to downsize
gates so that our new MAC architecture dissipates half the energy of the reference architecture.
We use the new architecture to develop a versatile MAC
unit that supports several different operating modes: three for
multiply-accumulate operations and three for multiply operations. We show that a 32-bit DTMAC unit can perform 16-bit
multiply-accumulate operations at one third of the energy
of a fixed-function, 32-bit architecture with the same cycle
count. In double-throughput mode, executing two concurrent
16-bit multiply-accumulate operations delivers high energy
efficiency. Deploying our design in a processor datapath can
yield significant speed and energy impact for applications that
compute many 16-bit multiply-accumulate operations.
REFERENCES
[1] O. L. MacSorley, High-speed arithmetic in binary computers, Proc.
Inst. Radio Eng. (IRE), vol. 49, pp. 6791, Jan. 1961.
[2] W.-C. Yeh and C.-W. Jen, High-speed booth encoded parallel multiplier design, IEEE Trans. Comput., vol. 49, no. 7, pp. 692701, Jul.
2000.
[3] M. R. Santoro and M. A. Horowitz, SPIM: A pipeline 64 64 bit
iterative multiplier, IEEE J. Solid-State Circuits, vol. 2, no. 1, pp.
487493, Apr. 1989.
[4] V. G. Oklobdzija, D. Villeger, and S. S. Liu, A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach, IEEE Trans. Comput., vol. 45,
no. 3, pp. 294306, Mar. 1996.
[5] S. K. Mathew, M. A. Anders, B. Bloechel, T. Nguyen, R. K. Krishnamurthy, and S. Borkar, A 4-GHz 300-mW 64-bit integer execution
ALU with dual supply voltages in 90-nm CMOS, IEEE J. Solid-State
Circuits, vol. 40, no. 1, pp. 4451, Jan. 2005.
[6] J. Liu, S. Zhou, H. Zhu, and C.-K. Cheng, An algorithmic approach
for generic parallel adders, in Proc. IEEE Int. Conf. Comput. Aided
Des. (ICCAD), Dec. 2003, pp. 734740.
[7] P. F. Stelling and V. G. Oklobdzija, Implementing multiply-accumulate operation in multiplication time, in Proc. Int. Symp. Comput.
Arithmetic (ARITH), July 1997, pp. 99106.
[8] J. Groschdl and G.-A. Kamendje, A single-cycle (32 32 + 32 +
64)-bit multiply/accumulate unit for digital signal processing and
public-key cryptography, in Proc. IEEE Int. Conf. Electron., Circuits,
Syst. (ICECS), Dec. 2008, pp. 739742.
[9] A. Abdelgawad and M. Bayoumi, High speed and area-efficient
multiply accumulate (MAC) unit for digital signal processing applications, in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2007,
pp. 31993202.
[10] M. D. Ercegovac and T. Lang, Digital Arithmetic. San Mateo, CA:
Morgan Kaufmann, 2003.
[11] T. T. Hoang, M. Sjlander, and P. Larsson-Edefors, High-speed, energy-efficient 2-cycle multiply-accumulate architecture, in Proc. IEEE
Int. SOC Conf. (SOC), Sep. 2009, pp. 119122.
[12] C. R. Baugh and B. A. Wooley, A twos complement parallel array
multiplication algorithm, IEEE Trans. Comput., vol. C-22, pp.
10451047, Dec. 1973.
[13] M. Sjlander and P. Larsson-Edefors, Multiplication acceleration
through twin precision, IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 17, pp. 12331246, Sep. 2009.
[14] M. Hatamian and G. L. Cash, A 70-MHz 8-bit 8-bit parallel
pipelined multiplier in 2.5-m CMOS, IEEE J. Solid-State Circuits,
vol. JSSC-21, no. 4, pp. 505513, 1986.
[15] H. Eriksson, P. Larsson-Edefors, M. Sheeran, M. Sjlander, D. Johansson, and M. Schlin, Multiplier reduction tree with logarithmic
logic depth and regular connectivity, in Proc. IEEE Int. Symp. Circuits
Syst. (ISCAS), May 2006, pp. 48.
[16] J. Sklansky, Conditional-sum addition logic, IRE Trans. Electronic
Comput., vol. EC-9, pp. 226231, 1960.
[17] P. M. Kogge and H. S. Stone, A parallel algorithm for the efficient
solution of a general class of recurrence equations, IEEE Trans.
Comput., vol. C-22, no. 8, pp. 786193, Aug. 1973.
[18] D. Brooks and M. Martonosi, Dynamically exploiting narrow width
operands to improve processor power and performance, in Proc. Int.
Symp. High-Perform. Comput. Archit., 1999, pp. 1322.
[19] S. Yoshizawa and Y. Miyanaga, Use of a variable wordlength technique in an OFDM receiver to reduce energy dissipation, IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 55, no. 9, pp. 28482859, Oct. 2008.
[20] R. K. Kolagotla, J. Fridman, B. C. Aldrich, M. M. Hoffman, W.
C. Anderson, M. S. Allen, D. B. Witt, R. R. Dunton, and L. A.
Booth, High performance dual-MAC DSP architecture, IEEE Signal
Process. Mag., vol. 19, no. 4, pp. 4253, Jul. 2002.
[21] S. Hong and S.-S. Chin, Reconfigurable embedded MAC core design
for low-power coarse-grain FPGA, Electron. Lett., vol. 39, no. 7, pp.
606608, Apr. 2003.
[22] T. T. Hoang, M. Sjlander, and P. Larsson-Edefors, Double
throughput multiply-accumulate unit for FlexCore processor enhancements, presented at the IEEE Int. Symp. Parallel Distrib. Process.
(IPDPS), Reconfigurable Archit. Workshop (RAW), Rome, Italy, May
2009.
[23] S.-R. Kuang and J.-P. Wang, Design of power-efficient configurable
booth multiplier, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57,
no. 3, pp. 568580, Mar. 2010.
[24] M. Sjlander, H. Eriksson, and P. Larsson-Edefors, An efficient twinprecision multiplier, in Proc. IEEE Int. Conf. Comput. Des. (ICCD),
Oct. 2004, pp. 3033.
[25] M. Thuresson, M. Sjlander, M. Bjrk, L. Svensson, P. Larsson-Edefors, and P. Stenstrom, FlexCore: Utilizing exposed datapath control
for efficient computing, Springer J. Signal Process. Syst., vol. 57, no.
1, pp. 519, Oct. 2009.
Tung Thanh Hoang received the B.S. degree in
electronic engineering from Hanoi University of
Science and Technology, Hanoi, Vietnam, in 2003
and the M.Sc. degree in electrical engineering from
Korea University, Seoul, in 2007. He is currently
working toward the Ph.D. degree at the Department
of Computer Science and Engineering, Chalmers
University of Technology, Sweden.
His research interests are in the areas of high-performance, low-power digital circuits, and its application in embedded systems.
3081