Master of Technology: "A Low Power High Speed Configurable Adder For Approximate Computing"
Master of Technology: "A Low Power High Speed Configurable Adder For Approximate Computing"
Master of Technology
In the Discipline of
VLSI System Design
By
A.Mounika (18641D5710)
2019-2020
VAAGDEVI COLLEGE OF ENGINEERING
(Autonomous, Affiliated to JNTUH, Accredited By NBA & NAAC)
BOLLIKUNTA, WARANGAL - 506 005
CERTIFICATE
This is to certify that the Project Work Phase-I entitled “A Low Power High Speed
Configurable Adder for Approximate Computing” is a bonafide work carried out By
A.Mounika (18641D5710) in partial fulfillment of the requirements for the award of degree of
Master of Technology in VLSI System Design from Vaagdevi College of Engineering,
(Autonomous) during the academic year2019-2020.
I hereby declare that the work presented in this M.Tech VLSI System Design Project
Phase-I report entitled."A LOW POWER HIGH-SPEED CONFIGURABLE ADDER FOR
APPROXIMATE COMPUTING” is a record of my own work done in the partial
fulfillment for the award of the degree of Master of Technology in VLSI System Design,
VAAGDEVI COLLEGE OF ENGINEERING (Autonomous), Affiliated to JNTUH,
Accredited By NBA & NAAC (A Grade), under the guidance of Mr.B.Ranjith
kumar,Asst.Professor, ECE Department. I hereby also declare that this project work shall be
carried forward in Project Work Phase-II in next semester & the work in this report bears no
resemblance to any other project submitted at Vaagdevi College of Engineering or any other
university/college for the award of the degree.
A.Mounika (18641D5710)
ACKNOWLEDGEMENT
The development of the project though it was an arduous task, it has been made by the help
of many people. We are pleased to express our thanks to the people whose suggestions,
comments, criticisms greatly encouraged us in betterment of the project.
I/We would like to express my sincere gratitude and indebtedness to my project Guide
Mr.B.Ranjith Kumar,Asst.Professor, for his valuable suggestions and interest throughout the
course of this project.
We would like to express my sincere thanks and profound gratitude to Dr. K. Prakash,
principal of Vaagdevi College of Engineering, for his support, guidance and encouragement in
the course of our project.
We are also thankful to the Head of the Department Mr. M. Shashidhar, Associate
Professor for providing excellent infrastructure and a nice atmosphere for completing this
project successfully.
We are highly thankful to the Project Coordinators for their valuable suggestions,
encouragement and motivations for completing this project successfully.
I convey my heartfelt thanks to the lab staff for allowing me to use the required equipment
whenever needed.
Finally, I would like to take this opportunity to thank my family for their support through the
work. I sincerely acknowledge and thank all those who gave directly or indirectly their support in
completion of this work.
A.Mounika (18641D5710)
LIST OF FIGURES:
1.1 MOTIVATION
As the scale of integration keeps growing, more and more sophisticated signal
processing systems are being implemented on a VLSI chip. These signal processing applications
not only demand great computation capacity but also consume considerable amount of energy.
While performance and Area remain to be the two major design tolls, power consumption has
become a critical concern in today’s VLSI system design[]. The need for low-power VLSI
system arises from two main forces. First, with the steady growth of operating frequency and
processing capacity per chip, large currents have to be delivered and the heat due to large power
consumption must be removed by proper cooling techniques. Second, battery life in portable
electronic devices is limited. Low power design directly leads to prolonged operation time in
these portable devices.
Addition usually impacts widely the overall performance of digital systems and a crucial
arithmetic function. In electronic applications adders are most widely used. Applications where
these are used are multipliers, DSP to execute various algorithms like FFT, FIR and IIR.
Wherever concept of multiplication comes adders come in to the picture. As we know millions of
instructions per second are performed in microprocessors. So, speed of operation is the most
important constraint to be considered while designing multipliers. Due to device portability
miniaturization of device should be high and power consumption should be low. Devices like
Mobile, Laptops etc. require more battery backup.
So, a VLSI designer has to optimize these three parameters in a design. These constraints
are very difficult to achieve so depending on demand or application some compromise between
constraints has to be made. Ripple carry adders exhibits the most compact design but the slowest
in speed. Whereas carry look ahead is the fastest one but consumes more area. Carry select
adders act as a compromise between the two adders. In 2002, a new concept of hybrid adders is
presented to speed up addition process by Wang et al. that gives hybrid carry look-ahead/carry
select adders design. In 2008, low power multipliers based on new hybrid full adders is
presented.
1
DESIGN of area- and power-efficient high-speed data path logic systems are one of the
most substantial areas of research in VLSI system design. In digital adders, the speed of addition
is limited by the time required to propagate a carry through the adder. The sum for each bit
position in an elementary adder is generated sequentially only after the previous bit position has
been summed and a carry propagated into the next position.
The CSLA is used in many computational systems to alleviate the problem of carry
propagation delay by independently generating multiple carries and then select a carry to
generate the sum.
The design of portable devices requires consideration for peak power consumption to
ensure reliability and proper operation. However, the time averaged power is often more critical
as it is linearly related to the battery life. There are four sources of power dissipation in digital
CMOS circuits: switching power, short-circuit power, leakage power and static power. The
following equation describes these four components of power:
Pswitching is the switching power. For a properly designed CMOS circuit, this power
component usually dominates, and may account for more than 90% of the total power. denotes
the transition activity factor, which is defined as the average number of power consuming
transitions that is made at a node in one clock period. Vs is the voltage swing, where in most
cases it is the same as the supply voltage, Vdd. CL is the node capacitance. It can be broken into
three components, the gate capacitance, the diffusion capacitance, and the interconnect
capacitance. The interconnect capacitance is in general a function of the placement and routing.
fck is the frequency of clock. The switching power for static CMOS is derived as follows.
2
During the low to high output transition, the path from Vdd to the output node is con-
ducting to charge CL. Hence, the energy provided by the supply source is
where is the current drawn from the supply. Here, R is the resistance of the
path between the Vdd and the output node. Therefore, the energy can be rewritten as
During the high to low transition, no energy is supplied by the source. Hence, the average power
consumed during one clock cycle is
Eq. (2.4) and Eq. (2.5) estimate the energy and the power of a single gate only. From a system
point of view, is used to account for the actual number of gates switching at a point in time.
Pshortcircuit is the short-circuit power. It is a type of dynamic power and is typically
much smaller than Pswitching. Isc is known as the direct-path short circuit current. It refers to
the conducting current from power supply directly to ground when both the NMOS and PMOS
transistors are simultaneously active during switching.
Pleakage is the leakage power. Ileakage refers to the leakage current. It is primarily
determined by fabrication technology considerations and originates from two sources. The first is
the reverse leakage current of the parasitic drain-/source-substrate diodes. This current is in the
order of a few femtoamperes per diode, which translates into a few microwatts of power for a
million transistors. The second source is the sub threshold current of MOSFETs, which is in the
order of a few nanoamperes. For a million transistors, the total subthreshold leakage current
results in a few milliwatts of power.
Pstatic is the static power and Istatic is static current. This current arises from circuits that
have a constant source of current between the power supplies such as bias circuitries, pseudo-
NMOS logic families. For CMOS logic family, power is dissipated only when the circuits
switch, with no static power consumption.
3
Energy is independent of the clock frequency. Reducing the frequency will lower the
power consumption but will not change the energy required to perform a given operation, as
depicted by Eq. (2.4) and Eq. (2.5). It is important to note that the battery life is determined by
energy consumption, whereas the heat dissipation considerations are related to the power
consumption.
There are four factors that influence the power dissipation of CMOS circuits. They are
technology, circuit design style, architecture, and algorithm. The challenge of meeting the
contradicting goals of high performance and low power system operation has motivated the
development of low power process technologies and the scaling of device feature sizes.
Design considerations for low power should be carried out in all steps in the design hierarchy,
namely 1) Fundamental, 2) material, 3) device, 4) circuit, and 5) system.
Power consumption is linearly proportional to voltage swing (Vs) and supply voltage
(Vdd) as indicated in Eq. (2.5). For most CMOS logic families, the swing is typically rail-to-rail.
Hence, power consumption is also said to be proportional to the square of the supply voltage,
Vdd. Therefore, lowering the Vdd is an efficient approach to reduce both energy and power,
presuming that the signal voltage swing can be freely chosen. This is, however, at the expense of
the delay of circuits. The delay, td, can be shown to be proportional to .The
exponent is between 1 and 2. It tends to be closer to 1 for MOS transistors that are in deep
sub-micrometer region, where carrier velocity saturation may occur. increases toward 2 for
longer channel transistors.
The current technology trends are to reduce feature size and lower supply voltage.
Lowering Vdd leads to increased circuit delays and therefore lower functional throughput.
Smaller feature size, however, reduces gate delay, as it is inversely proportional to the square of
the effective channel length of the devices. In addition, thinner gate oxides impose voltage
limitation for reliability reasons. Hence, the supply voltage must be lowered for smaller
geometries. The net effect is that circuit performance improves as CMOS technologies scale
down, despite of the Vdd reduction. Therefore, the new technology has made it possible to fulfill
the contradicting requirements of low-power and high throughput.
4
The various techniques that are currently used to scale the supply voltage include
optimizing the technology and device reliability, trading off area for low power in architecture
driven approach, and exploiting the concurrency possibility in algorithmic transformations.
Hence, the voltage scaling is limited by the threshold voltage Vth.
In applications such as digital processing, where the throughput is of more concern than
the speed, architecture can be designed to reduce the supply voltage at the expense of speed
without throughput degradation. Hence, the performance of the system can be maintained.
Verilog (HDL)
Modelsim6.4b
Xilinx ISE 10.1
1.4 ADVANATGES
1.5 APPLICATIONS
5
CHAPTER 2
LITERATURE SURVEY
This is about a digital circuit. For an electronic circuit that handles analog signals
see Electronic mixer.In electronics, an adder or summer is a digital circuit that
performs addition of numbers. In many computers and other kinds of processors, adders are used
not only in the arithmetic logic unit(s), but also in other parts of the processor, where they are
used to calculate addresses, table indices, and similar.
Although adders can be constructed for many numerical representations, such as binary-
coded decimal or excess-3, the most common adders operate on binary numbers. In cases
where two's complement or ones' complement is being used to represent negative numbers, it is
trivial to modify an adder into an adder–subtractor. Other signed number representations require
a more complex adder. Different type of adders as follows
6
Figure of Half Adder
7
Schematic symbol for a 1-bit full adder with Cin and Cout drawn on sides of block to
emphasize their use in a multi-bit adder. A full adder adds binary numbers and accounts for
values carried in as well as out. A one-bit full adder adds three one-bit numbers, often written
as A, B, and Cin; A and B are the operands, and Cin is a bit carried in from the next less significant
stage.[2] The full-adder is usually a component in a cascade of adders, which add 8, 16, 32, etc.
binary numbers. The circuit produces a two-bit output sum typically represented by the
signals Cout and S, where . The one-bit full adder's truth table is:
A full adder can be implemented in many different ways such as with a custom transistor-
level circuit or composed of other gates. One example implementation is
with and .
In this implementation, the final OR gate before the carry-out output may be replaced by
anXOR gate without altering the resulting logic. Using only two types of gates is convenient if
the circuit is being implemented using simple IC chips which contain only one gate type per
Inputs Outputs
8
A B Cin Cout S
0 0 0 0 0
1 0 0 0 1
0 1 0 0 1
1 1 0 1 0
0 0 1 0 1
1 0 1 1 0
0 1 1 1 0
1 1 1 1 1
A full adder can be constructed from two half adders by connecting A and B to the input
of one half adder, connecting the sum from that to an input to the second adder, connecting Ci to
the other input and OR the two carry outputs. Equivalently, S could be made the three-bit XOR
of A, B, and Ci, and Cout could be made the three-bit majority function of A, B, and Ci.
9
2.1.3 RIPPLE CARRY ADDERS (RCA)
Concatenating the N full adders forms N bit Ripple carry adder. In this carry out of previous full
adder becomes the input carry for the next full adder. It calculates sum and carry according to the
following equations. As carry ripples from one full adder to the other, it traverses longest critical
path and exhibits worst-case delay. Si = Ai xor Bi xorCi
RCA is the slowest in all adders (O (n) time) but it is very compact in size (O (n) area). If the
ripple carry adder is implemented by concatenating N full adders, the delay of such an adder is
2N gate delays from Cin to Cout. The delay of adder increases linearly with increase in number
of bits. Block diagram of RCA is shown in figure 1.
A carry skip divides the words to be added in to groups of equal size of k-bits. Carry Propagate
pi signals may be used within a group of bits to accelerate the carry propagation. If all the pi
signals within the group are pi=1, carry bypasses the entire group as shown in figure 2.
10
TCSKA = (b -1)+0.5+(N/b-2)+(b -1) = 2b + N/b – 3.5 Stages
Block width tremendously affects the latency of adder. Latency is directly proportional to
block width. More number of blocks means block width is less, hence more delay. The idea
behind Variable Block Adder (VBA) is to minimize the critical path delay in the carry chain of a
carry skip adder, while allowing the groups to take different sizes. In case of carry skip adder,
such condition will result in more number of skips between stages.
Such adder design is called variable block design, which is tremendously used to fasten the
speed of adder. In the variable block carry skip adder design we divided a 32-bit adder in to 4
blocks or groups. The bit widths of groups are taken as; First block is of 4 bits, second is of 6
bits, third is 18 bit wide and the last group consist of most significant 4 bits.
Table 1 shows that the logic utilization of carry skip and variable carry skip 32-bit adder. The
power and delay, which are obtained also given in the table1. From table it can be observed that
variable block design consumes more area as gate count and number of LUT’s consumed by
variable block design is more than conventional carry skip adder.
The carry select adder comes in the category of conditional sum adder. Conditional sum
adder works on some condition. Sum and carry are calculated by assuming input carry as 1 and 0
prior the input carry comes. When actual carry input arrives, the actual calculated values of sum
and carry are selected using a multiplexer. The conventional carry select adder consists of k/2 bit
adder for the lower half of the bits i.e. least significant bits and for the upper half i.e. most
11
significant bits (MSB’s) two k/ bit adders. In MSB adders one adder assumes carry input as one
for performing addition and another assumes carry input as zero. The carry out calculated from
the last stage i.e. least significant bit stage is used to select the actual calculated values of output
carry and sum. The selection is done by using a multiplexer. This technique of dividing adder in
to stages increases the area utilization but addition operation fastens. The block diagram of
conventional k bit adder is shown in figure 3.
Basics:
logic element that computes the -bit sum of two -bit numbers. The carry-select adder
The number of bits in each carry select block can be uniform, or variable. In the uniform
case, the optimal delay occurs for a block size of . When variable, the block size should
have a delay, from addition inputs A and B to the carry out, equal to that of the multiplexer chain
leading into it, so that the carry out is calculated just in time. The delay is derived from
uniform sizing, where the ideal number of full-adder elements per block is equal to the square
root of the number of bits being added, since that will yield an equal number of MUX delays.
A carry-look ahead adder (CLA) is a type of adder used in digital logic. A carry-look
ahead adder improves speed by reducing the amount of time required to determine carry bits. It
can be contrasted with the simpler, but usually slower, ripple carry adder for which the carry bit
is calculated alongside the sum bit, and each bit must wait until the previous carry has been
calculated to begin calculating its own result and carry bits (see adder for detail on ripple carry
adders). The carry-look ahead adder calculates one or more carry bits before the sum, which
12
reduces the wait time to calculate the result of the larger value bits. The Kogge-Stone adder and
Brent-Kung adder are examples of this type of adder.
This means that no digit position can have an absolutely final value until it has been
established whether or not a carry is coming in from the right. Moreover, if the sum without a
carry is 9 (in pencil-and-paper methods) or 1 (in binary arithmetic), it is not even possible to tell
whether or not a given digit position is going to pass on a carry to the position on its left. At
worst, when a whole sequence of sums comes to ...99999999... (in decimal) or ...11111111... (in
binary), nothing can be deduced at all until the value of the carry coming in from the right is
known, and that carry is then propagated to the left, one step at a time, as each digit position
evaluated "9+1=0, carry 1" or "1+1=0, carry 1". It is the "rippling" of the carry from right to left
that gives a ripple-carry adder its name, and its slowness. When adding 32-bit integers, for
instance, allowance has to be made for the possibility that a carry could have to ripple through
every one of the 32 one-bit adders.
1. Calculating, for each digit position, whether that position is going to propagate a carry if
one comes in from the right.
2. Combining these calculated values to be able to deduce quickly whether, for each group
of digits, that group is going to propagate a carry that comes in from the right.
13
For each bit in a binary sequence to be added, the Carry Look Ahead Logic will
determine whether that bit pair will generate a carry or propagate a carry. This allows the circuit
to "pre-process" the two numbers being added to determine the carry ahead of time. Then, when
the actual addition is performed, there is no delay from waiting for the ripple carry effect (or time
it takes for the carry from the first Full Adder to be passed down to the last Full Adder). Below is
a simple 4-bit generalized Carry Look Ahead circuit that combines with the 4-bit Ripple Carry
Adder we used above with some slight adjustments:
For the example provided, the logic for the generate (g) and propagate (p) values are
given below. Note that the numeric value determines the signal from the circuit above, starting
from 0 on the far left to 3 on the far right:
Substituting into , then into , then into yields the expanded equations:
14
To determine whether a bit pair will generate a carry, the following logic works:
To determine whether a bit pair will propagate a carry, either of the following logic statements
work:
Using basic arithmetic, we calculate right to left, "8+2=0, carry 1", "7+2+1=0, carry 1",
"6+3+1=0, carry 1", and so on to the end of the sum. Although we know the last digit of the
result at once, we cannot know the first digit until we have gone through every digit in the
calculation, passing the carry from each digit to the one on its left. Thus adding two n-digit
numbers has to take a time proportional to n, even if the machinery we are using would
otherwise be capable of performing many calculations simultaneously.
In electronic terms, using bits (binary digits), this means that even if we have n one-bit
adders at our disposal, we still have to allow a time proportional to n to allow a possible carry to
propagate from one end of the number to the other. Until we have done this,
A carry look-ahead adder can reduce the delay. In principle the delay can be reduced so
that it is proportional to logn, but for large numbers this is no longer the case, because even when
15
carry look-ahead is implemented, the distances that signals have to travel on the chip increase in
proportion to n, and propagation delays increase at the same rate. Once we get to the 512-bit to
2048-bit number sizes that are required in public-key cryptography, carry look-ahead is not of
much help
Drawbacks
Montgomery multiplication, which depends on the rightmost digit of the result, is one
solution; though rather like carry-save addition itself, it carries a fixed overhead , so that a
sequence of Montgomery multiplications saves time but a single one does not. Fortunately
exponentiation, which is effectively a sequence of multiplications, is the most common operation
in public-key cryptography.
The carry-save unit consists of n full adders, each of which computes a single sum and
carry bit based solely on the corresponding bits of the three input numbers. Given the three n -
bit numbers a, b, and c, it produces a partial sum ps and a shift-carry sc:
16
CHAPTER 3
LITERATURE SURVEY
DESIGN of area- and power-efficient high-speed data path logic systems are one of the
most substantial areas of research in VLSI system design. In digital adders, the speed of addition
is limited by the time required to propagate a carry through the adder. The sum for each bit
position in an elementary adder is generated sequentially only after the previous bit position has
been summed and a carry propagated in to the next position .The CSLA is used in many
computational systems to alleviate the problem of carry propagation delay by independently
generating multiple carries and then select a carry to generate the sum. However ,the CSLA is
not area efficient because it uses multiple pairs of Ripple Carry Adders (RCA) to generate partial
sum and carry by considering carry input cin=0 and cin=1, then the final sum and carry are
selected by the multiplexers (mux).The basic idea of this work is to use Binary to Excess-1
Converter(BEC) instead of RCA with cin=1 in the regular CSLA to achieve lower area and
power consumption The main advantage of this BEC logic comes from the lesser number of
logic gates than the n-bitFull Adder (FA) structure.
The carry select adder comes in the category of conditional sum adder. Conditional sum
adder works on some condition. Sum and carry are calculated by assuming input carry as 1 and 0
prior the input carry comes. When actual carry input arrives, the actual calculated values of sum
and carry are selected using a multiplexer. The conventional carry select adder consists of k/2 bit
adder for the lower half of the bits i.e. least significant bits and for the upper half i.e. most
significant bits (MSB’s) two k/ bit adders. In MSB adders one adder assumes carry input as one
for performing addition and another assumes carry input as zero. The carry out calculated from
the last stage i.e. least significant bit stage is used to select the actual calculated values of output
carry and sum. The selection is done by using a multiplexer. This technique of dividing adder in
to stages increases the area utilization but addition operation fastens.
17
Fig. Regular 16-b SQRT CSLA.
3.1.1 MULTIPLEXER
In electronics, a multiplexer (or MUX) is a device that selects one of several analog or
digital input signals and forwards the selected input into a single line.[1] A multiplexer of 2n
inputs has n select lines, which are used to select which input line to send to the output.[2]
Multiplexers are mainly used to increase the amount of data that can be sent over the network
within a certain amount of time and bandwidth.[1] A multiplexer is also called a data selector.
They are used in CCTV, and almost every business that has CCTV fitted, will own one of these.
An electronic multiplexer makes it possible for several signals to share one device or resource,
for example one A/D converter or one communication line, instead of having one device per
input signal.
On the other hand, a demultiplexer (or demux) is a device taking a single input signal and
selecting one of many data-output-lines, which is connected to the single input. A multiplexer is
often used with a complementary demultiplexer on the receiving end.[1]
18
An electronic multiplexer can be considered as a multiple-input, single-output switch,
and a demultiplexer as a single-input, multiple-output switch.[3] The schematic symbol for a
multiplexer is an isosceles trapezoid with the longer parallel side containing the input pins and
the short parallel side containing the output pin.[4] The schematic on the right shows a 2-to-1
multiplexer on the left and an equivalent switch on the right. The wire connects the desired input
to the output.
In digital circuit design, the selector wires are of digital value. In the case of a 2-to-1
multiplexer, a logic value of 0 would connect to the output while a logic value of 1 would
connect to the output. In larger multiplexers, the number of selector pins is equal to
where is the number of inputs.
For example, 9 to 16 inputs would require no fewer than 4 selector pins and 17 to 32
inputs would require no fewer than 5 selector pins. The binary value expressed on these selector
pins determines the selected input pin.
A 2-to-1 multiplexer has a boolean equation where and are the two inputs, is the
selector input, and is the output:
This truth table shows that when then but when then . A
straightforward realization of this 2-to-1 multiplexer would need 2 AND gates, an OR gate,
and a NOT gate.
Larger multiplexers are also common and, as stated above, require selector pins
for inputs. Other common sizes are 4-to-1, 8-to-1, and 16-to-1. Since digital logic uses
19
binary values, powers of 2 are used (4, 8, 16) to maximally control a number of inputs for the
given number of selector inputs.
4-to-1 mux
The AND, OR, and Inverter (AOI) implementation of an XOR gate is shown in
Fig.below. The gates between the dotted lines are performing the operations in parallel and the
numeric representation of each gate indicates the delay contributed by that gate. The delay and
area evaluation methodology considers all gates to be made up of AND, OR, and Inverter, each
having delay equal to 1 unit and area equal to 1 unit. We then add up the number of gates in the
longest path of a logic block that contributes to the maximum delay.
20
The area evaluation is done by counting the total number of AOI gates required for each logic
block. Based on this approach, the CSLA adder blocks of 2:1 mux, Half Adder (HA), and FA are
evaluated and listed in Table I.
The structure of the 16-b regular SQRT CSLA has five groups of different size RCA. The
delay and area evaluation of each group are shown in Fig, in which the numerals within specify
the delay values, e.g., sum2 requires 10 gate delays. The steps leading to the evaluation are as
follows.
21
1) The group2 has two sets of 2-b RCA. Based on the consideration of delay values of Table I,
the arrival time of selection input of 6:3 mux is earlier than and later than. Thus,is summation of
and 2) Except for group2, the arrival time of mux selection input is always
greater than the arrival time of data outputs from the RCA’s. Thus, the delay of group3 to group5
is determined, respectively as follows: 3) The one set of 2-b RCA in group2 has 2 FA for and the
other set has 1 FA and 1 HA for . Based on the area count of Table I, the total number of gate
counts in group2 is determined as follows:
4) Similarly, the estimated maximum delay and area of the other groups in the regular SQRT
CSLA are evaluated and listed
3.3.1
22
The number of full adder cells are more thereby power consumption of the design also
increases
Number of full adder cells doubles the area of the design also increased.
Fig. above illustrates how the basic function of the CSLA is obtained by using 4-bit BEC
together with the mux. One input of the 8:4 mux gets as it input( (B3, B2, B1, and B0) and
23
another input of the mux is the BEC output. This produces the two possible partial results in
parallel and the mux is used to select either the BEC output or the direct inputs according to the
control signal Cin. The importance of the BEC logic stems from the large silicon area reduction
when the CSLA with large number of bits are designed. The Boolean expressions of the 4-bit
BEC is listed as (note the functional symbols NOT, & XOR)
Figure: Modified SQRT Carry Select Adder The parallel RCA with Cin=1 is replaced with BEC
24
The delay and area estimation of each group are shown in Fig below. The steps leading to the
evaluation are given here.
Figure: Delay and area evaluation of modified SQRT CSLA: (a) group2, (b) group3, (c) group4,
and (d) group5. H is a Half Adder.
The group2 has one 2-b RCA which has 1 FA and 1 HA for Cin=0 Instead of another 2-b
RCA with Cin=1 a 3-b BEC is used which adds one to the output from 2-b RCA. Based on the
25
consideration of delay values of Table I, the arrival time of selection input C1 [t=7] of 6:3 mux is
earlier than the s3 [t=9] and C3[t=10] and later than theS2[t=4]. Thus, the sum3 and finalC3
(output from mux) are depending on S3 and mux and partial C3 (input to mux) and mux,
respectively. The sum2 depends on C1 and mux.
1) The area count of group2 is determined as follows:
To remove the duplicate adder cells in the conventional CSLA, an area efficient SQRT
CSLA is proposed by sharing Common Boolean Logic (CBL) term. While analyzing the
truth table of single bit full adder, results show that the output of summation signal as
carry-in signal is logic “0” is inverse signal of itself as carry-in signal is logic “1”. It is
illustrated by red circles in Table II. To share the Common Boolean Logic term, we only need to
26
implement a XOR gate and one INV gate to generate the summation pair. And to generate
the carry pair, we need to implement one OR gate and one AND gate. In this way, the
summation and carry circuits can be kept parallel.
TABLE III
TRUTH TABLE OF SINGLE BIT FULL ADDER, WHERE THE UPPER HALF PART IS
THE CASE OF CIN=0 AND THE LOWER HALF PART IS THE CASE OF CIN=1
This method replaces the Binary to Excess-1 converter add one circuit by
common Boolean logic. As compared with modified SQRT CSLA, the
proposed structure is little bit faster. Internal structure of proposed CSLA is shown
in Fig. 8.
27
Fig. 8 Internal structure of the proposed area-efficient carry select adder is
constructed by sharing the common Boolean logic term.
28
Fig. 9 16-Bit Proposed SQRT CSLA using Common Boolean Logic.
29
CHAPTER 4
PROPOSED METHOD
I. INTRODUCTION
Applications that have recently emerged (such as image recognition and synthesis, digital signal
processing, which is computationally demanding, and wearable devices, which require battery
power) have created challenges relative to power consumption. Addition is a fundamental
arithmetic function for these applications [1] [2]. Most of these applications have an inherent
tolerance for insignificant inaccuracies. By exploiting the inherent tolerance feature, approximate
computing can be adopted for a tradeoff between accuracy and power. At present, this tradeoff
plays a significant role in such application domains [3]. As computation quality requirements of
an application may vary significantly at runtime, it is preferable to design qualityconfigurable
systems that are able to tradeoff computation quality and computational effort according to
application requirements [4] [5]. The previous proposals for configurability suffer the cost of the
increase in power [5] or in delay [12]. In order to benefit such application, a low-power and
high-speed adder for configurable approximation is strongly required. In this paper, we propose a
configurable approximate adder, which consumes lesser power than [5] does with a comparable
delay and area. In addition, the delay observed with the proposed adder is much smaller than that
of [12] with a comparable power consumption. Our primary contribution is that, to achieve
accuracy configurability the proposed adder achieved the optimization of power and delay
simultaneously and with no bias toward either. We implemented the proposed adder, the
conventional carry look-ahead adder (CLA), and the ripple carry adder (RCA) in Verilog HDL
using a 45-nm library. Then we evaluated the power consumption, critical path delay, and design
area for each of these implementations. Compared with the conventional CLA, with 1.95% mean
relative error distance (MRED), the proposed adder reduced power consumption and critical path
delay by 42.7% and 56.9%, respectively. We provided a crosswise comparison to demonstrate
the superiority of the proposed adder. Moreover, we implemented two previously studied
configurable adders to evaluate power consumption, critical path delay, design area, and
accuracy. We also evaluated the quality of these accuracy configurable adders in a real image
processing application.
30
II. RELATED WORK
Gupta et al. [6] discussed how to simplify the complexity of a conventional mirror adder cell at
the transistor level. Mahdiani et al. [7] proposed a lower-part-OR adder, which utilizes OR gates
for addition of the lower bits and precise adders for addition of the upper bits. Venkatesan et al.
[8] proposed to construct an equivalent untimed circuit that represents the behavior of an
approximate circuit. The above static approximate designs [6-8] with fixed accuracy may fail to
meet the quality requirements of applications or result in wastage of power when high quality is
not required. Kahng et al. [4] proposed an accuracy-configurable adder (ACA), which is based
on a pipeline structure. The correction scheme of the ACA proceeds from stage 1 to stage 4, if
the most significant bits of the results are required to be correct, all the four stages should be
performed. Motivated by the above, Ye et al. [5] proposed an accuracy gracefully-degrading
adder (GDA) which allows the accurate and approximate sums of its sub adders to be selected at
any time. Similar to [5], our adder proposed in this paper does not consider a pipeline structure
either. To generate outputs with different levels of computation accuracy and to obtain the
configurability of accuracy, some multiplexers and additional logic blocks are required in [5].
However, the additional logic blocks require more area. Furthermore, these blocks will cause
power wastage when their outputs are not used to generate a sum. This problem was addressed
by [12] based on a low-power configurable adder that generates an approximate sum by using
OR gates. Similar to [12], the proposed adder also uses OR gates to generate an approximate
sum, but [12] focuses on only power consumption and its delay is large. Thus, it may fail to meet
the speed requirement of an application.
III. PROPOSED ACCURACY-CONFIGURABLE ADDER
Typically, a CLA consists of three parts: (1) half adders for carry generation (G) and propagation
(P) signals preparation, (2) carry look-ahead units for carry generation, and (3) XOR gates for
sum generation. We focus on the half adders for G and P signals preparation in part 1.
Consider an n-bit CLA; each part of it can be obtained as follows:
P = A⨁B, G = A ⋅ B, (1) C = G + P ⋅ Cିଵ, (2) S = P⨁Cିଵ. (3) where
i is denoted the bit position from the least significant bit. Note that owing to reuse of the circuit
of Ai XOR Bi for Si generation, here Pi is defined as Ai XOR Bi instead of Ai OR Bi. Because
C0 is equal to G0 , if G0 is 0, C0 will be 0. From (2), we find that C1 is equal to G1 when C0 is
0. In other words, if G0 and G1 are equal to 0, C0 and C1 will be 0. By expanding the above to
31
i , Ci will be 0 when G0, G1, … , Gi are all 0. This means that the carry propagation from C0 to
Ci is masked. From (3), we can obtain that Si is equal to Pi when Ci-1 is 0. From the perspective
of approximate computing, if G is controllable and can be controlled to be 0, the carry
propagation will be masked and S (=P) can be considered as an approximate sum. In other
words, we can obtain the selectivity of S between the accurate and approximate sum if we can
control G to be A AND B or 0. Evidently, we can achieve selectivity by adding a select signal.
Figure 1(a) is a conventional half adder and Fig. 1(b) is a half adder to which the select signal
has been added. Compared with the conventional half adder, we add a signal named “M_X” as
the select signal and use a 3-input AND gate to replace the 2-input one. When M_X = 1, the
function of G is the same as that of a conventional half adder; when M_X = 0, G is equal to 0.
Consider the condition when the inputs Ai and Bi are both 1, when M_Xi = 1, the accurate sum
Si and carry Ci will be 0 and 1 ({Ci , Si} = {1,0}); when M_X0 , M_X1 , … , M_Xi are all 0, Si
is equal to Pi (= Ai XOR Bi = 0) as an approximate sum and Ci is equal to 0 ({Ci , Si} = {0, 0})
as discussed above. Here {,} denotes concatenation. This means that the difference between the
accurate and approximate sum is 2. Toward better accuracy results for the approximate sum, we
use an OR function instead of an XOR function for P generation when M_X = 0. Thus, the
difference will be reduced to 1. A 2-input XOR gate can be implemented by using a 2-input
NAND gate, a 2-input OR gate, and a 2-input AND gate. An equivalent circuit of the
conventional half adder is shown in Fig. 2. This is called a carrymaskable half adder (CMHA).
The dashed frame represents the equivalent circuit of a 2-input XOR (M_X = 1). We can obtain
the following: P is equal to A XOR B, and G is equal to A AND B when M_X = 1; when M_X =
0, P is equal to A OR B and G is 0. Thus, M_X can be considered as a carry mask signal.
32
Consider an n-bit CLA, whose half adders for G and P signals preparation are replaced by
CMHAs. In this case, an nbit carry mask signal for each CMHA is required. To simplify the
structure for masking carry propagation, we group four CMHAs and use a 1-bit mask signal to
mask the carry propagation of the CMHAs in each group. The structure of a group with four
CMHAs is shown in Fig. 3 as an example. A3- 0, B3-0, P3-0, and G3-0 are 4-bit-length signals
and represent {A3, A2, A1, A0}, {B3, B2, B1, B0}, {P3, P2, P1, P0}, and {G3, G2, G1, G0,},
respectively. M_X0 is a 1-bit signal and is connected to the four CMHAsto mask the carry
propagation simultaneously. When M_X0 = 1, P3-0 = A3-0 XOR B3-0, and G3-0 = A3-0 AND
B3- 0; when M_X0 = 0, P3-0 = A3-0 OR B3-0, and G3-0 = 0. We proposed an accuracy-
configurable adder by using CMHAs to mask the carry propagation.
33
The structure of the proposed 16-bit adder is shown in Fig. 4 as an example. Four groups
(CMHA3-0, CMHA7-4, CMHA11- 8, and CMHA15-12) are used to prepare the P and G signals.
Each group comprises four CMHAs There is no mask signal for CMHA15-12 in this example;
therefore, accurate P15-12 (= A15-12 XOR B15-12) and G15-12 (= A15-12 AND B15-12) are
always obtained. P15-0 and G15-0 are the outputs from Part 1 and are connected to Part 2. Note
that P15-0 is also connected to Part 3 for sum generation. In Part 2, four 4-bit carry look-ahead
units (unit 0, 1, 2, 3) generate four PGs (PG0, PG1, PG2, and PG3), four GGs (GG0, GG1, GG2,
and GG3), and 12 carries (C2-0, C6- 4, C10-8, and C14-12) first, and then the carry look-ahead
unit 4 generates the remaining four carries (C3, C7, C11, and C15) by using the PGs and GGs.
C15-0 is the output of Part 2 and is connected to Part 3. The fifteen 2-input XOR gates in Part 3
generate the sum.
34
VII. APPLICATIONS
In complicated datapath system, multiplier is considered as a much bigger component in power
consumption. Our carryprediction-based approximation uses generate bit to predict the carry
from lower segments. The critical delay can be restrained to a smaller value with shorter critical
path in carry propagation. Further extension of our technique to multiplier depends on the
multiplication structure used in hardware implementation. There is a variety of hardware designs
for multiplication, according to the structures of reduction tree. In this section, we apply our
technique on three kinds of multiplication structures including array multiplier, Wallace
multiplier, and Dadda multiplier.
35
Fig. 24. Multiplier: worst case error versus PDP.
36
As shown in Fig. 22, the basic structure of multiplier employs a three-step process to multiply
two integers.
Step 1) Generate all partial products by using an AND gate array.
Step 2) Combine the partial products in k stages by layers of half/full adder until the matrix
height is reduced to two. Different types of structures depend on the reduction tree used to
reduce the number of partial products in this step.
Step 3) Sum the resulting numbers in the final stage by a conventional adder. In array multiplier,
the carry bits in one stage are propagated diagonally downward, which follows the basic shift-
and-add multiplication algorithm.
Wallace multiplier based on Wallace tree combines the partial products as early as possible,
which makes it faster than array multiplier [20]. Dadda’s strategy is to make the combination
take place as late as possible, which leads to simpler reduction tree and wider adder in final stage
[20]. Thus, we can design approximate multipliers by using our SARA design instead of CRA in
the final stage. Three types of 16 × 16 multipliers (array multiplier, Wallace multiplier, and
Dadda multiplier) as well as behavioral multiplier are synthesized and implemented by using
Nangate 45-nm Open Cell Library. Their error data are obtained from 100-K-run Monte Carlo
simulation with uniform distribution of operands. In approximate multiplier, the final stage uses
SARA4, which consists of subadders with a bit-width of 4 bits, while the accurate one uses
CRA.
Figs. 23 and 24 present the tradeoff between error and PDP. Most of approximate multipliers
configured in approximate mode have better PDP compared with the accurate multipliers. The
variance of error between different approximate modes in approximate multiplier has similar
trend as SARA. Total error increases as more bits are configured in approximate mode.
Approximate array multiplier shows larger error than approximate Wallace/Dadda multiplier at
the same PDP level. It is because array multiplier has larger critical delay from internal stages in
step 2 than Wallace/Dadda multiplier. Figs. 25 and 26 show the error versus EDP for both
accurate and approximate multipliers. As more MUXes are set to propagate approximate carry,
the average error in output increases to about 107, which as well achieves the best EDP. The
worst case error rate of approximate Dadda multiplier is about 30%, while it comes to about 17%
for approximate array multiplier and Wallace multiplier. As shown in Fig. 26, when approximate
multipliers are working in completely accurate mode (error rate equals 0), EDP is larger than that
37
of their accurate counterpart. In summary, the experimental results show that our technique can
be successfully extended to highspeed multiplier designs.
Fig. 27. Comparison of image lenna. (a) Accurate adder. (b) SARA4. (c) SARA8. (d) SARA4-
DAR2. (e) GDA. (f) RAP-CLA.
Fig. 28. Comparison of image cameraman. (a) Accurate adder. (b) SARA4. (c) SARA8. (d)
SARA4-DAR2. (e) GDA. (f) RAP-CLA.
Fig. 29. Comparison of image kiel. (a) Accurate adder. (b) SARA4. (c) SARA8. (d) SARA4-
DAR2. (e) GDA. (f) RAP-CLA.
38
And due to the simple but effective structure of SARA, it provides an easy way for us to convert
conventional multiplier into approximate design.
DCT has been recognized as the basic in many transform coding methods for image and video
signal processing. It is used to transform the pixel data of image or video into corresponding
coefficients in frequency domain. Since human visual system is more sensitive to the changes in
low frequency, the loss of accuracy in high-frequency components does not heavily degrade the
quality of image processed by DCT. In addition, those components in different frequency have
different tolerances to the degradation in the original data. It is a good example to show the
reconfigurability of our design by applying them in VLSI implementation of DCT, computing in
JPEG image compression.
The 2-D DCT is implemented by the row–column decomposition technique, which contains two
stages of 1-D DCT [21]–[23]. The 2-D DCT of size N × N could be defined as Z = Ct XC (24)
where C is a normalized Nth-order matrix and X is the data matrix. Generally, the image is
divided into several N × N blocks and each block is transformed by 2-D DCT into frequency
domain components. The VLSI implementation of DCT computing contains a set of ROM and
accumulator components, which can be implemented by multipliers and adders [21]–[23]. In this
application, we use approximate adders to replace those accurate ones in CRAs to implement an
imprecise, but low-power circuit for image processing that contains DCT computing.
We replace the adders in circuits with different configurations of SARA, SARA-DAR, GDA, as
well as RAP-CLA. The results are obtained by numerical simulations on four images (Fig. 27–
30) in MATLAB. As we know, after DCT process, data in different frequency domain have a
different level of error tolerance. As shown in Fig. 31, matrix components in the top-left corner
correspond to lower frequency coefficients that are sensitive to human vision, while those
components in bottom-right corner might allow more errors. To utilize this feature for better
energy-accuracy tradeoff, we make following configuration for different designs. 1) SARA4:
SARA4 with 4, 3, 2, 1 consecutive segments working in accurate mode are used to compute
components in S1, S2, S3, and S4, respectively.
39
Fig. 30. Comparison of image house. (a) Accurate adder. (b) SARA4. (c) SARA8. (d) SARA4-
DAR2. (e) GDA. (f) RAP-CLA.
2) SARA8: SARA8 with one segment in accurate mode is used to compute components in S1
and S2, while another configuration with all segments in approximate mode are for S3 and S4.
3) SARA4-DAR2: DAR counterpart of SARA4 with a detection window of 2 bits.
4) GDA: G D A4,1, GDA3,1, GDA2,1, and GDA1,1 (same notation as [18]) are used to compute
components in S1, S2, S3, and S4, respectively.
5) RAP-CLA: Since RAP-CLA can work in one approximate mode, we use RAP-CLA with
window sizes of 20, 16, 12, and 8 to compute components in S1, S2, S3, and S4. The image
processing results are shown in Table IV.
40
PSNR in the table is defined via the mean squared error (MSE). Given an m ×n image I and its
restored image K, MSE, and PSNR are defined as
MSE = 1 mnm i=1 n j=1 [I(i, j) − K(i, j)] 2 (25)
PSNR = 20 · log(MAXI) − 10 · log(MSE) (26)
where MAXI is the maximum pixel value of the image. SARA4-DAR2 has the highest PSNR for
every image among all configurable adders, which is close to the quality of accurate adder.
Comparing SARA8 with GDA, they have similar PSNR and similar delay, but SARA8 has less
power consumption according to the analysis in Section VI. SARA4-DAR2 achieves better
image quality than SARA4, but might result in more power due to additional logics for self-
configuration. The image quality for different adders in DCT computing can also be
demonstrated in Figs. 27–30. According to human vision, SARA and its DAR counterpart show
better image quality than GDA and RAP-CLA in JPEG compression processing.
41
CHAPTER 5
HARDWARE REQUIREMENTS
4.1GENERAL
VLSI stands for "Very Large Scale Integration". This is the field which involves packing
more and more logic devices into smaller and smaller areas. VLSI, circuits that would have taken
boardfuls of space can now be put into a small space few millimeters across! This has opened up
a big opportunity to do things that were not possible before. VLSI circuits are everywhere .your
computer, your car, your brand new state-of-the-art digital camera, the cell-phones, and what
have you. All this involves a lot of expertise on many fronts within the same field, which we will
look at in later sections. VLSI has been around for a long time, but as a side effect of advances in
the world of computers, there has been a dramatic proliferation of tools that can be used to
design VLSI circuits. Alongside, obeying Moore's law, the capability of an IC has increased
exponentially over the years, in terms of computation power, utilisation of available area, yield.
The combined effect of these two advances is that people can now put diverse functionality into
the IC's, opening up new frontiers. Examples are embedded systems, where intelligent devices
are put inside everyday objects, and ubiquitous computing where small computing devices
proliferate to such an extent that even the shoes you wear may actually do something useful like
monitoring your heartbeats. Integrated circuit (IC) technology is the enabling technology for a
whole host of innovative devices and systems that have changed the way we live. Jack Kilby
and Robert Noyce received the 2000 Nobel Prize in Physics for their invention of the integrated
circuit; without the integrated circuit, neither transistors nor computers would be as important as
they are today. VLSI systems are much smaller and consume less power than the discrete
components used to build electronic systems before the 1960s. Integration allows us to build
systems with many more transistors, allowing much more computing power to be applied to
solving a problem. Integrated circuits are also much easier to design and manufacture and are
more reliable than discrete systems; that makes it possible to develop special-purpose systems
that are more efficient than general-purpose computers for the task at hand.
42
Three Categories:
1. Analog:
Small transistor count precision circuits such as Amplifiers, Data converters, filters, hase
Locked, sensors etc…
Progress in the fabrication of IC's has enabled us to create fast and powerful circuits in
smaller and smaller devices. This also means that we can pack a lot more of functionality into the
same area. The biggest application of this ability is found in the design of ASIC's. These are IC's
that are created for specific purposes - each device is created to do a particular job, and do it
well. The most common application area for this is DSP - signal filters, image compression, etc.
To go to extremes, consider the fact that the digital wristwatch normally consists of a single IC
doing all the time-keeping jobs as well as extra features like games,calendar, etc.
These are highly complex mixed signal circuits (digital and analog all on the same chip).
A network processor chip or a wireless radio chip is an example of an SoC.
Electronic systems now perform a wide variety of tasks in daily life. Electronic systems
in some cases have replaced mechanisms that operated mechanically, hydraulically, or by other
means; electronics are usually smaller, more flexible, and easier to service. In other cases
electronic systems have created totally new applications. Electronic systems perform a variety of
tasks, some of them visible, some more hidden:
43
Personal entertainment systems such as portable MP3 players and DVD players perform
sophisticated algorithms with remarkably little energy.
Electronic systems in cars operate stereo systems and displays; they also control fuel
injection systems, adjust suspensions to varying terrain, and perform the control functions
required for anti-lock braking (ABS) systems.
Digital electronics compress and decompress video, even at high definition data rates, on-
the-fly in consumer electronics.
Low-cost terminals for Web browsing still require sophisticated electronics, despite their
dedicated function.
Personal computers and workstations provide word-processing, financial analysis, and
games. Computers include both central processing units (CPUs) and special-purpose
hardware for disk access, faster screen display, etc.
Medical electronic systems measure bodily functions and perform complex processing
algorithms to warn about unusual conditions. The availability of these complex systems, far from
overwhelming consumers, only creates demand for even more complex systems. The growing
sophistication of applications continually pushes the design and manufacturing of integrated
circuits and electronic systems to new levels of complexity. And perhaps the most amazing
characteristic of this collection of systems is its variety as systems become more complex, we
build not a few general-purpose computers but an ever wider range of special-purpose systems.
Our ability to do so is a testament to our growing mastery of both integrated circuit
manufacturing and design, but the increasing demands of customers continue to test the limits of
design and manufacturing.
While we will concentrate on integrated circuits in this book, the properties of integrated
circuits what we can and cannot efficiently put in an integrated circuit—largely determine the
44
architecture of the entire system. Integrated circuits improve system characteristics in several
critical ways. ICs have three key advantages over digital circuits built from discrete components:
• Size. Integrated circuits are much smaller—both transistors and wires are shrunk to micrometer
sizes, compared to the millimeter or centimeter scales of discrete components. Small size leads to
advantages in speed and power consumption, since smaller components have smaller parasitic
resistances, capacitances, and inductances.
• Speed. Signals can be switched between logic 0 and logic 1 much quicker within a chip than
they can between chips. Communication within a chip can occur hundreds of times faster than
communication between chips on a printed circuit board. The high speed of circuits on-chip is
due to their small size—smaller components and wires have smaller parasitic capacitances to
slow down the signal.
• Power consumption. Logic operations within a chip also take much less power. Once again,
lower power consumption is largely due to the small size of circuits on the chip—smaller
parasitic capacitances and resistances require less power to drive them.
These advantages of integrated circuits translate into advantages at the system level:
• Lower power consumption. Replacing a handful of standard parts with a single chip reduces
total power consumption. Reducing power consumption has a ripple effect on the rest of the
system: a smaller, cheaper power supply can be used; since less power consumption means less
heat, a fan may no longer be necessary; a simpler cabinet with less shielding for electromagnetic
shielding may be feasible, too.
45
• Reduced cost. Reducing the number of components, the power supply requirements, cabinet
costs, and so on, will inevitably reduce system cost. The ripple effect of integration is such that
the cost of a system built from custom ICs can be less, even though the individual ICs cost more
than the standard parts they replace. Understanding why integrated circuit technology has such
profound influence on the design of digital systems requires understanding both the technology
of IC manufacturing and the economics of ICs and digital systems.
46
FPGA, which carries digital ones and zeros on its internal programmable interconnect fabric, and
field-programmable analog array (FPAA), which carries analog values on its internal
programmable interconnect fabric.
The FPGA industry sprouted from programmable read-only memory (PROM) and
programmable logic devices (PLDs). PROMs and PLDs both had the option of being
programmed in batches in a factory or in the field (field programmable), however programmable
logic was hard-wired between logic gates. In the late 1980s the Naval Surface Warfare
Department funded an experiment proposed by Steve Casselman to develop a computer that
would implement 600,000 reprogrammable gates. Casselman was successful and a patent related
to the system was issued in 1992. Some of the industry’s foundational concepts and technologies
for programmable logic arrays, gates, and logic blocks are founded in patents awarded to David
W. Page and LuVerne R. Peterson in 1985.Xilinx Co-Founders, Ross Freeman and Bernard
Vonderschmitt, invented the first commercially viable field programmable gate array in 1985 –
the XC2064. The XC2064 had programmable gates and programmable interconnects between
gates, the beginnings of a new technology and market. The XC2064 boasted a mere 64
configurable logic blocks (CLBs), with two 3-input lookup tables (LUTs). More than 20 years
later, Freeman was entered into the National Inventors Hall of Fame for his invention. Xilinx
continued unchallenged and quickly growing from 1985 to the mid-1990s, when competitors
sprouted up, eroding significant market-share. By 1993, Actel was serving about 18 percent of
the market. The 1990s were an explosive period of time for FPGAs, both in sophistication and
the volume of production. In the early 1990s, FPGAs were primarily used in telecommunications
and networking. By the end of the decade, FPGAs found their way into consumer, automotive,
and industrial applications. FPGAs got a glimpse of fame in 1997, when Adrian Thompson, a
researcher working at the University of Sussex, merged genetic algorithm technology and
FPGAs to create a sound recognition device. Thomson’s algorithm configured an array of 10 x
10 cells in a Xilinx FPGA chip to discriminate between two tones, utilising analogue features of
the digital chip. The application of genetic algorithms to the configuration of devices like FPGAs
is now referred to as Evolvable hardware .
47
4.5.2 Modern developments
A recent trend has been to take the coarse-grained architectural approach a step further by
combining the logic blocks and interconnects of traditional FPGAs with embedded
microprocessors and related peripherals to form a complete "system on a programmable chip".
This work mirrors the architecture by Ron Perlof and Hana Potash of Burroughs Advanced
Systems Group which combined a reconfigurable CPU architecture on a single chip called the
SB24. That work was done in 1982. Examples of such hybrid technologies can be found in the
Xilinx Virtex-II PRO and Virtex-4 devices, which include one or more PowerPC processors
embedded within the FPGA's logic fabric. The Atmel FPSLIC is another such device, which uses
an AVR processor in combination with Atmel's programmable logic architecture. The Actel
SmartFusion devices incorporate an ARM architecture Cortex-M3 hard processor core (with up
to 512kB of flash and 64kB of RAM) and analog peripherals such as a multi-channel ADC and
DACs to their flash-based FPGA fabric.In 2010, an extensible processing platform was
introduced for FPGAs that fused features of an ARM high-end microcontroller (hard-core
implementations of a 32-bit processor, memory, and I/O) with an FPGA fabric to make FPGAs
easier for embedded designers to use. By incorporating the ARM processor-based platform into a
28 nm FPGA family, the extensible processing platform enables system architects and embedded
software developers to apply a combination of serial and parallel processing to address the
challenges they face in designing today's embedded systems, which must meet ever-growing
demands to perform highly complex functions. By allowing them to design in a familiar ARM
environment, embedded designers can benefit from the time-to-market advantages of an FPGA
platform compared to more traditional design cycles associated with ASICs. An alternate
approach to using hard-macro processors is to make use of soft processor cores that are
implemented within the FPGA logic. MicroBlaze and Nios II are examples of popular softcore
processors.As previously mentioned, many modern FPGAs have the ability to be reprogrammed
at "run time," and this is leading to the idea of reconfigurable computing or reconfigurable
systems — CPUs that reconfigure themselves to suit the task at hand.Additionally, new, non-
FPGA architectures are beginning to emerge. Software-configurable microprocessors such as the
Stretch S5000 adopt a hybrid approach by providing an array of processor cores and FPGA-like
programmable cores on the same chip.
48
4.5.3 FPGA COMPARISONS
Historically, FPGAs have been slower, less energy efficient and generally achieved less
functionality than their fixed ASIC counterparts. A study has shown that designs implemented on
FPGAs need on average 18 times as much area, draw 7 times as much dynamic power, and are 3
times slower than the corresponding ASIC implementations. An Altera Cyclone II FPGA, on an
Altera teraSIC DE1 Prototyping board.Advantages include the ability to re-program in the field
to fix bugs, and may include a shorter time to market and lower non-recurring engineering costs.
[citation needed]
Vendors can also take a middle road by developing their hardware on ordinary
FPGAs, but manufacture their final version so it can no longer be modified after the design has
been committed.
Xilinx claims that several market and technology dynamics are changing the ASIC/FPGA
paradigm:
These trends make FPGAs a better alternative than ASICs for a larger number of higher-
volume applications than they have been historically used for, to which the company attributes
the growing number of FPGA design starts (see History). Some FPGAs have the capability of
partial re-configuration that lets one portion of the device be re-programmed while other portions
continue running.
The primary differences between CPLDs (Complex Programmable Logic Devices) and
FPGAs are architectural. A CPLD has a somewhat restrictive structure consisting of one or more
programmable sum-of-products logic arrays feeding a relatively small number of clocked
registers. The result of this is less flexibility, with the advantage of more predictable timing
49
delays and a higher logic-to-interconnect ratio. The FPGA architectures, on the other hand, are
dominated by interconnect. This makes them far more flexible (in terms of the range of designs
that are practical for implementation within them) but also far more complex to design for.In
practice, the distinction between FPGAs and CPLDs is often one of size as FPGAs are usually
much larger in terms of resources than CPLDs. Typically only FPGA's contain more advanced
embedded functions such as adders, multipliers, memory, serdes and other hardened functions.
Another common distinction is that CPLDs contain embedded flash to store their configuration
while FPGAs usually, but not always, require an external flash or other device to store their
configuration.
4.7 APPLICATIONS
50
are intermediate between ASICs and industry standard integrated circuits like the 7400 or the
4000 series.As feature sizes have shrunk and design tools improved over the years, the maximum
complexity (and hence functionality) possible in an ASIC has grown from 5,000 gates to over
100 million. Modern ASICs often include entire 32-bit processors, memory blocks including
ROM, RAM, EEPROM, Flash and other large building blocks. Such an ASIC is often termed a
SoC (system-on-chip). Designers of digital ASICs use a hardware description language (HDL),
such as Verilog or VHDL, to describe the functionality of ASICs. Field-programmable gate
arrays (FPGA) are the modern-day technology for building a breadboard or prototype from
standard parts; programmable logic blocks and programmable interconnects allow the same
FPGA to be used in many different applications. For smaller designs and/or lower production
volumes, FPGAs may be more cost effective than an ASIC design even in production. The non-
recurring engineering (NRE) cost of an ASIC can run into the millions of dollars.
The initial ASICs used gate array technology. Ferranti produced perhaps the first gate-
array, the ULA (Uncommitted Logic Array), around 1980. An early successful commercial
application was the ULA circuitry found in the 8-bit ZX81 and ZX Spectrum low-end personal
computers, introduced in 1981 and 1982. These were used by Sinclair Research (UK) essentially
as a low-cost I/O solution aimed at handling the computer's graphics. Some versions of
ZX81/Timex Sinclair 1000 used just four chips (ULA, 2Kx8 RAM, 8Kx8 ROM, Z80A CPU) to
implement an entire mass-market personal computer with built-in BASIC interpreter.
Customization occurred by varying the metal interconnect mask. ULAs had complexities of up to
a few thousand gates. Later versions became more generalized, with different base dies
customised by both metal and polysilicon layers. Some base dies include RAM elements.
In the mid 1980s, a designer would choose an ASIC manufacturer and implement their
design using the design tools available from the manufacturer. While third-party design tools
were available, there was not an effective link from the third-party design tools to the layout and
actual semiconductor process performance characteristics of the various ASIC manufacturers.
51
Most designers ended up using factory-specific tools to complete the implementation of their
designs. A solution to this problem, which also yielded a much higher density device, was the
implementation of standard cells. Every ASIC manufacturer could create functional blocks with
known electrical characteristics, such as propagation delay, capacitance and inductance, that
could also be represented in third-party tools. Standard-cell design is the utilization of these
functional blocks to achieve very high gate density and good electrical performance. Standard-
cell design fits between Gate Array and Full Custom design in terms of both its non-recurring
engineering and recurring component cost. By the late 1990s, logic synthesis tools became
available. Such tools could compile HDL descriptions into a gate-level netlist. Standard-cell
Integrated Circuits (ICs) are designed in the following conceptual stages, although these stages
overlap significantly in practice.
52
The routing tool takes the physical placement of the standard cells and uses the netlist to
create the electrical connections between them. Since the search space is large, this
process will produce a “sufficient” rather than “globally optimal” solution. The output is
a file which can be used to create a set of photomasks enabling a semiconductor
fabrication facility (commonly called a 'fab') to produce physical ICs.
Given the final layout, circuit extraction computes the parasitic resistances and
capacitances. In the case of a digital circuit, this will then be further mapped into delay
information, from which the circuit performance can be estimated, usually by static
timing analysis. This, and other final tests such as design rule checking and power
analysis (collectively called signoff) are intended to ensure that the device will function
correctly over all extremes of the process, voltage and temperature. When this testing is
complete the photomask information is released for chip fabrication.
These steps, implemented with a level of skill common in the industry, almost always
produce a final device that correctly implements the original design, unless flaws are later
introduced by the physical fabrication process. The design steps (or flow) are also common to
standard product design. The significant difference is that standard-cell design uses the
manufacturer's cell libraries that have been used in potentially hundreds of other design
implementations and therefore are of much lower risk than full custom design. Standard cells
produce a design density that is cost effective, and they can also integrate IP cores and SRAM
(Static Random Access Memory) effectively, unlike Gate Arrays.
Gate-array design is a manufacturing method in which the diffused layers, i.e. transistors
and other active devices, are predefined and wafers containing such devices are held in stock
prior to metallization in other words, unconnected. The physical design process then defines the
interconnections of the final device. For most ASIC manufacturers, this consists of from two to
as many as nine metal layers, each metal layer running perpendicular to the one below it. Non-
recurring engineering costs are much lower, as photolithographic masks are required only for the
metal layers, and production cycles are much shorter, as metallization is a comparatively quick
process. Gate-array ASICs are always a compromise as mapping a given design onto what a
53
manufacturer held as a stock wafer never gives 100% utilization. Often difficulties in routing the
interconnect require migration onto a larger array device with consequent increase in the piece
part price. These difficulties are often a result of the layout software used to develop the
interconnect. Pure, logic-only gate-array design is rarely implemented by circuit designers today,
having been replaced almost entirely by field-programmable devices, such as field-
programmable gate arrays (FPGAs), which can be programmed by the user and thus offer
minimal tooling charges non-recurring engineering, only marginally increased piece part cost,
and comparable performance. Today, gate arrays are evolving into structured ASICs that consist
of a large IP core like a CPU, DSP unit, peripherals, standard interfaces, integrated memories
SRAM, and a block of reconfigurable, uncommited logic. This shift is largely because ASIC
devices are capable of integrating such large blocks of system functionality and "system-on-a-
chip" requires far more than just logic blocks. In their frequent usages in the field, the terms
"gate array" and "semi-custom" are synonymous. Process engineers more commonly use the
term "semi-custom", while "gate-array" is more commonly used by logic (or gate-level)
designers.
By contrast, full-custom ASIC design defines all the photolithographic layers of the
device. Full-custom design is used for both ASIC design and for standard product design.The
benefits of full-custom design usually include reduced area (and therefore recurring component
cost), performance improvements, and also the ability to integrate analog components and other
pre-designed and thus fully verified components, such as microprocessor cores that form a
system-on-chip.
The disadvantages of full-custom design can include increased manufacturing and design
time, increased non-recurring engineering costs, more complexity in the computer-aided design
(CAD) system, and a much higher skill requirement on the part of the design team.
For digital-only designs, however, "standard-cell" cell libraries, together with modern
CAD systems, can offer considerable performance/cost benefits with low risk. Automated layout
54
tools are quick and easy to use and also offer the possibility to "hand-tweak" or manually
optimize any performance-limiting aspect of the design.
Structured design
Structured ASIC design (also referred to as "platform ASIC design"), is a relatively new
term in the industry, resulting in some variation in its definition. However, the basic premise of a
structured ASIC is that both manufacturing cycle time and design cycle time are reduced
compared to cell-based ASIC, by virtue of there being pre-defined metal layers (thus reducing
manufacturing time) and pre-characterization of what is on the silicon (thus reducing design
cycle time). One definition states that In a "structured ASIC" design, the logic mask-layers of a
device are predefined by the ASIC vendor (or in some cases by a third party). Design
differentiation and customization is achieved by creating custom metal layers that create custom
connections between predefined lower-layer logic elements. "Structured ASIC" technology is
seen as bridging the gap between field-programmable gate arrays and "standard-cell" ASIC
designs. Because only a small number of chip layers must be custom-produced, "structured
ASIC" designs have much smaller non-recurring expenditures (NRE) than "standard-cell" or
"full-custom" chips, which require that a full mask set be produced for every design. This is
effectively the same definition as a gate array. What makes a structured ASIC different is that in
a gate array, the predefined metal layers serve to make manufacturing turnaround faster. In a
structured ASIC, the use of predefined metallization is primarily to reduce cost of the mask sets
as well as making the design cycle time significantly shorter. For example, in a cell-based or
gate-array design the user must often design power, clock, and test structures themselves; these
are predefined in most structured ASICs and therefore can save time and expense for the
designer compared to gate-array. Likewise, the design tools used for structured ASIC can be
substantially lower cost and easier (faster) to use than cell-based tools, because they do not have
to perform all the functions that cell-based tools do.
55
4.10 SOFTWARE TOOLS
Verification Tool
Modelsim 6.4c
Synthesis Tool
Xilinx ISE 9.1
MODELSIM:
ModelSim is a useful tool that allows you to stimulate the inputs of your modules and
view both outputs and internal signals. It allows you to do both behavioural and timing
simulation, however, this document will focus on behavioural simulation. Keep in mind that
these simulations are based on models and thus the results are only as accurate as the constituent
models. ModelSim /VHDL, ModelSim /VLOG, ModelSim /LNL, and ModelSim /PLUS are
produced by Model Technology™ Incorporated. Unauthorized copying, duplication, or other
reproduction is prohibited without the written consent of Model Technology. The information in
this manual is subject to change without notice and does not represent a commitment on the part
of Model Technology. The program described in this manual is furnished under a license
agreement and may not be used or copied except in accordance with the terms of the agreement.
The online documentation provided with this product may be printed by the end-user. The
number of copies that may be printed is limited to the number of licenses purchased. ModelSim
is a registered trademark of Model Technology Incorporated. Model Technology is a trademark
of Mentor Graphics Corporation. PostScript is a registered trademark of Adobe Systems
Incorporated. UNIX is a registered trademark of AT&T in the USA and other countries. FLEXlm
is a trademark of Globetrotter Software, Inc. IBM, AT, and PC are registered trademarks, AIX
and RISC System/6000 are trademarks of International Business Machines Corporation.
56
Windows, Microsoft, and MS-DOS are registered trademarks of Microsoft Corporation.
OSF/Motif is a trademark of the Open Software Foundation, Inc. in the USA and other countries.
SPARC is a registered trademark and SPARCstation is a trademark of SPARC International, Inc.
Sun Microsystems is a registered trademark, and Sun, SunOS and Open Windows are trademarks
of Sun Microsystems, Inc. All other trademarks and registered trademarks are the properties of
their respective holders.
STANDARDS SUPPORTED
ModelSim VHDL supports both the IEEE 1076-1987 and 1076-1993 VHDL, the 1164-
1993 Standard Multivalue Logic System for VHDL Interoperability, and the 1076.2-1996
Standard VHDL Mathematical Packages standards. Any design developed with ModelSim will
be compatible with any other VHDL system that is compliant with either IEEE Standard 1076-
1987 or 1076-1993. ModelSim Verilog is based on IEEE Std 1364-1995 and a partial
implementation of 1364-2001, Standard Hardware Description Language Based on the Verilog
Hardware Description Language. The Open Verilog International Verilog LRM version 2.0 is
also applicable to a large extent. Both PLI (Programming Language Interface) and VCD (Value
Change Dump) are supported for ModelSim PE and SE users.
Mentor Graphics was the first to combine single kernel simulator (SKS) technology with
a unified debug environment for Verilog, VHDL, and SystemC. The combination of industry-
leading, native SKS performance with the best integrated debug and analysis environment make
ModelSim the simulator of choice for both ASIC and FPGA designs. The best standards and
platform support in the industry make it easy to adopt in the majority of process and tool flows.
ModelSim combines simulation performance and capacity with the code coverage and
debugging capabilities required to simulate multiple blocks and systems and attain ASIC gate-
level sign-off. Comprehensive support of Verilog, SystemVerilog for Design, VHDL, and
SystemC provide a solid foundation for single and multi-language design verification
environments. ModelSim’s easy to use and unified debug and simulation environment provide
57
today’s FPGA designers both the advanced capabilities that they are growing to need and the
environment that makes their work productive.
The ModelSim debug environment’s broad set of intuitive capabilities for Verilog,
VHDL, and SystemC make it the choice for ASIC and FPGA design. ModelSim eases the
process of finding design defects with an intelligently engineered debug environment. The
ModelSim debug environment efficiently displays design data for analysis and debug of all
languages. ModelSim allows many debug and analysis capabilities to be employed post-
simulation on saved results, as well as during live simulation runs. For example, the coverage
viewer analyzes and annotates source code with code coverage results, including FSM state and
transition, statement, expression, branch, and toggle coverage.Signal values can be annotated in
the source window and viewed in the waveform viewer, easing debug navigation with
hyperlinked navigation between objects and its declaration and between visited files.Race
conditions, delta, and event activity can be analyzed in the list and wave windows. User-defined
enumeration values can be easily defined for quicker understanding of simulation results. For
improved debug productivity, ModelSim also has graphical and textual dataflow capabilities.
SYNTHESIS TOOL
Xilinx ISE Simulator is a test bench and test fixture creation tool integrated in the Project
Navigator framework. It constitutes of Waveform Editor which can be used to graphically enter
stimuli and the expected response, and then generate a VHDL test bench or Verilog test fixture.
ISE controls all aspects of the design flow. Through the Project Navigator interface, we can
access all of the design entry and design implementation tools. We can also access the files and
documents associated with your project.
58
To start ISE: Double-click the ISE Project Navigator icon on desktop or select Start >
All Programs > Xilinx ISE 10.1i > Project Navigator.
6. Click Next.
7. Declare the ports for the counter design by filling in the port information as shown below:
8. Click Next, and then Finish in the New Source Information dialog box to complete the new
source file template..Click Next, then Next, then Finish.
SIMULATION
The design now is composed of Verilog elements and two cores. We now can synthesize the
design using Xilinx ISE simulator.
After the design is successfully defined, you will perform behavioral simulation, run
implementation with the Xilinx Implementation Tools, perform timing simulation, and configure
and download to the Spartan-3 FPGA board.
59
market quickly regardless of cost. Later an ASIC can be used in place of the FPGA when the
production volume increases, in
order to reduce cost.
Configurable Logic Blocks contain the logic for the FPGA. In large grain architecture, these
CLBs will contain enough logic to create a small state machine. In fine grain architecture, more
like a true gate array ASIC, the CLB will contain only very basic logic[26]. The diagram in
Figure 5.2 would be considered a large grain block. It contains RAM for creating arbitrary
combinatorial logic functions. It also contains flip-flops for clocked storage elements, and
60
multiplexers in order to route the logic within the block and to and from external resources. The
multiplexers also allow polarity selection and reset and clear input selection.
61
Figure 5.3: FPGA Configurable I/O Block
5.2.3 Programmable Interconnect
The interconnect of an FPGA is very different than that of a CPLD, but is rather similar to that of
a gate array ASIC. In Figure 5.4, a hierarchy of interconnect resources can be seen. There are
long lines which can be used to connect critical CLBs that are physically far from each other on
the chip without inducing much delay. They can also be used as buses within the chip. There are
also short lines which are used to connect individual CLBs which are located physically close to
each other. There is often one or several switch matrices, like that in a CPLD, to connect these
long and short lines together in specific ways. Programmable switches inside the chip allow the
connection of CLBs to interconnect lines and interconnect lines to each other and to the switch
matrix. Three-state buffers are used to connect many CLBs to a long line, creating a bus. Special
long lines, called global clock lines, are specially designed for low impedance and thus fast
propagation times. These are connected to the clock buffers and to each clocked element in each
CLB. This is how the clocks are distributed throughout the FPGA.
62
Figure 5.4: FPGA Programmable Interconnect
63
IMPLEMENTATION
In this part of tutorial we are going to have a short intro on FPGA design flow. A
simplified version of design flow is given in the flowing diagram.
There are different techniques for design entry. Schematic based, Hardware Description
Language and combination of both etc. . Selection of a method depends on the design and
designer. If the designer wants to deal more with Hardware, then Schematic entry is the better
choice. When the design is complex or the designer thinks the design in an algorithmic way then
HDL is the better choice. Language based entry is faster but lag in performance and density.
HDLs represent a level of abstraction that can isolate the designers from the details of the
hardware implementation. Schematic based entry gives designers much more visibility into the
hardware. It is the better choice for those who are hardware oriented. Another method but rarely
used is state-machines. It is the better choice for the designers who think the design as a series of
64
states. But the tools for state machine entry are limited. In this documentation we are going to
deal with the HDL based design entry.
5.2.2 Synthesis
The process that translates VHDL/ Verilog code into a device netlist format i.e. a
complete circuit with logical elements (gates flip flop, etc…) for the design. If the design
contains more than one sub designs, ex. to implement a processor, we need a CPU as one design
element and RAM as another and so on, then the synthesis process generates netlist for each
design element Synthesis process will check code syntax and analyze the hierarchy of the design
which ensures that the design is optimized for the design architecture, the designer has selected.
The resulting netlist(s) is saved to an NGC (Native Generic Circuit) file (for Xilinx® Synthesis
Technology (XST)).
5.2.3 Implementation
Translate
65
Map
Place and Route
Translate:
Process combines all the input netlists and constraints to a logic design file. This
information is saved as a NGD (Native Generic Database) file. This can be done using NGD
Build program. Here, defining constraints is nothing but, assigning the ports in the design to the
physical elements (ex. pins, switches, buttons etc) of the targeted device and specifying time
requirements of the design. This information is stored in a file named UCF (User Constraints
File). Tools used to create or modify the UCF are PACE, Constraint Editor Etc.
Map:
Process divides the whole circuit with logical elements into sub blocks such that they can
be fit into the FPGA logic blocks. That means map process fits the logic defined by the NGD file
into the targeted FPGA elements (Combinational Logic Blocks (CLB), Input Output Blocks
(IOB)) and generates an NCD (Native Circuit Description) file which physically represents the
design mapped to the components of FPGA. MAP program is used for this purpose.
66
Figure 5.6 FPGA map
PAR program is used for this process. The place and route process places the sub blocks
from the map process into logic blocks according to the constraints and connects the logic
blocks. Ex. if a sub block is placed in a logic block which is very near to IO pin, then it may save
the time but it may affect some other constraint. So tradeoff between all the constraints is taken
account by the place and route process.
The PAR tool takes the mapped NCD file as input and produces a completely routed
NCD file as output. The output NCD file consists of the routing information.
67
Chapter 6
SIMULATION RESULT
68
CONCLUSION
In this paper, an accuracy-configurable adder without suffering the cost of the increase in power
or in delay for configurability was proposed. The proposed adder is based on the conventional
CLA, and its configurability of accuracy is realized by masking the carry propagation at runtime.
The experimental results demonstrate that the proposed adder delivers significant power savings
and speedup with a small area overhead than those of the conventional CLA. Furthermore,
compared with previously studied configurable adders, the experimental results demonstrate that
the proposed adder achieves the original purpose of delivering an unbiased optimized result
between power and delay without sacrificing accuracy. It was also found that the quality
requirements of the evaluated application were not compromised.
69
FUTURE SCOPE
70
REFERENCES
[1] Kuldeep Rawat, Tarek Darwish and Magdy Bayoumi, “A low power and reduced
area Carry Select Adder”, 45th Midwest Symposium on Circuits and Systems, vol.1, pp.
467-470, March 2002.
[2] Y. Kim and L.-S. Kim, "64-bit carry-select adder with reduced area, " Electron. Lett. vol. 37,
no. 10, pp. 614- 615, May 2001.
[3] J. M. Rabaey, Digtal Integrated Circuits-A Design Perspective.Upper Saddle River, NJ:
Prentice-Hall,2001.
[4] Cadence, "Encounter user guide, " Version 6.2.4, March 2008.
[5] R. Priya and J. Senthil Kumar, “ Enhanced area efficient architecture for 128 bit Modified
CSLA”, International Conference on Circuits, Power and Computing Technologies,2013.
[6] Shivani Parmar and Kirat pal Singh,”Design of high speed hybrid carry select
adder”,IEEE ,2012.
[7] I-Chyn Wey, Cheng-Chen Ho, Yi-Sheng Lin, and Chien-Chang Peng,” An Area-Efficient
Carry Select Adder Design by Sharing the Common Boolean Logic Term”, Proceedings of
the International MultiConference of Engineers and Computer Scientist 2012 Vol II,IMCES
2012,HongKong,March 14-16 2012.
[8] B. Ramkumar and Harish M Kittur,” Low-Power and Area-Efficient Carry Select
Adder”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, VOL. 20, NO.
2, February 2012.
71
[8] Ms. S.Manjui, Mr. V. Sornagopae,” An Efficient SQRT Architecture of Carry Select Adder
Design by Common Boolean Logic”,IEEE, 2013.
[9] Youngjoon Kim and Lee-Sup Kim, “64-bit carry-select adder with reduced area”,
Electronics Letters, vol.37, issue 10, pp.614-615, May 2001.
[10] Yajuan He, Chip-Hong Chang and Jiangmin Gu, “An area efficient 64-bit square root
Carry-Select Adder for low power applications”, IEEE International Symposium on
Circuits and Systems,vol.4, pp.4082-4085,
May 2005.
[11] Youngjoon Kim and Lee-Sup Kim, “A low power carry select adder with reduced area”,
IEEE International Symposium on Circuits and Systems, vol.4, pp.218-221, May 2001.
[12] Hiroyuki Morinaka, Hiroshi Makino, Yasunobu Nakase, Hiroaki Suzuki and Koichiro
Mashiko, “A 64bit Carry Look-ahead CMOS Adder using Modified Carry
Select”,Proceeding of IEEE on Custom Integrated Circuits Conference, pp.585-588, May
1995.
72