VLSI Architecture :: MEL G642
Dr. A. Amalin Prince
BITS Pilani K.K. Birla Goa Campus
Department of Electrical and Electronics Engineering
MEL G642
Contents
Numerical representations
Finite length data
Signal processing under finite data precision
Special cases
MEL G642
Numerical Representation
MEL G642
Numerical Representation
Generic SoC Architecture
Snapdragon SoC Qualcomm
MEL G642
General
General
– Requirement on low silicon costs drives custom data types in
DSP applications.
– Word–16b; double word–32b;
– Custom word length = bus width = native
Two’s complement
– From HW point of view, the addition and subtraction can be
performed independent of the sign of the operand.
Fixed point two’s complement in our course
MEL G642
Fixed point
MEL G642
Fixed point numerical representation
What is fixed point
– Integer or fractional representation, Dominate in DSP.
– Integer: between -2n-1 and +2n-1-1 or [-2n-1, 2n-1 -1]
– Fractional: between -1 and 1-2-n+1 or [-1, 1-2-n+1]
Why fixed point (Important)
– Easy to implement data path HW
– Short physical critical path
– Low hardware (memory) costs, low power
– But, it must be the acceptable precision
Why not fixed point (sometimes)
– Cannot separate dynamic range and precision
– High firmware design costs
MEL G642
Integer numerical representation
The integer representation
-an-12n-1 + an-22n-2+...... + a121+ a0
the range is −2n-1 ≤ X < 2n-1
Example
01010011 = 83 (decimal)
= -0×27+1×26+0×25 +1×24+0×23+0×22+1×21+1×20
=26+24+21+20 = 64+16+2+1= 83
Example
10101101 = -83 (decimal)
= -1×27+0×26+1×25 +0×24+1×23+1×22+0×21+1×20
=-27+25 +23+22+20 = -128+32 +8+4+1 = -83
MEL G642
Fractional numerical representation
The fractional representation
– - a0 + a-12-1 + a-22-2+...... + a-(n-1)2-(n-1) + an2-n
– a0 is the sign bit, the range is −1 ≤ X < 1
Example
01010000 = 0.625 (decimal)
= -0×20+1×2-1+0×2-2 +1×2-3+0×2-4+0×2-5+0×2-6+0×2-7
=2-1+2-3 = 0.5 + 0.125 = 0.625
Example (10101111+00000001)
10110000 = -0.625 (decimal)
= -1×20+0×2-1+1×2-2 +1×2-3+0×2-4+0×2-5+0×2-6+0×2-7
=-20+2-2+2-3 = -1+0.25 +0.125 = -0.625
MEL G642
General
Precision
– The measurement of ability to distinguish between the closest
values, the distance between the smallest value that can be
represented.
Example:
– Using a 16 bits fixed-point processor, the precision of the data
with native data width is 0.000-0000-0000-0001which is the
distanced between 0 and 0.000-0000-0000-0001.
MEL G642
General
Dynamic range
– The ratio between the largest and the smallest numbers that can
be represented.
Quantization error
– The numerical error introduced when a longer numeric format is
converted to a shorter one
MEL G642
Dynamic range and precision of 16b
Dynamic range Dynamic range in DB Precision
Unsigned 1 to 65536 20 log10(216)=96dB 1
integer
Signed 1 to 32767 20 log10(215)=90dB 1
integer
Unsigned 2-16 to 0.99998474 96dB 2-16
fractional
Signed 2-15 to 0.99996948 90dB 2-15
fractional
dynamic _ range 20 log
positiveMAX
positiveMIN
MEL G642
Typical Sound Levels and Their
Decibel Levels
MEL G642
MEL G642
M.F format: QD(M,F) representation
M is the magnitude and F is the fractional
D stands for base, for binary system, D=2
An example of Q2(4.4), data width=M+F
M F
M F
.
23 = 8 2-4 =1/16=0.0625
Radix point
22 =4 2-3 =1/8 =0.125
21 =2 2-2 =1/4 =0.25
20 =1 2-1 =1/2 = 0.5
MEL G642
Integer or Fractional
MEL G642
Example
Multiply 16-bits integer 02FF16 by 16-bits integer 011016
using a 16-bit fixed-point datapath and save the result in
a 16-bit register. Discuss the result.
Answer:
– 32 bit result of the multiplication is 00032EF016
– The integer is right-aligned with the point at the right side of
LSB.
o the result after truncation must be aligned at the right side of the
result, and the truncation must be performed from the left side.
o The value of the result after truncation becomes 2EF016 or 0010
1110 1111 00002.
MEL G642
Dynamic range / precision
Format Largest positive value Largest negative value Precision
Q1.15 0.999969482421875 -1 0.00003051757813
Q2.14 1.99993896484375 -2 0.00006103515625
Q3.13 3.9998779296875 -4 0.00012207031250
Q4.12 7.999755859375 -8 0.00024414062500
Q5.11 15.99951171875 -16 0.00048828125000
Q6.10 31.9990234375 -32 0.00097656250000
Q7.9 63.998046875 -64 0.00195312500000
Q8.8 127.99609375 -128 0.00390625000000
Q9.7 255.9921875 -256 0.00781250000000
Q10.6 511.984375 -512 0.01562500000000
Q11.5 1023.96875 -1024 0.03125000000000
Q12.4 2047.9375 -2048 0.06250000000000
Q13.3 4095.875 -4096 0.12500000000000
Q14.2 8191.75 -8192 0.02500000000000
Q15.1 16383.5 -16384 0.05000000000000
Q16.0 32767 -32768 1.00000000000000
MEL G642
Example
Discussion:
– We find that the result of an integer multiplication is wrong and
overflow when the result keeps the same precision as operands.
o A reliable way is to limit the range of both operands to be |x |< 007F16
– For iterative
o If there are L steps of integer multiplications in an iterative loop,
there will be L+ 1 identical sign bits
– reduced dynamic range after multiplication
o except for an extreme case dealing with the maximum negative
value, (-max)×(-max)=(+max)
MEL G642
Example
Multiply 16-bits fractional 02FF16 by 16-bits fractional
011016 using a 16-bit fixed-point datapath and save the
result in a 16-bit register. Discuss the result.
Answer:
02FF 0000 0010 1111 1111 0.023406982
0110 0000 0001 0001 0000 0.008300781
00032EF0 0000 0000 0000 0011 0010 1110 111 000 0.000194296
– The fractional is left-aligned with the point at the left side of MSB.
o the result after truncation must be aligned at the leftt side of the
result, and the truncation must be performed from the right side.
o The value of the result after truncation becomes 000316 or 0000
0000 0000 00112 = 0.000091552.
MEL G642
Example
Calculate y=x8, where x=7F16( =0.9921875) is a fractional two’s
complement data, by using an 8-bit integer datapath. The output of
the multiplier is 16 bits when both operands are 8-bits. Discuss the
results.
the intermediate 8-bit operand for
multiplication has to be
yn[7:0] =2-8 × Round (FF 00×Yn)
to match the data type of an
operand. (It is not a good
way, yet the only way to manage
the iteration.)
Discussion
– The resolution of the result is lower after every step of the iteration. Integer-
based multiplication is not good for iterative computing.
MEL G642
Fraction MUL unit
How to design a fractional multiplier?
– Difficult?
– The integer two’s complement multiplier can be used for
fractional multiplication.
– After the multiplication, one of the two sign bits should be
removed. A saturation operation is required before removing one
of the sign bits to adapt the case when (-1)×(-1).
Refer class notes
MEL G642
MEL G642
Fixed point numerical representation Integer and fractional multiplication
Integer Fractional
Sign bit Sign bit Sign bit Sign bit
-23 22 21 20 × -23 22 21 20 -20 2-1 2-2 2-3 × -20 2-1 2-2 2-3
Sign bit After multiplication
-21 20 2-1 2-2 -2-3 2-4 2-5 2-6
-27 26 25 24 -23 22 21 20 Sign bit
MEL G642
Fractional result
Truncate off
-20 2-1 2-2 2-3 -2-4 2-5 2-6 0
Truncate off
Example : Example :
0111 × 0111 0111 × 0111
= 7 × 7 = 00110001 = 0.875 × 0.875 = 00.110001=0.1100010
= (-03+23+21+20) × (-03+23+21+20) = 0.5 + 0.25 + 0.015625
= (-07+06+25+24+03+02+01+20) = 0.765625 (7-bit precision)
= 32 + 16 + 1 = -01+00+2-1+2-2+0-3 (4-bit precision)
= 49 = 0.75 (4-bit precision after truncation).
=01 (4-bit precision) unacceptable (Acceptable, not so bad)
MEL G642
Example - Extension
Discussion
– it is not suitable to conduct integer multiplication if the ranges of operands are
unknown.
– The real result of y = (7F16)8 = 0.9391810; the radix point is right after the sign
bit.
– The result with finite precision hardware in Example is 0.9379.
– The error introduced by truncation is about 0.00128. The result of the fractional
multiplication is acceptable.
MEL G642
Floating point
MEL G642
Floating Point Numerical Representation
What is floating point
– Numbers are presented by a mantissa and an exponent
– value = (-1)S × mantissa × 2 exponent
Why floating point
– to get a larger dynamic range
– The computing HW automatically scales each data to utilize the
full word length of mantissa after each arithmetic operation.
o Hence best resolution is maintained for small numbers
– Reduce firmware development time
o Hence TTM advantage
Why not floating point
– complicated data path, longer critical path in the data path, could
be more silicon cost
MEL G642
Floating Point Numerical Representation
An example of IEEE 754 floating point numerical representation
Bit# 31 30 23 22 0
S EXPONENT FRACTION
MSB LSB MSB LSB
S is the sign bit of the mantissa (0 is positive and 1 is negative)
EXPONENT is an unsigned 8-bit field that determines the location of the
binary point of the number being encoded.
FRACTION is a 23-bit field containing the fractional part of the mantissa
without sign bit
LSB is the lest significant bit
MSB is the most significant bit
MEL G642
Floating Point Numerical Representation
DSP processor of Texas Instruments floating point numerical representation
S is the sign bit of the mantissa (0 is positive and 1 is negative)
EXPONENT is an unsigned 8-bit field that determines the location of the
binary point of the number being encoded.
FRACTION is a 23-bit field containing the fractional part of the mantissa
without sign bit
LSB is the lest significant bit
MSB is the most significant bit
MEL G642
MEL G642
Block floating point
A fixed-point processor does not have a floating-point
hardware unit
Floating-point computing can be emulated by fixed-point
DSP processors when larger dynamic range is required.
Block floating-point (also called dynamic signal scaling
technique) is an emulation method using a fixed-point
DSP processor hardware
MEL G642
Block floating point
Larger dynamic range
The average amplitude
40dB
30dB
20dB
Exponent=7
Exponent=4
10dB
Exponent=0
0dB
10dB
Exponent=3
20dB
30dB
Time
80dB
Scale Scale
down down
Scale down 42db 24dB No scaling 18dB
MEL G642
Finite length DSP
MEL G642
Finite length DSP: the problem
A challenge to get the best available precision on the
results under a precision-limited datapath and storage
system
– The first problem: the finite resolution given by the analog to
digital converter ADC
– The second problem: most DSP processors are “fixed point
processors” to minimize the silicon costs
– The third problem: extra quantization error introduced while
scaling down signals within a finite data length system
MEL G642
Definitions
Definition 1: Truncation:
– To convert a longer numerical format to a shorter one by
simple cutting off bits at the LSB part.
Definition 2: Quantization error
– The numerical error introduced when a longer numeric
format is converted to a shorter one
MEL G642
Round operation
Why: To eliminate bias error after truncation
To round up the truncated part or round-to-nearest
technique
– To add the smallest value after truncation if the MSB of
the discarded part is one.
Implementation
o extra addition operation is needed.
o the hardware availability and timing for rounding have to be
considered during the instruction set design
MEL G642
A rounding example
Round Arithmetic Example: Round 8 bits to 4 bits
Sign bit
Before round: 8 bits
A7 A6 A5 A4 A3 A2 A1 A0
Round to nearest
Sign bit b[3:0] = a[7:4] + {000, a[3]}
After round: 4 bits
b7 b6 b5 b4
MEL G642
Overflow, saturation, and guard
An overflow happens
– if X is not in the range -2N-1≤ X < 2N-1
When overflow happens
– In a normal computer, overflow can be managed by
o exception handling
o Re-execution of the algorithm after scaling input data.
– In a DSP processor running real-time applications
o there is no time to handle exceptions and re-execution of the
application.
o overflows in a real-time system need to be handled by other
means.
MEL G642
Overflow, saturation, and guard
Three ways to deal with overflow
– To use a floating point processor
o Higher hardware costs
– To scale down the input data
o Precision will be lower
– To use guard and saturation arithmetic
o Good idea
DSP react on overflow
– might be exception or saturation
MEL G642
Manage Overflow: Guard/saturate
Most popular way in DSP hardware
What is guard
– Add more sign extension bits to operands
– So that any results will/must be less than the maximum extended
value
– in the range that the guard can protect
2G×(–2N-1) ≤ X < 2G× (2N-1).
MEL G642
Guard bits requirements
In ASIC DSP Circuits
– design example discussed before?
In DSP ASIP
– In an ALU without iteration support we require one guard bit
(G=1)
– No of guard bits for iteration
MEL G642
How many Guard bits required
When accumulation needed, guard is needed.
Required Number of Guard bits is based on:
both the scaling factor or the guard factor:
MEL G642
How many Guard bits required
For example, if there are a maximum of 512 iteration
steps
– all the coefficients are checked and the sum of 512 coefficients is
not more than 250, the factor of 28 = 256 or eight guard bits will
be enough
MEL G642
Manage Overflow: saturation
Saturation: after executing an algorithm
if result ≥1 then final_result = 1-2-n+1
elseif result <-1 then final_result = -1
else final_resul <= result // no overflow.
– Performed after an iterative accumulation
– Do not do it during convolution (exceptions)
Discussion
– Better than exception for hard-real-time system
MEL G642
Manage Overflow: an example
An example: extension of 4 guard bits for accumulation and saturation
Sign bit Sign bit
Before guard: 4-bit OPA a3 a2 a1 a0 4-bit OPB b3 b2 b1 b0
Sign bit Sign bit
After guard: 8-bit ga4 ga3 ga2 ga1 a3 a2 a1 a0 gb4 gb3 gb2 gb1 b3 b2 b1 b0
guards=a3 Accumulator guards=b3
8-bit full adder + Accumulator register
Data could be out of range
gc4 gc3 gc2 gc1 c3 c2 c1 c0
during accumulation
After accumulation
Saturation
Dynamic range
d3 d2 d1 d0
The final result is correct After saturation, the
iteration final result is 4-bits
MEL G642
Manage Overflow: saturation
One guard bit means one more sign bit extension to
operands and the arithmetic hardware.
– At the same time, saturation circuit is required at the output of
the arithmetic circuit.
The saturation is enabled for single-precision computations
In the following cases, saturation should be disabled:
– For lower subword computing of a double precession
algorithm.
– Unsigned arithmetic computing.
– During the execution of an iteration algorithm.
MEL G642
Manage Overflow: saturation
When saturation is disabled,
– the carry-out bit (flag) should be used to link the upper and
the lower part computing of double-precision arithmetic.
MEL G642
Examples of special cases
Absolute operation on -1 in ALU.
Without guard and saturation operations it will become:
A [3:0] = 4’b1000
ABS (A [3:0]) <= INV (1000)+0001 <= 0111+0001 <= 1000
The right result should be +7 (4’b0111). The correct operation with
guard and saturation should be:
A [3:0] = 4’b1000; guarded A = GA [4:0] = 5’b11000
ABS (GA [4:0]) <= Truncate MSB (saturation (ABS (GA)))
<= Truncate MSB (saturation (INV(11000)+00001))
<= Truncate MSB (saturation (00111 + 00001))
<= Truncate MSB (saturation (01000))
<= Truncate MSB (00111) <= 0111
MEL G642
The right execution order
Guarding operands
Scaling and Rounding
Saturation and removing guards
Truncation
MEL G642
The End :: Thank you for your attention
Questions?
MEL G642