DSP Processors: We Have Seen That The Multiply and Accumulate (MAC) Operation Is Very Prevalent in DSP Computation
DSP Processors: We Have Seen That The Multiply and Accumulate (MAC) Operation Is Very Prevalent in DSP Computation
We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters correlation of two signals x DSP FFT
A Digital Signal Processor (DSP) is a CPU that can compute each MAC tap in 1 clock cycle
Thus the entire L coefficient MAC takes (about) L clock cycles For in real-time the time between input of 2 x values must be more than L clock cycles
XTAL
ALU with ADD, MULT, etc
registers
bus
memory
PC
a c
b d
DSP Slide 1
MACs
the basic MAC loop is
loop over all times n initialize yn 0 loop over i from 1 to number of coefficients yn yn + ai * xj (j related to i) output yn
in order to implement in low-level programming for real-time we need to update the static buffer from now on, we'll assume that x values in pre-prepared vector for efficiency we don't use array indexing, rather pointers we must explicitly increment the pointers we must place values into registers in order to do arithmetic
loop over all times n clear y register set number of iterations to n loop update a pointer update x pointer multiply z a * x (indirect addressing) increment y y + z (register operations) output y
DSP
Slide 2
Cycle counting
We still cant count cycles need to take fetch and decode into account need to take loading and storing of registers into account we need to know number of cycles for each arithmetic operation let's assume each takes 1 cycle (multiplication typically takes more) assume zero-overhead loop (clears y register, sets loop counter, etc.) Then the operations inside the outer loop look something like this: 1. Update pointer to ai 2. Update pointer to xj 3. Load contents of ai into register a 4. Load contents of xj into register x 5. Fetch operation (MULT) 6. Decode operation (MULT) 7. MULT a*x with result in register z 8. Fetch operation (INC) 9. Decode operation (INC) 10. INC register y by contents of register z So it takes at least 10 cycles to perform each MAC using a regular CPU
DSP Slide 3
bus
px
memory
PC
accumulator
pa
2.
3. 4. 5. 6. 7.
Update pointer to ai y a x Update pointer to xj Load contents of ai into register a Load contents of xj into register x Fetch operation (MAC) Decode operation (MAC) MAC a*x with incremented to accumulator y
registers
bus
px
memory
PC
pa
Update pointer to ai || Update pointer to xj 2. Load contents of ai into register a 3. Load contents of xj into register x 4. Fetch operation (MAC) 5. Decode operation (MAC) 6. MAC a*x with incremented to accumulator y However 6 > 1, so this is still NOT a DSP !
DSP Slide 5
bank 1
bus
a x
bank 2
y
1.
Update pointer to ai || Update pointer to xj 2. Load ai into a || Load xj into x 3. Fetch operation (MAC) 4. Decode operation (MAC) 5. MAC a*x with incremented to accumulator y However 5 > 1, so this is still NOT a DSP !
DSP Slide 6
one memory for data and program can change program during run-time one memory for program one memory (or more) for data needn't count fetch since in parallel we can remove decode as well (see later)
PC
pa
px
Update pointer to ai || Update pointer to xj 2. Load ai into a || Load xj into x 3. MAC a*x with incremented to accumulator y However 3 > 1, so this is still NOT a DSP !
1.
DSP Slide 7
Step 5 - pipelines
We seem to be stuck Update MUST be before Load Load MUST be before MAC But we can use a pipelined approach Then, on average, it takes 1 tick per tap actually, if pipeline depth is D, N taps take N+D-1 ticks For large N >> D or when we fill the pipeline the number of ticks per tap is 1 (this is a DSP)
op
U1 U2 L1 U3 L2 M1 U4 L3 M2 U5 L4 M3 L5 M4 M5
t
1 2 3 4 5 6 7
DSP Slide 8
Fixed point
Most DSPs are fixed point, i.e. handle integer (2s complement) numbers only
floating point is more expensive and slower floating point numbers can underflow fixed point numbers can overflow
When regular fixed point CPUs overflow numbers greater than MAXINT become negative numbers smaller than -MAXINT become positive
Most fixed point DSPs have a saturation arithmetic mode numbers larger than MAXINT become MAXINT numbers smaller than -MAXINT become -MAXINT this is still an error, but a smaller error There is a tradeoff between safety from overflow and SNR
DSP Slide 9