CUSTOM SINGLE PURPOSE PROCESSOR DESIGN
General Vs Single purpose processors
Higher Performance
Due to fewer clock cycles Shorter clock cycle
Smaller Size
Less power consumption
High NRE cost
Longer Time-to-market
Less flexible
Combinational logic design
A) Problem description y is 1 if a is to 1, or b and c are 1. z is 1 if b or c is to 1, but not both, or if all are 1. a 0 0 0 0 1 1 1 1 B) Truth table Inputs b 0 0 1 1 0 0 1 1 c 0 1 0 1 0 1 0 1 Outputs y z 0 0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 C) Output equations y = a'bc + ab'c' + ab'c + abc' + abc z = a'b'c + a'bc' + ab'c + abc' + abc
D) Minimized output equations y bc a 00 01 11 10 0 0 0 1 0 1 1 z a bc 1 1 1
E) Logic Gates a b c y
y = a + bc 00 0 0 1 0 01 1 1 11 0 1 10 1 1
z = ab + bc + bc
RT level Combinational components
I(m-1) I1 I0 n S0 n-bit, m x 1 Multiplexor S(log n m) I(log n -1) I0 A B A n n B n A n B
n n-bit Adder
n carry sum
log n x n Decoder
O(n-1) O1 O0
n-bit Comparato r less equa greate l r
n bit, m function S0 ALU S(log n m) O
O= I0 if S=0..00 I1 if S=0..01 I(m-1) if S=1..11
O0 =1 if I=0..00 O1 =1 if I=0..01 O(n-1) =1 if I=1..11
sum = A+B (first n bits) carry = (n+1)th bit of A+B
less = 1 if A<B equal =1 if A=B greater=1 if A>B
O = A op B op determined by S.
With enable input e all Os are 0 if e=0
With carry-in input Ci sum = A + B + Ci
May have status outputs carry, zero, etc.
RT level Sequential components
I n load clear n-bit Register n Q Q= 0 if clear=1, I if load=1 and clock=1, Q(previous) otherwise. Q = lsb - Content shifted - I stored in msb shift I n-bit Shift register n-bit Counter n Q Q= 0 if clear=1, Q(prev)+1 if count=1 and clock=1. Q
Sequential logic design
A) Problem Description You want to construct a clock divider. Slow down your preexisting clock so that you output a 1 for every four clock cycles C) Implementation Model a Combinational logic x I1 I0 Q1 B) State Diagram a=0 x=0 0 a=1 1 a=0 x=0 a=1 x=1 3 a=1 2 x=0 a=0 a=0 Q0 D) State Table (Moore-type) Inputs Q1 Q0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 Outputs I1 I0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 0
State register I1 I0
a 0 1 0 1 0 1 0 1
x 0 0 0 1
a=1
Given this implementation model
Sequential logic design quickly reduces to combinational logic design
Sequential logic design (cont.)
E) Minimized Output Equations F) Combinational Logic
I1 Q1Q0 00 a 0
1
01
11
0
0
0
1
1
0
10 1 1
a I1 = Q1Q0a + Q1a + Q1Q0 x
I0
Q1Q0 a
01 1 0 11 1 0
10 0 1 I0 = Q0a + Q0a
I1
00 0 1
x Q1Q0 00 a 0 0 1 0
I0 01 0 0 11 1 1 10 0 0 x = Q1Q0 Q1 Q0
Custom single-purpose processor basic model
external control inputs controller datapath control inputs external data inputs datapath controller datapath registers next-state and control logic
external control outputs
datapath control outputs
external data outputs
state register
functional units
controller and datapath a view inside the controller and datapath
Example: greatest common divisor
First create algorithm Convert algorithm to complex state machine
(a) black-box view go_i x_i GCD
3: x = x_i !1 1: 1 2: !go_i 2-J: !(!go_i)
(c) state diagram
y_i
d_o
4: y = y_i !(x!=y) x!=y 6: x<y 7: y = y -x 6-J: !(x<y)
Known as FSMD: finitestate machine with datapath Can use templates to perform such conversion
(b) desired functionality 0: int x, y; 1: while (1) { 2: while (!go_i); 3: x = x_i; 4: y = y_i; 5: while (x != y) { 6: if (x < y) 7: y = y - x; else 8: x = x - y; } 9: d_o = x; }
5:
8: x = x - y
5-J: 9: 1-J: d_o = x
State diagram templates
10
Assignment statement a=b next statement Loop statement while (cond) { loop-bodystatements } next !cond statement cond
loopbodystatement s c1 stmts J:
Branch statement
if (c1) c1 stmts else if c2 c2 stmts else other stmts next statement
C: c1 !c1*c2 c2 stmts !c1*!c2 others
a=b
C:
next statemen t J:
next statement
next statement
Creating the datapath
11
Create a register for any declared variable Create a functional unit for each arithmetic operation Connect the ports, registers and functional units
!1 1: 1 2: !go_i !(!go_i) x_i y_i
2-J: x_sel 3: x = x_i y_sel n-bit 2x1 n-bit 2x1
Datapath
x_ld
4: y = y_i !(x!=y) x!=y 6: x<y 7: y = y -x 6-J: !(x<y) != 5: x!=y x_neq_ y x_lt_y y_ld
0: x
0: y
5:
< 6: x<y
subtractor 8: x-y
subtractor 7: y-x
8: x = x - y
9: d d_ o
d_ld
Based on reads and writes Use multiplexors for multiple sources
5-J:
9: 1-J:
d_o = x
Creating the controllers FSM
12
1: 1 2: !go_i 2-J: 3: x = x_i 0001 2: !go_i 00102-J: x_sel = 0 0011 3: x_ld = 1 y_sel = 0 0100 4: y_ld = 1 !(x!=y) 0101 5: x!=y 6: x<y 7: y = y -x 6-J: !(x<y) 0110 6: x_lt_y 7: y_sel = 1 y_ld = 1 0111 1001 6-J: 5-J: 9: 1-J: d_o = x 1010 5-J: 1011 9: 1100 1-J: d_ld = 1 x_neq_y !x_lt_y x_sel =1 8: x_ld = 1 1000 !x_neq_y !(!go_i) !1 go_i
Controller
0000 1: 1
!1 !(!go_i)
Same structure as FSMD Replace complex actions/conditions with Datapath datapath configurations
x_i y_i x_sel y_sel x_ld y_ld 0: x 0: y n-bit 2x1 n-bit 2x1
4:
y = y_i
5:
8: x = x - y
!= 5: x!=y x_neq_ y x_lt_y d_ld
< 6: x<y
subtractor 8: x-y
subtractor 7: y-x
9: d d_ o
13
Splitting into a controller and datapath
go_i
Controller implementation model
go_i Combinational logic x_sel y_sel x_ld y_ld x_neq_y x_lt_y d_ld
Controller
0000 1: 1 0001 2: !go_i 00102-J: x_sel = 0 0011 3: x_ld = 1 y_sel = 0 0100 4: y_ld = 1 0101 5:
!1 x_i !(!go_i) x_sel y_sel x_ld y_ld 0: x 0: y n-bit 2x1 n-bit 2x1 y_i
(b) Datapath
!= x_neq_y=0 5: x!=y x_neq_ y x_lt_y d_ld
< 6: x<y
subtractor 8: x-y
subtractor 7: y-x
Q3 Q2 Q1 Q0 0110 6: State register I3 I2 I1 I0 x_lt_y=1 7: y_sel = 1 y_ld = 1 0111
x_neq_y= 1 x_lt_y= 0 =1 x_sel 8: x_ld = 1 1000
9: d d_ o
1001 6-J:
1010 5-J: 1011 9: 1100 1-J: d_ld = 1
14
Controller state table for the GCD example
Inputs Q3 0 0 0 0 Q2 0 0 0 0 Q1 0 0 0 1 Q0 0 1 1 0 x_ne q_y * * * * x_lt_ y * * * * go_i * 0 1 * I3 0 0 0 0 I2 0 0 0 0 I1 0 1 1 0 I0 1 0 1 1 Outputs x_sel X X X X y_sel X X X X x_ld 0 0 0 0 y_ld 0 0 0 0 d_ld 0 0 0 0
0 0
0 0 0 0 0 1 1 1 1 1 1 1 1
0 1
1 1 1 1 1 0 0 0 0 1 1 1 1
1 0
0 0 1 1 1 0 0 1 1 0 0 1 1
1 0
1 1 0 0 1 0 1 0 1 0 1 0 1
* *
0 1 * * * * * * * * * * *
* *
* * 0 1 * * * * * * * * *
* *
* * * * * * * * * * * * *
0 0
1 0 1 0 1 1 1 0 1 0 0 0 0
1 1
0 1 0 1 0 0 0 1 1 0 0 0 0
0 0
1 1 0 1 0 0 1 0 0 0 0 0 0
0 1
1 0 0 1 1 1 0 1 0 0 0 0 0
0 X
X X X X X 1 X X X X X X X
X 0
X X X X 1 X X X X X X X X
1 0
0 0 0 0 0 1 0 0 0 0 0 0 0
0 1
0 0 0 0 1 0 0 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 1 0 0 0 0
Design Custom single purpose processor for
Fibonacci
number up to n
int i, j,k,n,Outp; while (1) { while (!go_i); n = n_i; i=0; j=1; k=0; outp=i; outp=j; while (k<=n) { k=i+j; i=j; j=k; outp=k; } }
RT-level custom single-purpose processor design
16
Problem Specification
We often start with a state machine
Sen der
rdy_in clock
Rather than algorithm Cycle timing often too central to functionality
Bridge A single-purpose processor that converts two 4-bit inputs, arriving one at a time over data_in along with a rdy_in pulse, into one 8-bit output on data_out along with a rdy_out pulse.
rdy_out
Re cei ver
data_in(4)
data_out(8)
rdy_in=0 WaitFirst4
Bridge
rdy_in=1 RecFirst4End
Example
FSMD
rdy_in=1 RecFirst4Start data_lo=data_in rdy_in=0
rdy_in=0
rdy_in=1 RecSecond4En d Inputs rdy_in: bit; data_in: bit[4]; Outputs rdy_out: bit; data_out:bit[8] Variables data_lo, data_hi: bit[4];
Bus bridge that converts 4bit bus to 8-bit bus
Start with FSMD Known as register-transfer (RT) level
rdy_in= 1 RecSecond4Sta WaitSecond4 rt data_hi=data_in rdy_in=0 Send8Start data_out=data _hi & data_lo rdy_out=1 Send8End rdy_out=0
RT-level custom single-purpose processor design (cont)
17
Bridge
(a) Controller
rdy_in=0 WaitFirst4 rdy_in=0 rdy_in=1 RecFirst4Start data_lo_ld=1 rdy_in=1 RecFirst4End
rdy_in=0 rdy_in=1 WaitSecond4 RecSecond4Star t data_hi_ld=1 Send8Start data_out_ld=1 rdy_out=1
rdy_in clk
rdy_in=1
RecSecond4End
Send8End rdy_out=0
rdy_ou t data_out data_out_ld data_hi_ld data_lo_ld
data_in(4)
to all registers data_hi data_lo
data_out
(b) Datapath
Optimizing single-purpose processors 18
Optimization is the task of making design metric values the best possible Optimization opportunities
original
program
FSMD
datapath FSM
Optimizing the original program
19
Analyze program attributes and look for areas of possible improvement
number size
of computations
of variable
time
and space complexity
used
and division very expensive
operations
multiplication
20
Optimizing the original program (cont)
original program 0: int x, y; 1: while (1) { 2: while (!go_i); 3: x = x_i; 4: y = y_i; 5: while (x != y) { 6: if (x < y) 7: y = y - x; else 8: x = x - y; } 9: d_o = x; } replace the subtraction operation(s) with modulo operation in order to speed up program
GCD(42, 8) - 9 iterations to complete the loop x and y values evaluated as follows : (42, 8), (43, 8), (26,8), (18,8), (10, 8), (2,8), (2,6), (2,4), (2,2).
optimized program 0: int x, y, r; 1: while (1) { 2: while (!go_i); // x must be the larger number 3: if (x_i >= y_i) { 4: x=x_i; 5: y=y_i; } 6: else { 7: x=y_i; 8: y=x_i; } 9: while (y != 0) { 10: r = x % y; 11: x = y; 12: y = r; } 13: d_o = x; } GCD(42,8) - 3 iterations to complete the loop x and y values evaluated as follows: (42, 8), (8,2), (2,0)
Optimizing the FSMD
21
Areas of possible improvements
merge
states
states
with constants on transitions can be eliminated, transition taken is already known
states
with independent operations can be merged
separate
states
states
which require complex operations (a*b*c*d) can be broken into smaller states to reduce hardware size
scheduling
Optimizing the FSMD (cont.)
22
1: 1 2: !go_i 2-J: x = x_i y = y_i !(x!=y) x!=y 6: x<y !(x<y) y = y -x 6-J: 5-J: d_o = x 8: x = x - y !(!go_i)
int x, y;
!1
original FSMD eliminate state 1 transitions have constant values
optimized FSMD int x, y;
2: !go_i go_i x = x_i y = y_i
3: 4: 5:
merge state 2 and state 2J no loop operation in between them
3:
5:
merge state 3 and state 4 assignment operations are independent of one another
merge state 5 and state 6 transitions from state 6 can be done in state 5 eliminate state 5J and 6J transitions from each state can be done from state 7 and state 8, respectively eliminate state 1-J transition from state 1-J can be done directly from state 9
x<y 7: y = y -x
x>y 8: x = x - y
9:
d_o = x
7:
9: 1-J:
Optimizing the datapath
23
Sharing of functional units
one-to-one
mapping, as done previously, is not
necessary
if
same operation occurs in different states, they can share a single functional unit
Multi-functional units
ALUs
support a variety of operations, it can be shared among operations occurring in different states
Optimizing the FSM
24
State encoding
task
of assigning a unique bit pattern to each state in an FSM of state register and combinational logic vary
size
can
be treated as an ordering problem
State minimization
task
of merging equivalent states into a single state
state
equivalent if for all possible input combinations