Introduction to Instruction Pipelining
Pipelining
Pipelining is an implementation technique in which
multiple instructions are overlapped in execution
Here we will consider RISC architecture
Memory Access by Load/Store
All other instructions use registers.
Pipelining is Natural!
Laundry Example
Ann, Brian, Cathy, Dave A B C D
each have one load of clothes
to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
“Folder” takes 20 minutes
Sequential Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e
r D
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
Pipelined Laundry: Start work ASAP
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
T
a A
s
k
B
O
r
d C
e
r
D
Pipelined laundry takes 3.5 hours for 4 loads
Pipelining Lessons
Pipelining doesn’t help
6 PM 7 8 9 latency of single task, it helps
throughput of entire workload
Time
Pipeline rate is limited by
slowest pipeline stage
30 40 40 40 40 20
T Multiple tasks operating
a A simultaneously using
s different resources
k
Potential speedup = Number
B pipeline stages
O
r Unbalanced lengths of
d C pipeline stages reduces
e speedup
r
D Time to “fill” pipeline and
time to “drain” it reduces
speedup
Stall for Dependencies
The Five Stages of Load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Load Ifetch Reg/Dec Exec Mem Wr
Ifetch: Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec: Registers Fetch and Instruction Decode
Exec: Calculate the memory address
Mem: Read the data from the Data Memory
Wr: Write the data back to the register file
Branch instruction – 2 stages
Store instruction – 4 stages
ALU instruction – 5 stages (4th stage is idle)
Pipelining
Improve performance by increasing throughput
Ideal speedup is number of stages in the pipeline.
Do we achieve this? NO!
The computer pipeline stage time are limited by the slowest resource, either
the ALU operation, or the memory access
Fill and drain time
Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1 Cycle 2
Clk
Single Cycle Implementation:
Load Store Waste
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk
Multiple Cycle Implementation:
Load Store R-type
Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch
Pipeline Implementation:
Load Ifetch Reg Exec Mem Wr
Store Ifetch Reg Exec Mem Wr
R-type Ifetch Reg Exec Mem Wr
Why Pipeline?
Suppose we execute 100 instructions
Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst = 4500 ns
Multicycle Machine
10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns
Ideal pipelined machine
10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
Why Pipeline? Because the resources are there!
Time (clock cycles)
ALU
I Im Reg Dm Reg
n Inst 0
s
ALU
t Inst 1 Im Reg Dm Reg
r.
ALU
O Inst 2 Im Reg Dm Reg
r
d Inst 3
ALU
Im Reg Dm Reg
e
r
Inst 4
ALU
Im Reg Dm Reg
Speedup and Efficiency
k-stage pipeline processes n tasks in k + (n-1) clock
cycles:
k cycles for the first task and n-1 cycles for the remaining
n-1 tasks
Total time to process n tasks Tk = [ k + (n-1)]t
For the non-pipelined processor T1 = n k t
Speedup factor T1 nkt nk
Sk =
Tk = [ k + (n-1)] t= k + (n-1)
nk ≈k
If n is very large (n >> k), then Sk ≈ n
Efficiency and Throughput
Efficiency of the k-stages pipeline:
Sk n
Ek = = k + (n-1)
k
Pipeline throughput (the number of tasks per unit time):
Hk = [ k +n(n-1)] t = nf
k + (n-1)
Can pipelining get us into trouble?
Yes: Pipeline Hazards
Structural hazards: attempt to use the same resource two different
ways at the same time
E.g., combined washer/dryer would be a structural hazard or folder busy
doing something else (watching TV)
Single memory cause structural hazards
Data hazards: attempt to use item before it is ready
E.g., one sock of pair in dryer and one in washer; can’t fold until you get
sock from washer through dryer
instruction depends on result of prior instruction still in the pipeline
Control hazards: attempt to make a decision before condition is
evaluated
E.g., washing football uniforms and need to get proper detergent level;
need to see after dryer before next load in
branch instructions
Can always resolve hazards by waiting
pipeline control must detect the hazard
take action (or delay action) to resolve hazards
Slow Down From Stalls
• Perfect pipelining with no hazards à an instruction completes every
cycle (total cycles ~ num instructions) k + (n-1) ≈ n
à speedup = increase in clock speed = num pipeline stages
nk
Sk ≈ ≈k
n
•With hazards and stalls, some cycles (= stall time) go by
during which no instruction completes, and then the stalled
instruction completes
• Total cycles = number of instructions + stall cycles
• Slowdown because of stalls = 1/ (1 + stall cycles per instr)
Speedup Equation for Pipelining
Compared to unpipelined,
Now it is evident that
Speedup Equation for Pipelining
Putting in the equation of speedup
Thus, if there are no stalls, the speedup is equal to the number
of pipeline stages, matching our intuition for the ideal case.
Single Memory is a Structural Hazard
Time (clock cycles)
ALU
I Mem Mem Reg
n
Load Reg
ALU
Mem Reg Mem Reg
t Instr 1
r.
ALU
Mem Reg Mem Reg
O Instr 2
r
ALU
d Mem Reg Mem Reg
e
Instr 3
r
ALU
Mem Reg Mem Reg
Instr 4
Single Memory is a Structural Hazard
Time (clock cycles)
ALU
I Mem Mem Reg
n
Load Reg
ALU
Mem Reg Mem Reg
t Instr 1
r.
ALU
Mem Reg Mem Reg
O Instr 2
r
d
Instr 3
ALU
Bubble Mem Reg Mem Reg
e
r
Instr 4
ALU
Mem Reg Mem Reg
One cycle stall for structural hazard
Example: Dual-port vs. Single-port
Machine A: Dual ported memory
Machine B: Single ported memory, but its pipelined implementation has
a 1.05 times faster clock rate
Ideal CPI = 1 for both
Data references are of 40% mix
SpeedUpA = [1/(1 + 0)] x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = [1/(1 + 0.4x1)] x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
Machine A is 1.33 times faster
Control Hazard
When a branch is executed, it may or may not change the
PC to something other than incrementing it.
If a branch changes the PC to its target address, it is
called a taken branch;
If it falls through, it is not taken, or untaken.
If instruction i is a taken branch, then the PC is normally
not changed until the end of ID, after the completion of the
address calculation and comparison.
Control Hazard Solution #1: Stall
I Time (clock cycles)
n
ALU
s Mem Reg Mem Reg
t Add
r.
ALU
Mem Reg Mem Reg
Beq
O
r
Load Lost
ALU
Mem Reg Mem Reg
d potential
e
r
Stall: wait until decision is clear
Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) =>slow
Move decision to end of decode by improving hardware
save 1 cycle per branch
If 20% instructions are BEQ, all others have CPI 1, what is the average
CPI?
Control Hazard Solution #1: Stall
Control Hazard Solution #2: Predict
I Time (clock cycles)
n
ALU
s Mem Reg Mem Reg
t Add
r.
ALU
Mem Reg Mem Reg
Beq
O
r
Load
ALU
Mem Reg Mem Reg
d
e
r
Predict: guess one direction (taken/untaken) then back up if wrong
Impact: 0 lost cycles per branch instruction if right, 1 if wrong
(right 50% of time)
Need to “Squash” and restart following instruction if wrong
Produce CPI on branch of (1 *.5 + 2 * .5) = 1.5
Total CPI might then be: 1.5 * .2 + 1 * .8 = 1.1 (20% branch)
Control Hazard Solution #2: Predict
Control Hazard Solution #3: Delayed Branch
I Time (clock cycles)
n
ALU
s Mem Reg Mem Reg
t Add
r.
ALU
Mem Reg Mem Reg
Beq
O
r
ALU
d Misc Mem Reg Mem Reg
ALU
r Load Mem Reg Mem Reg
Delayed Branch: Redefine branch behavior (takes place after
next instruction)
Impact: 0 extra clock cycles per branch instruction if can find
instruction to put in “slot” ( 50% of time)
The longer the pipeline, the harder to fill
Used by MIPS architecture
Control Hazard Solution #3: Delayed Branch
Scheduling Branch Delay Slots (Fig A.14)
A. From before branch B. From branch target C. From fall through
add $1,$2,$3 sub $4,$5,$6 add $1,$2,$3
if $2=0 then if $1=0 then
delay slot delay slot
add $1,$2,$3
if $1=0 then
delay slot sub $4,$5,$6
becomes becomes becomes
add $1,$2,$3
if $2=0 then if $1=0 then
add $1,$2,$3 sub $4,$5,$6
add $1,$2,$3
if $1=0 then
sub $4,$5,$6
A is the best choice, fills delay slot & reduces instruction count (IC)
In B, the sub instruction may need to be copied, increasing IC
In B and C, must be okay to execute sub when branch fails
More On Delayed Branch
Compiler effectiveness for single branch delay
slot:
Fills about 60% of branch delay slots
About 80% of instructions executed in branch delay slots
useful in computation
About 50% (60% x 80%) of slots usefully filled
Evaluating Branch Alternatives
A simplified pipeline speedup equation for Branch:
Pipeline speedup = Pipeline depth
1 +Branch frequency´Branch penalty
Assume that in a deeper pipeline, it takes at least three pipeline
stages before the branch-target address is known and an
additional cycle before the branch condition is evaluated
Assume 4% unconditional branch
6% conditional branch- untaken
10% conditional branch-taken
Evaluating Branch Alternatives
Data Hazard on r1
An instruction depends on the result of a previous instruction still in the pipeline
add r1 ,r2,r3
sub r4, r1 ,r3
and r6, r1 ,r7
or r8, r1 ,r9
xor r10, r1 ,r11
Data Hazard on r1:
• Dependencies backwards in time are hazards
Time (clock cycles)
IF ID/RF EX MEM WB
ALU
I add r1,r2,r3 Im Reg Dm Reg
ALU
Im Reg Dm Reg
s
t
sub r4,r1,r3
r.
ALU
Im Reg Dm Reg
and r6,r1,r7
O
ALU
r Im Reg Dm Reg
d or r8,r1,r9
e
ALU
Im Reg Dm Reg
r xor r10,r1,r11
Data Hazard Solution:
• “Forward” result from one stage to another
Time (clock cycles)
IF ID/RF EX MEM WB
ALU
I add r1,r2,r3 Im Reg Dm Reg
ALU
Im Reg Dm Reg
s
t
sub r4,r1,r3
r.
ALU
Im Reg Dm Reg
and r6,r1,r7
O
ALU
r Im Reg Dm Reg
d or r8,r1,r9
e
ALU
Im Reg Dm Reg
r xor r10,r1,r11
• “or” OK if define read/write properly
•Forwarding can’t prevent all data hazard! – lw followed by R-type?
Forwarding (or Bypassing): What about Loads?
• Dependencies backwards in time are hazards
Time (clock cycles)
IF ID/RF EX MEM WB
ALU
lw r1,0(r2) Im Reg Dm Reg
ALU
Im Reg Dm Reg
sub r4,r1,r3
• Can’t solve with forwarding:
• Must delay/stall instruction dependent on loads
Forwarding (or Bypassing): What about Loads
• Dependencies backwards in time are hazards
Time (clock cycles)
IF ID/RF EX MEM WB
ALU
lw r1,0(r2) Im Reg Dm Reg
ALU
sub r4,r1,r3 Stall Im Reg Dm Reg
• Can’t solve with forwarding:
• Must delay/stall instruction dependent on loads
Software Scheduling to Avoid Load
Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code: Fast code:
LW Rb,b LW Rb,b
LW Rc,c LW Rc,c
ADD Ra,Rb,Rc LW Re,e
SW a,Ra ADD Ra,Rb,Rc
LW Re,e LW Rf,f
LW Rf,f SW a,Ra
SUB Rd,Re,Rf SUB Rd,Re,Rf
SW d,Rd SW d,Rd
Compiler optimizes for performance by out-of-order execution
Summary: Pipelining
What makes it easy
all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores; Memory
addresses are asigned
What makes it hard?
structural hazards: suppose we had only one memory
control hazards: need to worry about branch instructions
data hazards: an instruction depends on a previous instruction
We’ll talk about modern processors and what really
makes it hard:
trying to improve performance with out-of-order execution, etc.
Summary
Pipelining is a fundamental concept
multiple steps using distinct resources
Utilize capabilities of the Datapath by pipelined
instruction processing
start next instruction while working on the current one
limited by length of longest stage (plus fill/flush)
detect and resolve hazards