0% found this document useful (0 votes)

12 views35 pages

Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec03-Pipelining - (Cuuduongthancong - Com)

Uploaded by

Ann Herbst

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views35 pages

Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec03-Pipelining - (Cuuduongthancong - Com)

Uploaded by

Ann Herbst

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

3/19/2013

dce
2011

ADVANCED COMPUTER
ARCHITECTURE
Khoa Khoa học và Kỹ thuật Máy tính
BM Kỹ thuật Máy tính
BK
TP.HCM

Trần Ngọc Thịnh

http://www.cse.hcmut.edu.vn/~tnthinh

dce
2011

Pipelining

1
3/19/2013

dce
2011
What is pipelining?
• Implementation technique in which multiple
instructions are overlapped in execution
• Real-life pipelining examples?
– Laundry
– Factory production lines
– Traffic??

dce
2011

Instruction Pipelining (1/2)

• Instruction pipelining is CPU implementation technique where
multiple operations on a number of instructions are
overlapped.
• An instruction execution pipeline involves a number of steps,
where each step completes a part of an instruction. Each
step is called a pipeline stage or a pipeline segment.
• The stages or steps are connected in a linear fashion: one
stage to the next to form the pipeline -- instructions enter at
one end and progress through the stages and exit at the other
end.
• The time to move an instruction one step down the pipeline is
is equal to the machine cycle and is determined by the stage
with the longest processing delay.

2
3/19/2013

dce
2011

Instruction Pipelining (2/2)

• Pipelining increases the CPU instruction throughput:
The number of instructions completed per cycle.
– Under ideal conditions (no stall cycles), instruction
throughput is one instruction per machine cycle, or ideal
CPI = 1
• Pipelining does not reduce the execution time of an
individual instruction: The time needed to complete
all processing steps of an instruction (also called
instruction completion latency).
– Minimum instruction latency = n cycles, where n is the
number of pipeline stages
5

dce
2011
Pipelining Example: Laundry
• Laundry Example
• Ann, Brian, Cathy, Dave
each have one load of ABCD
clothes to wash, dry, and
fold
• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

3
3/19/2013

dce 2011
Sequential Laundry
6 PM 7 8 9 10 11 Midnight

Time

30 40 20 30 40 20 30 40 20 30 40 20
T
a
s
A
k

O
B
r
d C
e
r
D
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
7

dce 2011
Pipelined Laundry Start work ASAP
6 PM 7 8 9 10 11 Midnight

Time

30 40 40 40 40 20
T
a
s
A
k

O
B
r
d C
e
r
D
Pipelined laundry takes 3.5 hours for 4 loads
Speedup = 6/3.5 = 1.7
8

4
3/19/2013

dce
2011
Pipelining Lessons
6 PM 7 8 9 Pipelining doesn’t help
latency of single task, it helps
Time throughput of entire workload
T Pipeline rate limited by
30 40 40 40 40 20
a slowest pipeline stage
s Multiple tasks operating
k A simultaneously
Potential speedup = Number
O
r B pipe stages
Unbalanced lengths of pipe
d
stages reduces speedup
e C
r Time to “fill” pipeline and time
to “drain” it reduces speedup
D

dce
2011
Pipelining Example: Laundry
• Pipelined Laundry Observations:
– At some point, all stages of washing will be
operating concurrently
– Pipelining doesn’t reduce number of stages
• doesn’t help latency of single task
• helps throughput of entire workload

– As long as we have separate resources, we can

pipeline the tasks
– Multiple tasks operating simultaneously use
different resources

5
3/19/2013

dce
2011
Pipelining Example: Laundry
• Pipelined Laundry Observations:
– Speedup due to pipelining depends on the number
of stages in the pipeline

– Pipeline rate limited by slowest pipeline stage

• If dryer needs 45 min , time for all stages has to be 45
min to accommodate it
• Unbalanced lengths of pipe stages reduces speedup

– Time to “fill” pipeline and time to “drain” it reduces

speedup
– If one load depends on another, we will have to
wait (Delay/Stall for Dependencies)
11

dce
2011
CPU Pipelining
• 5 stages of a MIPS instruction
– Fetch instruction from instruction memory
– Read registers while decoding instruction
– Execute operation or calculate address, depending on
the instruction type
– Access an operand from data memory
– Write result into a register
• We can reduce the cycles to fit the stages.

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Load Ifetch Reg/Dec Exec Mem Wr

6
3/19/2013

dce
2011
CPU Pipelining
• Example: Resources for Load Instruction
– Fetch instruction from instruction memory (Ifetch)
– Instruction memory (IM)
– Read registers while decoding instruction
(Reg/Dec)
– Register file & decoder (Reg)
– Execute operation or calculate address, depending
on the instruction type (Exec)
– ALU
– Access an operand from data memory (Mem)
– Data memory (DM)
– Write result into a register (Wr)
– Register file (Reg)
13

dce
2011
CPU Pipelining
• Note that accessing source & destination registers is performed in two
different parts of the cycle
• We need to decide upon which part of the cycle should reading and
writing to the register file take place.
Reading Time (clock cycles) Writing

I
n
ALU

Inst 0 Im Reg Dm Reg

s
t
ALU

r. Inst 1 Im Reg Dm Reg

ALU

O Im Reg Dm Reg
r Inst 2
d
ALU

e Inst 3 Im Reg Dm Reg

r
ALU

Im Reg Dm Reg
Inst 4

Fill time Sink time

7
3/19/2013

dce
2011
CPU Pipelining: Example
• Single-Cycle, non-pipelined execution
•Total time for 3 instructions: 24 ns

P ro g ra m
e x e c u t io n
o rd e r 2 4 6 8 10 12 14 16 18
Time
(in in str u c tio ns )
Instruction Reg ALU Data Reg
lw $ 1 , 1 0 0 ( $ 0 ) fetch access

8 ns Instruction Reg ALU Data Reg

fetch access
lw $ 2 , 2 0 0 ( $ 0 )
Instruction
8 ns fetch
...
lw $ 3 , 3 0 0 ( $ 0 )
8 ns

dce
2011
CPU Pipelining: Example
• Single-cycle, pipelined execution
– Improve performance by increasing instruction throughput
– Total time for 3 instructions = 14 ns
– Each instruction adds 2 ns to total execution time
– Stage time limited by slowest resource (2 ns)
– Assumptions:
• Write to register occurs in 1st half of clock
• Read from register occurs in 2nd half of clock
P ro g r a m
e x e c u t io n 2 4 6 8 10 12 14
o rd e r
Time
( in in s t ru c tio n s)
Instruction Da ta
lw $1, 100($0) fetch
Reg ALU
access
Reg

Instruction D a ta
lw $2, 200($0) 2 ns
fetch
Reg ALU
access
Reg

Instruction Da ta
lw $3, 300($0) 2 ns
fetch
Reg ALU
access
Reg

2 ns 2 ns 2 ns 2 ns 2 ns

8
3/19/2013

dce
2011
CPU Pipelining: Example
• Assumptions:
– Only consider the following instructions:
lw, sw, add, sub, and, or, slt, beq
– Operation times for instruction classes are:
• Memory access 2 ns
• ALU operation 2 ns
• Register file read or write 1 ns
– Use a single- cycle (not multi-cycle) model
– Clock cycle must accommodate the slowest instruction (2 ns)
– Both pipelined & non-pipelined approaches use the same HW components

InstrClass IstrFetch RegRead ALUOp DataAccess RegWrite TotTime

lw 2 ns 1 ns 2 ns 2 ns 1 ns 8 ns
sw 2 ns 1 ns 2 ns 2 ns 7 ns
add, sub, and, or, slt 2 ns 1 ns 2 ns 1 ns 6 ns
beq 2 ns 1 ns 2 ns 5 ns

dce
2011
CPU Pipelining Example: (1/2)
• Theoretically:
– Speedup should be equal to number of stages ( n
tasks, k stages, p latency)
– Speedup = n*p ≈ k (for large n)
p/k*(n-1) + p
• Practically:
– Stages are imperfectly balanced
– Pipelining needs overhead
– Speedup less than number of stages
18

9
3/19/2013

dce
2011
CPU Pipelining Example: (2/2)
• If we have 3 consecutive instructions
– Non-pipelined needs 8 x 3 = 24 ns
– Pipelined needs 14 ns
=> Speedup = 24 / 14 = 1.7
• If we have1003 consecutive instructions
– Add more time for 1000 instruction (i.e. 1003 instruction)on
the previous example
• Non-pipelined total time= 1000 x 8 + 24 = 8024 ns
• Pipelined total time = 1000 x 2 + 14 = 2014 ns
=> Speedup ~ 3.98~ (8 ns / 2 ns]
~ near perfect speedup
=> Performance increases for larger number of instructions
(throughput)
19

dce
2011
Pipelining MIPS Instruction Set
• MIPS was designed with pipelining in mind
=> Pipelining is easy in MIPS:
– All instruction are the same length
– Limited instruction format
– Memory operands appear only in lw & sw instructions
– Operands must be aligned in memory

1. All MIPS instruction are the same length

– Fetch instruction in 1st pipeline stage
– Decode instructions in 2nd stage
– If instruction length varies (e.g. 80x86), pipelining will be
more challenging
20

10
3/19/2013

dce
2011
Pipelining MIPS Instruction Set
2. MIPS has limited instruction format
– Source register in the same place for each
instruction (symmetric)
– 2nd stage can begin reading at the same time as
decoding
– If instruction format wasn’t symmetric, stage 2
should be split into 2 distinct stages
=> Total stages = 6 (instead of 5)

dce
2011
Pipelining MIPS Instruction Set
3. Memory operands appear only in lw & sw
instructions
– We can use the execute stage to calculate
memory address
– Access memory in the next stage
– If we needed to operate on operands in memory
(e.g. 80x86), stages 3 & 4 would expand to
• Address calculation
• Memory access
• Execute

11
3/19/2013

dce
2011
Pipelining MIPS Instruction Set
4. Operands must be aligned in memory
– Transfer of more than one data operand can be
done in a single stage with no conflicts
– Need not worry about single data transfer
instruction requiring 2 data memory accesses
– Requested data can be transferred between the
CPU & memory in a single pipeline stage

dce
2011
Instruction Pipelining Review
– MIPS In-Order Single-Issue Integer Pipeline
– Performance of Pipelines with Stalls
– Pipeline Hazards
• Structural hazards
• Data hazards
 Minimizing Data hazard Stalls by Forwarding
 Data Hazard Classification
 Data Hazards Present in Current MIPS Pipeline
• Control hazards
 Reducing Branch Stall Cycles
 Static Compiler Branch Prediction
 Delayed Branch Slot
» Canceling Delayed Branch Slot
24

12
3/19/2013

dce MIPS In-Order Single-Issue Integer Pipeline Ideal

2011

Operation
(No stall cycles)
Fill Cycles = number of stages -1
Clock Number Time in clock cycles 
Instruction Number 1 2 3 4 5 6 7 8 9

Instruction I IF ID EX MEM WB
Instruction I+1 IF ID EX MEM WB
Instruction I+2 IF ID EX MEM WB
Instruction I+3 IF ID EX MEM WB
Instruction I +4 4 cycles = n -1 IF ID EX MEM WB

Time to fill the pipeline

MIPS Pipeline Stages:
IF = Instruction Fetch First instruction, I Last instruction,
Completed I+4 completed
ID = Instruction Decode
EX = Execution
MEM = Memory Access n= 5 pipeline stages Ideal CPI =1
In-order = instructions executed in original program order
WB = Write Back
Ideal pipeline operation without any stall cycles
25

dce2011
5 Steps of MIPS Datapath
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
MUX

Next SEQ PC Next SEQ PC

Adder

4 RS1
Zero?
MUX MUX

MEM/WB
Address

Memory

RS2
EX/MEM
Reg File

ID/EX
IF/ID

ALU

Memory
Data

MUX

IR <= mem[PC];
WB Data

Sign
PC <= PC + 4 Extend
Imm

A <= Reg[IRrs];
RD RD RD
B <= Reg[IRrt]
rslt <= A opIRop B
WB <= rslt
• Data stationary control
Reg[IRrd] <= WB – local decode for each instruction phase
/ pipeline stage

13
3/19/2013

dce
Visualizing Pipelining
2011

Figure A.2, Page A-8

Time (clock cycles)

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU
n Ifetch Reg DMem Reg
Write
s destination
IF ID EX MEM WB
t register
r. in first half

ALU
Ifetch Reg DMem Reg
of WB cycle
O
r

ALU
Ifetch Reg DMem Reg

d
e
r

ALU
Ifetch Reg DMem Reg

Read operand registers

in second half of ID cycle

Operation of ideal integer in-order 5-stage pipeline

dce
2011
Pipelining Performance Example
• Example: For an unpipelined CPU:
– Clock cycle = 1ns, 4 cycles for ALU operations and branches and 5
cycles for memory operations with instruction frequencies of 40%,
20% and 40%, respectively.
– If pipelining adds 0.2 ns to the machine clock cycle then the speedup
in instruction execution from pipelining is:
Non-pipelined Average instruction execution time = Clock cycle x
Average CPI
= 1 ns x ((40% + 20%) x 4 + 40%x 5) = 1 ns x 4.4 = 4.4 ns
In the pipelined implementation five stages are used with an
average instruction execution time of: 1 ns + 0.2 ns = 1.2 ns
Speedup from pipelining = Instruction time unpipelined
Instruction time pipelined
= 4.4 ns / 1.2 ns = 3.7 times faster

14
3/19/2013

dce
2011
Pipeline Hazards
• Hazards are situations in pipelining which prevent the next
instruction in the instruction stream from executing during the
designated clock cycle possibly resulting in one or more stall
(or wait) cycles.
• Hazards reduce the ideal speedup (increase CPI > 1) gained
from pipelining and are classified into three classes:
– Structural hazards: Arise from hardware resource conflicts when the
available hardware cannot support all possible combinations of
instructions.
– Data hazards: Arise when an instruction depends on the result of a
previous instruction in a way that is exposed by the overlapping of
instructions in the pipeline
– Control hazards: Arise from the pipelining of conditional branches and
other instructions that change the PC
29

dce
2011
How do we deal with hazards?
• Often, pipeline must be stalled
• Stalling pipeline usually lets some instruction(s) in
pipeline proceed, another/others wait for data,
resource, etc.
• A note on terminology:
– If we say an instruction was “issued later than instruction x”,
we mean that it was issued after instruction x and is not as
far along in the pipeline

– If we say an instruction was “issued earlier than instruction

x”, we mean that it was issued before instruction x and is
further along in the pipeline

15
3/19/2013

dce
2011
Stalls and performance
• Stalls impede progress of a pipeline and result in deviation
from 1 instruction executing/clock cycle
• Pipelining can be viewed to:
– Decrease CPI or clock cycle time for instruction
– Let’s see what affect stalls have on CPI…

• CPI pipelined = Ideal CPI + Pipeline stall cycles per instruction

= 1 + Pipeline stall cycles per instruction
• Ignoring overhead and assuming stages are balanced:

CPI unpipeline d
Speedup 
1  pipeline stall cycles per instructio n

dce
2011
Even more pipeline performance issues!
• This results in:
Clock cycle unpipeline d
Clock cycle pipelined 
Pipeline depth

Clock cycle unpipeline d

• Which leads to: Pipeline depth 
Clock cycle pipelined

1 Clock cycle unpipeline d

Speedup from pipelining  
1  Pipeline stall cycles per instructio n Clock cycle pipelined

1
  Pipeline depth
1  Pipeline stall cycles per instructio n

• If no stalls, speedup equal to # of pipeline stages in

ideal case

16
3/19/2013

dce
2011
Structural Hazards
• In pipelined machines overlapped instruction execution
requires pipelining of functional units and duplication of
resources to allow all possible combinations of instructions in
the pipeline.

• If a resource conflict arises due to a hardware resource being

required by more than one instruction in a single cycle, and
one or more such instructions cannot be accommodated,
then a structural hazard has occurred, for example:
– when a pipelined machine has a shared single-memory pipeline stage
for data and instructions.
 stall the pipeline for one cycle for memory data access

dce
2011
An example of a structural hazard
ALU

Load Mem Reg DM Reg

ALU

Instruction 1 Mem Reg DM Reg

ALU

Instruction 2 Mem Reg DM Reg

ALU

Instruction 3 Mem Reg DM Reg

ALU

Instruction 4 Mem Reg DM Reg

Time
What’s the problem here? 34

17
3/19/2013

dce2011
How is it resolved?

ALU
Load Mem Reg DM Reg

ALU
Instruction 1 Mem Reg DM Reg

ALU
Instruction 2 Mem Reg DM Reg

Stall Bubble Bubble Bubble Bubble Bubble

ALU
Instruction 3 Mem Reg DM Reg

Time
Pipeline generally stalled by
inserting a “bubble” or NOP 35

dce2011
Or alternatively…
Clock Number

Inst. # 1 2 3 4 5 6 7 8 9 10
LOAD IF ID EX MEM WB
Inst. i+1 IF ID EX MEM WB
Inst. i+2 IF ID EX MEM WB
Inst. i+3 stall IF ID EX MEM WB
Inst. i+4 IF ID EX MEM WB
Inst. i+5 IF ID EX MEM
Inst. i+6 IF ID EX

LOAD instruction “steals” an instruction fetch cycle

which will cause the pipeline to stall.

Thus, no instruction completes on clock cycle 8

18
3/19/2013

dce
2011
A Structural Hazard Example
• Given that data references are 40% for a specific instruction
mix or program, and that the ideal pipelined CPI ignoring
hazards is equal to 1.

• A machine with a data memory access structural hazards

requires a single stall cycle for data references and has a
clock rate 1.05 times higher than the ideal machine. Ignoring
other performance losses for this machine:

Average instruction time = CPI X Clock cycle time

Average instruction time = (1 + 0.4 x 1) x Clock cycle ideal
1.05
= 1.3 X Clock cycle time ideal
Therefore the machine without the hazard is better.

dce
2011
Remember the common case!
• All things being equal, a machine without structural
hazards will always have a lower CPI.

• But, in some cases it may be better to allow them

than to eliminate them.

• These are situations a computer architect might have

to consider:
– Is pipelining functional units or duplicating them costly in
terms of HW?
– Does structural hazard occur often?
– What’s the common case???
38

19
3/19/2013

dce 2011
Data Hazards
• Data hazards occur when the pipeline changes the order of
read/write accesses to instruction operands in such a way
that the resulting access order differs from the original
sequential instruction operand access order of the
unpipelined machine resulting in incorrect execution.
• Data hazards may require one or more instructions to be
stalled to ensure correct execution.
• Example:
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
OR R8,R1,R9
XOR R10, R1, R11
– All the instructions after ADD use the result of the ADD instruction
– SUB, AND instructions need to be stalled for correct execution.

dce 2011
Data Hazard on R1
Time (clock cycles)

IF ID/RF EX MEM WB

I
ALU

add r1,r2,r3 Ifetch Reg DMem Reg

n
s
ALU

t sub r4,r1,r3 Ifetch Reg DMem Reg

r.
ALU

O and r6,r1,r7 Ifetch Reg DMem Reg

r
d
ALU

Reg
or r8,r1,r9 Ifetch Reg DMem
e
r
ALU

xor r10,r1,r11 Ifetch Reg DMem Reg

20
3/19/2013

dce
2011
Minimizing Data hazard Stalls by Forwarding
• Forwarding is a hardware-based technique (also called register
bypassing or short-circuiting) used to eliminate or minimize data
hazard stalls.
• Using forwarding hardware, the result of an instruction is copied
directly from where it is produced (ALU, memory read port etc.), to
where subsequent instructions need it (ALU input register, memory
write port etc.)
• For example, in the MIPS integer pipeline with forwarding:
– The ALU result from the EX/MEM register may be forwarded or fed back to the
ALU input latches as needed instead of the register operand value read in the ID
stage.
– Similarly, the Data Memory Unit result from the MEM/WB register may be fed back
to the ALU input latches as needed .
– If the forwarding hardware detects that a previous ALU operation is to write the
register corresponding to a source for the current ALU operation, control logic
selects the forwarded result as the ALU input rather than the value read from the
register file.

dce
2011
HW Change for Forwarding

NextPC
mux
Registers

MEM/WR
EX/MEM
ALU
ID/EX

Data
mux

Memory
mux

Immediate

What circuit detects and resolves this hazard?

21
3/19/2013

dce 2011

Forwarding to Avoid Data Hazard

Time (clock cycles)
I
n add r1,r2,r3

ALU
Ifetch Reg DMem Reg

s
t
r.

ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
O
r

ALU
Reg
d and r6,r1,r7 Ifetch Reg DMem

e
r

ALU
Ifetch Reg DMem Reg
or r8,r1,r9

ALU
xor r10,r1,r11 Ifetch Reg DMem Reg

dce 2011
Forwarding to Avoid LW-SW Data Hazard
Time (clock cycles)

I add r1,r2,r3 Ifetch

ALU

Reg DMem Reg

n
s
lw r4, 0(r1)
ALU

t Ifetch Reg DMem Reg

r.
ALU

Ifetch Reg DMem Reg

O sw r4,12(r1)
r
d
ALU

Ifetch Reg DMem Reg

e or r8,r6,r9
r
ALU

Ifetch Reg DMem Reg

xor r10,r9,r11

22
3/19/2013

dce 2011
Data Hazard Classification
Given two instructions I, J, with I occurring before J in an
instruction stream:
• RAW (read after write): A true data dependence
I
J tried to read a source before I writes to it, so J
..
incorrectly gets the old value.
..
• WAW (write after write): A name dependence
J tries to write an operand before it is written by I J
The writes end up being performed in the wrong order.
Program
• WAR (write after read): A name dependence Order
J tries to write to a destination before it is read by I,
so I incorrectly gets the new value.
• RAR (read after read): Not a hazard.

dce 2011
Data Hazard Classification
I (Write) I (Read)
I Shared
.. Shared
.. Operand Operand
J J (Read) J (Write)
Program
Order Read after Write (RAW) Write after Read (WAR)

I (Write) I (Read)

Shared Shared
Operand Operand

J (Write) J (Read)
Write after Write (WAW) Read after Read (RAR) not a hazard
46

23
3/19/2013

dce
2011
Read after write (RAW) hazards
• With RAW hazard, instruction j tries to read a source operand
before instruction i writes it.
• Thus, j would incorrectly receive an old or incorrect value

• Graphically/Example:

… j i … i: ADD R1, R2, R3

Instruction j is a Instruction i is a j: SUB R4, R1, R6
read instruction write instruction
issued after i issued before j

• Can use stalling or forwarding to resolve this hazard

dce
2011
Write after write (WAW) hazards
• With WAW hazard, instruction j tries to write an operand
before instruction i writes it.

• The writes are performed in wrong order leaving the value

written by earlier instruction

• Graphically/Example:

… j i … i: SUB R4, R1, R3

Instruction j is a Instruction i is a j: ADD R1, R2, R3
write instruction write instruction
issued after i issued before j

24
3/19/2013

dce
2011
Write after read (WAR) hazards
• With WAR hazard, instruction j tries to write an operand
before instruction i reads it.

• Instruction i would incorrectly receive newer value of its

operand;
– Instead of getting old value, it could receive some newer, undesired
value:

• Graphically/Example:
i: SUB R1, R4, R3
… j i … j: ADD R1, R2, R3
Instruction j is a Instruction i is a
write instruction read instruction
issued after i issued before j
49

dce
2011
Data Hazards Requiring Stall Cycles
• In some code sequence cases, potential data hazards cannot be handled
by bypassing. For example:
Lw R1, 0 (R2)
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
• The LD (load double word) instruction has the data in clock cycle 4 (MEM
cycle).
• The DSUB instruction needs the data of R1 in the beginning of that cycle.
• Hazard prevented by hardware pipeline interlock causing a stall cycle.

25
3/19/2013

dce
Data Hazard Even with Forwarding
2011

Time (clock cycles)

I lw r1, 0(r2) Ifetch

ALU
Reg DMem Reg

n
s
t

ALU
sub r4,r1,r6 Ifetch Reg DMem Reg

ALU
Reg Reg
and r6,r1,r7 Ifetch DMem

r
d
e

ALU
Ifetch Reg DMem Reg
or r8,r1,r9
r

dce
Data Hazard Even with Forwarding
2011

Time (clock cycles)

I lw r1, 0(r2) Ifetch

ALU

Reg DMem Reg

n
s
t
ALU

sub r4,r1,r6 Ifetch Reg Bubble DMem Reg

O Bubble
ALU

Ifetch Reg DMem Reg

and r6,r1,r7
r
d
Bubble
ALU

Ifetch
e
Reg DMem
or r8,r1,r9
r

26
3/19/2013

dce
2011
Hardware Pipeline Interlocks
• A hardware pipeline interlock detects a data hazard and stalls
the pipeline until the hazard is cleared.
• The CPI for the stalled instruction increases by the length of
the stall.
• For the Previous example, (no stall cycle):
LW R1, 0(R1) IF ID EX MEM WB
SUB R4,R1,R5 IF ID EX MEM WB
AND R6,R1,R7 IF ID EX MEM WB
OR R8, R1, R9 IF ID EX MEM WB

With Stall Cycle:

Stall + Forward

LW R1, 0(R1) IF ID EX MEM WB

SUB R4,R1,R5 IF ID STALL EX MEM WB
AND R6,R1,R7 IF STALL ID EX MEM WB
OR R8, R1, R9 STALL IF ID EX MEM WB

dce
2011
Data hazards and the compiler
• Compiler should be able to help eliminate
some stalls caused by data hazards

• i.e. compiler could not generate a LOAD

instruction that is immediately followed by
instruction that uses result of LOAD’s
destination register.

• Technique is called “pipeline/instruction

scheduling”
54

27
3/19/2013

dce
2011
Some example situations
Situation Example Action
No Dependence LW R1, 45(R2) No hazard possible because no
ADD R5, R6, R7 dependence exists on R1 in the
SUB R8, R6, R7 immediately following three instructions.
OR R9, R6, R7

Dependence LW R1, 45(R2) Comparators detect the use of R1 in the

ADD R5, R1, R7 ADD and stall the ADD (and SUB and OR)
requiring stall SUB R8, R6, R7 before the ADD begins EX
OR R9, R6, R7

Dependence LW R1, 45(R2) Comparators detect the use of R1 in SUB

ADD R5, R6, R7 and forward the result of LOAD to the ALU
overcome by SUB R8, R1, R7 in time for SUB to begin with EX
forwarding OR R9, R6, R7

Dependence with LW R1, 45(R2) No action is required because the read of

ADD R5, R6, R7 R1 by OR occurs in the second half of the
accesses in order SUB R8, R6, R7 ID phase, while the write of the loaded
data occurred in the first half.
OR R9, R1, R7
55

dce Static Compiler Instruction Scheduling (Re-Ordering)

2011

for Data Hazard Stall Reduction

• Many types of stalls resulting from data hazards are very frequent. For
example:

A = B+ C
produces a stall when loading the second data value (B).

• Rather than allow the pipeline to stall, the compiler could sometimes
schedule the pipeline to avoid stalls.

• Compiler pipeline or instruction scheduling involves rearranging the code

sequence (instruction reordering) to eliminate or reduce the number of
stall cycles.

Static = At compilation time by the compiler

Dynamic = At run time by hardware in the CPU

28
3/19/2013

dce
2011
Static Compiler Instruction Scheduling Example
• For the code sequence:
a, b, c, d ,e, and f
a=b+c are in memory
d=e-f
• Assuming loads have a latency of one clock cycle, the following
code or pipeline compiler schedule eliminates stalls:

Original code with stalls: Scheduled code with no stalls:

LW Rb,b LW Rb,b
LW Rc,c
Stall LW Rc,c
ADD Ra,Rb,Rc
LW Re,e
SW Ra,a
LW Re,e ADD Ra,Rb,Rc
LW Rf,f LW Rf,f
SUB Rd,Re,Rf SW Ra,a
Stall
No stalls for scheduled code SUB
SW Rd,d Rd,Re,Rf
2 stalls for original code
SW Rd,d
57

dce
2011
Performance of Pipelines with Stalls
• Hazard conditions in pipelines may make it necessary to stall the pipeline
by a number of cycles degrading performance from the ideal pipelined
CPU CPI of 1.
CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction
= 1 + Pipeline stall clock cycles per instruction
• If pipelining overhead is ignored and we assume that the stages are
perfectly balanced then speedup from pipelining is given by:
Speedup = CPI unpipelined / CPI pipelined
= CPI unpipelined / (1 + Pipeline stall cycles per instruction)

• When all instructions in the multicycle CPU take the same number of
cycles equal to the number of pipeline stages then:

Speedup = Pipeline depth / (1 + Pipeline stall cycles per instruction)

29
3/19/2013

dce
2011

Control Hazards
• When a conditional branch is executed it may change the PC and, without
any special measures, leads to stalling the pipeline for a number of cycles
until the branch condition is known (branch is resolved).
– Otherwise the PC may not be correct when needed in IF
• In current MIPS pipeline, the conditional branch is resolved in stage 4 (MEM
stage) resulting in three stall cycles as shown below:

Branch instruction IF ID EX MEM WB

Branch successor stall stall stall IF ID EX MEM WB
Branch successor + 1 IF ID EX MEM WB
Branch successor + 2 3 stall cycles IF ID EX MEM
Branch successor + 3 IF ID EX
Branch successor + 4 IF ID
Branch successor + 5 IF
Assuming we stall or flush the pipeline on a branch instruction:
Three clock cycles are wasted for every branch for current MIPS pipeline
Branch Penalty = stage number where branch is resolved - 1
here Branch Penalty = 4 - 1 = 3 Cycles
59

dce Control
2011
Hazard on Branches
Three Stage Stall

10: beq r1,r3,36

ALU

Ifetch Reg DMem Reg

ALU

Reg Reg
14: and r2,r3,r5 Ifetch DMem
ALU

Reg
18: or r6,r1,r7 Ifetch Reg DMem
ALU

Ifetch Reg DMem Reg

22: add r8,r1,r9
ALU

36: xor r10,r1,r11 Ifetch Reg DMem Reg

30
3/19/2013

dce
2011
Reducing Branch Stall Cycles
Pipeline hardware measures to reduce branch stall cycles:
1- Find out whether a branch is taken earlier in the pipeline.
2- Compute the taken PC earlier in the pipeline.

In MIPS:
– In MIPS branch instructions BEQZ, BNE, test a register for equality to
zero.
– This can be completed in the ID cycle by moving the zero test into that
cycle.
– Both PCs (taken and not taken) must be computed early.
– Requires an additional adder because the current ALU is not useable until
EX cycle.
– This results in just a single cycle stall on branches.

dce
2011
Branch Stall Impact
• If CPI = 1, 30% branch,
Stall 3 cycles => new CPI = 1.9!
• Two part solution:
– Determine branch taken or not sooner, AND
– Compute taken branch address earlier
• MIPS branch tests if register = 0 or  0
• MIPS Solution:
– Move Zero test to ID/RF stage
– Adder to calculate new PC in ID/RF stage
– 1 clock cycle penalty for branch versus 3

31
3/19/2013

dce
2011

Pipelined MIPS Datapath

Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC Next
Modified MIPS

MUX
SEQ PC
Pipeline:

Adder
Adder

Zero?
Conditional
4 RS1
Branche
Completed in

MEM/WB
Address

Memory

RS2

EX/MEM
Reg File

ID/EX

ALU
ID Stage
IF/ID

Memory
MUX

Data

MUX
Branch resolved in

WB Data
Sign
stage 2 (ID) Imm
Extend

Branch Penalty = 2
RD RD RD
-1=1

• Interplay of instruction set design and cycle time.

dce
2011
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
– Execute successor instructions in sequence
– “Squash” instructions in pipeline if branch actually taken
– Advantage of late pipeline state update
– 47% MIPS branches not taken on average
– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
– 53% MIPS branches taken on average
– But haven’t calculated branch target address in MIPS
• MIPS still incurs 1 cycle branch penalty
• Other machines: branch target known before outcome
– What happens when hit not-taken branch?

32
3/19/2013

dce
Four Branch Hazard Alternatives
2011

#4: Delayed Branch

– Define branch to take place AFTER a following
instruction

branch instruction
sequential successor1
sequential successor2
........ Branch delay of length n

sequential successorn
branch target if taken

– 1 slot delay allows proper decision and branch target

address in 5 stage pipeline
– MIPS uses this

dce
2011
Scheduling Branch Delay Slots
A. From before branch B. From branch target C. From fall through
add $1,$2,$3 sub $4,$5,$6 add $1,$2,$3
if $2=0 then if $1=0 then
delay slot delay slot
add $1,$2,$3
if $1=0 then
delay slot sub $4,$5,$6

becomes becomes becomes

add $1,$2,$3
if $2=0 then if $1=0 then
add $1,$2,$3 sub $4,$5,$6
add $1,$2,$3
if $1=0 then
sub $4,$5,$6

• A is the best choice, fills delay slot & reduces instruction count (IC)
• In B, the sub instruction may need to be copied, increasing IC
• In B and C, must be okay to execute sub when branch fails

33
3/19/2013

dce
2011
Delayed Branch
• Compiler effectiveness for single branch delay slot:
– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots
useful in computation
– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: As processor go to
deeper pipelines and multiple issue, the branch
delay grows and need more than one delay slot
– Delayed branching has lost popularity compared to more
expensive but more flexible dynamic approaches
– Growth in available transistors has made dynamic
approaches relatively cheaper

dce
2011
Evaluating Branch Alternatives

Pipeline speedup = Pipeline depth

1 +Branch frequencyBranch penalty

Assume: 4% unconditional branch,

6% conditional branch- untaken,
10% conditional branch-taken
Scheduling Branch CPI speedup v.speedup v. scheme penalty
unpipelinedstall
Stall pipeline 3 1.60 3.1 1.0
Predict not taken1x0.04+3x0.10 1.34 3.7 1.19
Predict taken 1x0.14+2x0.061.26 4.0 1.29
Delayed branch 0.5 1.10 4.5 1.45

34
3/19/2013

dce
2011
Pipelining Summary
• Pipelining overlaps the execution of multiple instructions.
• With an idea pipeline, the CPI is one, and the speedup is
equal to the number of stages in the pipeline.
• However, several factors prevent us from achieving the ideal
speedup, including
– Not being able to divide the pipeline evenly
– The time needed to empty and flush the pipeline
– Overhead needed for pipelining
– Structural, data, and control hazards
• Just overlap tasks, and easy if tasks are independent

dce
2011
Pipelining Summary
• Speed Up VS. Pipeline Depth; if ideal CPI is 1, then:
Pipeline Depth Clock Cycle Unpipelined
Speedup = X
1 + Pipeline stall CPI Clock Cycle Pipelined
• Hazards limit performance
– Structural: need more HW resources
– Data: need forwarding, compiler scheduling
– Control: early evaluation & PC, delayed branch, prediction
• Increasing length of pipe increases impact of hazards;
pipelining helps instruction bandwidth, not latency
• Compilers reduce cost of data and control hazards
– Load delay slots
– Branch delay slots
– Branch prediction

Lec03-Pipelining 2021
No ratings yet
Lec03-Pipelining 2021
20 pages
Understanding Processor Pipelining
No ratings yet
Understanding Processor Pipelining
28 pages
3-Pipelining 241110 203716
No ratings yet
3-Pipelining 241110 203716
59 pages
Pipelining in Computer Architecture
No ratings yet
Pipelining in Computer Architecture
64 pages
Pipelining Unit 3
No ratings yet
Pipelining Unit 3
19 pages
L14 MipsPipeline Ovw
No ratings yet
L14 MipsPipeline Ovw
17 pages
Pipelining Lecture
No ratings yet
Pipelining Lecture
74 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
Week 11 Reduced
No ratings yet
Week 11 Reduced
29 pages
Chapter 4.5 - 4.8 Piplined Processor and Hazards
No ratings yet
Chapter 4.5 - 4.8 Piplined Processor and Hazards
68 pages
Understanding Pipelining in CPUs
No ratings yet
Understanding Pipelining in CPUs
13 pages
07 Pipeline Notes
No ratings yet
07 Pipeline Notes
145 pages
Bản Sao Của Lecture 9 - Pipelined Processor Design
No ratings yet
Bản Sao Của Lecture 9 - Pipelined Processor Design
11 pages
Chapter 6
No ratings yet
Chapter 6
43 pages
Pipelining Basic and Intermediate Concepts
No ratings yet
Pipelining Basic and Intermediate Concepts
75 pages
Lec18 Pipeline
No ratings yet
Lec18 Pipeline
59 pages
Pipeline Processing
No ratings yet
Pipeline Processing
28 pages
Lecture 13 Pipelining
No ratings yet
Lecture 13 Pipelining
12 pages
06 Pipeline PDF
No ratings yet
06 Pipeline PDF
17 pages
Computer Systems Pipelining Guide
No ratings yet
Computer Systems Pipelining Guide
7 pages
Pipeline
No ratings yet
Pipeline
39 pages
07 MIPS Pipelining CH4
No ratings yet
07 MIPS Pipelining CH4
73 pages
CSE332 / EEE336 Computer Organization & Architecture Pipelining I
No ratings yet
CSE332 / EEE336 Computer Organization & Architecture Pipelining I
21 pages
Parallel Processing and Pipelining
No ratings yet
Parallel Processing and Pipelining
53 pages
Pipelining in Computer Architecture
No ratings yet
Pipelining in Computer Architecture
77 pages
Computer Systems Pipelining Guide
No ratings yet
Computer Systems Pipelining Guide
39 pages
Pipelining and Parallel Processing
No ratings yet
Pipelining and Parallel Processing
26 pages
CS530 Fall2015 Lecture9
No ratings yet
CS530 Fall2015 Lecture9
5 pages
Pipelining Concepts and Problems
No ratings yet
Pipelining Concepts and Problems
33 pages
Pipelined Processor Design Overview
No ratings yet
Pipelined Processor Design Overview
106 pages
Cse410 10 Pipelining A
No ratings yet
Cse410 10 Pipelining A
27 pages
Lecture Notes Pipelining Stages 7B
No ratings yet
Lecture Notes Pipelining Stages 7B
7 pages
Lec18 Pipeline Chap9 2
No ratings yet
Lec18 Pipeline Chap9 2
26 pages
This Study Resource Was: Pipelining Analogy
No ratings yet
This Study Resource Was: Pipelining Analogy
58 pages
5.1-5.3 Pipelining and Parallel Processing
No ratings yet
5.1-5.3 Pipelining and Parallel Processing
56 pages
General Principles of Pipelining: Andrew Warfield CS313
No ratings yet
General Principles of Pipelining: Andrew Warfield CS313
25 pages
Pipeline Processor Design
No ratings yet
Pipeline Processor Design
89 pages
Pipelined Processor Design: Computer Architecture and Assembly Language
No ratings yet
Pipelined Processor Design: Computer Architecture and Assembly Language
22 pages
Lecture # Pipelining
No ratings yet
Lecture # Pipelining
36 pages
Lec12 Pipeline
No ratings yet
Lec12 Pipeline
23 pages
A Pipelining
No ratings yet
A Pipelining
16 pages
PipeLining in Microprocessors
No ratings yet
PipeLining in Microprocessors
19 pages
Pipeline
No ratings yet
Pipeline
22 pages
Pipe 1 New
No ratings yet
Pipe 1 New
64 pages
Module 4 - Parallel & Pipeline Processing - Final
No ratings yet
Module 4 - Parallel & Pipeline Processing - Final
31 pages
Computer Organization and Assembly Language: Pipeline: Introduction
No ratings yet
Computer Organization and Assembly Language: Pipeline: Introduction
25 pages
5 Pipelining
No ratings yet
5 Pipelining
38 pages
Pipeline Processing
No ratings yet
Pipeline Processing
16 pages
Slide 6
No ratings yet
Slide 6
46 pages
Unit 6
No ratings yet
Unit 6
30 pages
Understanding Pipelining in CPUs
No ratings yet
Understanding Pipelining in CPUs
16 pages
DLCOA 6.1 Sep2024
No ratings yet
DLCOA 6.1 Sep2024
81 pages
05 Pipelining
No ratings yet
05 Pipelining
34 pages
Computer Systems Architecture: Thorsten Altenkirch and Liyang Hu
No ratings yet
Computer Systems Architecture: Thorsten Altenkirch and Liyang Hu
20 pages
Parallel Processing & Pipelining
No ratings yet
Parallel Processing & Pipelining
33 pages
Computer Architecture and Organization
No ratings yet
Computer Architecture and Organization
49 pages
Pipelined MIPS Processor: Dmitri Strukov ECE 154A
No ratings yet
Pipelined MIPS Processor: Dmitri Strukov ECE 154A
81 pages
Lec3 1
No ratings yet
Lec3 1
18 pages
Understanding Pipelining in Computing
No ratings yet
Understanding Pipelining in Computing
26 pages
Mughal History
0% (1)
Mughal History
48 pages
Chiller
No ratings yet
Chiller
22 pages
General Structural Notes for Philippines
No ratings yet
General Structural Notes for Philippines
2 pages
Zooongash
No ratings yet
Zooongash
42 pages
PSC Bridge 3
No ratings yet
PSC Bridge 3
15 pages
BOYSEN® Marmorino™ Italian Marble Finish: Liters Use
No ratings yet
BOYSEN® Marmorino™ Italian Marble Finish: Liters Use
1 page
Cable-Stayed Bridge: Overview & Types
No ratings yet
Cable-Stayed Bridge: Overview & Types
41 pages
Ahu 1 Specifications and Performance Data
No ratings yet
Ahu 1 Specifications and Performance Data
1 page
Facade Sample
No ratings yet
Facade Sample
2 pages
Malaysia & Singapore Contacts
No ratings yet
Malaysia & Singapore Contacts
7 pages
Architecture - October 2024
No ratings yet
Architecture - October 2024
25 pages
Nssgoe3957424 - University of Education 2
No ratings yet
Nssgoe3957424 - University of Education 2
1 page
2º de Esoapectfund
No ratings yet
2º de Esoapectfund
15 pages
Madrasha Narsingdi Design
No ratings yet
Madrasha Narsingdi Design
21 pages
Recepción en Planta Arquitectónica
No ratings yet
Recepción en Planta Arquitectónica
1 page
FSD-module-2 Lab Programs
No ratings yet
FSD-module-2 Lab Programs
17 pages
Mez 63x75 Report
No ratings yet
Mez 63x75 Report
2 pages
HVAC Systems Guide for Engineers
No ratings yet
HVAC Systems Guide for Engineers
64 pages
Stainless Steel Pipe Clamps & Clips
No ratings yet
Stainless Steel Pipe Clamps & Clips
8 pages
The Automated Testing Framework
No ratings yet
The Automated Testing Framework
9 pages
Renaissance Architecture Wikipedia
No ratings yet
Renaissance Architecture Wikipedia
33 pages
19.DG Inspection Checklist
No ratings yet
19.DG Inspection Checklist
1 page
Understanding Tensile Structures
100% (1)
Understanding Tensile Structures
95 pages
Abpl Mep (GPL - 146) DWG Register
No ratings yet
Abpl Mep (GPL - 146) DWG Register
1 page
Incheon Bridge Dampers Project
No ratings yet
Incheon Bridge Dampers Project
1 page
Mrs - Nandini - Villa-56 BOQ
No ratings yet
Mrs - Nandini - Villa-56 BOQ
3 pages
Givoni
No ratings yet
Givoni
4 pages
5 - Roof Truss & Staircase Detail
No ratings yet
5 - Roof Truss & Staircase Detail
1 page
Request For Approval of Works No.12 (Abutment 1 & 2 Wing Wall)
No ratings yet
Request For Approval of Works No.12 (Abutment 1 & 2 Wing Wall)
7 pages
Shell Script Assignments Guide
No ratings yet
Shell Script Assignments Guide
8 pages