0% found this document useful (0 votes)
10 views64 pages

Module 2

The document discusses pipelining in computer architecture, explaining its basic concepts, characteristics, and advantages, such as increased throughput and reduced cycles per instruction. It details single-cycle and multi-cycle datapaths, highlighting their respective benefits and drawbacks, as well as the various types of hazards that can occur during pipelining, such as structural, data, and control hazards. Additionally, it covers techniques for resolving these hazards, including forwarding and pipeline interlocks.

Uploaded by

anikaarajesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views64 pages

Module 2

The document discusses pipelining in computer architecture, explaining its basic concepts, characteristics, and advantages, such as increased throughput and reduced cycles per instruction. It details single-cycle and multi-cycle datapaths, highlighting their respective benefits and drawbacks, as well as the various types of hazards that can occur during pipelining, such as structural, data, and control hazards. Additionally, it covers techniques for resolving these hazards, including forwarding and pipeline interlocks.

Uploaded by

anikaarajesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Pipelining: Basic and

Intermediate Concepts
Chappathi Making
Chappathi Making (Contd.)
Chappathi Making (Contd.)
Pipelining Basics

• Pipelining overlaps execution of instructions to


improve performance.
• Pipelining partitions the system into multiple
independent stages with added buffers between
the stages.
• Similar to an assembly line where each stage
handles part of the task.
• Each stage works concurrently on different
instructions.
• Key metric: throughput, or instructions completed
per unit time.
Classic Five-Stage RISC
Pipeline

• Five stages: Instruction Fetch (IF),


Instruction Decode (ID), Execute
(EX), Memory Access (MEM), Write-
Back (WB).
• Each new instruction begins one
cycle after the previous.
• Multiple instructions progress
simultaneously through the pipeline.
Visualizing Pipelining
Pipelining Characteristics

Time Pipelining doesn’t reduce

30 40 40 40 40 20 latency of single task, it

A improves throughput of entire


workload
B
Pipeline rate limited by slowest
C pipeline stage

D Potential speedup = Number of


pipe stages

Unbalanced lengths of pipe stages reduces speedup

Time to “fill” pipeline and time to “drain” it reduces speedup


Datapaths – single cycle and multi-cycle


Datapath
– collection of functional units (like ALU,
registers, memory) and their connections
that execute instructions
– It’s the “hardware pathway” data travels
through during instruction execution.
– two kinds of Datapaths: Single Cycle
Datapath and Multiple Cycle Datapath

26 / 35
Single Cycle Datapath

Every instruction completes in exactly one clock
cycle — no matter if it’s a simple ADD or a complex
LOAD.

Key characteristics:

One long clock cycle that must be long enough for
the slowest instruction.

All steps (fetch, decode, execute, memory access,
write-back) happen in that one cycle.

Hardware must duplicate some resources (e.g.,
separate instruction and data memory) so multiple
steps can happen at once in that cycle.

27 / 35
Single Cycle Datapath (Contd.)

Advantages of Single Cycle Datapath

Simplicity: It is less complex than top-down since
its development is easier than that of other large
systems and as such perfect for such systems.

Consistent Timing: All of the instructions take
equal amounts of time.

28 / 35
Single Cycle Datapath (Contd.)

Disadvantages of Single Cycle Datapath

Inefficient for Complex Instructions: Since all
instructions, including the basic ones, have to wait
for the longest operation to be completed, this can
result in more time being taken.

Duplicate Hardware Requirement: It calls for
more hardware due to the fact that functional units
cannot be shared or utilized in the same instruction
cycle.

29 / 35
Multi-Cycle Datapath

Each instruction is split into multiple steps, with each
step taking one clock cycle.

Different instructions take different numbers of
cycles depending on complexity.

Key characteristics:

Shorter clock cycle (based on the slowest single step,
not the slowest whole instruction).

Can reuse hardware (same memory for instruction
and data in different cycles).

Control logic is more complex (often
microprogrammed).

30 / 35
Multi-Cycle Datapath (Contd.)

Advantages of Multi-Cycle Datapath

More efficient — clock period is shorter, simple
instructions finish quickly, complex ones take longer.

Less hardware duplication - Some functional units
can be utilized more than once during a single step
of the same instruction, as is the case with ALU.

31 / 35
Multi-Cycle Datapath (Contd.)

Disadvantages of Multi-Cycle Datapath

Complex Control Logic: Additionally, the control
unit has to be more complex to handle multiple
cycles per one instruction.

Extra Registers: In between the steps, more
registers are required to hold results that will be used
in later operations or computations.

Slower overall throughput compared to pipelining
(but better than single-cycle).

32 / 35
Single Vs Multi-Cycle Datapaths

33 / 35
Pipeline – a series of datapaths

34 / 35
Datapath with pipeline registers

35 / 35
Pipelined RISC Data path
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back

Next PC
Next SEQ PC Next SEQ PC
Adder

4 RS1
Zero?

MEM/WB
Address

Reg File

EX/MEM
RS2

ID/EX
IF/ID

ALU
Memory

Memory

MUX

WB Data
Sign
Extend
Imm

RD RD RD
RISC MIPS Instruction Pipeline
 Each instruction can take at most 5 clock cycles
 Instruction fetch cycle (IF)
 Based on PC, fetch the instruction from memory
 Increment PC

Next PC
Instruction
Fetch Adder

4
Address

Memory
RISC MIPS Instruction Pipeline
 Instruction decode/register fetch cycle (ID)
 Decode the instruction + register read operation
 Fixed field decoding [ADD R1,R2,R2] OR [LW R1,8(R2) ]
Ex: A3.01.02.03 : 10100011 00000001 00000010 00000011
 Ex: 86.01.02.03 : 10000110 00000001 00001000 00000010

Next SEQ PC

Instr. Decode
RS1
Reg. Fetch
Reg File

RS2

Sign
Extend
Imm

RD
RISC MIPS Instruction Pipeline
 Execution/Effective address cycle (EX)
 Memory reference: Calculate the effective address
 [LW R1,8(R2) ] EFF ADDR= [R2] +8
 Register-register ALU instruction [ADD R1,R2,R2]

Execute
Zero?
Addr. Calc
ALU

RD
RISC MIPS Instruction Pipeline
 Memory access cycle (MEM)
 Load from memory and store in register [LW R1,8(R2)]
 Store the data from the register to memory [SW R3,16(R4)]

Memory
Access
Memory
RISC Instruction Pipeline
 Write-back cycle (WB)
 Register-register ALU instruction or load instruction
 Write to register file [LW R1,8(R2)] , [ADD R1,R2,R3]

Instr. Decode Write


Reg. Fetch Back

RS1

RS2
Reg File

RD

Imm

Sign
Extend
Visualizing Pipelining
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU
IM REG DM REG

IM REG ALU DM REG

IM REG ALU DM REG

ALU
IM REG DM REG
Benefits of Pipelining

• Increases instruction throughput


significantly.
• Can approach ideal speedup =
number of pipeline stages.
• Reduction in CPI (cycles per
instruction) for many workloads.
• Not visible to the programmer;
handled entirely in hardware.
Limits to pipelining
 Hazards: circumstances that would cause incorrect execution if
next instruction is fetched and executed

Structural hazards: Attempting to use the same hardware to do


two different things at the same time

Data hazards: Instruction depends on result of prior instruction


still in the pipeline

Control hazards: Caused by delay between the fetching of


instructions and decisions about changes in control flow
(branches)
Structural Hazard

Eg: Uniport Memory


Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU
Load Ifetch Reg DMem Reg

ALU
Instr 1 Ifetch Reg DMem Reg

ALU
Instr 2 Ifetch Reg DMem Reg

ALU
Instr 3 Ifetch Reg DMem Reg

Instr 4

Structural Hazard
Resolving Structural Hazard
 Eliminate the use same hardware for two different things at the
same time

 Solution 1: Wait

 must detect the hazard

 must have mechanism to stall

 Solution 2: Duplicate hardware

 Multiple such units will help both instruction to progress


Handling Structural Hazards

• Duplicate or pipeline functional units


to avoid conflicts.
• Use stalls (pipeline bubbles) when
conflicts can't be avoided.
• Example: shared memory causing
fetch/store conflict.
• Optimized designs use split caches
and buffers to reduce stalls.
Detecting & Resolving Structural Hazard

Time (clock cycles)


Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU
Load Ifetch Reg DMem Reg

ALU
Instr 1 Ifetch Reg DMem Reg

ALU
Instr 2 Ifetch Reg DMem Reg

Stall Bubble Bubble Bubble Bubble Bubble

ALU
Instr 3 Ifetch Reg DMem Reg
Eliminating Structural Hazards at Design

Adder
Next PC
Next SEQ PC Next SEQ PC

4 Zero?
RS1

MUX MUX

MEM/WB
Address

RS2

EX/MEM
Reg File

ID/EX
IF/ID

ALU

Cache
Cache
Instr

Data

MUX

WB Data
Sign
Extend
Imm

RD RD RD
Data Hazard

Time (clock cycles)

IF ID/RF EX MEM WB

ALU
add r1,r2,r3 Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
sub r4,r1,r3

ALU
and r6,r1,r7 Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
or r8,r1,r9

ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
Data Hazards and
Forwarding
• Data hazard: later instruction reads a
register before it’s updated.
• Forwarding (bypassing) sends results
directly to dependent stages.
• Prevents stalls for most ALU
operations.
• More complex forwarding needed for
multi-cycle instructions.
Three Generic Data Hazards

 Read After Write (RAW)


InstrJ tries to read operand before InstrI writes it

I : add r1,r2,r3
J: sub r4,r1,r3

 Caused by a data dependence


 This hazard results from an actual need for communication.
Three Generic Data Hazards
 Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
 Called an anti-dependence by compiler writers.
 This results from reuse of the name r1
 Can’t happen in MIPS 5 stage pipeline because:
 All instructions take 5 stages, and
 Reads are always in stage 2, and
 Writes are always in stage 5
Three Generic Data Hazards
 Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3

 Called an output dependence J: add r1,r2,r3


 This also results from the reuse of name r1. K: mul r6,r1,r7
 Can’t happen in MIPS 5 stage pipeline because:

 All instructions take 5 stages, and

 Writes are always in stage 5


 WAR and WAW happens in out of order pipes
Operand Forwarding to Avoid Data Hazard

Time (clock cycles)

ALU
add r1,r2,r3 Ifetch Reg DMem Reg

ALU
sub r4,r1,r3 Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
and r6,r1,r7

ALU
Ifetch Reg DMem Reg
or r8,r1,r9

ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
Pipeline Interlocks

• Hardware mechanism to detect and


resolve hazards.
• Stalls the pipeline when data isn't
available in time.
• Ensures correct execution without
programmer intervention.
• Example: stall after a load followed
by a dependent ALU op.
Branch Hazards

• Branches disrupt sequential


instruction flow.
• The target of the branch isn’t known
until later stage (ID or EX).
• Stalls or incorrect path execution can
occur.
• Common solution: freeze pipeline or
flush incorrect instructions.
Control Hazard on Branches

=> Three Stage Stall

ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5

ALU
18: or r6,r1,r7 Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9

ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
Four Branch Hazard Alternatives

#1: Stall until branch direction is clear

#2: Predict Branch Not Taken

#3: Predict Branch Taken

#4: Delayed Branch


Four Branch Hazard Alternatives

#1: Stall until branch direction is clear

#2: Predict Branch Not Taken

Execute successor instructions in sequence

“Squash” instructions in pipeline if branch actually taken


Four Branch Hazard Alternatives
#3: Predict Branch Taken

But branch target address in is not known by IF stage

Target is known at same time as branch outcome (IDstage)

 MIPS still incurs 1 cycle branch penalty


Four Branch Hazard Alternatives
#4: Delayed Branch
Define branch to take place AFTER one instruction
following the branch instruction.
 1 slot delay allows proper decision and branch target
address in 5 stage pipeline (MIPS uses this approach)

Where to get instructions to fill branch delay slot?


Conditional Branches
 When do you know you have a branch?
 During ID cycle (Could you know before that?)
 When do you know if the branch is Taken or Not-Taken
 During EXE cycle/ ID stage depending on the design
 We need for sophisticated solutions for following cases
 Modern pipelines are deep ( 10 + stages)
 Several instructions issued/cycle
 Several predicted branches in-flight at the same time
Simple static predictive schemes
 Predict branch Not -Taken (easiest to implement; default
for dynamic branch prediction)
If prediction correct no problem;
If prediction incorrect, delay = number of stages
between ID and EXE
 Predict branch Taken
Interesting only if target address can be computed
early
 Prediction depends on the direction of branch
Backward-Taken-Forward-Not-Taken (BTFNT)
Rationale: Backward branches at end of loops:
mostly taken
Dynamic branch prediction
 Execution of a branch requires knowledge of:
Branch instruction - encode whether instruction is a
branch or not. Decide on taken or not taken (i.e.,
prediction can be done at IF stage)
Whether the branch is Taken/Not-Taken (hence a branch
prediction mechanism)
If the branch is taken what is the target address (can be
computed but can also be “precomputed”, i.e., stored in
some table)
If the branch is taken what is the instruction at the branch
target address (saves the fetch cycle for that instruction)
Dynamic branch prediction
 Use a Branch Prediction Buffer (BPB)
Also called Branch Prediction Table (BPT), Branch History
Table (BHT)
Records previous outcomes of the branch instruction.
How to index into the table is an issue.
 A prediction using BPB is attempted when the branch
instruction is fetched (IF stage or equivalent)
 It is acted upon during ID stage (when we know we have a
branch)
Dynamic branch prediction
 Has a prediction been made (Y/N)
If not use default “Not Taken”
 Is it correct or incorrect ?
 Two cases:
Case 1: Yes and the prediction was correct (known at
ID stage) or No but the default was correct: No delay
Case 2: Yes and the prediction was incorrect or No and
the default was incorrect: Delay
RISC Instruction Set

• MIPS
RISC-V 64-bit
32/64 bit is used as the
representative RISC architecture.
• Three main instruction classes: ALU,
Load/Store, Branches.
• Fixed instruction format simplifies
decoding and pipelining.
• Registers are the only operands for
arithmetic/logical ops.
Unpipelined Implementation

• Executes one instruction at a time,


over multiple cycles.
• Typical instruction takes 4-5 cycles
(CPI ≈ 4.54).
• Easy to understand and implement,
but inefficient.
• Used as a baseline to illustrate
pipelining benefits.
Unpipelined RISC Data path

Instruction Instr. Decode Execute Memory Write


Fetch Reg. Fetch Addr. Calc Access Back
Next PC
Adder

Next SEQ PC

4 RS1
Zero?

Reg File
Address

RS2

ALU
Inst

Memory
Memory

RD L
M
D
Sign
Imm Extend

WB Data
Implementing the Pipeline

• Split instruction/data memory or


cache to avoid memory conflicts.
• Use pipeline registers between
stages (e.g., IF/ID, ID/EX).
• Careful control needed to avoid
simultaneous hardware usage.
• Balance stage durations to prevent
bottlenecks.
Pipelined RISC Data path
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back

Next PC
Next SEQ PC Next SEQ PC
Adder

4 RS1
Zero?

MEM/WB
Address

Reg File

EX/MEM
RS2

ID/EX
IF/ID

ALU
Memory

Memory

MUX

WB Data
Sign
Extend
Imm

RD RD RD
RISC MIPS Instruction Pipeline
RISC-V
 Each instruction can take at most 5 clock cycles
 Instruction fetch cycle (IF)
 Based on PC, fetch the instruction from memory
 Increment PC

Next PC
Instruction
Fetch Adder

4
Address

Memory
RISC MIPS Instruction Pipeline
RISC-V
 Instruction decode/register fetch cycle (ID)
 Decode the instruction + register read operation
 Fixed field decoding [ADD R1,R2,R2] OR [LW R1,8(R2) ]
Ex: A3.01.02.03 : 10100011 00000001 00000010 00000011
 Ex: 86.01.02.03 : 10000110 00000001 00001000 00000010

Next SEQ PC

Instr. Decode
RS1
Reg. Fetch
Reg File

RS2

Sign
Extend
Imm

RD
RISC MIPS Instruction Pipeline
RISC-V
 Execution/Effective address cycle (EX)
 Memory reference: Calculate the effective address
 [LW R1,8(R2) ] EFF ADDR= [R2] +8
 Register-register ALU instruction [ADD R1,R2,R2]

Execute
Zero?
Addr. Calc
ALU

RD
RISC MIPS Instruction Pipeline
RISC-V
 Memory access cycle (MEM)
 Load from memory and store in register [LW R1,8(R2)]
 Store the data from the register to memory [SW R3,16(R4)]

Memory
Access
Memory
RISC Instruction Pipeline
RISC-V
 Write-back cycle (WB)
 Register-register ALU instruction or load instruction
 Write to register file [LW R1,8(R2)] , [ADD R1,R2,R3]

Instr. Decode Write


Reg. Fetch Back

RS1

RS2
Reg File

RD

Imm

Sign
Extend
RISC-V
RISC MIPS Instruction Pipeline
Cycles required to implement different instructions
 Branch instructions – 4 cycles
 Store instructions – 4 cycles
 All other instructions – 5 cycles

EX
IF ID MEM WB
Visualizing Pipelining
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU
IM REG DM REG

IM REG ALU DM REG

IM REG ALU DM REG

ALU
IM REG DM REG
Multi-cycle Operations
 Can EXE stage complete the operation in 1 cycle ?

 Some operations require more than 1 clock cycle to complete.

Floating Point/Integer Multiply

Floating Point/Integer Divide

Floating Point Add/Sub

 Dedicated hardware units are available on the processor for


performing these operations.

 FP-Mul and FP-Add are fully pipelined, but FP-Div is un-


pipelined.
Multi-cycle Operations

 Latency: The number of intervening cycles between an


instruction that produces a result and an instruction that uses
the result.
 Initiation / Repeat Interval: The number of cycles that must
elapse between issuing two operations of a given type.

You might also like