PIPELINING: BASIC AND
INTERMEDIATE CONCEPTS
CHAPTER # 03
WHAT IS PIPELINING?
• Pipelining is an implementation technique whereby multiple instructions are overlapped
in execution; it takes advantage of parallelism that exists among the actions needed to
execute an instruction.
• Today, pipelining is the key implementation technique used to make fast processors, and
even processors that cost less than a dollar are pipelined.
• A pipeline is like an assembly line. In an automobile assembly line, there are many steps,
each contributing something to the construction of the car.
WHAT IS PIPELINING?
• Each step operates in parallel with the other steps, although on a different car.
• In a computer pipeline, each step in the pipeline completes a part of an instruction. Like
the assembly line, different steps are completing different parts of different instructions in
parallel. Each of these steps is called a pipe stage or a pipe segments.
• The stages are connected one to the next to form a pipe—instructions enter at one end,
progress through the stages, and exit at the other end, just as cars would in an assembly
line.
WHAT IS PIPELINING
• Laundry Example
A B C D
• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
C
d
e
r D
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
WHAT IS PIPELINING
START WORK ASAP
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
T
a A
s • Pipelined laundry takes 3.5
k
B hours for 4 loads
O
r
C
d
e
r D
WHAT IS PIPELINING
• In an automobile assembly line, throughput is defined as the number of cars per hour and
is determined by how often a completed car exits the assembly line.
• Likewise, the throughput of an instruction pipeline is determined by how often an
instruction exits the pipeline.
• Because the pipe stages are hooked together, all the stages must be ready to proceed at
the same time, just as we would require in an assembly line.
WHAT IS PIPELINING
• The time required between moving an instruction one step down the pipeline is a
processor cycle.
• Because all stages proceed at the same time, the length of a processor cycle is
determined by the time required for the slowest pipe stage, just as in an auto assembly
line the longest step would determine the time between advancing cars in the line.
• In a computer, this processor cycle is almost always 1 clock cycle.
WHAT IS PIPELINING
• The pipeline designer’s goal is to balance the length of each pipeline stage, just as the
designer of the assembly line tries to balance the time for each step in the process. If the
stages are perfectly balanced, then the time per instruction on the pipelined processor—
assuming ideal conditions—is equal to:
WHAT IS PIPELINING
• Under these conditions, the speedup from pipelining equals the number of pipe stages,
just as an assembly line with n stages can ideally produce cars n times as fast. Usually,
however, the stages will not be perfectly balanced; furthermore, pipelining does involve
some overhead.
• Thus, the time per instruction on the pipelined processor will not have its minimum
possible value, yet it can be close.
• Pipelining yields a reduction in the average execution time per instruction. If the starting
point is a processor that takes multiple clock cycles per instruction, then pipelining
reduces the CPI.This is the primary view we will take.
What Is PIPELINING
LESSONS
Pipelining
6 PM 7 8 9 • Pipelining doesn’t help latency of
single task, it helps throughput of
Time
entire workload
T
30 40 40 40 40 20 • Pipeline rate limited by slowest
a
pipeline stage
s
k A
• Multiple tasks operating
simultaneously
O
B • Potential speedup = Number pipe
r
d stages
e • Unbalanced lengths of pipe stages
C
r reduces speedup
D • Time to “fill” pipeline and time to
“drain” it reduces speedup
A SIMPLE IMPLEMENTATION OF A RISC
INSTRUCTION SET
• Every instruction in this RISC subset can be implemented in, at most, 5 clock cycles. The
5 clock cycles are as follows.
• Instruction fetch cycle (IF):
• Send the program counter (PC) to memory and fetch the current instruction from
memory. Update the PC to the next sequential instruction by adding 4 (because each
instruction is 4 bytes) to the PC.
A SIMPLE IMPLEMENTATION OF A RISC
INSTRUCTION SET
• Instruction decode/register fetch cycle (ID):
• Decode the instruction and read the registers corresponding to register source
specifiers from the register file.
• Do the equality test on the registers as they are read, for a possible branch. Sign-extend
the offset field of the instruction in case it is needed.
• Compute the possible branch target address by adding the sign-extended offset to the
incremented PC.
A SIMPLE IMPLEMENTATION OF A RISC
INSTRUCTION SET
• Execution/effective address cycle (EX):
• The ALU operates on the operands prepared in the prior cycle, performing one of three
functions, depending on the instruction type.
• ■ Memory reference—The ALU adds the base register and the offset to form the effective
address.
• ■ Register-Register ALU instruction—The ALU performs the operation specified by the ALU
opcode on the values read from the register file.
• ■ Register-Immediate ALU instruction—The ALU performs the operation specified by the ALU
opcode on the first value read from the register file and the sign-extended immediate.
• ■ Conditional branch—Determine whether the condition is true.
A SIMPLE IMPLEMENTATION OF A RISC
INSTRUCTION SET
• Memory access (MEM):
• If the instruction is a load, the memory does a read using the effective address computed
in the previous cycle. If it is a store, then the memory writes the data from the second
register read from the register file using the effective address.
• Write-back cycle (WB):
• Write the result into the register file, whether it comes from the memory system (for a
load) or from the ALU (for an ALU instruction).
A SIMPLE IMPLEMENTATION OF A RISC
INSTRUCTION SET
THE CLASSIC FIVE-STAGE PIPELINE FOR A RISC
PROCESSOR
• We can pipeline the execution described in the previous section with almost no changes
by simply starting a new instruction on each clock cycle.
• Each of the clock cycles from the previous section becomes a pipe stage—a cycle in the
pipeline. This results in the execution pattern shown in Figure, which is the typical way a
pipeline structure is drawn.
• Although each instruction takes 5 clock cycles to complete, during each clock cycle the
hardware will initiate a new instruction and will be executing some part of the five
different instructions.
THE CLASSIC FIVE-STAGE PIPELINE FOR A RISC
PROCESSOR
THE CLASSIC FIVE-STAGE PIPELINE FOR A RISC
PROCESSOR
BASIC PERFORMANCE ISSUES IN PIPELINING
• Pipelining increases the processor instruction throughput—the number of instructions
completed per unit of time—but it does not reduce the execution time of an individual
instruction. In fact, it usually slightly increases the execution time of each instruction due
to overhead in the control of the pipeline.
• In addition to limitations arising from pipeline latency, limits arise from imbalance among
the pipe stages and from pipelining overhead. Imbalance among the pipe stages reduces
performance because the clock can run no faster than the time needed for the slowest
pipeline stage. Pipeline overhead arises from the combination of pipeline register delay
and clock skew.
EXAMPLE
• Consider the un-pipelined processor in the previous section. Assume that it has a 4 GHz
clock (or a 0.5 ns clock cycle) and that it uses four cycles for ALU operations and
branches and five cycles for memory operations. Assume that the relative frequencies of
these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and
setup, pipelining the processor adds 0.1 ns of overhead to the clock. Ignoring any latency
impact, how much speedup in the instruction execution rate will we gain from a pipeline?
SOLUTION
• The average instruction execution time on the un-pipelined processor is:
In the pipelined implementation, the clock must run at the speed of the slowest stage plus overhead, which
will be 0.5 + 0.1 or 0.6 ns; this is the average instruction execution time. Thus, the speedup from pipelining
is
THE MAJOR HURDLE OF PIPELINING—PIPELINE HAZARDS
• There are situations, called hazards, that prevent the next instruction in the instruction
stream from executing during its designated clock cycle.
• Hazards reduce the performance from the ideal speedup gained by pipelining.
• There are three classes of hazards:
• 1. Structural hazards
• 2. Data hazards
• 3. Control hazards
THE MAJOR HURDLE OF PIPELINING—PIPELINE HAZARDS
• Structural hazards: HW cannot support this combination of instructions (single person to fold
and put clothes away).
• Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing
sock)
• Control hazards: Pipelining of branches & other instructions that change the PC
• Common solution is to stall the pipeline until the hazard is resolved, inserting one or more
“bubbles” in the pipeline
STRUCTURAL HAZARDS:
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
I Load
ALU
Ifetch Reg DMem Reg This is another
way of looking
n
at the effect of a
s Instr 1 stall.
ALU
Ifetch Reg DMem Reg
t
r.
ALU
Instr 2 Ifetch Reg DMem Reg
O
r Stall Bubble Bubble Bubble Bubble Bubble
d
e
r
ALU
Instr 3 Ifetch Reg DMem Reg
STALLS IN PIPELINE
DATA HAZARDS
• A data hazard occurs when the pipeline execution must be stalled because one step must
wait for another one to complete. This comes up when a planned instruction can not
execute in the planned clock cycle because the data needed is not yet available.
• Read After Write (RAW) Instrj tries to read operand before InstrIwrites it.
DATA HAZARDS
• Write After Read (WAR) InstrJ tries to write operand before InstrI reads it
• Write After Write (WAW) InstrJ tries to write operand before InstrI writes it
EXAMPLE
• Consider the pipelined execution of these instructions:
• add x1,x2,x3
• sub x4,x1,x5
• and x6,x1,x7
• or x8,x1,x9
• xor x10,x1,x11
Time (clock cycles)
IF ID/RF EX MEM WB
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
n
s
ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
t
r.
ALU
Ifetch Reg DMem Reg
and r6,r1,r7
O
r
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
d
e
ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
r
The use of the result of the ADD instruction in the next three instructions causes a
hazard, since the register is not written until after those instructions read it.
Forwarding is the concept of making data
available to the input of the ALU for
subsequent instructions, even though the
Forwarding To Avoid generating instruction hasn’t gotten to WB
Data Hazard in order to write the memory or registers.
Time (clock cycles)
I
n add r1,r2,r3 Ifetch
ALU
Reg DMem Reg
s
t
ALU
r sub r4,r1,r3 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
r and r6,r1,r7
d
e
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
r
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
Time (clock cycles) The stall is necessary as shown here.
ALU
Ifetch Reg DMem Reg
lw r1, 0(r2)
I
n
s
ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg
t
r.
Bubble
ALU
Reg Reg
O and r6,r1,r7 Ifetch DMem
r
d
Bubble
ALU
e or r8,r1,r9
Ifetch Reg DMem
There are some instances where hazards occur, even with forwarding.
This is another representation of the stall.
LW R1, 0(R2) IF ID EX MEM WB
SUB R4, R1, R5 IF ID EX MEM WB
AND R6, R1, R7 IF ID EX MEM WB
OR R8, R1, R9 IF ID EX MEM WB
LW R1, 0(R2) IF ID EX MEM WB
SUB R4, R1, R5 IF ID stall EX MEM WB
AND R6, R1, R7 IF stall ID EX MEM WB
OR R8, R1, R9 stall IF ID EX MEM WB
Pipeline Scheduling
Instruction scheduled by compiler - move instruction in order to reduce stall.
lw Rb, b code sequence for a = b+c before scheduling
lw Rc, c
Add Ra, Rb, Rc stall
sw a, Ra
lw Re, e code sequence for d = e+f before scheduling
lw Rf, f
sub Rd, Re, Rf stall
sw d, Rd
Arrangement of code after scheduling.
lw Rb, b
lw Rc, c
lw Re, e
Add Ra, Rb, Rc
lw Rf, f
sw a, Ra
sub Rd, Re, Rf
sw d, Rd
Pipeline Scheduling
scheduled unscheduled
54%
gcc
31%
42%
spice
14%
65%
tex
25%
0% 20% 40% 60% 80%
% loads stalling pipeline
CONTROL HAZARDS
• A control hazard is when we need to find the destination of a branch, and can’t fetch any
new instructions until we know that destination.
• Control hazards can cause a greater performance loss for our RISC V pipeline than do
data hazards. When a branch is executed, it may or may not change the PC to something
other than its current value plus 4. Recall that if a branch changes the PC to its target
address, it is a taken branch; if it falls through, it is not taken, or untaken. If instruction i is
a taken branch, then the PC is usually not changed until the end of ID, after the
completion of the address calculation and comparison.
CONTROL HAZARD ON BRANCHES THREE STAGE STALL
ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg
ALU
14: and r2,r3,r5 Ifetch Reg DMem Reg
ALU
18: or r6,r1,r7 Ifetch Reg DMem Reg
ALU
22: add r8,r1,r9 Ifetch Reg DMem Reg
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
REDUCING PIPELINE BRANCH PENALTIES
• 1. Freeze or flush the pipeline: the simplest scheme
• Hold or delete any instructions after the branch until the branch destination is known .
• This is the solution shown in Figure
• The branch penalty is fixed and cannot be reduced by software
REDUCING PIPELINE BRANCH PENALTIES
• 2.Treat every branch as not taken.
• Continue to fetch instructions as if there were no branch.
• Restart fetch at target address (and turn previously fetched instruction into a NOP) if
the branch is taken.
REDUCING PIPELINE BRANCH PENALTIES
• 3.Treat every branch as taken
• As soon as the branch is decoded and the target address is computed, we assume the
branch to be taken and begin fetching and executing at the target..
• This buys us a one-cycle improvement when the branch is actually taken, because we
know the target address at the end of ID, one cycle before we know whether the branch
condition is satisfied in the ALU stage. In either a predicted-taken or predicted-not-taken
scheme, the compiler can improve performance by organizing the code so that the most
frequent path matches the hardware’s choice.
REDUCING PIPELINE BRANCH PENALTIES
REDUCING PIPELINE BRANCH PENALTIES
• A fourth scheme, which was heavily used in early RISC processors is called delayed
branch. In a delayed branch, the execution cycle with a branch delay of one is:
• Although it is possible to have a branch delay longer than one, in practice almost all
processors with delayed branch have a single instruction delay; other techniques are used
if the pipeline has a longer potential branch penalty. The job of the compiler is to make
the successor instructions valid and useful.
REDUCING PIPELINE BRANCH PENALTIES
PERFORMANCE OF PIPELINES WITH STALLS
PERFORMANCE OF BRANCH SCHEMES
• What is the effective performance of each of these schemes?
• The effective pipeline speedup with branch penalties, assuming an ideal CPI of 1, is
Pipeline stall cycles from branches = Branch frequency x Branch penalty
A BASIC PIPELINE FOR RISC V
EXCEPTION IN PIPELINE
• The terminology used to describe exceptional situations where the normal execution
order of instruction is changed varies among processors. The terms interrupt, fault, and
exception are used, although not in a consistent fashion. We use the term exception to
cover all these mechanisms,
• ■ I/O device request ■ Invoking an operating system service from a user program.
• ■ Tracing instruction execution ■ Breakpoint (programmer-requested interrupt)
• ■ Integer arithmetic overflow■ FP arithmetic anomaly ■ Page fault (not in main memory)
■ Misaligned memory accesses (if alignment is required) ■ Memory protection violation
■ Using an undefined or unimplemented instruction ■ Hardware malfunctions ■ Power
failure
EXCEPTION IN PIPELINE
• Synchronous versus asynchronous
• User requested versus forced
• User maskable versus user nonmaskable
• Within versus between instructions
• Resume versus terminate
EXTENDING THE RISC V INTEGER PIPELINE TO
HANDLE MULTICYCLE OPERATIONS
• We now want to explore how our RISC V pipeline can be extended to handle floating-
point operations.
EXTENDING THE RISC V INTEGER PIPELINE TO
HANDLE MULTICYCLE OPERATIONS
EXTENDING THE RISC V INTEGER PIPELINE TO
HANDLE MULTICYCLE OPERATIONS
HAZARDS AND FORWARDING IN LONGER LATENCY
PIPELINES
• Because the divide unit is not fully pipelined, structural hazards can occur. These will need
to be detected and issuing instructions will need to be stalled.
• Because the instructions have varying running times, the number of register writes
required in a cycle can be larger than 1.
• Write after write (WAW) hazards are possible.
• Instructions can complete in a different order than they were issued, causing problems
with exceptions ; Imprecise exception.
• Because of longer latency of operations, stalls for RAW hazards will be more frequent.
HAZARDS AND FORWARDING IN LONGER
LATENCY PIPELINES
HAZARDS AND FORWARDING IN LONGER
LATENCY PIPELINES
CLASS ACTIVITY-I
• For the code sequence below, choose the statement that best describes requirements
for correctness:
A No stalls as is
B No stalls with forwarding
C Must stall
CLASS ACTIVITY-II
• For the code sequence below, choose the statement that best describes requirements
for correctness
A No stalls as is
B No stalls with forwarding
C Must stall
CLASS ACTIVITY-III
• For the code sequence below, choose the statement that best describes requirements
for correctness
A No stalls as is
B No stalls with forwarding
C Must stall
EXERCISE
• If any dependencies exist where are they and what type are they?
• How many cycles does it take to execute the code fragment?
SOLUTION
EXERCISE
• If any dependencies exist where are they and what type are they?
• How many cycles does it take to execute the code fragment?
EXERCISE
How many cycles does it take to execute the code fragment? Draw Pipeline Diagram to support your
answer
NUMERICAL
• The time delay of various segments in a 5 stage pipeline are t1=35 ns , t2= 30ns , t3=
40ns ,t4= 45 ns and t5= 35 ns . The interface register delay time is t= 5ns. How long
would it take to complete 150 instructions in the pipeline? (Assuming all instructions are
independent.
NUMERICAL
• Given a non-pipelined architecture , running at 1 GHz, that takes 5 cycles to complete an
instruction . It was later converted to a 5 stage pipeline operating at 800 MHz. A stall of
70 cycles happens in 2% of memory instructions and a stall of 2 cycles happens in 20% of
the branch instructions. 30 % instructions are of memory and 20% are of branches.
• What is actual speedup obtained by pipelining ?
THE MIPS R4000 PIPELINE
• The MIPS architecture and RISC V are very similar, differing only in a few instructions, including
a delayed branch in the MIPS ISA.
• The R4000 is a 64 bit instruction set .
• However, it uses an 8-stage integer pipeline as opposed to the 5-stage pipeline.
• The extra stages are incorporated in to the instruction fetch and memory access stages.
• The strategy of using a deeper pipeline for speeding up memory access is often called super
pipelining.
• Instruction and data memory are fully pipelined, so a new instruction can start on every clock
cycle.
THE MIPS R4000 PIPELINE
• The Pipeline Stages
• IF : First half of instruction fetch; PC selection actually happens here, together with initiation of instruction cache
access
• IS : Second half of instruction fetch, complete instruction cache access.
• RF : Instruction decode and register fetch, hazard checking, and also instruction cache hit detection
• EX : Execution, which includes effective address calculation, ALU operation, and branch target completion of
data cache access
• DF : Data fetch, first half of data cache access
• DS : Second half of data fetch, completion of data cache access
• TC :Tag check, determine whether the data cache access hit
• WB :Write back for loads and register-register operation.
THE MIPS R4000 PIPELINE
• Load Delays:
WHAT IS SUPERSCALAR PROCESSOR?
• A type of microprocessor that is used to implement a type of parallelism known as
instruction-level parallelism in a single processor to execute more than one instruction
during a CLK cycle by dispatching simultaneously various instructions to special
execution units on the processor. A scalar processor executes single instruction for
each clock cycle; a superscalar processor can execute more than one instruction during a
clock cycle.
WHAT IS SUPERSCALAR PROCESSOR?
• Superscalar processing is the ability to initiate multiple instructions during the same clock
cycle.
• A typical Superscalar processor fetches and decodes the incoming instruction stream
several instructions at a time.
• Superscalar architecture exploit the potential of ILP(Instruction Level Parallelism).
WHAT IS SUPERSCALAR PROCESSOR?
WHAT IS SUPERSCALAR PROCESSOR?
SUPER PIPELINE VS SUPER SCALAR
WHAT IS GOOD WITH SUPERSCALARS?
• The hardware solves everything
• Hardware detects potential parallelism between instructions.
• Hardware tries to issue as many instructions as possible in parallel.
• Hardware solves register renaming.
WHAT IS BAD WITH SUPERSCALARS?
• Very complex
• Much hardware is needed for run-time detection. There is a limit in how far we can go
with this technique.
• Power consumption can be very large!
• The instruction window is limited this limits the capacity to detect
potentially parallel instructions.
VLIW (VERY LONG INSTRUCTION WORD) PROCESSORS
• In this style of architectures, the compiler formats a fixed number of operations as one
big instruction (called a bundle) and schedules them.
• With few numbers of instructions, say 3, it is usually called LIW (Long Instruction Word).
• There is a change in the instruction set architecture, i.e., 1 program counter points to 1
bundle (not 1 operation).
• The operations in a bundle are issued in parallel.
• The bundles follow a fixed format and so the decode operations are done in parallel.
VLIW (VERY LONG INSTRUCTION WORD) PROCESSORS
RECALL FROM PIPELINING REVIEW
• Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls
• Ideal pipeline CPI: measure of the maximum performance attainable by the
implementation
• Structural hazards: HW cannot support this combination of instructions
• Data hazards: Instruction depends on result of prior instruction still in the pipeline
• Control hazards: Caused by delay between the fetching of instructions and decisions
about changes in control flow (branches and jumps)
INSTRUCTION LEVEL PARALLELISM
• Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve
performance
• 2 approaches to exploit ILP:
• 1) Rely on hardware to help discover and exploit the parallelism dynamically (e.g.,
Pentium 4,AMD Opteron, IBM Power) , and
• 2) Rely on software technology to find parallelism, statically at compile-time (e.g.,
Itanium 2)
HAZARD VS DEPENDENCE
• Dependence: fixed property of instruction stream (i.e., program)
• Hazard: property of program and processor organization
• Definition: a hazard is created whenever there is a dependence between instructions, and
they are close enough that the overlap during execution would change the order of access to
the operand involved in the dependence.
• – implies potential for executing things in wrong order .potential only exists if instructions can
be simultaneously “in-flight” (i.e. in the pipeline simultaneously)
• For example, can have RAW dependence with or without hazard – When distance between
RAW instructions is larger than the pipeline depth
ASSUMPTION OF FP LATENCY
BASIC PIPELINE SCHEDULING AND LOOP UNROLLING
LOOP UNROLLING
• Loop overhead (instructions that do book-keeping for the loop): 2
• Actual work (the ld, add.d, and s.d): 3 instructions
• Can we somehow get execution time to be 3 cycles per iteration?
• A simple scheme to increase the number of instructions relative to the branch overhead
instructions.
• Replicates the loop body multiple times, adjusting the loop termination code
•
LOOP UNROLLING (STRAIGHTFORWARD WAY)
Eliminates 3 branches
Eliminates 3 decrements of X1
1 cycle stall for FLD
2 cycles stall for FADD
LOOP UNROLLING (SCHEDULING THAT MINIMIZES STALLS )