0% found this document useful (0 votes)

7 views85 pages

Chapter # 03 Pipelining

Pipelining is a technique that allows multiple instructions to be executed simultaneously by overlapping their execution stages, similar to an assembly line. It increases instruction throughput but does not reduce the execution time of individual instructions due to overhead and potential hazards such as structural, data, and control hazards. The classic five-stage pipeline for RISC processors involves instruction fetch, decode, execution, memory access, and write-back, optimizing performance while managing the challenges posed by hazards.

Uploaded by

fatimasalahuddinmirza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views85 pages

Chapter # 03 Pipelining

Uploaded by

fatimasalahuddinmirza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

PIPELINING: BASIC AND

INTERMEDIATE CONCEPTS
CHAPTER # 03
WHAT IS PIPELINING?

• Pipelining is an implementation technique whereby multiple instructions are overlapped

in execution; it takes advantage of parallelism that exists among the actions needed to
execute an instruction.
• Today, pipelining is the key implementation technique used to make fast processors, and
even processors that cost less than a dollar are pipelined.
• A pipeline is like an assembly line. In an automobile assembly line, there are many steps,
each contributing something to the construction of the car.
WHAT IS PIPELINING?

• Each step operates in parallel with the other steps, although on a different car.
• In a computer pipeline, each step in the pipeline completes a part of an instruction. Like
the assembly line, different steps are completing different parts of different instructions in
parallel. Each of these steps is called a pipe stage or a pipe segments.
• The stages are connected one to the next to form a pipe—instructions enter at one end,
progress through the stages, and exit at the other end, just as cars would in an assembly
line.
WHAT IS PIPELINING
• Laundry Example
A B C D
• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

6 PM 7 8 9 10 11 Midnight

Time

30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
C
d
e
r D
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
WHAT IS PIPELINING
START WORK ASAP
6 PM 7 8 9 10 11 Midnight

Time

30 40 40 40 40 20
T
a A
s • Pipelined laundry takes 3.5
k
B hours for 4 loads
O
r
C
d
e
r D
WHAT IS PIPELINING

• In an automobile assembly line, throughput is defined as the number of cars per hour and
is determined by how often a completed car exits the assembly line.
• Likewise, the throughput of an instruction pipeline is determined by how often an
instruction exits the pipeline.
• Because the pipe stages are hooked together, all the stages must be ready to proceed at
the same time, just as we would require in an assembly line.
WHAT IS PIPELINING

• The time required between moving an instruction one step down the pipeline is a
processor cycle.
• Because all stages proceed at the same time, the length of a processor cycle is
determined by the time required for the slowest pipe stage, just as in an auto assembly
line the longest step would determine the time between advancing cars in the line.
• In a computer, this processor cycle is almost always 1 clock cycle.
WHAT IS PIPELINING

• The pipeline designer’s goal is to balance the length of each pipeline stage, just as the
designer of the assembly line tries to balance the time for each step in the process. If the
stages are perfectly balanced, then the time per instruction on the pipelined processor—
assuming ideal conditions—is equal to:
WHAT IS PIPELINING

• Under these conditions, the speedup from pipelining equals the number of pipe stages,
just as an assembly line with n stages can ideally produce cars n times as fast. Usually,
however, the stages will not be perfectly balanced; furthermore, pipelining does involve
some overhead.
• Thus, the time per instruction on the pipelined processor will not have its minimum
possible value, yet it can be close.
• Pipelining yields a reduction in the average execution time per instruction. If the starting
point is a processor that takes multiple clock cycles per instruction, then pipelining
reduces the CPI.This is the primary view we will take.
What Is PIPELINING
LESSONS
Pipelining
6 PM 7 8 9 • Pipelining doesn’t help latency of
single task, it helps throughput of
Time
entire workload
T
30 40 40 40 40 20 • Pipeline rate limited by slowest
a
pipeline stage
s
k A
• Multiple tasks operating
simultaneously
O
B • Potential speedup = Number pipe
r
d stages
e • Unbalanced lengths of pipe stages
C
r reduces speedup

D • Time to “fill” pipeline and time to

“drain” it reduces speedup
A SIMPLE IMPLEMENTATION OF A RISC
INSTRUCTION SET
• Every instruction in this RISC subset can be implemented in, at most, 5 clock cycles. The
5 clock cycles are as follows.
• Instruction fetch cycle (IF):
• Send the program counter (PC) to memory and fetch the current instruction from
memory. Update the PC to the next sequential instruction by adding 4 (because each
instruction is 4 bytes) to the PC.
A SIMPLE IMPLEMENTATION OF A RISC
INSTRUCTION SET
• Instruction decode/register fetch cycle (ID):
• Decode the instruction and read the registers corresponding to register source
specifiers from the register file.
• Do the equality test on the registers as they are read, for a possible branch. Sign-extend
the offset field of the instruction in case it is needed.
• Compute the possible branch target address by adding the sign-extended offset to the
incremented PC.
A SIMPLE IMPLEMENTATION OF A RISC
INSTRUCTION SET
• Execution/effective address cycle (EX):
• The ALU operates on the operands prepared in the prior cycle, performing one of three
functions, depending on the instruction type.
• ■ Memory reference—The ALU adds the base register and the offset to form the effective
address.
• ■ Register-Register ALU instruction—The ALU performs the operation specified by the ALU
opcode on the values read from the register file.
• ■ Register-Immediate ALU instruction—The ALU performs the operation specified by the ALU
opcode on the first value read from the register file and the sign-extended immediate.
• ■ Conditional branch—Determine whether the condition is true.
A SIMPLE IMPLEMENTATION OF A RISC
INSTRUCTION SET
• Memory access (MEM):
• If the instruction is a load, the memory does a read using the effective address computed
in the previous cycle. If it is a store, then the memory writes the data from the second
register read from the register file using the effective address.
• Write-back cycle (WB):
• Write the result into the register file, whether it comes from the memory system (for a
load) or from the ALU (for an ALU instruction).
A SIMPLE IMPLEMENTATION OF A RISC
INSTRUCTION SET
THE CLASSIC FIVE-STAGE PIPELINE FOR A RISC
PROCESSOR
• We can pipeline the execution described in the previous section with almost no changes
by simply starting a new instruction on each clock cycle.
• Each of the clock cycles from the previous section becomes a pipe stage—a cycle in the
pipeline. This results in the execution pattern shown in Figure, which is the typical way a
pipeline structure is drawn.
• Although each instruction takes 5 clock cycles to complete, during each clock cycle the
hardware will initiate a new instruction and will be executing some part of the five
different instructions.
THE CLASSIC FIVE-STAGE PIPELINE FOR A RISC
PROCESSOR
THE CLASSIC FIVE-STAGE PIPELINE FOR A RISC
PROCESSOR
BASIC PERFORMANCE ISSUES IN PIPELINING

• Pipelining increases the processor instruction throughput—the number of instructions

completed per unit of time—but it does not reduce the execution time of an individual
instruction. In fact, it usually slightly increases the execution time of each instruction due
to overhead in the control of the pipeline.
• In addition to limitations arising from pipeline latency, limits arise from imbalance among
the pipe stages and from pipelining overhead. Imbalance among the pipe stages reduces
performance because the clock can run no faster than the time needed for the slowest
pipeline stage. Pipeline overhead arises from the combination of pipeline register delay
and clock skew.
EXAMPLE

• Consider the un-pipelined processor in the previous section. Assume that it has a 4 GHz
clock (or a 0.5 ns clock cycle) and that it uses four cycles for ALU operations and
branches and five cycles for memory operations. Assume that the relative frequencies of
these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and
setup, pipelining the processor adds 0.1 ns of overhead to the clock. Ignoring any latency
impact, how much speedup in the instruction execution rate will we gain from a pipeline?
SOLUTION

• The average instruction execution time on the un-pipelined processor is:

In the pipelined implementation, the clock must run at the speed of the slowest stage plus overhead, which
will be 0.5 + 0.1 or 0.6 ns; this is the average instruction execution time. Thus, the speedup from pipelining
is
THE MAJOR HURDLE OF PIPELINING—PIPELINE HAZARDS

• There are situations, called hazards, that prevent the next instruction in the instruction
stream from executing during its designated clock cycle.
• Hazards reduce the performance from the ideal speedup gained by pipelining.
• There are three classes of hazards:
• 1. Structural hazards
• 2. Data hazards
• 3. Control hazards
THE MAJOR HURDLE OF PIPELINING—PIPELINE HAZARDS

• Structural hazards: HW cannot support this combination of instructions (single person to fold
and put clothes away).
• Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing
sock)
• Control hazards: Pipelining of branches & other instructions that change the PC
• Common solution is to stall the pipeline until the hazard is resolved, inserting one or more
“bubbles” in the pipeline
STRUCTURAL HAZARDS:
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

I Load

ALU
Ifetch Reg DMem Reg This is another
way of looking
n
at the effect of a
s Instr 1 stall.

ALU
Ifetch Reg DMem Reg
t
r.

ALU
Instr 2 Ifetch Reg DMem Reg

O
r Stall Bubble Bubble Bubble Bubble Bubble
d
e
r

ALU
Instr 3 Ifetch Reg DMem Reg
STALLS IN PIPELINE
DATA HAZARDS

• A data hazard occurs when the pipeline execution must be stalled because one step must
wait for another one to complete. This comes up when a planned instruction can not
execute in the planned clock cycle because the data needed is not yet available.
• Read After Write (RAW) Instrj tries to read operand before InstrIwrites it.
DATA HAZARDS

• Write After Read (WAR) InstrJ tries to write operand before InstrI reads it

• Write After Write (WAW) InstrJ tries to write operand before InstrI writes it
EXAMPLE

• Consider the pipelined execution of these instructions:

• add x1,x2,x3
• sub x4,x1,x5
• and x6,x1,x7
• or x8,x1,x9
• xor x10,x1,x11
Time (clock cycles)
IF ID/RF EX MEM WB

ALU
add r1,r2,r3 Ifetch Reg DMem Reg

n
s

ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
t
r.

ALU
Ifetch Reg DMem Reg
and r6,r1,r7
O
r

ALU
Ifetch Reg DMem Reg
or r8,r1,r9
d
e

ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
r

The use of the result of the ADD instruction in the next three instructions causes a
hazard, since the register is not written until after those instructions read it.
Forwarding is the concept of making data
available to the input of the ALU for
subsequent instructions, even though the
Forwarding To Avoid generating instruction hasn’t gotten to WB
Data Hazard in order to write the memory or registers.

Time (clock cycles)

I
n add r1,r2,r3 Ifetch

ALU
Reg DMem Reg

s
t

ALU
r sub r4,r1,r3 Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
r and r6,r1,r7
d
e

ALU
Ifetch Reg DMem Reg
or r8,r1,r9
r

ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
Time (clock cycles) The stall is necessary as shown here.

ALU
Ifetch Reg DMem Reg
lw r1, 0(r2)
I
n
s

ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg
t
r.
Bubble

ALU
Reg Reg
O and r6,r1,r7 Ifetch DMem

r
d
Bubble

ALU
e or r8,r1,r9
Ifetch Reg DMem

There are some instances where hazards occur, even with forwarding.
This is another representation of the stall.

LW R1, 0(R2) IF ID EX MEM WB

SUB R4, R1, R5 IF ID EX MEM WB

AND R6, R1, R7 IF ID EX MEM WB

OR R8, R1, R9 IF ID EX MEM WB

LW R1, 0(R2) IF ID EX MEM WB

SUB R4, R1, R5 IF ID stall EX MEM WB

AND R6, R1, R7 IF stall ID EX MEM WB

OR R8, R1, R9 stall IF ID EX MEM WB

Pipeline Scheduling

Instruction scheduled by compiler - move instruction in order to reduce stall.

lw Rb, b code sequence for a = b+c before scheduling

lw Rc, c
Add Ra, Rb, Rc stall
sw a, Ra
lw Re, e code sequence for d = e+f before scheduling
lw Rf, f
sub Rd, Re, Rf stall
sw d, Rd
Arrangement of code after scheduling.
lw Rb, b
lw Rc, c
lw Re, e
Add Ra, Rb, Rc
lw Rf, f
sw a, Ra
sub Rd, Re, Rf
sw d, Rd
Pipeline Scheduling

scheduled unscheduled

54%
gcc
31%
42%
spice
14%
65%
tex
25%

0% 20% 40% 60% 80%

% loads stalling pipeline
CONTROL HAZARDS

• A control hazard is when we need to find the destination of a branch, and can’t fetch any
new instructions until we know that destination.
• Control hazards can cause a greater performance loss for our RISC V pipeline than do
data hazards. When a branch is executed, it may or may not change the PC to something
other than its current value plus 4. Recall that if a branch changes the PC to its target
address, it is a taken branch; if it falls through, it is not taken, or untaken. If instruction i is
a taken branch, then the PC is usually not changed until the end of ID, after the
completion of the address calculation and comparison.
CONTROL HAZARD ON BRANCHES THREE STAGE STALL

ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg

ALU
14: and r2,r3,r5 Ifetch Reg DMem Reg

ALU
18: or r6,r1,r7 Ifetch Reg DMem Reg

ALU
22: add r8,r1,r9 Ifetch Reg DMem Reg

ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
REDUCING PIPELINE BRANCH PENALTIES

• 1. Freeze or flush the pipeline: the simplest scheme

• Hold or delete any instructions after the branch until the branch destination is known .
• This is the solution shown in Figure
• The branch penalty is fixed and cannot be reduced by software
REDUCING PIPELINE BRANCH PENALTIES

• 2.Treat every branch as not taken.

• Continue to fetch instructions as if there were no branch.
• Restart fetch at target address (and turn previously fetched instruction into a NOP) if
the branch is taken.
REDUCING PIPELINE BRANCH PENALTIES

• 3.Treat every branch as taken

• As soon as the branch is decoded and the target address is computed, we assume the
branch to be taken and begin fetching and executing at the target..
• This buys us a one-cycle improvement when the branch is actually taken, because we
know the target address at the end of ID, one cycle before we know whether the branch
condition is satisfied in the ALU stage. In either a predicted-taken or predicted-not-taken
scheme, the compiler can improve performance by organizing the code so that the most
frequent path matches the hardware’s choice.
REDUCING PIPELINE BRANCH PENALTIES
REDUCING PIPELINE BRANCH PENALTIES

• A fourth scheme, which was heavily used in early RISC processors is called delayed
branch. In a delayed branch, the execution cycle with a branch delay of one is:

• Although it is possible to have a branch delay longer than one, in practice almost all
processors with delayed branch have a single instruction delay; other techniques are used
if the pipeline has a longer potential branch penalty. The job of the compiler is to make
the successor instructions valid and useful.
REDUCING PIPELINE BRANCH PENALTIES
PERFORMANCE OF PIPELINES WITH STALLS
PERFORMANCE OF BRANCH SCHEMES

• What is the effective performance of each of these schemes?

• The effective pipeline speedup with branch penalties, assuming an ideal CPI of 1, is

Pipeline stall cycles from branches = Branch frequency x Branch penalty

A BASIC PIPELINE FOR RISC V
EXCEPTION IN PIPELINE

• The terminology used to describe exceptional situations where the normal execution
order of instruction is changed varies among processors. The terms interrupt, fault, and
exception are used, although not in a consistent fashion. We use the term exception to
cover all these mechanisms,
• ■ I/O device request ■ Invoking an operating system service from a user program.
• ■ Tracing instruction execution ■ Breakpoint (programmer-requested interrupt)
• ■ Integer arithmetic overflow■ FP arithmetic anomaly ■ Page fault (not in main memory)
■ Misaligned memory accesses (if alignment is required) ■ Memory protection violation
■ Using an undefined or unimplemented instruction ■ Hardware malfunctions ■ Power
failure
EXCEPTION IN PIPELINE

• Synchronous versus asynchronous

• User requested versus forced
• User maskable versus user nonmaskable
• Within versus between instructions
• Resume versus terminate
EXTENDING THE RISC V INTEGER PIPELINE TO
HANDLE MULTICYCLE OPERATIONS
• We now want to explore how our RISC V pipeline can be extended to handle floating-
point operations.
EXTENDING THE RISC V INTEGER PIPELINE TO
HANDLE MULTICYCLE OPERATIONS
EXTENDING THE RISC V INTEGER PIPELINE TO
HANDLE MULTICYCLE OPERATIONS
HAZARDS AND FORWARDING IN LONGER LATENCY
PIPELINES
• Because the divide unit is not fully pipelined, structural hazards can occur. These will need
to be detected and issuing instructions will need to be stalled.
• Because the instructions have varying running times, the number of register writes
required in a cycle can be larger than 1.
• Write after write (WAW) hazards are possible.
• Instructions can complete in a different order than they were issued, causing problems
with exceptions ; Imprecise exception.
• Because of longer latency of operations, stalls for RAW hazards will be more frequent.
HAZARDS AND FORWARDING IN LONGER
LATENCY PIPELINES
HAZARDS AND FORWARDING IN LONGER
LATENCY PIPELINES
CLASS ACTIVITY-I

• For the code sequence below, choose the statement that best describes requirements
for correctness:

A No stalls as is
B No stalls with forwarding
C Must stall
CLASS ACTIVITY-II

• For the code sequence below, choose the statement that best describes requirements
for correctness

A No stalls as is
B No stalls with forwarding
C Must stall
CLASS ACTIVITY-III

• For the code sequence below, choose the statement that best describes requirements
for correctness

A No stalls as is
B No stalls with forwarding
C Must stall
EXERCISE

• If any dependencies exist where are they and what type are they?
• How many cycles does it take to execute the code fragment?
SOLUTION
EXERCISE

• If any dependencies exist where are they and what type are they?
• How many cycles does it take to execute the code fragment?
EXERCISE

How many cycles does it take to execute the code fragment? Draw Pipeline Diagram to support your
answer
NUMERICAL

• The time delay of various segments in a 5 stage pipeline are t1=35 ns , t2= 30ns , t3=
40ns ,t4= 45 ns and t5= 35 ns . The interface register delay time is t= 5ns. How long
would it take to complete 150 instructions in the pipeline? (Assuming all instructions are
independent.
NUMERICAL

• Given a non-pipelined architecture , running at 1 GHz, that takes 5 cycles to complete an

instruction . It was later converted to a 5 stage pipeline operating at 800 MHz. A stall of
70 cycles happens in 2% of memory instructions and a stall of 2 cycles happens in 20% of
the branch instructions. 30 % instructions are of memory and 20% are of branches.
• What is actual speedup obtained by pipelining ?
THE MIPS R4000 PIPELINE

• The MIPS architecture and RISC V are very similar, differing only in a few instructions, including
a delayed branch in the MIPS ISA.
• The R4000 is a 64 bit instruction set .
• However, it uses an 8-stage integer pipeline as opposed to the 5-stage pipeline.
• The extra stages are incorporated in to the instruction fetch and memory access stages.
• The strategy of using a deeper pipeline for speeding up memory access is often called super
pipelining.
• Instruction and data memory are fully pipelined, so a new instruction can start on every clock
cycle.
THE MIPS R4000 PIPELINE

• The Pipeline Stages

• IF : First half of instruction fetch; PC selection actually happens here, together with initiation of instruction cache
access
• IS : Second half of instruction fetch, complete instruction cache access.
• RF : Instruction decode and register fetch, hazard checking, and also instruction cache hit detection
• EX : Execution, which includes effective address calculation, ALU operation, and branch target completion of
data cache access
• DF : Data fetch, first half of data cache access
• DS : Second half of data fetch, completion of data cache access
• TC :Tag check, determine whether the data cache access hit
• WB :Write back for loads and register-register operation.
THE MIPS R4000 PIPELINE

• Load Delays:
WHAT IS SUPERSCALAR PROCESSOR?

• A type of microprocessor that is used to implement a type of parallelism known as

instruction-level parallelism in a single processor to execute more than one instruction
during a CLK cycle by dispatching simultaneously various instructions to special
execution units on the processor. A scalar processor executes single instruction for
each clock cycle; a superscalar processor can execute more than one instruction during a
clock cycle.
WHAT IS SUPERSCALAR PROCESSOR?

• Superscalar processing is the ability to initiate multiple instructions during the same clock
cycle.
• A typical Superscalar processor fetches and decodes the incoming instruction stream
several instructions at a time.
• Superscalar architecture exploit the potential of ILP(Instruction Level Parallelism).
WHAT IS SUPERSCALAR PROCESSOR?
WHAT IS SUPERSCALAR PROCESSOR?
SUPER PIPELINE VS SUPER SCALAR
WHAT IS GOOD WITH SUPERSCALARS?

• The hardware solves everything

• Hardware detects potential parallelism between instructions.
• Hardware tries to issue as many instructions as possible in parallel.
• Hardware solves register renaming.
WHAT IS BAD WITH SUPERSCALARS?

• Very complex
• Much hardware is needed for run-time detection. There is a limit in how far we can go
with this technique.
• Power consumption can be very large!
• The instruction window is limited  this limits the capacity to detect
potentially parallel instructions.
VLIW (VERY LONG INSTRUCTION WORD) PROCESSORS

• In this style of architectures, the compiler formats a fixed number of operations as one
big instruction (called a bundle) and schedules them.
• With few numbers of instructions, say 3, it is usually called LIW (Long Instruction Word).
• There is a change in the instruction set architecture, i.e., 1 program counter points to 1
bundle (not 1 operation).
• The operations in a bundle are issued in parallel.
• The bundles follow a fixed format and so the decode operations are done in parallel.
VLIW (VERY LONG INSTRUCTION WORD) PROCESSORS
RECALL FROM PIPELINING REVIEW

• Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls

• Ideal pipeline CPI: measure of the maximum performance attainable by the

implementation
• Structural hazards: HW cannot support this combination of instructions
• Data hazards: Instruction depends on result of prior instruction still in the pipeline
• Control hazards: Caused by delay between the fetching of instructions and decisions
about changes in control flow (branches and jumps)
INSTRUCTION LEVEL PARALLELISM

• Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve

performance

• 2 approaches to exploit ILP:

• 1) Rely on hardware to help discover and exploit the parallelism dynamically (e.g.,
Pentium 4,AMD Opteron, IBM Power) , and
• 2) Rely on software technology to find parallelism, statically at compile-time (e.g.,
Itanium 2)
HAZARD VS DEPENDENCE

• Dependence: fixed property of instruction stream (i.e., program)

• Hazard: property of program and processor organization
• Definition: a hazard is created whenever there is a dependence between instructions, and
they are close enough that the overlap during execution would change the order of access to
the operand involved in the dependence.
• – implies potential for executing things in wrong order .potential only exists if instructions can
be simultaneously “in-flight” (i.e. in the pipeline simultaneously)
• For example, can have RAW dependence with or without hazard – When distance between
RAW instructions is larger than the pipeline depth
ASSUMPTION OF FP LATENCY
BASIC PIPELINE SCHEDULING AND LOOP UNROLLING
LOOP UNROLLING

• Loop overhead (instructions that do book-keeping for the loop): 2

• Actual work (the ld, add.d, and s.d): 3 instructions
• Can we somehow get execution time to be 3 cycles per iteration?
• A simple scheme to increase the number of instructions relative to the branch overhead
instructions.
• Replicates the loop body multiple times, adjusting the loop termination code
•
LOOP UNROLLING (STRAIGHTFORWARD WAY)

Eliminates 3 branches

Eliminates 3 decrements of X1

1 cycle stall for FLD

2 cycles stall for FADD
LOOP UNROLLING (SCHEDULING THAT MINIMIZES STALLS )

RISC V Diagram - Drawio
No ratings yet
RISC V Diagram - Drawio
1 page
Intel x86 Assembler Instruction Set Opcode Table
No ratings yet
Intel x86 Assembler Instruction Set Opcode Table
9 pages
PipeLining in Microprocessors
No ratings yet
PipeLining in Microprocessors
19 pages
Pipelining
No ratings yet
Pipelining
43 pages
ACA Unit 2,7th Sem CSE
No ratings yet
ACA Unit 2,7th Sem CSE
13 pages
Computer Architecture and Organization
No ratings yet
Computer Architecture and Organization
49 pages
Pipeline: A Simple Implementation of A RISC Instruction Set
No ratings yet
Pipeline: A Simple Implementation of A RISC Instruction Set
16 pages
Pipelining Preview: Basics & Challenges
No ratings yet
Pipelining Preview: Basics & Challenges
75 pages
Pipelining Basic Concept
No ratings yet
Pipelining Basic Concept
23 pages
CO Pipelining PDF Notes
No ratings yet
CO Pipelining PDF Notes
10 pages
Co Unit 4
No ratings yet
Co Unit 4
17 pages
Pipelining Concepts and Problems
No ratings yet
Pipelining Concepts and Problems
33 pages
Computer Architecture Pipe Line
No ratings yet
Computer Architecture Pipe Line
28 pages
CA
No ratings yet
CA
3 pages
COA Unit-3 Slides
No ratings yet
COA Unit-3 Slides
76 pages
Lec04 Pipelining Intro&hazards
No ratings yet
Lec04 Pipelining Intro&hazards
77 pages
Module 4-Pipelining
No ratings yet
Module 4-Pipelining
39 pages
Lecture 13 Pipelining
No ratings yet
Lecture 13 Pipelining
12 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
Pipelining and ALU
No ratings yet
Pipelining and ALU
23 pages
Pipelining Basic and Intermediate Concepts
No ratings yet
Pipelining Basic and Intermediate Concepts
75 pages
Pipelinehazard For Class
No ratings yet
Pipelinehazard For Class
61 pages
Pipelinehazard 160823134502
No ratings yet
Pipelinehazard 160823134502
61 pages
Principles of Designing Pipelined Processor-1
No ratings yet
Principles of Designing Pipelined Processor-1
32 pages
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
No ratings yet
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
136 pages
Pipeline
No ratings yet
Pipeline
22 pages
Pipeline Processing
No ratings yet
Pipeline Processing
28 pages
Computer Organization and Architecture Pipelining Set Execution, Stages and Throughput
No ratings yet
Computer Organization and Architecture Pipelining Set Execution, Stages and Throughput
7 pages
Pipelining. Pipeline Hazards: Sabina Batyrkhanovna
No ratings yet
Pipelining. Pipeline Hazards: Sabina Batyrkhanovna
19 pages
CS530 Fall2015 Lecture9
No ratings yet
CS530 Fall2015 Lecture9
5 pages
Pipeline
No ratings yet
Pipeline
39 pages
Pipelining 2019
No ratings yet
Pipelining 2019
82 pages
Computer Architecture 1
No ratings yet
Computer Architecture 1
8 pages
Helping Slides Pipelining Hazards Solutions
No ratings yet
Helping Slides Pipelining Hazards Solutions
55 pages
Computer Organization: An Introduction To RISC Hardware: 6.1 An Overview of Pipelining
No ratings yet
Computer Organization: An Introduction To RISC Hardware: 6.1 An Overview of Pipelining
12 pages
Piplining
No ratings yet
Piplining
23 pages
Pipelining - Modified1
No ratings yet
Pipelining - Modified1
51 pages
Week 11-13
No ratings yet
Week 11-13
76 pages
Module 4
No ratings yet
Module 4
12 pages
Lecture 3.1.2 (Concept of Pipelining, Pipeline Hazards)
No ratings yet
Lecture 3.1.2 (Concept of Pipelining, Pipeline Hazards)
6 pages
Comparison Between Pipelining
No ratings yet
Comparison Between Pipelining
9 pages
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
No ratings yet
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
64 pages
Pipelining
No ratings yet
Pipelining
26 pages
Module 5 Part2 Pipelining
No ratings yet
Module 5 Part2 Pipelining
36 pages
Pipeline Hazards
No ratings yet
Pipeline Hazards
61 pages
Pipe Lining
No ratings yet
Pipe Lining
14 pages
Pipelining & Riscs: Pipelining Used Key Implementation Technique To Build Fast Processors. It
No ratings yet
Pipelining & Riscs: Pipelining Used Key Implementation Technique To Build Fast Processors. It
6 pages
Module 3-Part 2
No ratings yet
Module 3-Part 2
50 pages
Computer Architecture: Nguyễn Trí Thành
No ratings yet
Computer Architecture: Nguyễn Trí Thành
77 pages
Pipelining and Parallel Processing
No ratings yet
Pipelining and Parallel Processing
26 pages
Chapter 17 - Pipelining Hazards
No ratings yet
Chapter 17 - Pipelining Hazards
33 pages
Coa 3
No ratings yet
Coa 3
74 pages
Pipelining
No ratings yet
Pipelining
47 pages
CAO Fall 2024 Lecture 07 RISC V Pipelined Implementation
No ratings yet
CAO Fall 2024 Lecture 07 RISC V Pipelined Implementation
114 pages
Module 4 - Parallel & Pipeline Processing - Final
No ratings yet
Module 4 - Parallel & Pipeline Processing - Final
31 pages
Concept of Pipelining - Computer Architecture Tutorial What Is Pipelining?
100% (1)
Concept of Pipelining - Computer Architecture Tutorial What Is Pipelining?
5 pages
COA Lecture 10
No ratings yet
COA Lecture 10
22 pages
Pipelined Architecture With Its Diagram
No ratings yet
Pipelined Architecture With Its Diagram
20 pages
ILP - Appendix C PDF
No ratings yet
ILP - Appendix C PDF
52 pages
Computer System Organization
No ratings yet
Computer System Organization
26 pages
Screwcutting in the Lathe for Home Machinists: Reference Handbook for Both Imperial and Metric Projects
From Everand
Screwcutting in the Lathe for Home Machinists: Reference Handbook for Both Imperial and Metric Projects
Martin Cleeve
No ratings yet
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
1 Vector Processing: Solutions
No ratings yet
1 Vector Processing: Solutions
16 pages
Advanced Linux Programming
No ratings yet
Advanced Linux Programming
31 pages
Elet 3405 HW 4
0% (1)
Elet 3405 HW 4
6 pages
18s Cpe221 Test2 Solution
No ratings yet
18s Cpe221 Test2 Solution
6 pages
07 Simd Avx
No ratings yet
07 Simd Avx
41 pages
Intel's P6 Uses Decoupled Superscalar Design
No ratings yet
Intel's P6 Uses Decoupled Superscalar Design
7 pages
CH14 COA9e Processor Structure and Function
No ratings yet
CH14 COA9e Processor Structure and Function
40 pages
Computer Organization and Design MIPS Edition 5th Edition Patterson Solutions Manualinstant Download
100% (10)
Computer Organization and Design MIPS Edition 5th Edition Patterson Solutions Manualinstant Download
49 pages
Addressing Mode Numerical ExampleBCA (TU) Second Semester
75% (4)
Addressing Mode Numerical ExampleBCA (TU) Second Semester
11 pages
Cda3101 f13 Exam3 Answerkey
No ratings yet
Cda3101 f13 Exam3 Answerkey
7 pages
Unit-2.2 Branch Handling
No ratings yet
Unit-2.2 Branch Handling
16 pages
CS 211: Computer Architecture: Instructor: Prof. Bhagi Narahari
No ratings yet
CS 211: Computer Architecture: Instructor: Prof. Bhagi Narahari
82 pages
CAO Notes 1
No ratings yet
CAO Notes 1
48 pages
COMP 200 - Assignment #3
No ratings yet
COMP 200 - Assignment #3
3 pages
Pipeline and Vector Processing
No ratings yet
Pipeline and Vector Processing
18 pages
Computer Architecture Suggestions
No ratings yet
Computer Architecture Suggestions
4 pages
The Intel Pen Ti Um Processor
No ratings yet
The Intel Pen Ti Um Processor
12 pages
How Does Microprocessor Differentiate Between Data and Instruction
100% (1)
How Does Microprocessor Differentiate Between Data and Instruction
60 pages
3.1.1.1 Opcode Column in The Instruction Summary Table (Instructions Without VEX Prefix)
No ratings yet
3.1.1.1 Opcode Column in The Instruction Summary Table (Instructions Without VEX Prefix)
4 pages
RISC-V-DV Paper
No ratings yet
RISC-V-DV Paper
3 pages
Ee660 2017 Spring Materials Week 04 Assg
0% (1)
Ee660 2017 Spring Materials Week 04 Assg
3 pages
Micro Programming
No ratings yet
Micro Programming
15 pages
4 MultiIssue 2024
No ratings yet
4 MultiIssue 2024
174 pages
15 - 370w18 - Pipeline Last
No ratings yet
15 - 370w18 - Pipeline Last
53 pages
Investigating Instruction Pipelining
No ratings yet
Investigating Instruction Pipelining
8 pages
29 - Instruction Cycle and Sub-Cycle
100% (3)
29 - Instruction Cycle and Sub-Cycle
4 pages
Addressing Modes
No ratings yet
Addressing Modes
39 pages
Android-Based Simulator To Support Tomasulo Algorithm Teaching and Learning
No ratings yet
Android-Based Simulator To Support Tomasulo Algorithm Teaching and Learning
7 pages

Chapter # 03 Pipelining

Uploaded by

Chapter # 03 Pipelining

Uploaded by

PIPELINING: BASIC AND

• Pipelining is an implementation technique whereby multiple instructions are overlapped

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

D • Time to “fill” pipeline and time to

• Pipelining increases the processor instruction throughput—the number of instructions

• The average instruction execution time on the un-pipelined processor is:

• Consider the pipelined execution of these instructions:

Time (clock cycles)

LW R1, 0(R2) IF ID EX MEM WB

SUB R4, R1, R5 IF ID EX MEM WB

AND R6, R1, R7 IF ID EX MEM WB

OR R8, R1, R9 IF ID EX MEM WB

LW R1, 0(R2) IF ID EX MEM WB

SUB R4, R1, R5 IF ID stall EX MEM WB

AND R6, R1, R7 IF stall ID EX MEM WB

OR R8, R1, R9 stall IF ID EX MEM WB

Instruction scheduled by compiler - move instruction in order to reduce stall.

lw Rb, b code sequence for a = b+c before scheduling

0% 20% 40% 60% 80%

• 1. Freeze or flush the pipeline: the simplest scheme

• 2.Treat every branch as not taken.

• 3.Treat every branch as taken

• What is the effective performance of each of these schemes?

Pipeline stall cycles from branches = Branch frequency x Branch penalty

• Synchronous versus asynchronous

• Given a non-pipelined architecture , running at 1 GHz, that takes 5 cycles to complete an

• The Pipeline Stages

• A type of microprocessor that is used to implement a type of parallelism known as

• The hardware solves everything

• Ideal pipeline CPI: measure of the maximum performance attainable by the

• Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve

• 2 approaches to exploit ILP:

• Dependence: fixed property of instruction stream (i.e., program)

• Loop overhead (instructions that do book-keeping for the loop): 2

1 cycle stall for FLD

You might also like