0% found this document useful (0 votes)
9 views11 pages

Bản Sao Của Lecture 9 - Pipelined Processor Design

Uploaded by

hinam74
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views11 pages

Bản Sao Của Lecture 9 - Pipelined Processor Design

Uploaded by

hinam74
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Lecture 9: Pipelined Processor Design

Drawbacks of Single Cycle Processor


❑ Long cycle time
➢ All instructions take as much time as the slowest instruction

Worst Case Timing


❑ Slowest instruction: load

➢ Cycle time is longer than needed for other instructions

Multicycle Implementation
❑ Break instruction execution into five steps
➢ Instruction fetch
➢ Instruction decode, register read, target address for jump/branch
➢ Execution, memory address calculation, or branch outcome
➢ Memory access or ALU instruction completion
➢ Load instruction completion
❑ One clock cycle per step (clock cycle is reduced)
➢ First 2 steps are the same for all instructions

Single cycle vs. multicycle example


❑ Single cycle

❑ Multicycle
➢ Shorter clock cycle time: constrained by longest step, not longest instruction
➢ Higher overall performance: simpler instructions take fewer cycles, less waste
❑ Assume the following operation times for components:
➢ Instruction and data memories: 200 ps
➢ LU and adders: 180 ps
➢ Decode and Register file access (read or write): 150 ps
➢ Ignore the delays in PC, mux, extender, and wires
❑ Assume the following instruction mix:
➢ 40% ALU, 20% Loads, 10% stores, 20% branches, & 10% jumps
❑ Which of the following would be faster and by how much?
➢ Single-cycle implementation for all instructions
➢ Multicycle implementation optimized for every class of instructions

=> Example solution


❑ For fixed single-cycle implementation:
➢ Clock cycle = 880 ps determined by longest delay (load instruction)
❑ For multi-cycle implementation:
➢ Clock cycle = max (200, 150, 180) = 200 ps (maximum delay at any step)
➢ Average CPI = 0.4×4 + 0.2×5 + 0.1×4+ 0.2×3 + 0.1×2 = 3.8
❑ Speedup = 880 ps / (3.8 × 200 ps) = 880 / 760 = 1.16

The idea of pipelining


❑ Multicycle improves performance over single cycle, but can you see limitations of the
multi-cycle design?
➢ Some HW resources are idle during different phases of the instruction cycle, e.g.
“Fetch” logic is idle when an instruction is being “decoded” or “executed”
➢ Most of the datapath is idle when a memory access is happening

❑ Can we do better?
➢ Yes: More concurrency → Higher instruction throughput (i.e., more “work” completed
in one cycle)

❑ Idea: when an instruction is using some resources in its processing phase, process
other instructions on idle resources
➢ E.g., when an instruction is being decoded, fetch the next instruction
➢ E.g., when an instruction is being executed, decode another instruction
➢ E.g., when an instruction is accessing data memory (lw/sw), execute the next
instruction
➢ E.g., when an instruction is writing its result into the register file, access data
memory for the next instruction
Single-cycle vs multi-cycle vs pipeline
❑ Five stages, one step per stage
➢ Each step requires 1 clock cycle → steps enter/leave pipeline at the rate of one step
per clock cycle

Pipeline performance
❑ Ideal pipeline assumptions
➢ Identical operations, e.g. four laundry steps are repeated for all loads
➢ Independent operations, e.g. no dependency between laundry steps
➢ Uniformly partitionable sub operations (that do not share resources), e.g. laundry
steps have uniform latency.
❑ Ideal pipeline speedup

➢ Speedup is due to increased throughput (*) , latency (*) does not decrease

❑ Speedup for non-ideal pipelines is less


➢ External/internal fragmentation, pipeline stalls.
✓ Latency = execution time (delay or response time) = the total time from start to
finish of ONE instruction
✓ Throughput (or execution bandwidth) = the total amount of work done in a given
amount of time

=> Example: An MIPS pipelined processor performance


❑ Assume time for stages is
✓ 100ps for register read or write
✓ 200ps for other stages

❑ Compare pipelined datapath with single-cycle datapath


=> solution
❑ Time btw 1st and 5th instructions: single cycle = 3200ps (4 x 800ps) vs pipelined =
800ps (4 x 200ps)
→ speedup = 4.
➢ Execution time for 5 instructions: 4000ps vs 1800ps ≈ 2.22 times speedup
→ Why shouldn't the speedup be 5 (#stages)? What’s wrong?
➢ Think of real programs which execute billions of instructions.

MIPS ISA supports for pipelining


❑ What makes it easy
➢ All instructions are 32-bits
• Easier to fetch and decode in one cycle: fetch in the 1st stage and decode in the 2nd stage
c.f. x86: 1- to 17-byte instructions
➢ Few and regular instruction formats
• Can decode and read registers in one step
➢ Memory operations occur only in loads and stores
• Can calculate address in 3rd stage, access memory in 4th stage
➢ Operands must be aligned in memory
• Memory access takes only one cycle
➢ Each instruction writes at most one result (i.e., changes the machine state) and does
it in the last few pipeline stages (MEM or WB)

Ideas from the Single-Cycle Datapath


❑ How to pipeline a single-cycle datapath? Think of the simple datapath as a linear
sequence of stages.
Pipelined Datapath
❑ Add state registers between each pipeline stage
➢ To isolate information between cycles

Pipeline operation
❑ Cycle-by-cycle flow of instructions through the pipelined datapath
➢ Same clock edge updates all pipeline registers, register file, and data memory (for
store instruction)
➢ “Single-clock-cycle” pipeline diagram
✓ Shows pipeline usage in a single cycle
✓ Highlight resources used
➢ c.f. “multi-clock-cycle” diagram (later)
✓ Graph of operation over time
❑ We’ll look at “single-clock-cycle” diagrams for load to verify the proposed datapath

IF for Load, Store, …


EX for Load

MEM for Load


WB for Load

Corrected Datapath for Load


Multi-Cycle Pipeline Diagram
❑ Shows the complete execution of instructions in a single figure
➢ Instructions are listed in instruction execution order from top to bottom
➢ Clock cycles move from left to right
➢ Figure shows the use of resources at each stage and each cycle

❑ Can help with answering questions like:


➢ How many cycles does it take to execute this code?
➢ What is the ALU doing during cycle 4

Pipelined control: control points


❑ Same control points as in the single-cycle datapath

Pipelined control: settings


❑ Control signals derived from instruction & determined during ID
➢ As instruction moves → pipeline the control signals → extend pipeline registers to
include control signals
➢ Each stage uses some of the control signals

Pipelined control: complete


Can Pipelining Get Us Into Trouble?
❑ Yes - instruction pipeline is not an ideal pipeline
➢ different instructions → not all need the same stages: some pipe stages idle for some
instructions → external fragmentation
➢ different pipeline stages → not the same latency: some pipe stages are too fast but
all take the same clock cycle time → internal fragmentation
➢ instructions are not independent of each other → pipeline stalls: pipeline is not
always moving

❑ Issues in pipeline design: pipeline hazards


➢ structural hazards: attempt to use the same resource by two different instructions at
the same time
➢ data hazards: attempt to use data before it is ready, e.g. an instruction’s source
operand(s) are produced by a prior instruction still in the pipeline
➢ control hazards: attempt to make a decision about program control flow before the
condition has been evaluated and the new PC target address calculated (e.g. branch and
jump instructions, exceptions
Example: structural hazards

Summary
❑ Multi-cycle processor
➢ Use one clock cycle per step → shorter clock cycle time = longest step, not longest
instruction.
➢ Higher performance over single-cycle processor: simpler instructions take fewer
cycles → less waste

❑ Pipeline processor design


➢ Employs instruction parallelism: process the next instruction on the resources
available when current instructions move to subsequent phases.
➢ Speedup is due to increased throughput: once the pipeline is full, CPI=1.
➢ Datapath can be derived from that of single-cycle processor, with additional buffer
registers
➢ Control signals remain the same as in the single-cycle case but some of them are
moved along the pipeline via inter-stage buffers.

❑ As the instruction pipeline is not ideal, various issues may occur including structural,
data, and control hazards.

You might also like