Computer Architecture Basics Overview
Computer Architecture Basics Overview
A Key Question
How Was Wright Able To Design Fallingwater?
18-447 Can have many guesses
(Ultra) hard work, perseverance, dedication (over decades)
Computer Architecture
Experience of decades
Lecture 1: Introduction and Basics Creativity
Out-of-the-box thinking
Principled design
A good understanding of past designs
Good judgment and intuition
Prof. Onur Mutlu
Strong combination of skills (math, architecture, art, …)
Carnegie Mellon University
…
Spring 2015, 1/12/2015
(You will be exposed to and hopefully develop/enhance
many of these skills in this course)
2
A Quote from The Architect Himself Major High-Level Goals of This Course
“architecture […] based upon principle, and not upon Understand the principles
precedent” Understand the precedents
In Computer Architecture
3 4
10-11-2023
L2 CACHE 1
L2 CACHE 0
SHARED L3 CACHE
to understand how a processor works underneath the
DRAM INTERFACE
CORE 0 CORE 1
DRAM BANKS
software layer and how decisions made in hardware affect the
software/programmer
DRAM MEMORY
CONTROLLER
to enable you to be comfortable in making design and
optimization decisions that cross the boundaries of different
L2 CACHE 2
L2 CACHE 3
layers and system components
CORE 2 CORE 3
Memory Performance Hog Can you fix the problem without knowing what is
Low priority happening “underneath”?
(Core 0) (Core 1)
Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service
in multi-core systems,” USENIX Security 2007.
13 14
Row decoder
(Row 1, Column 0)
L2 L2
Rows
CACHE
Row address 0
1
CACHE
unfairness
INTERCONNECT
Shared DRAM
DRAM MEMORY CONTROLLER Memory System
Row
Empty
Row 01 Row Buffer CONFLICT
HIT !
Column address 0
1
85 Column mux
DRAM DRAM DRAM DRAM
Bank 0 Bank 1 Bank 2 Bank 3 Data
15 16
10-11-2023
17 18
Row decoder
for (j=0; j<N; j++) { for (j=0; j<N; j++) {
index = j*linesize; streaming index = rand(); random
A[index] = B[index]; A[index] = B[index]; T0: Row 0
… … T0:
T1: Row 05
} }
T1:
T0:Row
Row111
0
T1:
T0:Row
Row16
0
STREAM RANDOM Memory Request Buffer Row 0 Row Buffer
- Sequential memory access - Random memory access
- Very high row buffer locality (96% hit rate) - Very low row buffer locality (3% hit rate)
- Memory intensive - Similarly memory intensive Row size: 8KB, cache blockColumn mux
size: 64B
T0: STREAM
128
T1: (8KB/64B)
RANDOM requests of T0 serviced
Data before T1
Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007. Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.
19 20
10-11-2023
Now That We Know What Happens Underneath Reading on Memory Performance Attacks
How would you solve the problem? Thomas Moscibroda and Onur Mutlu,
"Memory Performance Attacks: Denial of Memory Service
in Multi-Core Systems"
What is the right place to solve the problem? Proceedings of the 16th USENIX Security Symposium (USENIX SECURITY),
pages 257-274, Boston, MA, August 2007. Slides (ppt)
Programmer? Problem
System software? Algorithm
Compiler? Program/Language
One potential reading for your Homework 1 assignment
Hardware (Memory controller)? Runtime System
(VM, OS, MM)
Hardware (DRAM)?
ISA (Architecture)
Circuits?
Microarchitecture
Logic
Two other goals of this course: Circuits
Enable you to think critically Electrons
Enable you to think broadly
21 22
23 24
10-11-2023
L2 CACHE 1
L2 CACHE 0
SHARED L3 CACHE
DRAM INTERFACE
CORE 0 CORE 1
DRAM BANKS
DRAM MEMORY
CONTROLLER
L2 CACHE 2
L2 CACHE 3
CORE 2 CORE 3
bitline
bitline
bitline
Typical N = 64 ms
Downsides of refresh
-- Energy consumption: Each refresh consumes energy
-- Performance degradation: DRAM rank/bank unavailable while
A DRAM cell consists of a capacitor and an access transistor
refreshed
It stores data in terms of charge in the capacitor
-- QoS/predictability impact: (Long) pause times during refresh
A DRAM chip consists of (10s of 1000s of) rows of such cells
-- Refresh rate limits DRAM capacity scaling
28
10-11-2023
8%
Part of your Homework 1
47%
15%
35 36
Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.
10-11-2023
37 38
Eliminate or minimize it: Replace DRAM with a different Source Code to Induce Errors in Modern DRAM Chips
technology that does not have the problem https://github.com/CMU-SAFARI/rowhammer
…
45 46
Recap: Some Goals of 447 Review: Major High-Level Goals of This Course
Teach/enable/empower you to: Understand the principles
Understand how a computing platform (processor + memory + Understand the precedents
interconnect) works
Implement a simple platform (with not so simple parts), with a
Based on such understanding:
focus on the processor and memory
Enable you to evaluate tradeoffs of different designs and ideas
Understand how decisions made in hardware affect the
software/programmer as well as hardware designer Enable you to develop principled designs
Think critically (in solving problems) Enable you to develop novel, out-of-the-box designs
Think broadly across the levels of transformation
Understand how to analyze and make tradeoffs in design The focus is on:
Principles, precedents, and how to use them for new designs
In Computer Architecture
47 48
10-11-2023
How to design, implement, and evaluate a functional modern Goal 2: To provide the necessary background and experience to
processor design, implement, and evaluate a modern processor by
Semester-long lab assignments performing hands-on RTL and C-level implementation.
A combination of RTL implementation and higher-level simulation Strong emphasis on functionality and hands-on design.
ISA
60
10-11-2023
This course covers the HW/SW interface and Read: Patt, “Requirements, Bottlenecks, and Good Fortune: Agents for
microarchitecture Microprocessor Evolution,” Proceedings of the IEEE 2001.
We will focus on tradeoffs and how they affect software
63 64
10-11-2023
How to dig out information, think critically and broadly implementation, and efficiency.
Strong emphasis on making things work, realizing ideas
How to work even harder and more efficiently!
67 68
10-11-2023
69 70
Understand why computers work the way they do No clear, definitive answers to these problems
75 76
10-11-2023
You can invent new paradigms for computation, You can invent new paradigms for computation,
communication, and storage communication, and storage
Recommended book: Thomas Kuhn, “The Structure of Recommended book: Thomas Kuhn, “The Structure of
Scientific Revolutions” (1962) Scientific Revolutions” (1962)
Pre-paradigm science: no clear consensus in the field Pre-paradigm science: no clear consensus in the field
Normal science: dominant theory used to explain/improve Normal science: dominant theory used to explain/improve
things (business as usual); exceptions considered anomalies things (business as usual); exceptions considered anomalies
Revolutionary science: underlying assumptions re-examined Revolutionary science: underlying assumptions re-examined
79 80
10-11-2023
… but, first …
Let’s understand the fundamentals…
You can change the world only if you understand it well Fundamental Concepts
enough…
Especially the past and present dominant paradigms
And, their advantages and shortcomings – tradeoffs
And, what remains fundamental across generations
And, what techniques you can use and develop to solve
problems
81 82
Computation
Communication Processing
Storage (memory)
control Memory
(sequencing) (program I/O
and data)
datapath
83 84
10-11-2023
The Von Neumann Model (of a Computer) The Von Neumann Model (of a Computer)
MEMORY Q: Is this the only way that a computer can operate?
Mem Addr Reg
CONTROL UNIT
IP Inst Register
87 88
10-11-2023
91 92
10-11-2023
93 94
ISA vs. Microarchitecture Level Tradeoff Let’s Get Back to the Von Neumann Model
A similar tradeoff (control vs. data-driven execution) can be But, if you want to learn more about dataflow…
made at the microarchitecture level
Dennis and Misunas, “A preliminary architecture for a basic
ISA: Specifies how the programmer sees instructions to be data-flow processor,” ISCA 1974.
executed Gurd et al., “The Manchester prototype dataflow
Programmer sees a sequential, control-flow execution order vs. computer,” CACM 1985.
Programmer sees a data-flow execution order A later 447 lecture, 740/742
Multiple instructions at a time: Intel Pentium uarch Traditional (ISA-only) definition: “The term
Out-of-order execution: Intel Pentium Pro uarch architecture is used here to describe the attributes of a
Separate instruction and data caches system as seen by the programmer, i.e., the conceptual
structure and functional behavior as distinct from the
But, what happens underneath that is not consistent with organization of the dataflow and controls, the logic design,
the von Neumann model is not exposed to software and the physical implementation.” Gene Amdahl, IBM
Difference between ISA and microarchitecture Journal of R&D, April 1964
97 98
Microprocessor
Microarchitecture usually changes faster than ISA
ISA, uarch, circuits
Few ISAs (x86, ARM, SPARC, MIPS, Alpha) but many uarchs
“Architecture” = ISA + microarchitecture Why?
99 100
10-11-2023
ISA Microarchitecture
Instructions Implementation of the ISA under specific design constraints
Opcodes, Addressing Modes, Data Types and goals
Instruction Types and Formats
Anything done in hardware without exposure to software
Registers, Condition Codes
Pipelining
Memory
In-order versus out-of-order instruction execution
Address space, Addressability, Alignment
Virtual memory management Memory access scheduling policy
Call, Interrupt/Exception Handling Speculative execution
Superscalar processing (multiple instruction issue?)
Access Control, Priority/Privilege
Clock gating
I/O: memory-mapped vs. instr.
Caching? Levels, size, associativity, replacement policy
Task/thread Management
Prefetching?
Power and Thermal Management Voltage/frequency scaling?
Multi-threading support, Multiprocessor support Error correction?
101 102
104
10-11-2023
Performance Program
Runtime System
System and Task-level tradeoffs (VM, OS, MM)
How to divide the labor between hardware and software ISA
Microarchitecture
New issues and Logic
capabilities
Circuits
Computer architecture is the science and art of making the at the bottom
Electrons
appropriate trade-offs to meet a design point (Look Down)
Why art?
We do not (fully) know the future (applications, users, market)
111 112
10-11-2023
Runtime System
(VM, OS, MM)
ISA
Microarchitecture
Changing issues and Logic
capabilities
Circuits
at the bottom
(Look Down and Forward) Electrons
115 116
10-11-2023
Instruction MIPS
Basic element of the HW/SW interface
Consists of
opcode: what the instruction does
operands: who it is to do it to
0 rs rt rd shamt funct R-type
6-bit 5-bit 5-bit 5-bit 5-bit 6-bit
119 120
10-11-2023
121 122
125 126
127 128
10-11-2023
Levy, “The Intel iAPX 432,” 1981. Why is having registers a good idea?
http://www.cs.washington.edu/homes/levy/capabook/Chapter Because programs exhibit a characteristic called data locality
9.pdf A recently produced/accessed value is likely to be used more
than once (temporal locality)
Storing that value in a register eliminates the need to go to
memory each time that value is needed
133 134
M[N-1]
Memory Program Counter
array of storage locations memory address
indexed by an address of the current instruction
What Are the Elements of An ISA? What Are the Elements of An ISA?
Load/store vs. memory/memory architectures Addressing modes specify how to obtain the operands
Absolute LW rt, 10000
Load/store architecture: operate instructions operate only on use immediate value as address
registers Register Indirect: LW rt, (rbase)
E.g., MIPS, ARM and many RISC ISAs use GPR[rbase] as address
Displaced or based: LW rt, offset(rbase)
Memory/memory architecture: operate instructions can use offset+GPR[rbase] as address
operate on memory locations
Indexed: LW rt, (rbase, rindex)
E.g., x86, VAX and many CISC ISAs
use GPR[rbase]+GPR[rindex] as address
Memory Indirect LW rt ((rbase))
use value at M[ GPR[ rbase ] ] as address
Auto inc/decrement LW Rt, (rbase)
use GRP[rbase] as address, but inc. or dec. GPR[rbase] each time
139 140
10-11-2023
Disadvantage:
More work for the compiler
More work for the microarchitect
141 142
143 144
10-11-2023
Tradeoffs?
Which one is more general purpose?
145 146
Virtual memory
Each program has the illusion of the entire memory space, which is greater
than physical memory
Access protection
149 150
X86: Small Semantic Gap: String Operations X86: Small Semantic Gap: String Operations
An instruction operates on a string REP MOVS (DEST SRC)
Small Semantic Gap Examples in VAX Small versus Large Semantic Gap
FIND FIRST CISC vs. RISC
Find the first set bit in a bit field
Complex instruction set computer complex instructions
Helps OS resource allocation operations
Initially motivated by “not good enough” code generation
SAVE CONTEXT, LOAD CONTEXT
Reduced instruction set computer simple instructions
Special context switching instructions
INSQUEUE, REMQUEUE
John Cocke, mid 1970s, IBM 801
Goal: enable better compiler control and optimization
Operations on doubly linked list
INDEX
Array access with bounds checking RISC motivated by
STRING Operations Memory stalls (no work done in a complex instruction when
Compare strings, find substrings, … there is a memory stall?)
Cyclic Redundancy Check Instruction When is this correct?
EDITPC Simplifying the hardware lower cost, higher frequency
Implements editing functions to display fixed format output
Enabling the compiler to optimize the code better
Digital Equipment Corp., “VAX11 780 Architecture Handbook,” 1977-78. Find fine-grained parallelism to reduce stalls
155 156
10-11-2023
157 158
159 160
10-11-2023
Klaiber, “The Technology Behind Crusoe Processors,” Transmeta White Paper 2000. Klaiber, “The Technology Behind Crusoe Processors,” Transmeta White Paper 2000.
161 162
0, 1, 2, 3 address machines
Computer Architecture Elements of an ISA
Lecture 4: ISA Tradeoffs (Continued) and Instructions, data types, memory organizations, registers, etc
Addressing modes
MIPS ISA Complex (CISC) vs. simple (RISC) instructions
Semantic gap
ISA translation
Prof. Onur Mutlu
Kevin Chang
Carnegie Mellon University
Spring 2015, 1/21/2015
163 164
10-11-2023
Simple Decoding
4 bytes per instruction, regardless of format
must be 4-byte aligned (2 lsb of PC must be 2b’00)
format and fields easy to extract in hardware
167 168
10-11-2023
169 170
More modes:
+ help better support programming constructs (arrays, pointer- Alpha:
based accesses)
-- make it harder for the architect to design
-- too many choices for the compiler?
Many ways to do the same thing complicates compiler design
Wulf, “Compilers and Computer Architecture,” IEEE Computer 1981
173 174
x86 x86
register
indirect
indexed
Memory
absolute (base +
index)
SIB +
displacement
scaled
register + (base +
displacement index*4)
register
Register
175 176
10-11-2023
X86 SIB-D Addressing Mode X86 Manual: Suggested Uses of Addressing Modes
Static address
Dynamic storage
Arrays
Records
x86 Manual Vol. 1, page 3-22 -- see course resources on website x86 Manual Vol. 1, page 3-22 -- see course resources on website
Also, see Section 3.7.3 and 3.7.5 Also, see Section 3.7.3 and 3.7.5
177 178
X86 Manual: Suggested Uses of Addressing Modes Other Example ISA-level Tradeoffs
Condition codes vs. not
VLIW vs. single instruction
Precise vs. imprecise exceptions
Static arrays w/ fixed-size elements
Virtual memory vs. not
Unaligned access vs. not
Hardware interlocks vs. software-guaranteed interlocking
2D arrays Software vs. hardware managed page fault handling
Cache coherence (hardware vs. software)
…
2D arrays
LWL/LWR is slower
Note LWL and LWR still fetch within word boundary
181 182
183 184
10-11-2023
James C. Hoe
Cons of having no restrictions on alignment Dept of ECE, CMU
185
0 rs rt rd 0 ADD R-type
Semantics
6-bit 5-bit 5-bit 5-bit 5-bit 6-bit
Semantics
GPR[rt] GPR[rs] + sign-extend (immediate)
PC PC + 4
[MIPS R4000 Microprocessor User’s Manual]
Exception on “overflow”
Variations
Arithmetic: {signed, unsigned} x {ADD, SUB}
Logical: {AND, OR, XOR, LUI}
{ code A }
addi r- ra r- if X==Y if X==Y
if X==Y then True False goto
R2000 load has an architectural latency of 1 inst*. { code B } code C
code B code C
the instruction immediately following a load (in the “delay else
slot”) still sees the old register value { code C }
goto
the load instruction no longer has an atomic semantics { code D }
code B
Why would you do it this way?
code D
Is this a good idea? (hint: R4000 redefined LW to
complete atomically) code D
E.g. High-level Code fork R2000 branch instructions also have an architectural
if (i == j) then then
latency of 1 instructions
e = g the instruction immediately after a branch is always
else executed (in fact PC-offset is computed from the delay
else
e = h slot instruction)
f = e branch target takes effect on the 2nd instruction
join
bne ri rj L1 bne ri rj L1
Assembly Code nop
suppose e, f, g, h, i, j are in re, rf, rg, rh, ri, rj add re rg r0 add re rg r0
j L2 j L2
bne ri rj L1 # L1 and L2 are addr labels nop re
add rg r0
L1: add re rh r0 L1: add re rh r0
# assembler computes offset
add re rg r0 # e = g L2: add rf re r0 L2: add rf re r0
j L2 . . . . . . . .
L1: add re rh r0 # e = h
L2: add r r r0 # f = e
10-11-2023
5. callee saves callee-saved registers to stack (also file)
r4~r7, old r29, r31) “Load-store” architecture
....... body of callee (can “nest” additional calls) .......
Simple branches
6. callee loads results to r2, r3 limited varieties of branch conditions and targets
epilogue
Microprogrammed Microarchitectures
Food for Thought for You Review: Other Example ISA-level Tradeoffs
How would you design a new ISA? Condition codes vs. not
VLIW vs. single instruction
Where would you place it? SIMD (single instruction multiple data) vs. SISD
What design choices would you make in terms of ISA Precise vs. imprecise exceptions
properties? Virtual memory vs. not
Unaligned access vs. not
What would be the first question you ask in this process? Hardware interlocks vs. software-guaranteed interlocking
“What is my design point?” Software vs. hardware managed page fault handling
Cache coherence (hardware vs. software)
…
Process instruction
A Very Basic Instruction Processing Engine Remember: Programmer Visible (Architectural) State
Single-cycle machine
M[0]
M[1]
M[2]
AS’ Sequential AS M[3] Registers
Combinational M[4] - given special names in the ISA
Logic Logic
(as opposed to addresses)
(State) - general vs. special purpose
M[N-1]
Memory Program Counter
array of storage locations memory address
What is the clock cycle time determined by? of the current instruction
indexed by an address
What is the critical path of the combinational logic
determined by? Instructions (and programs) specify how to transform
the values of programmer visible state
225 226
Instruction Processing “Cycle” vs. Machine Clock Cycle Instruction Processing Viewed Another Way
Instructions transform Data (AS) to Data’ (AS’)
Single-cycle machine:
All six phases of the instruction processing cycle take a single This transformation is done by functional units
Units that “operate” on data
machine clock cycle to complete
Single-cycle vs. Multi-cycle: Control & Data Many Ways of Datapath and Control Design
Single-cycle machine: There are many ways of designing the data path and
Control signals are generated in the same clock cycle as the control logic
one during which data signals are operated on
Everything related to an instruction happens in one clock cycle Single-cycle, multi-cycle, pipelined datapath and control
(serialized processing)
Single-bus vs. multi-bus datapaths
See your homework 2 question
Multi-cycle machine:
Hardwired/combinational vs. microcoded/microprogrammed
Control signals needed in the next cycle can be generated in
control
the current cycle
Control signals generated by combinational logic versus
Latency of control processing can be overlapped with latency
of datapath operation (more parallelism) Control signals stored in a memory structure
We will see the difference clearly in microprogrammed Control signals and structure depend on the datapath
multi-cycle microarchitectures design
231 232
10-11-2023
Sum over all instructions [{CPI} x {clock cycle time}]
{# of instructions} x {Average CPI} x {clock cycle time}
A Closer Look
Single cycle microarchitecture performance
CPI = 1
Clock cycle time = long
Multi-cycle microarchitecture performance
CPI = different for each instruction Now, we have
Average CPI hopefully small two degrees of freedom
to optimize independently
Clock cycle time = short
233 234
Combinational data
(State)
MemWrite
Instruction
address
Address Read
data
Instruction
Write Data
Instruction
data memory
memory
MemRead
235 236
**Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
10-11-2023
Register #
PC Address Instruction Registers ALU Address
Register #
Single-cycle, synchronous memory
Instruction
memory ID/RF Data
Register # EX/AG memory
Contrast this with memory that tells when the data is ready
i.e., Ready bit: indicating the read or write is done Data
MEM
237 **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] 238
4
Add
Instruction [5– 0]
**Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. 239
JAL, JR, JALR omitted 240
ALL RIGHTS RESERVED.]
10-11-2023
4
Machine encoding 25:21 3
ALU operation
Read
Read register 1
PC address Read
20:16 Read data 1
register 2 Zero
register
ALU ALU
result
6-bit 5-bit 5-bit 5-bit 5-bit 6-bit memory
Read
Write data 2
data
RegWrite
Semantics 1
if MEM[PC] == ADD rd rs rt
IF ID EX MEM WB
GPR[rd] GPR[rs] + GPR[rt] if MEM[PC] == ADD rd rs rt
GPR[rd] GPR[rs] + GPR[rt]
Combinational
PC PC + 4
PCfrom
**Based on original figure [P&HPC + 4 2004 Elsevier. ALL RIGHTS RESERVED.]
CO&D, COPYRIGHT
state update logic
241 242
**Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
4
ALU operatio
Machine encoding PC
Read
address
25:21
Read
register 1
Read
3
data 1
Read
20:16 Zero
register 2
Instruction Registers ALU ALU
ADDI rs rt immediate I-type Instruction
15:11
Write
register
Read
result
isItype RegWrite
ALUSrc
Semantics 116
Sign
32
isItype
if MEM[PC] == ADDI rt rs immediate extend
Load Instructions
Assembly (e.g., load 4-byte word)
LW rtreg offset16 (basereg)
Single-Cycle Datapath for
Data Movement Instructions Machine encoding
LW base rt offset I-type
6-bit 5-bit 5-bit 16-bit
Semantics
if MEM[PC]==LW rt offset16 (base)
EA = sign-extend(offset) + GPR[base]
GPR[rt] MEM[ translate(EA) ]
PC PC + 4
245 246
Instruction
Instruction
Write
Registers
register
ALU ALU
result
data
SW base rt offset I-type
memory
Write
Read
data 2 Write Data
memory
6-bit 5-bit 5-bit 16-bit
data
data
RegDest RegWrite
isItype ALUSrc
116 Sign
32
isItype
MemRead
Semantics
extend
1
if MEM[PC]==SW rt offset16 (base)
EA = sign-extend(offset) + GPR[base]
MEM[ translate(EA) ] GPR[rt]
if MEM[PC]==LW rt offset16 (base) IF ID EX MEM WB PC PC + 4
EA = sign-extend(offset) + GPR[base]
GPR[rt] MEM[ translate(EA) ]
Combinational
PC PC + 4 state update logic247 248
10-11-2023
Add
1
4 add Add
ALU operation MemWrite
Read 3
Read register 1
PC address Read
data 1
4
add
Read
register 2 Zero Address Read Read 3 ALU operation isStore
Instruction Registers ALU ALU data Read
PC register 1 MemWrite
Write result address Read
Instruction register data 1
Read Data Read
memory Write
Write data 2
data memory register 2 Zero
data Instruction Registers ALU ALU
Write Read
RegDest RegWrite result Address
data
register
isItype 016 ALUSrc MemRead
Instruction
memory
Read
Sign
32
isItype Write
data
data 2
Data
extend
0 RegDest RegWrite Write
memory
data
isItype !isStore ALUSrc
16 32
Sign isItype MemRead
extend
isLoad
if MEM[PC]==SW rt offset16 (base) IF ID EX MEM WB
EA = sign-extend(offset) + GPR[base]
MEM[ translate(EA) ] GPR[rt]
Combinational
PC PC + 4 state update logic249 **Based on original figure from [P&H CO&D, COPYRIGHT
2004 Elsevier. ALL RIGHTS RESERVED.]
250
MemtoReg
isLoad
**Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] 251 252
10-11-2023
isJ Add
Machine encoding PCSrc
4
X ALU operation
Read 3 0
J immediate J-type PC Read
address
register 1
Read
Read
data 1
MemWrite
if MEM[PC]==J immediate26
data
ALUSrc
0 16 32
Sign X MemRead
target = { PC[31:28], immediate26, 2’b00 } extend
0
PC target
**Based on original figure from [P&H CO&D, COPYRIGHT
2004 Elsevier. ALL RIGHTS RESERVED.]
if MEM[PC]==J immediate26
253 PC = { PC[31:28], immediate26, 2’b00 } What about JR, JAL, JALR?
254
255 256
10-11-2023
16 0 32 16 32
data
Instruction [5– 0]
**Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
0
6-bit
26
rs
5-bit
21
rt
5-bit
16
rd
5-bit
11
shamt
5-bit
6
funct
6-bit
0
R-type
31 26 21 16 0
Consider
All R-type and I-type ALU instructions
LW and SW
259 260
10-11-2023
261
JAL and JALR require additional RegDest and MemtoReg options 262
JR and JALR require additional PCSrc options
__ don’t care
ALUOp
MemWrite
ALUSrc
1
RegWrite
0
Instruction [25– 21] Read
Read register 1
0
Sign
extend
funct
ALU
control
ALU operation
Instruction [5– 0]
I-Type ALU LW
PCSrc1=Jump PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0] Instruction [25– 0] Shift Jump address [31– 0]
left 2 left 2
26 28 0 1 26 28 0 1
1 1
RegWrite RegWrite
0 0
Instruction [25– 21] Read Instruction [25– 21] Read
Read register 1 Read register 1
PC address Read PC address Read
Instruction [20– 16] Read data 1 Instruction [20– 16] Read data 1
register 2 bcond
Zero register 2 bcond
Zero
Instruction 0 Registers Read ALU ALU Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read [31– 0] 0 Read
M Write data 2 result Address 1 M Write data 2 result Address 1
Instruction u register M data Instruction u register M data
u M u M
memory Instruction [15– 11] x u memory Instruction [15– 11] x u
1 Write x Data 1 Write x Data
data x data x
1 memory 1 memory
0 0
Write Write
data data
16 32 16 32
Instruction [15– 0] Instruction [15– 0]
0 1
Sign Sign
extend
opcodeALU operation
ALU
control
extend
Add
ALU
control
ALU operation
**Based on original figure from [P&H CO&D, COPYRIGHT 2004 265 **Based on original figure from [P&H CO&D, COPYRIGHT 2004 266
Elsevier. ALL RIGHTS RESERVED.] Elsevier. ALL RIGHTS RESERVED.]
0 0
RegWrite RegWrite
1 0
Instruction [25– 21] Read Instruction [25– 21] Read
Read register 1 Read register 1
PC address Read PC address Read
Instruction [20– 16] Read data 1 Instruction [20– 16] Read data 1
register 2 bcond
Zero register 2 bcond
Zero
Instruction 0 Registers Read ALU ALU Instruction 0 Registers Read ALU ALU
[31– 0] [31– 0]
X
M Write data 2 0 Address Read M Write data 2 0 Address Read
result 1 result 1
X
Instruction u register M data Instruction u register M data
X X
u M u M
memory Instruction [15– 11] x u memory Instruction [15– 11] x u
1 Write x Data 1 Write x Data
data x data x
1 memory 1 memory
0 0
Write Write
data data
16 32 16 32
Instruction [15– 0] Instruction [15– 0]
0 0
Sign Sign
extend
Add
ALU
control
ALU operation extend
bcond
ALU
control
ALU operation
**Based on original figure from [P&H CO&D, COPYRIGHT 2004 267 **Based on original figure from [P&H CO&D, COPYRIGHT 2004 268
Elsevier. ALL RIGHTS RESERVED.] Elsevier. ALL RIGHTS RESERVED.]
10-11-2023
X
PC+4 [31– 28] PC+4 [31– 28]
u u u u
x x x x
ALU
Add result 1 0 ALU
Add result 1 0
Add Add
RegDst Shift PCSrc2=Br Taken RegDst Shift PCSrc2=Br Taken
Jump left 2 Jump left 2
4 Branch 4 Branch
MemRead MemRead
Instruction [31– 26] Instruction [31– 26]
Control MemtoReg Control MemtoReg
ALUOp ALUOp
MemWrite MemWrite
ALUSrc ALUSrc
0 0
RegWrite RegWrite
0 0
Instruction [25– 21] Read Instruction [25– 21] Read
Read register 1 Read register 1
PC address Read PC address Read
Instruction [20– 16] Read data 1 Instruction [20– 16] Read data 1
register 2 bcond
Zero register 2 bcond
Zero
Instruction 0 Registers Read ALU ALU Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read [31– 0] 0 Read
M Write data 2 result Address 1 M Write data 2 result Address 1
X X
Instruction register M data Instruction register M data
X
u u
X X
u M u M
memory Instruction [15– 11] x u memory Instruction [15– 11] x u
1 Write x Data 1 Write x Data
data x data x
1 memory 1 memory
0 0
Write Write
data data
16 32 16 32
Instruction [15– 0] Instruction [15– 0]
0 0
Sign Sign
extend
bcondALU operation
ALU
control
extend
X
ALU
control
ALU operation
**Based on original figure from [P&H CO&D, COPYRIGHT 269 **Based on original figure from [P&H CO&D, COPYRIGHT 270
2004 Elsevier. ALL RIGHTS RESERVED.] 2004 Elsevier. ALL RIGHTS RESERVED.]
271 272
10-11-2023
273 274
PCSrc1=Jump PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0] Instruction [25– 0] Shift Jump address [31– 0]
left 2 left 2
26 28 0 1 26 28 0 1
100ps
Add result 1 0 Add result 1 0
Add Add
RegDst Shift PCSrc2=Br Taken RegDst Shift PCSrc2=Br Taken
Jump left 2 Jump left 2
4 Branch 4 Branch
MemRead MemRead
Instruction [31– 26] Instruction [31– 26]
Control MemtoReg Control MemtoReg
ALUOp ALUOp
100ps
MemWrite MemWrite
ALUSrc ALUSrc
RegWrite RegWrite
200ps
address address
250ps
Instruction [20– 16] Read data 1 Instruction [20– 16] Read data 1
register 2 bcond
Zero register 2 bcond
Zero
Instruction 0 Registers Read ALU ALU Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read [31– 0] 0 Read
M Write data 2 result Address 1 M Write data 2 result Address 1
Instruction register data Instruction register data
400ps
u M u M
u M u M
memory x u memory x u
350ps
Instruction [15– 11] Write x Instruction [15– 11] Write x
1 data Data x 1 data Data x
1 memory 1 memory
0 0
Write Write
data data
16 32 16 32
Instruction [15– 0] Sign Instruction [15– 0] Sign
extend ALU ALU operation extend ALU ALU operation
control control
LW SW
PCSrc1=Jump PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0] Instruction [25– 0] Shift Jump address [31– 0]
left 2 left 2
26 28 0 1 26 28 0 1
100ps 100ps
Add result 1 0 Add result 1 0
Add Add
RegDst Shift PCSrc2=Br Taken RegDst Shift PCSrc2=Br Taken
Jump left 2 Jump left 2
4 Branch 4 Branch
MemRead MemRead
Instruction [31– 26] Instruction [31– 26]
Control MemtoReg Control MemtoReg
ALUOp ALUOp
100ps 100ps
MemWrite MemWrite
ALUSrc ALUSrc
RegWrite RegWrite
200ps 200ps
address address
250ps 250ps
Instruction [20– 16] Read data 1 Instruction [20– 16] Read data 1
register 2 bcond
Zero register 2 bcond
Zero
Instruction 0 Registers Read ALU ALU Instruction 0 Registers Read ALU ALU
550ps
[31– 0] 0 Read [31– 0] 0 Read
M Write data 2 result Address 1 M Write data 2 result Address 1
Instruction u register M data Instruction u register M data
u M u M
memory x u memory x u
PCSrc1=Jump PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0] Instruction [25– 0] Shift Jump address [31– 0]
left 2 left 2
26 28 0 1 26 28 0 1
200ps
PC+4 [31– 28] M M PC+4 [31– 28] M M
u u u u
100ps ALU
Add result
x
1
x
0
100ps ALU
Add result
x
1
x
0
Add Add
RegDst Shift PCSrc2=Br Taken RegDst Shift PCSrc2=Br Taken
Jump left 2 Jump left 2
4 Branch 4 Branch
MemRead MemRead
Instruction [31– 26] Instruction [31– 26]
Control MemtoReg Control MemtoReg
ALUOp ALUOp
350ps 200ps
MemWrite MemWrite
ALUSrc ALUSrc
RegWrite RegWrite
PC
Read
Instruction [25– 21] Read
register 1
Read
350ps PC
Read
Instruction [25– 21] Read
register 1
Read
200ps 200ps
address address
250ps
Instruction [20– 16] Read data 1 Instruction [20– 16] Read data 1
register 2 bcond
Zero register 2 bcond
Zero
Instruction 0 Registers Read ALU ALU Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read [31– 0] 0 Read
M Write data 2 result Address 1 M Write data 2 result Address 1
Instruction u register M data Instruction u register M data
u M u M
memory Instruction [15– 11] x u memory Instruction [15– 11] x u
1 Write x Data 1 Write x Data
data x data x
1 memory 1 memory
0 0
Write Write
data data
16 32 16 32
Instruction [15– 0] Sign Instruction [15– 0] Sign
extend ALU ALU operation extend ALU ALU operation
control control
Food for thought for you: What if memory sometimes takes 100ms to access?
Can control logic be on the critical path?
A note on CDC 5600: control store access too long… Does it make sense to have a simple register to register
add or jump to take {100ms+all else to do a memory
operation}?
283 284
10-11-2023
287 288
10-11-2023
289 290
Optional
Maurice Wilkes, “The Best Way to Design an Automatic
Calculating Machine,” Manchester Univ. Computer Inaugural
Conf., 1951.
295 296
10-11-2023
Review: (Micro)architecture Design Principles Review: Single-Cycle Design vs. Design Principles
Critical path design Critical path design
Find and decrease the maximum combinational logic delay
Break a path into multiple cycles if it takes too long Bread and butter (common case) design
Multi-Cycle Microarchitectures
Goal: Let each instruction take (close to) only as much time
it really needs
301 302
A Closer Look
307 308
10-11-2023
309 310
A multi-cycle microarchitecture sequences from state to Act of transitioning from one state to another
state to process an instruction Determining the next state and the microinstruction for the
The behavior of the machine in a state is completely determined by next state
control signals in that state
Microsequencing
The behavior of the entire processor is specified fully by a
finite state machine Control store stores control signals for every possible state
Store for microinstructions for the entire FSM
In a state (clock cycle), control signals control two things:
How the datapath should process the data Microsequencer determines which set of control signals will
How to generate the control signals for the next clock cycle be used in the next clock cycle (i.e., next state)
311 312
10-11-2023
315 316
10-11-2023
What Determines Next-State Control Signals? A Simple LC-3b Control and Datapath
What is happening in the current clock cycle
See the 9 control signals coming from “Control” block
What are these for?
18, 19
MAR <! PC
33
MDR <! M
R R
35
IR <! MDR
How many cycles does the fastest instruction take?
32
1011
RTI
BEN<! IR[11] & N + IR[10] & Z + IR[9] & P To 11
1010
To 8
ADD [IR[15:12]]
BR
To 10
DR<! SR1+OP2*
1
AND
XOR
JMP
0
[BEN] 0
How many cycles does the slowest instruction take?
set CC TRAP JSR
SHF
LEA STB
LDB LDW STW 1
22
To 18 5
DR<! SR1&OP2*
PC<! PC+LSHF(off9,1)
set CC
To 18
DR<! SR1 XOR OP2*
9 12
PC<! BaseR
To 18
Why does the BR take as long as it takes in the FSM?
set CC
To 18 15 4
To 18
MAR<! LSHF(ZEXT[IR[7:0]],1) [IR[11]]
MDR<! M[MAR]
28 20
0 1
What determines the clock cycle time?
R7<! PC R7<! PC
PC<! BaseR
R R
21
30
PC<! MDR R7<! PC
To 18 PC<! PC+LSHF(off11,1)
13
To 18
DR<! SHF(SR,A,D,amt4)
set CC To 18
14 2 6 7 3
To 18 DR<! PC+LSHF(off9, 1)
set CC MAR<! B+off6 MAR<! B+LSHF(off6,1) MAR<! B+LSHF(off6,1) MAR<! B+off6
To 18
29 25 23 24
NOTES MDR<! M[MAR[15:1]’0] MDR<! M[MAR] MDR<! SR MDR<! SR[7:0]
B+off6 : Base + SEXT[offset6]
PC+off9 : PC + SEXT[offset9] R R R R
27 16 17
*OP2 may be SR2 or SEXT[imm5] 31
DR<! SEXT[BYTE.DATA] DR<! MDR
** [15:8] or [7:0] depending on M[MAR]<! MDR M[MAR]<! MDR**
set CC set CC
MAR[0]
To 18 To 18 To 18
R R R R
322
To 19
LC-3b Datapath
Patt and Patel, Appendix C, Figure C.3
323
10-11-2023
IR[11:9] IR[11:9]
DR SR1
111 IR[8:6]
DRMUX SR1MUX
(a) (b)
Remember the MIPS datapath
IR[11:9]
N Logic BEN
Z
P
(c)
R COND1 COND0
IR[15:11]
BEN
BEN R IR[11]
Microsequencer
Control Store
2 6 x 35
0,0,IR[15:12]
6
35
IRD
Microinstruction
6
9 26
Address of Next State
(J, COND, IRD)
X
eS MU
UX
UX
E
UX
LS .SIZ
1M
2M
Ga DR
Gat AR
AD UX
LC-3b Microsequencer
Ga LU
LD AR
LD DR
UX
HF
LD N
LD G
UX
R.W N
M
Ga C
C
E
1
.PC
teM
teM
DR
DR
UK
.E
TA
1M
.IR
teA
.M
.M
eP
HF
M
.B
.R
.C
AR
M
nd
IO
D
AD
DA
DR
LD
LD
LD
Gat
AL
PC
SR
Co
IR
M
J
000000 (State 0)
000001 (State 1)
000010 (State 2)
000011 (State 3)
000100 (State 4)
000101 (State 5)
000111 (State 7)
001000 (State 8)
001001 (State 9)
001010 (State 10)
001011 (State 11)
001100 (State 12)
COND1 COND0
The Microsequencer: Some Questions
When is the IRD signal asserted?
BEN R IR[11]
Handouts
7 pages of Microprogrammed LC-3b design
An Exercise in http://www.ece.cmu.edu/~ece447/s14/doku.php?id=techd
Microprogramming ocs
http://www.ece.cmu.edu/~ece447/s14/lib/exe/fetch.php?m
edia=lc3b-figures.pdf
335 336
10-11-2023
18, 19
MAR <! PC
33
MDR <! M
R R
35
IR <! MDR
32
1011
RTI
BEN<! IR[11] & N + IR[10] & Z + IR[9] & P To 11
1010
To 8
ADD [IR[15:12]]
BR
To 10
AND
0
1 XOR
DR<! SR1+OP2* JMP
TRAP [BEN] 0
set CC JSR
SHF
LEA STB
LDB LDW STW 1
22
To 18 5
DR<! SR1&OP2*
PC<! PC+LSHF(off9,1)
set CC
9 12
To 18
DR<! SR1 XOR OP2* To 18
PC<! BaseR
set CC
To 18 15 4
To 18
MAR<! LSHF(ZEXT[IR[7:0]],1) [IR[11]]
0 1
28 20
MDR<! M[MAR]
R7<! PC R7<! PC
PC<! BaseR
R R
21
30
PC<! MDR R7<! PC
To 18 PC<! PC+LSHF(off11,1)
13
To 18
DR<! SHF(SR,A,D,amt4)
set CC To 18
14 2 6 7 3
To 18 DR<! PC+LSHF(off9, 1)
set CC MAR<! B+off6 MAR<! B+LSHF(off6,1) MAR<! B+LSHF(off6,1) MAR<! B+off6
To 18
29 25 23 24
NOTES MDR<! M[MAR[15:1]’0] MDR<! M[MAR] MDR<! SR MDR<! SR[7:0]
B+off6 : Base + SEXT[offset6]
PC+off9 : PC + SEXT[offset9] R R R R
27 16 17
*OP2 may be SR2 or SEXT[imm5] 31
DR<! SEXT[BYTE.DATA] DR<! MDR
** [15:8] or [7:0] depending on M[MAR]<! MDR M[MAR]<! MDR**
set CC set CC
MAR[0]
337 To 18 To 18 To 18
R R R R
To 19
BEN R IR[11]
0,0,IR[15:12]
6
IRD
State 18 (010010)
State 33 (100001)
State 35 (100011)
State 32 (100000)
State 6 (000110)
State 25 (011001)
State 27 (011011)
10-11-2023
IR[11:9] IR[11:9]
DR SR1
111 IR[8:6]
DRMUX SR1MUX
(a) (b)
IR[11:9]
N Logic BEN
Z
P
(c)
R COND1 COND0
IR[15:11]
BEN
BEN R IR[11]
Microsequencer
Control Store
2 6 x 35
0,0,IR[15:12]
6
35
IRD
Microinstruction
6
9 26
Address of Next State
(J, COND, IRD)
10-11-2023
X
eS MU
UX
UX
E
UX
LS .SIZ
1M
2M
Ga DR
Gat AR
AD UX
Ga LU
LD AR
LD DR
UX
HF
LD N
LD G
UX
R.W N
M
Ga C
C
E
1
.PC
teM
teM
DR
DR
UK
.E
TA
1M
.IR
teA
.M
.M
eP
HF
M
.B
.R
.C
AR
M
nd
IO
D
AD
DA
DR
LD
LD
LD
Gat
AL
PC
SR
Co
IR
M
J
000000 (State 0)
000001 (State 1)
000010 (State 2)
000011 (State 3)
000100 (State 4)
000101 (State 5)
000110 (State 6)
000111 (State 7)
001000 (State 8)
001001 (State 9)
001010 (State 10)
Microprogramming
010011 (State 19)
010100 (State 20)
010101 (State 21)
010110 (State 22)
010111 (State 23)
011000 (State 24)
011001 (State 25)
011010 (State 26)
011011 (State 27)
011100 (State 28)
011101 (State 29)
011110 (State 30)
011111 (State 31)
100000 (State 32)
100001 (State 33)
100010 (State 34)
100011 (State 35)
100100 (State 36)
100101 (State 37)
100110 (State 38)
100111 (State 39)
101000 (State 40)
101001 (State 41)
101010 (State 42)
101011 (State 43)
101100 (State 44)
101101 (State 45)
101110 (State 46)
101111 (State 47)
110000 (State 48)
110001 (State 49)
110010 (State 50)
110011 (State 51)
110100 (State 52)
110101 (State 53)
110110 (State 54)
110111 (State 55)
111000 (State 56)
111001 (State 57)
111010 (State 58)
111011 (State 59)
111100 (State 60)
111101 (State 61)
111110 (State 62)
111111 (State 63)
346
How can you implement a complex instruction using this The designer can translate any desired operation to a
control structure? sequence of microinstructions
Think REP MOVS All the designer needs to provide is
The sequence of microinstructions needed to implement the
desired operation
The ability for the control logic to correctly sequence through
the microinstructions
Any additional datapath control signals needed (no need if the
operation can be “translated” into existing control signals)
349 350
What changes, if any, do you make to the LC-3b has byte load and byte store instructions that move
state machine? data not aligned at the word-address boundary
datapath? Convenience to the programmer/compiler
control store?
microsequencer? How does the hardware ensure this works correctly?
Take a look at state 29 for LDB
Show all changes and microinstructions States 24 and 17 for STB
Coming up in Homework 2 Additional logic to handle unaligned accesses
351 352
10-11-2023
353 354
355
10-11-2023
Agenda for Today & Next Few Lectures Recap of Last Lecture
Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
Benefits vs. Design Principles
Multi-cycle and Microprogrammed Microarchitectures When to Generate Control Signals
Microprogrammed Control: uInstruction, uSequencer, Control
Store
Pipelining
LC-3b State Machine, Datapath, Control Structure
An Exercise in Microprogramming
Issues in Pipelining: Control & Data Dependence Handling, Variable Latency Memory, Alignment, Memory Mapped I/O, …
State Maintenance and Recovery, …
Microprogramming
Out-of-Order Execution Power of abstraction (for the HW designer)
Advantages of uProgrammed Control
Issues in OoO Execution: Load-Store Handling, … Update of Machine Behavior
357 358
Microsequencer
6 Simple Design
of the Control Structure
Control Store
2 6 x 35
35
Microinstruction
9 26
363 364
10-11-2023
….
PCWriteCond
PCWrite
IRWrite
IorD
ALUSrcA
register opcode field
Inputs from instruction
register opcode field
Control Logic for MIPS FSM Microprogrammed Control for MIPS FSM
Disadvantages Disadvantages
You should be very familiar with this right now You should be very familiar with this right now
373 374
What limitations do you see with the multi-cycle design? Goal: More concurrency Higher instruction throughput
(i.e., more “work” completed in one cycle)
Limited concurrency
Some hardware resources are idle during different phases of Idea: When an instruction is using some resources in its
instruction processing cycle processing phase, process other instructions on idle
“Fetch” logic is idle when an instruction is being “decoded” or resources not needed by that instruction
“executed” E.g., when an instruction is being decoded, fetch the next
Most of the datapath is idle when a memory access is instruction
happening E.g., when an instruction is being executed, decode another
instruction
E.g., when an instruction is accessing data memory (ld/st),
execute the next instruction
E.g., when an instruction is writing its result into the register
file, access data memory for the next instruction
375 376
10-11-2023
Idea:
Divide the instruction processing cycle into distinct “stages” of
processing
Ensure there are enough hardware resources to process one
instruction in each stage
Process a different instruction in each stage
Instructions consecutive in program order are processed in
consecutive stages
Task
order
F D E W
A
F D E W B
F D E W C
F D E W D
Time
“place one dirty load of clothes in the washer”
Pipelined: 4 cycles per 4 instructions (steady state)
“when the washer is finished, place the wet load in the dryer”
F D E W “when the dryer is finished, take out the dry load and fold”
F D E W “when folding is finished, ask your roommate (??) to put the clothes
Is life always this beautiful? away”
F D E W
- steps to do a load are sequentially dependent
F D E W
- no dependence between different loads
Time - different steps do not share resources
379 380
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
10-11-2023
6 PM 7 8 9 10 11 12 1 2 AM
Time
6 PM 7 8 9 10 11 12 1 2 AM
Time Task
order
Task A
order
A
- 4 loads of laundry in parallel B
C
B
- no additional resources
C
- throughput increased by 4
D
throughput restored (2 loads per hour) using 2 dryers What about the instruction processing “cycle”?
383 384
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
10-11-2023
T ps
385 386
G/k G/k
387 388
10-11-2023
26
left 2
28 0 PCSrc
1 1=Jump
PC+4 [31– 28] M M
u u
x x
ALU
Add result 1 0
Decodedecode and
2. Instruction MemRead
Instruction [31– 26]
Control MemtoReg PCSrc2=Br Taken
ALUOp
register operand
Evaluate fetch (ID/RF)
Address
MemWrite
ALUSrc
Fetch Operands
Instruction [25– 21] Read
Read
Execute
register 2 Zero
Instruction
Store Result
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
bcond data
16 32
Instruction [15– 0] Sign
extend ALU
control
Instruction [5– 0]
ALU operation
T BW=~(1/T)
389 Based on original figure from [P&H CO&D, COPYRIGHT 2004
Elsevier. ALL RIGHTS RESERVED.]
390
Add Instruction
lw $3, 300($0) 8800ps
ns fetch
Add
4
Shift
Add
result
...
left 2 8 ns
800ps
Read
register 1 Program 200 4 400 6 600 8 800 1000 1200 1400
PC Address Read
Read
data 1
execution 2 10 12 14
register 2 Zero
Time
Instruction Registers Read
0
ALU ALU
Read order
RF
Write data 2 result Address 1
data
Instruction register M
u Data M (in instructions)
memory u
Instruction Data
Write
data
x
1
memory x
0
write lw $1, 100($0)
fetch
Reg ALU
access
Reg
Write
data
16 32 Instruction Data
Sign lw $2, 200($0) 2 ns Reg ALU Reg
extend 200ps fetch access
Instruction Data
lw $3, 300($0) 2 ns
200ps Reg ALU Reg
fetch access
2 ns
200ps 2 ns
200ps 2200ps
ns 2 ns
200ps 2 ns
200ps
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
391 392
10-11-2023
IF/ID
IF/ID
IF/ID
IF/ID
IF/ID ID/EX
ID/EX
ID/EX
ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB
MEM/WB
MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
PCD+4
PCE+4
Add
nPCM
Add
Add Add
Add
Add
Add
Add
Add Add
Add 444
4 Add Add
Add result
Add
Add
Add
44 Add result result
result
result
result
Shift
Shift
Shift
Shift
Shift
Shift
left
left 222
left
left
leftleft2 2
Read
Read
Read
Instruction
Read
Instruction
Read
Instruction
Read
Instruction
register
register 111
Instruction
register PC
PC
PC Address
Address
Address register Read
11 Read
AE
Read
PCF
PCPC Address
Address register Read
Read Read
data
data111
AoutM
data Read data
data
data 1
Read data 11 Read
Read
Read
Read
Read register Zero
Zero
Zero Instruction register 222
register Zero
Zero
Zero
MDRW
Instruction register
register 22 Instruction
Instruction Registers ALU
Registers Read ALU ALU memory Registers
Registers Read
Read
Read ALU ALU
ALU
ALU
ALU ALU
ALU
ALU
Instruction Registers Read ALU memory
memory Write 00 ALU Read
IRD
AoutW
BM
16 32
ImmE
16
16 32
32
32
1616 3232 Sign
Sign
Sign
Sign
Sign extend
extend
extend
extend
extend
T/k T/k
ps T ps
393 394
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
lw $10,
sub $11,20($1)
$2, $3 lw $10,
sub $11,20($1)
$2, $3 lw $10, 20($1)
Instruction fetch Instruction decode Execution
sub $11, $2, $3 lw $10,
sub $11,20($1)
$2, $3 sub $11,20($1)
lw $10, $2, $3
t0 t1 t2 t3 t4 t5
0
00
Write back
back
Inst0 IF ID EX MEM WB
M
MM
uu
Execution Memory Write
xxx
11
1
IF/ID
IF/ID
IF/ID ID/EX
ID/EX
ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB
Inst1 IF ID EX MEM WB
444
Add
Add
Add
Add Add
Add
Add
Add
Add
result
Inst2 IF ID EX MEM WB
result
result
Shift
Shift
Shift
left 22
left
left 2 Inst3 IF ID EX MEM WB
Inst4 IF ID EX MEM
Read
Read
Read
Instruction
Instruction
Instruction
PC
PC Address
Address register 11
register
register 1 Read
PC Address Read
Read
Read data 11
data
data 1
Read
Read
register 22
register
register 2 Zero
Zero
Instruction
Instruction
Instruction Registers Read ALU ALU
IF ID EX
memory Registers
Registers Read
Read ALU
ALU ALU
ALU
memory
memory Write
Write 0
00 Address Read
Read
Read
Write data 22
data result
result
result Address
Address 1
11
register
register
register M
M data
data
data
M M
M
M
u
uu Data
Data
Data
Data u
IF ID
1 00
Write 0
Write
Write
data
data
data
steady state
16
16
16 32
32
32
Sign
Sign
Sign
IF
extend
extend
extend
(full pipeline)
Clock
Clock
Clock5621 43
Clock
Clock
395 396
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
10-11-2023
0
M
u
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 1
x
IF I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 Add
Add
4 Add
result
Branch
Shift
RegWrite left 2
ID I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 Read MemWrite
Instruction
PC Address register 1 Read
data 1
Read ALUSrc
register 2 Zero
Zero MemtoReg
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
EX I0 I1 I2 I3 I4 I5 I6 I7 I8
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
Instruction
[15– 0] 16 32 6
MEM I0 I1 I2 I3 I4 I5 I6 I7
Sign ALU
extend control MemRead
Instruction
[20– 16]
0
M ALUOp
Instruction u
WB
[15– 11] x
I0 I1 I2 I3 I4 I5 I6 Based on original figure from [P&H CO&D,
COPYRIGHT 2004 Elsevier. ALL RIGHTS
1
RESERVED.] RegDst
Option 1: decode once using the same logic as single-cycle and IF/ID
EX M WB
RegWrite
Instruction Shift Branch
Control M WB left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
EX M WB PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
IF/ID ID/EX EX/MEM MEM/WB [15– 0] Sign ALU MemRead
extend control
Option 2: carry relevant “instruction word/field” down the pipeline Instruction
is correct
r1 r4 op r5 (WAR)
Flow dependences always need to be obeyed because they
constitute true dependence on a value Output-dependence
Anti and output dependences exist due to limited number of r3 r1 op r2 Write-after-Write
architectural registers r5 r3 op r4 (WAW)
They are dependence on a name, not a value
r3 r6 op r7
We will later see what we can do about them
407 408
10-11-2023
M
0
00
MM
uu
Instruction decode Execution
sub $11, $2, $3
Execution
lw $10,
sub $11,20($1)
$2, $3
Memory
sub $11,20($1)
lw $10, $2, $3
Write back
Write back
Data Dependence Handling
xxx
1
11
IF/ID
IF/ID
IF/ID ID/EX
ID/EX
ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB
Add
Add
Add
Add AddAdd
Add
444 Add
Add result
result
result
Shift
Shift
Shift
left 22
left
Read
Read
Read
Instruction
Instruction
Instruction
PC
PC Address
Address register 11
register
register 1 Read
PC Address Read
Read
Read data 11
data
data 1
Read
Read
register 22
register 2 Zero
Zero
Zero
Instruction
Instruction
Instruction register
Registers Read
Registers
Registers Read ALU ALU
ALU
ALU
memory
memory
memory Write Read 0
00 ALU
ALU Read
Read
Read
Write
Write data 22
data result
result
result Address
Address
Address 1
11
register
register
register M
M data
data
data
M M
M
M
uuu Data
Data
Data
Data u
Clock
Clock
Clock5621 43
Clock
Clock
409 410
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Disadvantage:
Need to stall for all types of dependences, not only flow dep.
413 414
Not Stalling on Anti and Output Dependences Approaches to Dependence Detection (II)
What changes would you make to the scoreboard to enable Combinational dependence check logic
this? Special logic that checks if any instruction in later stages is
supposed to write to any source register of the instruction that
is being decoded
Yes: stall the instruction/pipeline
No: no need to stall… no flow dependence
Advantage:
No need to stall on anti and output dependences
Disadvantage:
Logic is more complex than a scoreboard
Logic becomes more complex as we make the pipeline deeper
and wider (flash-forward: think superscalar execution)
415 416
10-11-2023
417
Agenda for Today & Next Few Lectures Readings for Next Few Lectures (I)
Single-cycle Microarchitectures P&H Chapter 4.9-4.11
Multi-cycle and Microprogrammed Microarchitectures Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
Pipelining More advanced pipelining
Interrupt and exception handling
Out-of-order and superscalar execution concepts
Issues in Pipelining: Control & Data Dependence Handling,
State Maintenance and Recovery, …
McFarling, “Combining Branch Predictors,” DEC WRL
Technical Report, 1993.
Out-of-Order Execution
Pipelining
Basic Idea and Characteristics of An Ideal Pipeline
Pipelined Datapath and Control
Issues in Pipeline Design
Resource Contention
Dependences and Their Types
Control vs. data (flow, anti, output)
Five Fundamental Ways of Handling Data Dependences
Dependence Detection
Interlocking
Scoreboarding vs. Combinational
421 422
425 426
Control Dependence
Question: What should the fetch PC be in the next cycle?
Answer: The address of the next instruction
All instructions are control dependent on previous ones. Why? Data Dependence Handling:
If the fetched instruction is a non-control-flow instruction:
More Depth & Implementation
Next Fetch PC is the address of the next-sequential instruction
Easy to determine if we know the size of the fetched instruction
No need to detect
431 432
10-11-2023
EX
iFj i Aj iOj
MEM
0
ID/EX Instructions IA and IB (where IA comes before IB) have RAW
dependence iff
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
IF/ID
EX M WB
IB (R/I, LW, SW, Br or JR) reads a register written by IA (R/I or LW)
4
Add
Add
Add result
dist(IA, IB) dist(ID, WB) = 3
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
PC Address register 1
Read
Instruction
Read
register 2
data 1
Zero
Registers Read ALU ALU
Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst
Stall
disable PC and IR latching; ensure stalled instruction stays in its stage
Insert “invalid” instructions/nops into the stage following the stalled one
(called “bubbles”)
439 440
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
10-11-2023
Data Forwarding (or Data Bypassing) Resolving RAW Dependence with Forwarding
It is intuitive to think of RF as state Instructions IA and IB (where IA comes before IB) have RAW
“add rx ry rz” literally means get values from RF[ry] and RF[rz] dependence iff
respectively and put result in RF[rx] IB (R/I, LW, SW, Br or JR) reads a register written by IA (R/I or LW)
But, RF is just a part of a communication abstraction dist(IA, IB) dist(ID, WB) = 3
“add rx ry rz” means
1. get the results of the last instructions to define the values of In other words, if IB in ID stage reads a register written by
RF[ry] and RF[rz], respectively,
IA in EX, MEM or WB stage, then the operand required by IB
2. until another instruction redefines RF[rx], younger instructions
that refer to RF[rx] should use this instruction’s result is not yet in RF
What matters is to maintain the correct “data flow” retrieve operand from datapath instead of the RF
between operations, thus retrieve operand from the youngest definition if multiple
definitions are outstanding
add rz r- r-
IF ID EX MEM WB
addi r- rz r- IF ID ID
EX ID
MEM ID
WB
445 446
dist(i,j)=3 dist(i,j)=3
M M
u u
x x
Registers Registers
ForwardA ALU ForwardA ALU
dist(i,j)=2 dist(i,j)=2
M
u
dist(i,j)=1 Data
memory
M
u
dist(i,j)=1 Data
memory
x M x M
u u
x x
forward? Rt
Rt M
Rt
Rt M
u EX/MEM.RegisterRd u EX/MEM.RegisterRd
Rd Rd
x x
Forwarding MEM/WB.RegisterRd Forwarding MEM/WB.RegisterRd
unit unit
dist(i,j)=3
b. With forwarding b. With forwarding
[Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
447
[Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Assumes RF forwards internally
448
10-11-2023
Why doesn’t use_rs( ) appear in the forwarding logic? Even with data-forwarding, RAW dependence on an
immediately preceding LW instruction requires a stall
What does the above not take into account?
449 450
453 454
456
10-11-2023
457 458
459 460
10-11-2023
Stall Signals
Ensure the pipeline operates correctly in the presence of
dependencies
461 462
463 464
10-11-2023
465 466
Questions to Ponder
What is the role of the hardware vs. the software in data
dependence handling?
End of Pipelining the LC-3b Software based interlocking
Hardware based interlocking
Who inserts/manages the pipeline bubbles?
Who finds the independent instructions to fill “empty” pipeline
slots?
What are the advantages/disadvantages of each?
467 468
10-11-2023
Stall Fetch Until Next PC is Available: Good Idea? Doing Better than Stalling Fetch …
Rather than waiting for true-dependence on PC to resolve,
just guess nextPC = PC+4 to keep fetching every cycle
t0 t1 t2 t3 t4 t5 Is this a good guess?
Insth IF ID ALU MEM WB What do you lose if you guessed incorrectly?
Insti IF IF ID ALU MEM WB
Instj IF IF ID ALU ~20% of the instruction mix is control flow
Instk IF IF ~50 % of “forward” control flow (i.e., if-then-else) is taken
~90% of “backward” control flow (i.e., loop back) is taken
Instl
Agenda for Today & Next Few Lectures Reminder: Readings for Next Few Lectures (I)
Single-cycle Microarchitectures P&H Chapter 4.9-4.11
Multi-cycle and Microprogrammed Microarchitectures Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
Pipelining More advanced pipelining
Interrupt and exception handling
Out-of-order and superscalar execution concepts
Issues in Pipelining: Control & Data Dependence Handling,
State Maintenance and Recovery, …
McFarling, “Combining Branch Predictors,” DEC WRL
Technical Report, 1993. HW3 summary paper
Out-of-Order Execution
Reminder: Readings for Next Few Lectures (II) Reminder: Relevant Seminar Tomorrow
Smith and Plezskun, “Implementing Precise Interrupts in Practical Data Value Speculation for Future High-End
Pipelined Processors,” IEEE Trans on Computers 1988 Processors
(earlier version in ISCA 1985). HW3 summary paper Arthur Perais, INRIA (France)
Thursday, Feb 5, 4:30-5:30pm, CIC Panther Hollow Room
Summary:
Value prediction (VP) was proposed to enhance the
performance of superscalar processors by breaking RAW
dependencies. However, it has generally been considered too
complex to implement. During this presentation, we will
review different sources of additional complexity and propose
solutions to address them.
http://www.ece.cmu.edu/~calcm/doku.php?id=seminars:se
minars
483 484
10-11-2023
Disadvantages
Useless work: some instructions fetched/executed but
discarded (especially bad for easy-to-predict branches)
Requires additional ISA support
6 cycles 5 cycles
Unconditional branch: Easier to find instructions to fill the delay slot
Conditional branch: Condition computation should not depend on
instructions in delay slots difficult to fill the delay slot
499 500
10-11-2023
WAR) Delay slot sub $t4, $t5, $t6 Potential solutions if the instruction is a control-flow
instructions instruction:
does not change Becomes Becomes Becomes
program semantics
add $s1, $s2, $s3 Stall the pipeline until we know the next fetch address
Guess the next fetch address (branch prediction)
if $s2 = 0 then if $s1 = 0 then
add $s1, $s2, $s3
add $s1, $s2, $s3 sub $t4, $t5, $t6
if $s1 = 0 then Employ delayed branching (branch delay slot)
Safe?
Do something else (fine-grained multithreading)
sub $t4, $t5, $t6
within same For correctness: For correctness: Eliminate control-flow instructions (predicated execution)
basic block add a new instruction add a new instruction
to the not-taken path? to the taken path? Fetch from both possible paths (if you know the addresses
of both possible paths) (multipath execution)
[Based on original figure from P&H CO&D, COPYRIGHT 503 504
2004 Elsevier. ALL RIGHTS RESERVED.]
10-11-2023
507 508
10-11-2023
Kongetira et al., “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro 2005.
Slide credit: Joel Emer 509 510
Branch Prediction PC ??
0x0006
0x0008
0x0007
0x0005
0x0004
I-$ DEC RF WB
0x0001
LD R1, MEM[R0]
0x0002 D-$
ADD R2, R2, #1
0x0003
BRZERO 0x0001
0x0004
ADD R3, R2, #1 12 cycles
0x0005
MUL R1, R2, R3
0x0006
LD R2, MEM[R2] Branch prediction
0x0007
LD R0, MEM[R2]
8 cycles
513
516
10-11-2023
t0 t1 t2 t3 t4 t5 t0 t1 t2 t3 t4 t5
Insth IFPC ID ALU MEM Insth IFPC ID ALU MEM WB
Insti IFPC+4 ID ALU Insti IFPC+4 ID killed
Instj IFPC+8 ID Instj IFPC+8 killed
Instk IFtarget Instk IFtarget ID ALU WB
Instl Insth branch condition and target Instl IF ID ALU
evaluated in ALU IF ID
When a branch resolves IF
- branch target (Instk) is fetched
- all instructions fetched since
insth (so called “wrong-path”
Insth is a branch instructions) must be flushed517 Insth is a branch 518
Assume
Hazard
detection
unit
M ID/EX
IF/ID EX M WB
Branch Prediction (Enhanced) Fetch Stage with BTB and Direction Prediction
Idea: Predict the next fetch address (to be used in the next
Direction predictor (taken?)
cycle)
taken?
Requires three things to be predicted at fetch stage:
Whether the fetched instruction is a branch PC + inst size Next Fetch
(Conditional) branch direction Address
Program
Branch target address (if taken) hit?
Counter
Idea: Store the target address from previous instance and access target address
it with the PC
Called Branch Target Buffer (BTB) or Branch Target Address Cache of Target Addresses (BTB: Branch Target Buffer)
Cache Always taken CPI = [ 1 + (0.20*0.3) * 2 ] = 1.12 (70% of branches taken)
521 522
Cache of Target Addresses (BTB: Branch Target Buffer) cache/memory partially decoded instruction stored in I-cache
Second (2.): How do we predict the direction?
523 524
10-11-2023
525 526
actually
taken
N-bit BHT:
One
tag BTB: one target
Bit
table address per entry actually predict predict actually
per
entry not taken not taken taken
taken
taken? PC+4
= 1 0 actually
not taken
nextPC
The 1-bit BHT (Branch History Table) entry is updated with
the correct outcome after each execution of a branch
535 536
10-11-2023
Solution Idea: Add hysteresis to the predictor so that Accuracy for a loop with N iterations = (N-1)/N
prediction does not change on a single different outcome
TNTNTNTNTNTNTNTNTNTN 50% accuracy
Use two bits to track the history of predictions for a branch
instead of a single bit (assuming counter initialized to weakly taken)
State Machine for 2-bit Saturating Counter Hysteresis Using a 2-bit Counter
Counter using saturating arithmetic
Arithmetic with maximum and minimum values actually actually “weakly
taken !taken taken”
actually actually
taken pred !taken pred “strongly pred pred
taken taken taken” taken taken
11 actually 10 actually
taken taken actually
actually
!taken
actually actually taken
actually “strongly
taken !taken !taken !taken”
pred pred
actually “weakly !taken !taken actually
pred !taken pred !taken” actually !taken
!taken !taken taken
actually
01 actually 00
!taken 539 Change prediction after 2 consecutive mistakes 540
taken
10-11-2023
Is this good enough? Problem: Next fetch address after a control-flow instruction
is not determined after N cycles in a pipelined processor
How big is the branch problem? N cycles: (minimum) branch resolution latency
541 542
How long does it take to fetch 500 instructions? Realization 1: A branch’s outcome can be correlated with
100% accuracy other branches’ outcomes
100 cycles (all instructions fetched on the correct path)
No wasted work Global branch correlation
99% accuracy
100 (correct path) + 20 (wrong path) = 120 cycles
Realization 2: A branch’s outcome can be correlated with
20% extra instructions fetched
past outcomes of the same branch (other than the outcome
98% accuracy
100 (correct path) + 20 * 2 (wrong path) = 140 cycles of the branch “last-time” it was executed)
40% extra instructions fetched Local branch correlation
95% accuracy
100 (correct path) + 20 * 5 (wrong path) = 200 cycles
100% extra instructions fetched
543 544
10-11-2023
Reminder: Readings for Next Few Lectures (I) Reminder: Readings for Next Few Lectures (II)
P&H Chapter 4.9-4.11 Smith and Plezskun, “Implementing Precise Interrupts in
Pipelined Processors,” IEEE Trans on Computers 1988
Smith and Sohi, “The Microarchitecture of Superscalar (earlier version in ISCA 1985). HW3 summary paper
Processors,” Proceedings of the IEEE, 1995
More advanced pipelining
Interrupt and exception handling
Out-of-order and superscalar execution concepts
549 550
How long does it take to fetch 500 instructions? Realization 1: A branch’s outcome can be correlated with
100% accuracy other branches’ outcomes
100 cycles (all instructions fetched on the correct path)
No wasted work Global branch correlation
99% accuracy
100 (correct path) + 20 (wrong path) = 120 cycles
Realization 2: A branch’s outcome can be correlated with
20% extra instructions fetched
past outcomes of the same branch (other than the outcome
98% accuracy
100 (correct path) + 20 * 2 (wrong path) = 140 cycles of the branch “last-time” it was executed)
40% extra instructions fetched Local branch correlation
95% accuracy
100 (correct path) + 20 * 5 (wrong path) = 200 cycles
100% extra instructions fetched
551 552
10-11-2023
553 554
Two Level Global Branch Prediction How Does the Global Predictor Work?
First level: Global branch history register (N bits)
The direction of last N branches
Second level: Table of saturating counters for each history entry
The direction the branch took the last time the same history was
seen
Pattern History Table (PHT)
00 …. 00
1 1 ….. 1 0 00 …. 01
2 3 This branch tests i
GHR
previous 00 …. 10 Last 4 branches test j
(global
branch’s History: TTTN
history
direction Predict taken for i
register)
index Next history: TTNT
0 1
(shift in last outcome)
11 …. 11
McFarling, “Combining Branch Predictors,” DEC WRL TR 1993.
Yeh and Patt, “Two-Level Adaptive Training Branch Prediction,” MICRO 1991. 557 558
Global branch
PC + inst size Next Fetch history PC + inst size Next Fetch
Address Address
Program Program
hit? hit?
Counter Counter
Cache of Target Addresses (BTB: Branch Target Buffer) Cache of Target Addresses (BTB: Branch Target Buffer)
561 562
563 564
10-11-2023
565 566
Two Level Local Branch Prediction Two-Level Local History Branch Predictor
Which directions earlier instances of *this branch* went
First level: A set of local history registers (N bits each)
Direction predictor (2-bit counters)
Select the history register based on the PC of the branch
Second level: Table of saturating counters for each history entry taken?
The direction the branch took the last time the same history was
seen
Pattern History Table (PHT) PC + inst size Next Fetch
Address
00 …. 00 Program
hit?
1 1 ….. 1 0 Counter
00 …. 01
2 3
00 …. 10
Address of the
current instruction
index target address
0 1
Advantages:
+ Better accuracy: different predictors are better for different branches
+ Reduced warmup time (faster-warmup predictor used until the
slower-warmup predictor warms up)
Minimum branch penalty: 7 cycles
Disadvantages: Typical branch penalty: 11+ cycles
-- Need “meta-predictor” or “selector” 48K bits of target addresses stored in I-cache
-- Longer access latency Predictor tables are reset on a context switch
McFarling, “Combining Branch Predictors,” DEC WRL Tech Report, 1993. Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999.
569 570
571 572
10-11-2023
Assigns weights to correlations Stall the pipeline until we know the next fetch address
Jimenez and Lin, “Dynamic Branch Prediction with Guess the next fetch address (branch prediction)
Perceptrons,” HPCA 2001.
Employ delayed branching (branch delay slot)
Do something else (fine-grained multithreading)
Geometric history length predictor
Eliminate control-flow instructions (predicated execution)
Fetch from both possible paths (if you know the addresses
Your predictor?
of both possible paths) (multipath execution)
573 574
CMPEQ condition, a, 5;
CMOV condition, b 4;
CMOV !condition, b 3;
577 578
F
E
A
D
B
C F
D
E
C
A
B F
E
C
D
B
A A
D
B
C
E
F A
B
C
D
E
F B
A
D
C
E
F E
F
C
D
B
A F
D
E
B
C
A A
C
D
B
E A
B
C
D A
B
C B
A A
C B Disadvantages:
Branch Prediction -- Causes useless work for branches that are easy to predict
D Fetch Decode Rename Schedule RegisterRead Execute -- Reduces performance if misprediction cost < useless work
-- Adaptivity: Static predication is not adaptive to run-time branch behavior. Branch
F E D B A behavior changes based on input set, program phase, control-flow path.
E
-- Additional hardware and ISA support
Pipeline flush!!
-- Cannot eliminate all hard to predict branches
F
-- Loop branches
579 580
10-11-2023
583 584
10-11-2023
585 586
Wouldn’t it be nice
If predication did not require ISA support
587 588
10-11-2023
D
D D
A
p1=(cond)
Disadvantages compared to predicated execution
A
p1 = (cond)
A wish.jump p1 TARGET Extra branch instructions use machine resources
p1 = (cond) B
branch p1, TARGET (!p1)
(1) mov b,1 Extra branch instructions increase the contention for branch predictor table
B B
mov b, 1 (!p1) mov b,1 wish.join
wish.join
!p1(1)JOIN
JOIN entries
jmp JOIN C TARGET:
C
C
(p1) mov b,0
Constrains the compiler’s scope for code optimizations
TARGET: (p1) mov b,0
(1)
mov b,0 D JOIN:
595 596
10-11-2023
597 598
Curious?
Idea 2: Use history based target prediction Kim et al., “VPC Prediction: Reducing the Cost of Indirect
E.g., Index the BTB with GHR XORed with Indirect Branch PC Branches via Hardware-Based Dynamic Devirtualization,” ISCA
Chang et al., “Target Prediction for Indirect Jumps,” ISCA 1997. 2007.
+ More accurate
-- An indirect branch maps to (too) many entries in BTB
-- Conflict misses with other branches (direct or indirect)
-- Inefficient use of space if branch has few target addresses
599 600
10-11-2023
601 602
Delayed branching
18-447
Fine-grained multithreading Computer Architecture
Branch prediction
Agenda for Today & Next Few Lectures Reminder: Readings for Next Few Lectures (I)
Single-cycle Microarchitectures P&H Chapter 4.9-4.11
Multi-cycle and Microprogrammed Microarchitectures Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
Pipelining More advanced pipelining
Interrupt and exception handling
Out-of-order and superscalar execution concepts
Issues in Pipelining: Control & Data Dependence Handling,
State Maintenance and Recovery, …
McFarling, “Combining Branch Predictors,” DEC WRL
Technical Report, 1993. HW3 summary paper
Out-of-Order Execution
Reminder: Readings for Next Few Lectures (II) Readings Specifically for Today
Smith and Plezskun, “Implementing Precise Interrupts in Smith and Plezskun, “Implementing Precise Interrupts in
Pipelined Processors,” IEEE Trans on Computers 1988 Pipelined Processors,” IEEE Trans on Computers 1988
(earlier version in ISCA 1985). HW3 summary paper (earlier version in ISCA 1985). HW3 summary paper
609 610
Multi-Cycle Execution
Not all instructions take the same amount of time for
“execution”
Pipelining and Precise Exceptions:
Preserving Sequential Semantics Idea: Have multiple different functional units that take
different number of cycles
Can be pipelined or not pipelined
Can let independent instructions start execution on a different
functional unit before a previous long-latency instruction
finishes execution
614
FMUL R4 R1, R2 F D E E E E E E E E W
ADD R3 R1, R2
When to Handle
F D E W
F D E W
Exceptions: when detected (and known to be non-speculative)
F D E W Interrupts: when convenient
F D E E E E E E E E W Except for very high priority ones
FMUL R2 R5, R6
Power failure
ADD R7 R5, R6 F D E W
Machine check (error)
F D E W
What is wrong with this picture?
Sequential semantics of the ISA NOT preserved!
Priority: process (exception), depends (interrupt)
What if FMUL incurs an exception?
Handling Context: process (exception), system (interrupt)
615 616
10-11-2023
2. No later instruction should be retired. Enables (easy) recovery from exceptions, e.g. page faults
Retire = commit = finish execution and update arch. state Enables (easily) restartable processes
617 618
FMUL R3 R1, R2 F D E E E E E E E E W
ADD R4 R1, R2 F D E E E E E E E E W History buffer
F D E E E E E E E E W
F D E E E E E E E E W
Future register file
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W Checkpointing
Downside Readings
Worst-case instruction latency determines all instructions’ latency Smith and Plezskun, “Implementing Precise Interrupts in Pipelined
What about memory operations? Processors,” IEEE Trans on Computers 1988 and ISCA 1985.
Hwu and Patt, “Checkpoint Repair for Out-of-order Execution
Each functional unit takes worst-case number of cycles?
Machines,” ISCA 1987.
619 620
10-11-2023
Func Unit
Instruction Register Reorder
Cache File Func Unit Buffer
Func Unit
621 622
623 624
10-11-2023
627 628
10-11-2023
629 630
631 632
10-11-2023
Future file is used for fast access to latest register values Used only on exceptions
(speculative state) Advantage
Frontend register file No need to read the new values from the ROB (no CAM or
indirection) or the old value of destination register
In-Order Pipeline with Future File and Reorder Buffer Can We Reduce the Overhead of Two Register Files?
Decode (D): Access future file, allocate entry in ROB, check if instruction Idea: Use indirection, i.e., pointers to data in frontend and
can execute, if so dispatch instruction retirement
Execute (E): Instructions can complete out-of-order
Have a single storage that stores register data values
Completion (R): Write result to reorder buffer and future file
Keep two register maps (speculative and architectural); also
Retirement/Commit (W): Check for exceptions; if none, write result to
called register alias tables (RATs)
architectural register file or memory; else, flush pipeline, copy
architectural file to future file, and start from exception handler
In-order dispatch/execution, out-of-order completion, in-order retirement Future map used for fast access to latest register values
E
Integer add (speculative state)
Integer mul Frontend register map
E E E E
FP mul
R W
F D
E E E E E E E E Architectural map is used for state recovery on exceptions
E E E E E E E E ... (architectural state)
Load/store Backend register map
637 638
Future Map in Intel Pentium 4 Reorder Buffer vs. Future Map Comparison
Many modern
processors
are similar:
- MIPS R10K
- Alpha 21264
639 640
10-11-2023
641 642
Difference between exceptions and branch mispredictions? How do the three state maintenance methods fare in terms
Branch mispredictions are much more common of recovery latency?
need fast state recovery to minimize performance impact of Reorder buffer
mispredictions History buffer
Future file
643 644
10-11-2023
Checkpointing Checkpointing
When a branch is decoded Advantages
Make a copy of the future file/map and associate it with the Correct frontend register state available right after checkpoint
branch restoration Low state recovery latency
…
When an instruction produces a register value
All future file/map checkpoints that are younger than the
instruction are updated with the value
Disadvantages
When a branch misprediction is detected Storage overhead
Restore the checkpointed future file/map for the mispredicted Complexity in managing checkpoints
branch when the branch misprediction is resolved
…
Flush instructions in pipeline younger than the branch
Deallocate checkpoints younger than the branch
647 648
10-11-2023
651
10-11-2023
653 654
Important: Register Renaming with a Reorder Buffer Review: Register Renaming Examples
Output and anti dependencies are not true dependencies
WHY? The same register refers to values that have nothing to
do with each other
They exist due to lack of register ID’s (i.e. names) in
the ISA
The register ID is renamed to the reorder buffer entry that
will hold the register’s value
Register ID ROB entry ID
Architectural register ID Physical register ID
After renaming, ROB entry ID used to refer to the register
Idea: Checkpoint the frontend register state/map at the When an instruction produces a register value
time a branch is decoded and keep the checkpointed state All future file/map checkpoints that are younger than the
updated with results of instructions older than the branch instruction are updated with the value
Upon branch misprediction, restore the checkpoint associated
with the branch When a branch misprediction is detected
Restore the checkpointed future file/map for the mispredicted
Hwu and Patt, “Checkpoint Repair for Out-of-order branch when the branch misprediction is resolved
Execution Machines,” ISCA 1987. Flush instructions in pipeline younger than the branch
Deallocate checkpoints younger than the branch
659 660
10-11-2023
663 664
10-11-2023
Load/store addresses
(Dynamic Instruction Scheduling)
665
16 vs. 12 cycles
671 672
10-11-2023
FP registers
from memory from instruction unit
load
buffers store buffers
operation bus
reservation
stations to memory
FP FU FP FU
681 682
683 684
10-11-2023
Cycle 0 Cycle 2
685 686
Cycle 4
Cycle 3
687 688
10-11-2023
Cycle 7 Cycle 8
689 690
Does the tag have to be the ID of the Reservation Station Assume ADD (4 cycle execute), MUL (6 cycle execute)
Entry?
Assume one adder and one multiplier
How many cycles
What can potentially become the critical path?
in a non-pipelined machine
Tag broadcast value capture instruction wake up
in an in-order-dispatch pipelined machine with reorder buffer
(no forwarding and full forwarding)
How can you reduce the potential critical paths? in an out-of-order dispatch pipelined machine with reorder
buffer (full forwarding)
691 692
10-11-2023
Out-of-Order Execution with Precise Exceptions Out-of-Order Execution with Precise Exceptions
Idea: Use a reorder buffer to reorder instructions before TAG and VALUE Broadcast Bus
3. Keep track of readiness of source values of an instruction Tag broadcast enables communication (of readiness of
Broadcast the “tag” when the value is produced produced value) between instructions
Instructions compare their “source tags” to the broadcast tag
if match, source value becomes ready Wakeup and select enables out-of-order dispatch
4. When all source values of an instruction are ready, dispatch
the instruction to functional unit (FU)
Wakeup and select/schedule the instruction
697 698
701 702
703 704
10-11-2023
705 706
707 708
10-11-2023
709 710
711 712
10-11-2023
713 714
Tomasulo Template
18-447
Computer Architecture
Lecture 13: Out-of-Order Execution
and Data Flow
715
10-11-2023
Agenda for Today & Next Few Lectures Readings Specifically for Today
Single-cycle Microarchitectures Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
Multi-cycle and Microprogrammed Microarchitectures More advanced pipelining
Interrupt and exception handling
Pipelining Out-of-order and superscalar execution concepts
719 720
10-11-2023
Review: In-order vs. Out-of-order Dispatch Review: Out-of-Order Execution with Precise Exceptions
In order dispatch + precise exceptions: TAG and VALUE Broadcast Bus
IMUL R3 R1, R2
F D E E E E R W
ADD R3 R3, R1
F D STALL E R W ADD R1 R6, R7 S
Integer add R
F STALL D E R W IMUL R5 R6, R8 C E
E
ADD R7 R3, R5 H Integer mul
F D E E E E E R W O
E E E E E
F D R W
F D STALL E R W D FP mul
E E E E E E E E D
U E
Out-of-order dispatch + precise exceptions: L
... R
E E E E E E E E E
F D E E E E R W Load/store
Review: Enabling OoO Execution, Revisited Review: Summary of OOO Execution Concepts
1. Link the consumer of a value to the producer Register renaming eliminates false dependencies, enables
Register renaming: Associate a “tag” with each data value linking of producer to consumers
3. Keep track of readiness of source values of an instruction Tag broadcast enables communication (of readiness of
Broadcast the “tag” when the value is produced produced value) between instructions
Instructions compare their “source tags” to the broadcast tag
if match, source value becomes ready Wakeup and select enables out-of-order dispatch
4. When all source values of an instruction are ready, dispatch
the instruction to functional unit (FU)
Wakeup and select/schedule the instruction
723 724
10-11-2023
MUL R3 R1, R2
ADD R5 R3, R4
ADD R7 R2, R6
ADD R10 R8, R9
MUL R11 R7, R10
ADD R5 R5, R11
Conservative: Stall the load until all previous stores have whether load address matches a previous store address
computed their addresses (or even retired from the machine)
Aggressive: Assume load is independent of unknown-address How does the OOO engine treat the scheduling of a load instruction wrt
previous stores?
stores and schedule the load right away
Option 1: Assume load dependent on all previous stores
Intelligent: Predict (with a more sophisticated predictor) if the
Option 2: Assume load independent of all previous stores
load is dependent on the/any unknown address store
Option 3: Predict the dependence of a load on an outstanding store
733 734
Data Forwarding Between Stores and Loads Food for Thought for You
We cannot update memory out of program order Many other design choices
Need to buffer all store and load instructions in instruction window
Should reservation stations be centralized or distributed
Even if we know all addresses of past stores when we across functional units?
generate the address of a load, two questions still remain: What are the tradeoffs?
1. How do we check whether or not it is dependent on a store
2. How do we forward data to the load if it is dependent on a store Should reservation stations and ROB store data values or
should there be a centralized physical register file where all
Modern processors use a LQ (load queue) and an SQ for this data values are stored?
Can be combined or separate between loads and stores What are the tradeoffs?
A load searches the SQ after it computes its address. Why?
A store searches the LQ after it computes its address. Why? Exactly when does an instruction broadcast its tag?
…
737 738
More Food for Thought for You General Organization of an OOO Processor
How can you implement branch prediction in an out-of-
order execution machine?
Think about branch history register and PHT updates
Think about recovery from mispredictions
How to do this fast?
How can you combine superscalar + out-of-order + branch Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec.
prediction? 1995.
739 740
10-11-2023
741 742
Boggs et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, 2001.
Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, March-April 1999. 743 Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, April 1996 744
10-11-2023
745 746
749
751
10-11-2023
753 754
755 756
10-11-2023
OUT
759 760
10-11-2023
3 L 7 5 *
4 Form
Frame Token
5
Need to provide storage for only one operand/operator
Network Network
763
10-11-2023
765 766
767 768
10-11-2023
769 770
771 772
10-11-2023
Data Flow at the ISA level has not been (as) successful Microarchitecture-level dataflow:
Patt, Hwu, Shebanow, “HPS, a new microarchitecture: rationale
and introduction,” MICRO 1985.
Data Flow implementations under the hood (while Patt et al., “Critical issues regarding HPS, a high performance
preserving sequential ISA semantics) have been very microarchitecture,” MICRO 1985.
successful Hwu and Patt, “HPSm, a high performance restricted data
Out of order execution flow architecture having minimal functionality,” ISCA 1986.
Hwu and Patt, “HPSm, a high performance restricted data flow
architecture having minimal functionality,” ISCA 1986.
773 774
Out-of-Order Execution
Prof. Onur Mutlu
Carnegie Mellon University Issues in OoO Execution: Load-Store Handling, …
Spring 2015, 2/18/2015
Alternative Approaches to Instruction Level Parallelism
776
10-11-2023
777 778
779 780
10-11-2023
Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, March-April 1999. 781
Review: Pure Data Flow Pros and Cons Review: Combining Data Flow and Control Flow
Advantages
Can we get the best of both worlds?
Very good at exploiting irregular parallelism
Only real dependencies constrain processing
Two possibilities
Disadvantages
Debugging difficult (no precise state) Model 1: Keep control flow at the ISA level, do dataflow
Interrupt/exception handling is difficult (what is precise state underneath, preserving sequential semantics
semantics?)
Implementing dynamic data structures difficult in pure data Model 2: Keep dataflow model, but incorporate some control
flow models flow at the ISA level to improve efficiency, exploit locality, and
Too much parallelism? (Parallelism control needed) ease resource management
High bookkeeping overhead (tag matching, data storage) Incorporate threads into dataflow: statically ordered instructions;
when the first instruction is fired, the remaining instructions
Instruction cycle is inefficient (delay between dependent execute without interruption in control flow order (e.g., one can
instructions), memory locality is not exploited pipeline them)
783 784
10-11-2023
785 786
Space Space
791 792
10-11-2023
793 794
Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. 797 798
Vector Registers
Each vector data register holds N M-bit values
Vector control registers: VLEN, VSTR, VMASK
Vector Processing in More Depth Maximum VLEN can be N
Maximum number of elements stored in a vector register
Vector Mask Register (VMASK)
Indicates which elements of vector to operate on
V0,N-1 V1,N-1
800
10-11-2023
Address
MOVA R2 = B 1
Generator + MOVA R3 = C 1
X: LD R4 = MEM[R1++] 11 ;autoincrement addressing
LD R5 = MEM[R2++] 11
ADD R6 = R4 + R5 4
SHFR R7 = R6 >> 1 1
0 1 2 3 4 5 6 7 8 9 A B C D E F ST MEM[R3++] = R7 11
DECBNZ --R0, X 2 ;decrement and branch if NZ
Memory Banks
Picture credit: Krste Asanovic 805 806
For I = 0 to 49
Scalar execution time on an in-order processor with 16 C[i] = (A[i] + B[i]) / 2
banks (word-interleaved: consecutive words are stored in
Vectorized loop (each instruction and its latency):
consecutive banks)
MOVI VLEN = 50 1
First two loads in the loop can be pipelined 7 dynamic instructions
MOVI VSTR = 1 1
4 + 50*30 = 1504 cycles
VLD V0 = A 11 + VLN - 1
VLD V1 = B 11 + VLN – 1
Why 16 banks?
VADD V2 = V0 + V1 4 + VLN - 1
11 cycle memory access latency
VSHFR V3 = V2 >> 1 1 + VLN - 1
Having 16 (>11) banks ensures there are enough banks to
VST C = V3 11 + VLN – 1
overlap enough memory operations to cover memory latency
807 808
10-11-2023
Chain Chain
Load
Unit
Mult. Add
Memory
285 cycles
809 Slide credit: Krste Asanovic 810
Vector Code Performance - Chaining Vector Code Performance – Multiple Memory Ports
Vector chaining: Data forwarding from one vector Chaining and 2 load ports, 1 store port in each bank
functional unit to another
1 1 11 49 11 49
Strict assumption:
Each memory bank
4 49 has a single port
(memory bandwidth
bottleneck)
These two VLDs cannot be 1 49
pipelined. WHY?
11 49
79 cycles
VLD and VST cannot be
182 cycles pipelined. WHY? 19X perf. improvement!
811 812
10-11-2023
813 814
Index Vector Data Vector (to Store) Stored Vector (in Memory)
Idea: Masked operations
0 3.14 Base+0 3.14 VMASK register is a bit mask determining which data element
2 6.5 Base+1 X should not be acted upon
6 71.2 Base+2 6.5
VLD V0 = A
7 2.71 Base+3 X
Base+4 X VLD V1 = B
Base+5 X VMASK = (V0 != 0)
Base+6 71.2 VMUL V1 = V0 * V1
Base+7 2.71
VST B = V1
Does this look familiar? This is essentially predicated execution.
815 816
10-11-2023
Some Issues
Stride and banking
As long as they are relatively prime to each other and there
are enough banks to cover bank access latency, we can
sustain 1 element/cycle throughput
Storage of a matrix
Row major: Consecutive elements in a row are laid out
consecutively in memory
Column major: Consecutive elements in a column are laid out
consecutively in memory
You need to change the stride when accessing a row versus
column
819 820
10-11-2023
821 822
ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
Space Space
C[0] C[0] C[1] C[2] C[3]
Lane
Instruction
issue
Memory Subsystem
Slide credit: Krste Asanovic 825 Slide credit: Krste Asanovic 826
add
Vectorization is a compile-time reordering of Many existing ISAs include (vector-like) SIMD operations
operation sequencing
requires extensive loop dependence analysis
Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD
store
Slide credit: Krste Asanovic 827 828
10-11-2023
830
MMX Example: Image Overlaying (I) MMX Example: Image Overlaying (II)
Goal: Overlay the human in image 1 on top of the background in image 2
831 832
10-11-2023
Out-of-Order Execution
Prof. Onur Mutlu
Carnegie Mellon University Issues in OoO Execution: Load-Store Handling, …
Spring 2015, 2/20/2015
Alternative Approaches to Instruction Level Parallelism
834
835 836
10-11-2023
Time
Vector Registers, Stride, Masks, Length
Memory Banking add add add
Vectorizable Code
Scalar vs. Vector Code Execution store store store
Vector Chaining
Vector Stripmining load
Iter. Iter.
Gather/Scatter Operations Iter. 2 load 1 2 Vector Instruction
Performance improvement limited by vectorizability of code To understand this, let’s go back to our parallelizable code
Scalar operations limit vector machine performance example
Remember Amdahl’s Law
CRAY-1 was the fastest SCALAR machine at its time!
But, before that, let’s distinguish between
Programming Model (Software)
Many existing ISAs include SIMD operations vs.
Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD
Execution Model (Hardware)
839 840
10-11-2023
Programming Model vs. Hardware Execution Model How Can You Exploit Parallelism Here?
for (i=0; i < N; i++)
Programming Model refers to how the programmer expresses C[i] = A[i] + B[i];
the code Scalar Sequential Code
E.g., Sequential (von Neumann), Data Parallel (SIMD), Dataflow,
load
Multi-threaded (MIMD, SPMD), …
load
Let’s examine three programming
Iter. 1
options to exploit instruction-level
Execution Model refers to how the hardware executes the add parallelism present in this sequential
code underneath code:
E.g., Out-of-order execution, Vector processor, Array processor, store
Dataflow processor, Multiprocessor, Multithreaded processor, …
load 1. Sequential (SISD)
load
Execution Model can be very different from the Programming Iter. 2
Model 2. Data-Parallel (SIMD)
add
E.g., von Neumann model implemented by an OoO processor
E.g., SPMD model implemented by a SIMD processor (a GPU) store
3. Multithreaded (MIMD/SPMD)
841 842
when ready
store store VST V3 C
store Different iterations are present in the
instruction window and can execute in load
load Iter. Iter.
parallel in multiple functional units Realization: Each iteration is independent
load Iter. 2 1 load 2
Iter. 2 In other words, the loop is dynamically
unrolled by the hardware Idea: Programmer or compiler generates a SIMD
add
add instruction to execute the same instruction from
Superscalar or VLIW processor all iterations across different data
store Can fetch and execute multiple store
instructions per cycle Best executed by a SIMD processor (vector, array)
843 844
10-11-2023
load
load load load load
Iter. 1 load
load load load load
add
add add add add
load
Iter. Iter. Iter. Iter.
1 load 2 Realization: Each iteration is independent 1 2 Realization: Each iteration is independent
Iter. 2
Idea: Programmer or compiler generates a thread Idea:This
Programmer
particularormodel
compiler generates
is also a thread
called:
add
to execute each iteration. Each thread does the to execute each iteration. Each thread does the
same thing (but on different data) SPMD:
same Single
thing (but on Program Multiple Data
different data)
store
Can be executed on a MIMD machine CanCan
Can be
be executed
executed
be executed on
on a
a SIMT
on a MIMD SIMD machine
machine
machine
845 Single Instruction Multiple Thread 846
It is programmed using threads (SPMD programming model) load load Warp 0 at PC X+1
SIMD not Exposed to Programmer (SIMT) SIMT: Multiple instruction streams of scalar instructions
threads grouped dynamically into warps
[LD, LD, ADD, ST], NumThreads
load load 0 at PC X
Warp 1
Thread Warp 3
Thread Warp 8
load load Thread Warp Common PC
Scalar Scalar Scalar Scalar Thread Warp 7
add add Warp 20 at PC X+2
ThreadThread Thread Thread
W X Y Z
store store SIMD Pipeline
Iter.
Iter. Iter.
Iter.
1
33
20*32 + 1 2
34
20*32 +2
851 852
10-11-2023
Thread Warp 7
SIMD Pipeline
Fine-grained multithreading
I-Fetch
One instruction per thread in
Decode
pipeline at a time (No
interlocking)
RF
RF
RF
Interleave warp execution to Warps accessing
ALU
ALU
ALU
memory hierarchy
hide latencies Miss?
Register values of all threads stay D-Cache Thread Warp 1
in register file All Hit? Data Thread Warp 2
Slide credit: Krste Asanovic 855 Slide credit: Krste Asanovic 856
10-11-2023
Load Unit Multiply Unit Add Unit Let’s assume N=16, 4 threads per warp 4 warps
W0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Threads
W1
+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
W2 Data elements
time
W3
W4
W5
+ + + +
Warp issue
Sample GPU SIMT Code (Simplified) Sample GPU Program (Less Simplified)
CPU code
for (ii = 0; ii < 100000; ++ii) {
C[ii] = A[ii] + B[ii];
}
CUDA code
// there are 100000 threads
__global__ void KernelFunction(…) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int varA = aa[tid];
int varB = bb[tid];
C[tid] = varA + varB;
}
SIMD vs. SIMT Execution Model Threads Can Take Different Paths in Warp-based SIMD
SIMD: A single sequential instruction stream of SIMD Each thread can have conditional control flow instructions
instructions each instruction specifies multiple data inputs Threads can execute different control flow paths
[VLD, VLD, VADD, VST], VLEN
Path A C/1001
C D/0110
D F
Thread Warp Common PC
Branch divergence
E/1111
E Thread Thread Thread Thread
occurs when threads Path B
1 2 3 4
inside warps branch to G/1111
G
different execution A B C D E G A
paths.
This is the same as conditional execution.
Recall the Vector Mask and Masked Vector Operations?
Time
Slide credit: Tor Aamodt 865 Slide credit: Tor Aamodt 866
Dynamic Warp Formation Example Hardware Constraints Limit Flexibility of Warp Grouping
x/1111
Functional Unit
A y/1111
Legend
x/1110 A A
B y/0011 Execution of Warp x Execution of Warp y
at Basic Block A at Basic Block A
x/1000 x/0110 x/0001
C y/0010 D y/0001 F y/1100
D Registers
x/1110
A new warp created from scalar for each
E y/0011
threads of both Warp x and y
Thread Registers for Registers for Registers for Registers for
executing at Basic Block D
thread IDs thread IDs thread IDs thread IDs
x/1111 0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, …
G y/1111
A A B B C C D D E E F F G G A A
Baseline
Time
Can you move any thread
Dynamic
Warp
A A B B C D E E F G G A A
Lane
flexibly to any lane?
Formation
Time
Memory Subsystem
Slide credit: Tor Aamodt 871 Slide credit: Krste Asanovic 872
10-11-2023
Need techniques to
Tolerate memory divergence
Integrate solutions to branch and memory divergence
873 874
Generic speak:
30 cores
876
Slide credit: Kayvon Fatahalian
10-11-2023
NVIDIA GeForce GTX 285 “core” NVIDIA GeForce GTX 285 “core”
64 KB of storage 64 KB of storage
… for thread contexts
(registers)
… for thread contexts
(registers)
Recommended
Tex Tex
… … … … … … Narasiman et al., “Improving GPU Performance via Large
Warps and Two-Level Warp Scheduling,” MICRO 2011.
…
Tex Tex Fung et al., “Dynamic Warp Formation and Scheduling for
… … … … …
Efficient GPU Control Flow,” MICRO 2007.
Tex Tex
Jog et al., “Orchestrated Scheduling and Prefetching for
… … … … … … GPGPUs,” ISCA 2013.
883
10-11-2023
Traditional Characteristics
Multiple functional units
Each instruction in a bundle executed in lock step Fisher, “Very Long Instruction Word architectures and the
Instructions in a bundle statically aligned to be directly fed ELI-512,” ISCA 1983.
into the functional units ELI: Enormously longword instructions (512 bits)
885 886
Intel IA-64
Not fully VLIW, but based on VLIW principles
EPIC (Explicitly Parallel Instruction Computing)
Instruction bundles can have dependent instructions
A few bits in the instruction format specify explicitly which
instructions in the bundle are dependent on which other ones
Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. 889 890
894
Disadvantages:
-- Compiler support to partition the program and manage queues
-- Determines the amount of decoupling
-- Branch instructions require synchronization between A and E
-- Multiple instruction streams (can be done with a single one,
though)
895 896
10-11-2023
897 898
Loop Unrolling
18-447
Computer Architecture
Lecture 16: Systolic Arrays & Static Scheduling
-- What if iteration count not a multiple of unroll factor? (need extra code to detect
this)
-- Increases code size
899
10-11-2023
Agenda for Today & Next Few Lectures Approaches to (Instruction-Level) Concurrency
Single-cycle Microarchitectures Pipelining
Out-of-order execution
Multi-cycle and Microprogrammed Microarchitectures Dataflow (at the ISA level)
Pipelining SIMD Processing (Vector and array processors, GPUs)
VLIW
Issues in Pipelining: Control & Data Dependence Handling, Decoupled Access Execute
State Maintenance and Recovery, … Systolic Arrays
Out-of-Order Execution
Static Instruction Scheduling
Issues in OoO Execution: Load-Store Handling, …
905 906
y2 = w1x2 +
w2x3 + w3x4
y3 = w1x3 +
w2x4 + w3x5
Worthwhile to implement adder and multiplier separately
to allow overlapping of add/mul executions
911 912
10-11-2023
913 914
Taken further
Each PE can have its own data and instruction memory
Data memory to store partial/temporary results, constants
Leads to stream processing, pipeline parallelism
More generally, staged execution
915 916
10-11-2023
A B C
loop {
Compute1 A
Compute2 B
Compute3 C
}
917 918
919 920
10-11-2023
921 922
Pipelining
Out-of-Order Execution
Static Instruction Scheduling Rau and Fisher, “Instruction-level parallel processing: history,
overview, and perspective,” Journal of Supercomputing, 1993.
Faraboschi et al., “Instruction Scheduling for Instruction Level
Parallel Processors,” Proc. IEEE, Nov. 2001.
925 926
Agenda
Static Scheduling
Key Questions and Fundamentals
Static Instruction Scheduling
(with a Slight Focus on VLIW) Enabler of Better Static Scheduling: Block Enlargement
Predicated Execution
Loop Unrolling
Trace
Superblock
Hyperblock
Block-structured ISA
928
10-11-2023
Some Terminology: Basic vs. Atomic Block VLIW: Finding Independent Operations
Basic block: A sequence (block) of instructions with a single Within a basic block, there is limited instruction-level
control flow entry point and a single control flow exit point parallelism (if the basic block is small)
A basic block executes uninterrupted (if no To find multiple instructions to be executed in parallel, the
exceptions/interrupts) compiler needs to consider multiple basic blocks
Atomic block: A block of instructions where either all Problem: Moving an instruction above a branch is unsafe
instructions complete or none complete because instruction is not guaranteed to be executed
In most modern ISAs, the atomic unit of execution is at the
granularity of an instruction
Idea: Enlarge blocks at compile time by finding the
A basic block can be considered atomic (if there are no
exceptions/interrupts and side effects observable in the middle frequently-executed paths
of execution) Trace scheduling
One can reorder instructions freely within an atomic block, Superblock scheduling
subject only to true data dependences Hyperblock scheduling
933 934
Upward
r1 = ... r1 = r2 & r3 r4 = r1 ... r1 = r2 & r3
When moving an operation from a BB to its source BB’s
(a) safe and legal (b) illegal
register values required by the other dest BB’s must not be
destroyed
the movement must not cause new exceptions
937 938
Instr 1 Instr 2
Instr 2 Instr 3
Instr 3 Instr 4
Instr 4 Instr 1
Instr 5 Instr 5
939 940
10-11-2023
Instr 3
Instr 1 Instr 2 Instr 1 Instr 1
Instr 4
Instr 2 Instr 3 Instr 2 Instr 5
Instr 3 Instr 4 Instr 3 Instr 2
Instr 4 Instr 1 Instr 4 Instr 3
Instr 5 Instr 5 Instr 5 Instr 4
941 942
B X
Correctness C
D Y
943 944
10-11-2023
Trace Scheduling
build data precedence graph for a whole trace 4 4
945 946
947 948
10-11-2023
953 954
955 956
Hwu+, “The Superblock: An Effective Technique for VLIW and superscalar compilation,” J of SC 1991.
10-11-2023
Block selection 80 20
Advantages Select subset of BBs for inclusion in HB BB4
Difficult problem 10
+ Reduces the effect of unbiased branches on scheduling block
961 962
963 964
10-11-2023
Disadvantages
-- “Fault operations” can lead wasted work (atomicity)
-- Code bloat (multiple copies of the same basic block exists in
the binary and possibly in I-cache)
-- Need to predict which enlarged block comes next
Melvin and Patt, “Enhancing Instruction Scheduling with a Block-Structured ISA,” IJPP 1995. 965 966
BS-ISA blocks
Single-entry, single exit
Atomic
Need to roll back to the beginning of the block on fault
Multiple paths optimized (hardware has a choice to pick)
967 968
10-11-2023
18-447
Computer Architecture IA-64: A “Complicated” VLIW ISA
Lecture 17: Memory Hierarchy and Caches
Recommended reading:
Huck et al., “Introducing the IA-64 Architecture,” IEEE Micro 2000.
+ No lock-step execution \
+ Static reordering of stores and loads + dynamic checking
-- Hardware needs to perform dependency checking (albeit aided by
software) IA-64 Instruction
-- Other disadvantages of VLIW still exist Fixed-length 41 bits long
Contains three 7-bit register specifiers
Huck et al., “Introducing the IA-64 Architecture,” IEEE Micro, Sep/Oct
Contains a 6-bit field for specifying one of the 64 one-bit
2000. predicate registers
973 974
975 976
10-11-2023
Three Things That Hinder Static Scheduling Non-Faulting Loads and Exception Propagation in IA-64
Dynamic events (static unknowns) Idea: Support unsafe code motion
ld.s r1=[a]
inst 1 inst 1
Branch direction inst 2 unsafe inst 2
Load hit miss status …. code ….
motion br
Memory address br
Let’s see how IA-64 ISA has support to aid scheduling in …. ld r1=[a] …. chk.s r1 ld r1=[a]
the presence of statically-unknown load-store addresses use=r1 use=r1
Non-Faulting Loads and Exception Propagation in IA-64 Aggressive ST-LD Reordering in IA-64
Idea: Support unsafe code motion Idea: Reorder LD/STs in the presence of unknown address
Load and its use
ld.s r1=[a]
inst 1 inst 1 ld.a r1=[x]
inst 1 inst 2 potential
inst 2 unsafe use=r1 inst 2 aliasing inst 1
…. code …. …. inst 2
br motion
br br st [?]
st[?] ….
…. st [?]
ld r1=[x] ….
…. ld r1=[a] …. chk.s use ld r1=[a]
use=r1 use=r1 use=r1 ld.c r1=[x]
use=r1
Load data can be speculatively consumed (use) prior to check
“speculation” status is propagated with speculated data ld.a (advanced load) starts the monitoring of any store to the same
Any instruction that uses a speculative result also becomes speculative address as the advanced load
itself (i.e. suppressed exceptions) If no aliasing has occurred since ld.a, ld.c is a NOP
chk.s checks the entire dataflow sequence for exceptions If aliasing has occurred, ld.c re-loads from memory
979 980
10-11-2023
985 986
Idealism
Instruction
Supply
Pipeline
(Instruction
Data
Supply
The Memory Hierarchy
execution)
Zero cost
SHARED L3 CACHE
DRAM INTERFACE
CORE 0 CORE 1
DRAM BANKS
DRAM MEMORY
CONTROLLER
L2 CACHE 2
L2 CACHE 3
CORE 2 CORE 3
991 992
10-11-2023
_bitline
DRAM cell needs to be refreshed
Need more banks, more ports, higher frequency, or faster
technology
993 994
Two cross coupled inverters store a single bit 1. Decode row address
& drive word-lines
Feedback path enables the stored value to persist in the “cell”
4 transistors for storage 2. Selected bits drive
2 transistors for access bit-lines
• Entire row read
• Send to output
5. Precharge bit-lines
• For next access
995 996
10-11-2023
SRAM (Static Random Access Memory) DRAM (Dynamic Random Access Memory)
row enable Bits stored as charges on node
Read Sequence
row select capacitance (non-restorative)
1. address decode
- bit cell loses charge when read
_bitline
2. drive row select
- bit cell loses charge over time
3. selected bit-cells drive bitlines
_bitline
bitline
Idea: Have multiple levels of storage (progressively bigger With good locality of
and slower as the levels are farther from the processor) reference, memory
1003 1004
10-11-2023
Temporal: A program tends to reference the same memory Temporal locality principle
location many times and all within a small window of time Recently accessed data will be again accessed in the near
future
Spatial: A program tends to reference a cluster of memory This is what Maurice Wilkes had in mind:
locations at a time Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE
Trans. On Electronic Computers, 1965.
most notable examples:
“The use is discussed of a fast core memory of, say 32000 words
1. instruction memory references
as a slave to a slower core memory of, say, one million words in
2. array/data structure references such a way that in practical cases the effective access time is
nearer that of the fast memory than that of the slow memory.”
1005 1006
1007 1008
10-11-2023
High frequency pipeline Cannot make the cache large still done in some embedded processors (on-chip scratch pad
But, we want a large cache AND a pipelined design SRAM in lieu of a cache)
Idea: Cache hierarchy
Automatic: Hardware manages data movement across levels,
transparently to the programmer
Main ++ programmer’s life is easier
Level 2 Memory the average programmer doesn’t need to know about it
CPU Level1 Cache (DRAM)
RF Cache You don’t need to know how big the cache is and how it works to
write a “correct” program! (What if you want a “fast” program?)
1009 1010
L2 cache
512 KB ~ 1MB, many nsec Automatic
HW cache
L3 cache, management
.....
“By a slave memory I mean one which automatically
accumulates to itself words that come from a slower main Main memory (DRAM),
GB, ~100 nsec
memory, and keeps them available for subsequent use automatic
without it being necessary for the penalty of main memory demand
Swap Disk
access to be incurred again.” 100 GB, ~10 msec paging
1011 1012
10-11-2023
hi + mi = 1 Keep mi low
Thus increasing capacity Ci lowers mi, but beware of increasing ti
lower mi by smarter management (replacement::anticipate what you
Ti = hi·ti + mi·(ti + Ti+1)
don’t need, prefetching::anticipate what you will need)
Ti = ti + mi ·Ti+1
Keep Ti+1 low
hi and mi are defined to be the hit-rate faster lower hierarchies, but beware of increasing cost
and miss-rate of just the references that missed at Li-1 introduce intermediate hierarchies as a compromise
1013 1014
Hit/miss? Data
Prof. Onur Mutlu
Cache hit rate = (# hits) / (# hits + # misses) = (# hits) / (# accesses) Carnegie Mellon University
Average memory access time (AMAT) Spring 2015, 2/27/2015
= ( hit-rate * hit-latency ) + ( miss-rate * miss-latency )
Aside: Can reducing AMAT reduce performance?
1019
10-11-2023
Agenda for the Rest of 447 Readings for Today and Next Lecture
The memory hierarchy Memory Hierarchy and Caches
Caches, caches, more caches (high locality, high bandwidth)
Virtualizing the memory hierarchy Required
Main memory: DRAM Cache chapters from P&H: 5.1-5.3
Hit/miss? Data
Blocks and Addressing the Cache Direct-Mapped Cache: Placement and Access
Memory is logically divided into fixed-size blocks Assume byte-addressable memory:
256 bytes, 8-byte blocks 32 blocks
Each block maps to a location in the cache, determined by Assume cache: 64 bytes, 8 blocks
the index bits in the address tag index byte in block Direct-mapped: A block can go to only one location
used to index into the tag and data stores 2b 3 bits 3 bits tag index byte in block
3) compare tag bits in address with the stored tag in tag store
byte in block
=? MUX
If a block is in the cache (cache hit), the stored tag should be Hit? Data
valid and match the tag of the block Addresses with same index contend for the same location
Cause conflict misses
1027 1028
10-11-2023
Can lead to 0% hit rate if more than one block accessed in V tag V tag
=? =? =? =? Tag store
Logic Hit? =? =? =? =? =? =? =? =?
Data store Logic
Hit?
MUX
Data store
byte in block
MUX
MUX
byte in block
MUX
+ Likelihood of conflict misses even lower
-- More tag comparators and wider data mux; larger tags
1031 1032
10-11-2023
1035 1036
10-11-2023
Examples:
Not MRU (not most recently used)
Hierarchical LRU: divide the 4-way set into 2-way “groups”,
track the MRU group and the MRU way in each group
Victim-NextVictim Replacement: Only keep track of the victim
and the next victim
1037 1038
Hierarchical LRU (not MRU) Example Hierarchical LRU (not MRU) Example
1039 1040
10-11-2023
On a cache hit to V
Demote NV to V
Randomly pick an O block as NV
Turn V to O
1041 1042
On a cache hit to O
Do nothing
1043 1044
10-11-2023
Cache Replacement Policy: LRU or Random What Is the Optimal Replacement Policy?
LRU vs. Random: Which one is better? Belady’s OPT
Example: 4-way cache, cyclic references to A, B, C, D, E Replace the block that is going to be referenced furthest in the
0% hit rate with LRU policy future by the program
Set thrashing: When the “program working set” in a set is Belady, “A study of replacement algorithms for a virtual-
larger than set associativity storage computer,” IBM Systems Journal, 1966.
Random replacement policy is better when thrashing occurs How do we implement this? Simulate?
In practice:
Depends on workload Is this optimal for minimizing miss rate?
Average hit rate of LRU and Random are similar Is this optimal for minimizing execution time?
No. Cache miss latency/cost varies from block to block!
Best of both Worlds: Hybrid of LRU and Random Two reasons: Remote vs. local caches and miss overlapping
How to choose between the two? Set sampling Qureshi et al. “A Case for MLP-Aware Cache Replacement,“
See Qureshi et al., “A Case for MLP-Aware Cache Replacement,“ ISCA 2006.
ISCA 2006.
1045 1046
Associativity
Replacement policy
Insertion/Placement policy
1056
10-11-2023
1059 1060
10-11-2023
Readings for Today and Next Lecture How to Improve Cache Performance
Memory Hierarchy and Caches Three fundamental goals
1067 1068
10-11-2023
=? =?
1073 1074
=? tag index byte in block =? Seznec, “A Case for Two-Way Skewed-Associative Caches,” ISCA 1993.
1075 1076
10-11-2023
Software Approaches for Higher Hit Rate Restructuring Data Access Patterns (I)
Restructuring data access patterns Idea: Restructure data layout or data access patterns
Restructuring data layout Example: If column-major
x[i+1,j] follows x[i,j] in memory
Loop interchange x[i,j+1] is far away from x[i,j]
Data structure separation/merging
Poor code Better code
Blocking
for i = 1, rows for j = 1, columns
… for j = 1, columns for i = 1, rows
sum = sum + x[i,j] sum = sum + x[i,j]
1079 1080
10-11-2023
1083 1084
10-11-2023
P4 P3 P2 P1 P1 P2 P3 P4 S1 S2 S3
Implicit assumption: Reducing miss count reduces memory-
related stall time
Misses to blocks P1, P2, P3, P4 can be parallel
Misses to blocks S1, S2, and S3 are isolated
Misses with varying cost/MLP breaks this assumption!
Two replacement algorithms:
1. Minimizes miss count (Belady’s OPT)
Eliminating an isolated miss helps performance more than 2. Reduces isolated miss (MLP-Aware)
eliminating a parallel miss
Eliminating a higher-latency miss could help performance For a fully associative cache containing 4 blocks
1085 1086
S2 S3
Qureshi et al., “A Case for MLP-Aware Cache Replacement,”
P4 P3 P2 P1 P1 P2 P3 P4 S1
ISCA 2006.
Required reading for this week
Hit/Miss H H H M H H H H M M M
Misses=4
Time stall
Stalls=4
Belady’s OPT replacement
Hit/Miss H M M M H M M M H H H
Time Saved
stall Misses=6
cycles
Stalls=2
MLP-Aware replacement
1087 1088
10-11-2023
1090
1093 1094
1095
10-11-2023
Multiple Instructions per Cycle Handling Multiple Accesses per Cycle (I)
Can generate multiple cache/memory accesses per cycle True multiporting
How do we ensure the cache/memory can handle multiple Each memory cell has multiple read or write ports
accesses in the same clock cycle? + Truly concurrent accesses (no conflicts on read accesses)
-- Expensive in terms of latency, power, area
Solutions: What about read and write to the same location at the same
time?
true multi-porting
Peripheral logic needs to handle this
virtual multi-porting (time sharing a port)
banking (interleaving)
1097 1098
Peripheral Logic for True Multiporting Peripheral Logic for True Multiporting
1099 1100
10-11-2023
Handling Multiple Accesses per Cycle (II) Handling Multiple Accesses per Cycle (III)
Virtual multiporting Multiple cache copies
Time-share a single port Stores update both caches
Each access needs to be (significantly) shorter than clock cycle Loads proceed in parallel
Used in Alpha 21264 Port 1
Is this scalable? Used in Alpha 21164 Load Port 1
Cache
Copy 1 Data
Scalability?
Store operations form a
Store
bottleneck
Area proportional to “ports” Port 2
Cache
Port 2 Copy 2 Data
Load
1101 1102
1105
How do we design the caches in a multi-core system? CORE 0 CORE 1 CORE 2 CORE 3 CORE 0 CORE 1 CORE 2 CORE 3
Many decisions L2 L2 L2 L2
L2
Shared vs. private caches CACHE CACHE CACHE CACHE
CACHE
Disadvantages
Slower access
L2 L2 L2 L2
CACHE CACHE CACHE CACHE L2 Cores incur conflict misses due to other cores’ accesses
CACHE
Misses due to inter-core interference
Some cores can destroy the hit rate of other cores
DRAM MEMORY CONTROLLER
Guaranteeing a minimum level of service (or fairness) to each core is harder
DRAM MEMORY CONTROLLER
(how much space, how much bandwidth?)
1111 1112
10-11-2023
Shared Caches: How to Share? Example: Utility Based Shared Cache Partitioning
Free-for-all sharing Goal: Maximize system throughput
Placement/replacement policies are the same as a single core Observation: Not all threads/applications benefit equally from
system (usually LRU or pseudo-LRU) caching simple LRU replacement not good for system
throughput
Not thread/application aware
Idea: Allocate more cache space to applications that obtain the
An incoming block evicts a block regardless of which threads
most benefit from more space
the blocks belong to
The Multi-Core System: A Shared Resource View Need for QoS and Shared Resource Mgmt.
Why is unpredictable performance (or lack of QoS) bad?
1121 1122
1125 1126
1127 1128
10-11-2023
Hardware and software cooperatively and automatically An “address translation” mechanism maps this address to a
manage the physical memory space to provide the illusion “physical address”
Illusion is maintained for each independent process called “real address” in x86
Address translation mechanism can be implemented in
hardware and software together
1129 1130
A System with Virtual Memory (Page based) Virtual Pages, Physical Frames
Memory
Virtual address space divided into pages
Physical address space divided into frames
0:
Page Table 1:
Virtual Physical
Addresses 0: Addresses A virtual page is mapped to
1: A physical frame, if the page is in physical memory
CPU A location in disk, otherwise
Address Translation: The hardware converts virtual addresses into Page table is the table that stores the mapping of virtual
physical addresses via an OS-managed lookup table (page table) pages to physical frames
1131 1132
10-11-2023
Some System Software Jobs for VM Page Fault (“A Miss in Physical Memory”)
Keeping track of which physical frames are free If a page is not in physical memory but disk
Page table entry indicates virtual page not in memory
Allocating free physical frames to virtual pages Access to such a page triggers a page fault exception
OS trap handler invoked to move data from disk into memory
Other processes can continue executing
Page replacement policy
OS has full control over placement
When no physical frame is free, what should be swapped out?
Before fault After fault
Memory
Memory
Sharing pages between processes Page Table
Virtual Page Table
Physical
Addresses Addresses Virtual Physical
Addresses Addresses
Copy-on-write optimization CPU
CPU
Page-flip optimization
Disk
Disk
1135
10-11-2023
1137 1138
1139 1140
10-11-2023
VPN acts as
address translation
table index
1143 1144
10-11-2023
What Is in a Page Table Entry (PTE)? Remember: Cache versus Page Replacement
Page table is the “tag store” for the physical memory data store Physical memory (DRAM) is a cache for disk
A mapping table between virtual memory and physical memory Usually managed by system software via the virtual memory
PTE is the “tag store entry” for a virtual page in memory subsystem
Need a valid bit to indicate validity/presence in physical memory
Need tag bits (PFN) to support translation Page replacement is similar to cache replacement
Need bits to support replacement Page table is the “tag store” for physical memory data store
Need a dirty bit to support “write back caching”
Need protection bits to enable access control and protection What is the difference?
Required speed of access to cache vs. physical memory
Number of blocks in a cache vs. physical memory
“Tolerable” amount of time to find a replacement candidate
(disk versus memory access latency)
Role of hardware versus software
1145 1146
1151 1152
10-11-2023
1155 1156
10-11-2023
Some Issues in Virtual Memory How can we speed up translation & access control check?
1158
VPN PO
Where do we store it?
In hardware? 52-bit 12-bit
In physical memory? (Where is the PTBR?)
In virtual memory? (Where is the PTBR?) page concat PA
table 28-bit 40-bit
How can we store it efficiently without requiring physical
memory that can store all page tables?
Suppose 64-bit VA and 40-bit PA, how large is the page table?
Idea: multi-level page tables
252 entries x ~4 bytes 16x1015 Bytes
Only the first-level page table has to be in physical memory
and that is for just one process!
Remaining levels are in virtual memory (but get cached in
physical memory when accessed) and the process many not be using the entire
VM space!
1159 1160
10-11-2023
1161 1162
More on x86 Page Tables (I): Small Pages More on x86 Page Tables (II): Large Pages
1163 1164
10-11-2023
1165 1166
1167 1168
10-11-2023
Four-level Paging and Extended Physical Address Space in x86 Virtual Memory Issue II
How fast is the address translation?
How can we make it fast?
1175
10-11-2023
VA
TLB cache
PA
VA
cache tlb
PA
VA
cache tlb
PA
1181 1182
TLB physical
cache PPN = tag data
1185 1186
1187 1188
10-11-2023
An Exercise (Concluded)
1189 1190
1192
10-11-2023
On a write to a block, search all possible indices that can What levels of the memory hierarchy does the system
contain the same physical block, and update/invalidate software’s page mapping algorithms influence?
Used in Alpha 21264, MIPS R10K
What are the potential benefits and downsides of page
Restrict page placement in OS coloring?
make sure index(VA) = index(PA)
Called page coloring
Used in many SPARC processors
1193 1194
1195
10-11-2023
Aside: Protection w/o Virtual Memory Very Quick Overview: Base and Bound
Question: Do we need virtual memory for protection? In a multi-tasking system
Each process is given a non-overlapping, contiguous physical memory region, everything
belonging to a process must fit in that region
Answer: No When a process is swapped in, OS sets base to the start of the process’s memory region
and bound to the end of the region
HW translation and protection check (on each memory reference)
Other ways of providing memory protection
PA = EA + base, provided (PA < bound), else violations
Base and bound registers
Each process sees a private and uniform address space (0 .. max)
Segmentation
1197 1198
Very Quick Overview: Base and Bound (II) Segmented Address Space
segment == a base and bound pair
Limitations of the base and bound scheme
SEG # EA
segment tables
must be 1. PA
privileged data segment +,<
structures and 2. base &
table
private/unique to & okay?
each process bound
1199 1200
10-11-2023
1203 1204
10-11-2023
1206
1210
State of the Main Memory System Major Trends Affecting Main Memory (I)
Recent technology, architecture, and application trends Need for main memory capacity, bandwidth, QoS increasing
lead to new requirements
exacerbate old requirements
Mobile: Interactive + non-interactive consolidation Memory capacity per core expected to drop by 30% every two years
… Trends worse for memory bandwidth per core!
1217 1218
Major Trends Affecting Main Memory (II) Major Trends Affecting Main Memory (III)
Need for main memory capacity, bandwidth, QoS increasing Need for main memory capacity, bandwidth, QoS increasing
Multi-core: increasing number of cores/agents
Data-intensive applications: increasing demand/hunger for data
Consolidation: Cloud computing, GPUs, mobile, heterogeneity
Main memory energy/power is a key system design concern
IBM servers: ~50% energy spent in off-chip memory hierarchy
Main memory energy/power is a key system design concern [Lefurgy, IEEE Computer 2003]
DRAM consumes power when idle and needs periodic refresh
1219 1220
10-11-2023
Major Trends Affecting Main Memory (IV) The DRAM Scaling Problem
Need for main memory capacity, bandwidth, QoS increasing DRAM stores charge in a capacitor (charge-based memory)
Capacitor must be large enough for reliable sensing
Access transistor should be large enough for low leakage and high
retention time
Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009]
Evidence of the DRAM Scaling Problem Most DRAM Modules Are At Risk
A company B company C company
Row of Cells Wordline
Row Row
Victim
Row Opened
Aggressor Closed
Row VHIGH
LOW
(37/43) (45/54) (28/32)
Row Row
Victim
Row Up to Up to Up to
7 6 5
Repeatedly opening and closing a row enough times within a
refresh interval induces disturbance errors in adjacent rows in errors errors errors
most real DRAM chips you can buy today
Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of
Kim+, “Flipping Bits in Memory Without Accessing
1223 Them: An Experimental Study of DRAM Disturbance Errors,” ISCA 2014. 1224
DRAM Disturbance Errors,” ISCA 2014.
10-11-2023
loop: loop:
mov (X), %eax mov (X), %eax
mov (Y), %ebx X mov (Y), %ebx X
clflush (X) clflush (X)
clflush (Y) clflush (Y)
mfence Y mfence Y
jmp loop jmp loop
loop: loop:
mov (X), %eax mov (X), %eax
mov (Y), %ebx X mov (Y), %ebx X
clflush (X) clflush (X)
clflush (Y) clflush (Y)
mfence Y mfence Y
jmp loop jmp loop
10-11-2023
• AIntel
realSandy Bridge (2011)
reliability 16.1K
& security issue 11.6M/sec
• InAMD
a more controlled
Piledriver 59
(2012) environment, 6.1M/sec
we can
induce as many as ten million disturbance
errors
All modules from 2012–2013 are vulnerable
Kim+, “Flipping Bits in Memory Without Accessing
1229 Them: An Experimental Study of 1230
DRAM Disturbance Errors,” ISCA 2014.
Core Core
Main
Memory
Core Core
Cores’ interfere with each other when accessing shared main memory
Uncontrolled interference leads to many problems (QoS, performance)
1233 1234
1235
10-11-2023
1237
DRAM INTERFACE
CORE 0 CORE 1
DRAM BANKS
DRAM MEMORY
CONTROLLER
L2 CACHE 2
L2 CACHE 3
CORE 2 CORE 3
1239 1240
10-11-2023
Review: Memory Bank Organization Review: SRAM (Static Random Access Memory)
Read access sequence: Read Sequence
row select 1. address decode
1. Decode row address 2. drive row select
& drive word-lines
3. selected bit-cells drive bitlines
_bitline
bitline
2. Selected bits drive (entire row is read together)
bit-lines 4. diff. sensing and col. select
• Entire row read
(data is ready)
5. precharge all bitlines
3. Amplify row data
(for next read or write)
bit-cell array
4. Decode column
address & select subset n+m n 2n
2n row x 2m -col Access latency dominated by steps 2 and 3
of row
• Send to output Cycling time dominated by steps 2, 3 and 5
(nm to minimize
overall latency) - step 2 proportional to 2m
5. Precharge bit-lines - step 3 and 5 proportional to 2n
• For next access m 2m diff pairs
sense amp and mux
1
1241 1242
Review: DRAM (Dynamic Random Access Memory) Review: DRAM vs. SRAM
row enable
Bits stored as charges on node DRAM
capacitance (non-restorative)
Slower access (capacitor)
_bitline
- bit cell loses charge when read
Physical addressability Goal: Reduce the latency of memory array access and enable
Minimum size of data in memory can be addressed multiple accesses in parallel
Byte-addressable, word-addressable, 64-bit-addressable
Microarchitectural addressability depends on the abstraction Idea: Divide the array into multiple banks that can be
level of the implementation accessed independently (in the same cycle or in consecutive
cycles)
Alignment Each bank is smaller than the entire memory storage
Does the hardware support unaligned access transparently to Accesses to different banks can be overlapped
software?
A Key Issue: How do you map data to different banks? (i.e.,
Interleaving how do you interleave data across banks?)
1245 1246
1247 1248
10-11-2023
Rank
1251
10-11-2023
1253 1254
Row decoder
(Row 1, Column 0)
Rows
Row address 0
1
Row
Empty
Row 01 Row Buffer CONFLICT
HIT !
Column address 0
1
85 Column mux
Data
1255 1256
10-11-2023
1257 1258
1259 1260
10-11-2023
1261 1262
1265
1267
10-11-2023
DIMM (Dual in-line memory module) DIMM (Dual in-line memory module)
...
Chip 0
Chip 1
Chip 7
Rank 0 (Front) Rank 1 (Back)
Rank 0
<56:63>
<8:15>
<0:7>
<0:63> <0:63> <0:63>
Data <0:63>
Memory channel
10-11-2023
row 16k-1
Chip 0
...
Bank 0
Bank 0
<0:7> row 0
<0:7>
<0:7>
<0:7>
<0:7>
... Row-buffer
1B 1B 1B
...
<0:7>
<0:7>
DRAM Subsystem Organization Example: Transferring a cache block
DIMM 0xFFFF…F
Channel 0
Rank
Chip
...
Bank
DIMM 0
Row/Column
Cell 0x40
Rank 0
64B
cache block
0x00
1275
10-11-2023
...
<56:63>
<56:63>
<8:15>
<8:15>
<0:7>
<0:7>
0x40 0x40
0x00 0x00
...
<56:63>
<56:63>
<8:15>
<8:15>
<0:7>
<0:7>
0x40 0x40
64B 64B
Data <0:63> Data <0:63>
cache block cache block
8B 8B
0x00 8B 0x00
10-11-2023
...
<56:63>
<56:63>
<8:15>
<8:15>
<0:7>
<0:7>
0x40 0x40
8B
64B Data <0:63> 8B
64B Data <0:63>
cache block cache block
8B 8B
0x00 8B 0x00
A 64B cache block takes 8 I/O cycles to transfer.
Latency Components: Basic DRAM Operation Multiple Banks (Interleaving) and Channels
CPU → controller transfer time Multiple banks
Controller latency Enable concurrent DRAM accesses
Queuing & scheduling delay at the controller Bits in address determine which bank an address resides in
Access converted to basic commands Multiple independent channels serve the same purpose
Controller → DRAM transfer time But they are even better because they have separate data buses
DRAM bank latency Increased bus bandwidth
Simple CAS (column address strobe) if row is “open” OR
RAS (row address strobe) + CAS if array precharged OR Enabling more concurrency requires reducing
PRE + RAS + CAS (worst case) Bank conflicts
DRAM → Controller transfer time Channel conflicts
Bus latency (BL) How to select/randomize bank/channel indices in address?
Controller to CPU transfer time Lower order bits have more entropy
Randomizing hash functions (XOR of different address bits)
1283 1284
10-11-2023
Disadvantages
Higher cost than a single channel
More board wires
More pins (if on-chip memory controller)
1285 1286
Row interleaving
3 bits Column (11 bits) Byte in bus (3 bits)
Consecutive rows of memory in consecutive banks
Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits)
XOR
Accesses to consecutive cache blocks serviced in a pipelined manner
Row (14 bits) Bank (3 bits) Column (11 bits) C Byte in bus (3 bits)
Physical Frame number (19 bits) Page offset (12 bits) PA
Where are consecutive cache blocks? Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) PA
C Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits)
8 bits 3 bits Operating system can influence which bank/channel/rank a
Row (14 bits) C High Column Bank (3 bits) Low Col. Byte in bus (3 bits) virtual page is mapped to.
It can perform page coloring to
8 bits 3 bits
Row (14 bits) High Column C Bank (3 bits) Low Col. Byte in bus (3 bits)
8 bits 3 bits Minimize bank conflicts
Row (14 bits) High Column Bank (3 bits) C Low Col. Byte in bus (3 bits) Minimize inter-application interference [Muralidhara+ MICRO’11]
8 bits 3 bits
Row (14 bits) High Column Bank (3 bits) Low Col. C Byte in bus (3 bits)
8 bits 3 bits
1289 1290
1291
10-11-2023
Multiprocessors
Coherence and consistency
Interconnection networks
Multi-core issues
Cai+, “Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory 1293 1294
Lifetime”, ICCD 2012.
Required Reading (for the Next Few Lectures) Required Readings on DRAM
Onur Mutlu, Justin Meza, and Lavanya Subramanian, DRAM Organization and Operation Basics
"The Main Memory System: Challenges and Sections 1 and 2 of: Lee et al., “Tiered-Latency DRAM: A Low
Opportunities" Latency and Low Cost DRAM Architecture,” HPCA 2013.
Invited Article in Communications of the Korean Institute of http://users.ece.cmu.edu/~omutlu/pub/tldram_hpca13.pdf
Information Scientists and Engineers (KIISE), 2015.
Sections 1 and 2 of Kim et al., “A Case for Subarray-Level
http://users.ece.cmu.edu/~omutlu/pub/main-memory- Parallelism (SALP) in DRAM,” ISCA 2012.
system_kiise15.pdf http://users.ece.cmu.edu/~omutlu/pub/salp-dram_isca12.pdf
1298
1299 1300
10-11-2023
1303 1304
10-11-2023
1307 1308
10-11-2023
1310
(Row 1, Column 0)
Rows
Row address 0
1
FR-FCFS (first ready, first come first served) scheduling policy
1. Row-hit first
2. Oldest first
Row
Empty
Row 01 Row Buffer CONFLICT
HIT ! Goal 1: Maximize row buffer hit rate maximize DRAM throughput
Goal 2: Prioritize older requests ensure forward progress
Column address 0
1
85 Column mux
1311 1312
10-11-2023
1313 1314
CORE
stream1 random2
CORE Multi-Core
Chip
L2 L2
Low priority CACHE CACHE
unfairness
INTERCONNECT
Shared DRAM
DRAM MEMORY CONTROLLER Memory System
Row decoder
for (j=0; j<N; j++) { for (j=0; j<N; j++) {
index = j*linesize; streaming index = rand(); random
A[index] = B[index]; A[index] = B[index]; T0: Row 0
… … T0:
T1: Row 05
} }
T1:
T0:Row
Row111
0
T1:
T0:Row
Row16
0
STREAM RANDOM Memory Request Buffer Row 0 Row Buffer
- Sequential memory access - Random memory access
- Very high row buffer locality (96% hit rate) - Very low row buffer locality (3% hit rate)
- Memory intensive - Similarly memory intensive Row size: 8KB, cache blockColumn mux
size: 64B
T0: STREAM
128
T1: (8KB/64B)
RANDOM requests of T0 serviced
Data before T1
Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007. Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.
1317 1318
Slowdown
Memory
Lowperformance
priority hog
2
Slowdown
1
Cores make
very slow
0.5 progress
1319 1320
10-11-2023
Additional delays due to DRAM constraints They simply aim to maximize DRAM throughput
Called “protocol overhead” Thread-unaware and thread-unfair
Examples No intent to service each thread’s requests in parallel
Row conflicts FR-FCFS policy: 1) row-hit first, 2) oldest first
Unfairly prioritizes threads with high row-buffer locality
Read-to-write and write-to-read delays
Unfairly prioritizes threads that are memory intensive (many outstanding
memory accesses)
Loss of intra-thread parallelism
A thread’s concurrent requests are serviced serially instead of
in parallel
1323 1324
10-11-2023
Unfair slowdown of different threads Idea: Memory controller estimates each thread’s slowdown
due to interference and schedules requests in a way to
Low system performance
balance the slowdowns
Vulnerability to denial of service
Priority inversion: unable to enforce priorities/SLAs
Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for
Poor performance predictability (no performance isolation) Chip Multiprocessors,” MICRO 2007.
Uncontrollable, unpredictable system 1327 1328
10-11-2023
DRAM-related stall-time: The time a thread spends waiting for DRAM memory Each cycle, the DRAM controller
STshared: DRAM-related stall-time when the thread runs with other threads Computes Slowdown = STshared/STalone for threads with legal requests
STalone: DRAM-related stall-time when the thread runs alone Computes unfairness = MAX Slowdown / MIN Slowdown
Memory-slowdown = STshared/STalone
Relative increase in stall-time If unfairness < a
Use DRAM throughput oriented scheduling policy
Stall-Time Fair Memory scheduler (STFM) aims to equalize If unfairness ≥ a
Memory-slowdown for interfering threads, without sacrificing performance Use fairness-oriented scheduling policy
Considers inherent DRAM performance of each thread (1) requests from thread with MAX Slowdown first
Aims to allow proportional progress of threads (2) row-hit first , (3) oldest-first
1329 1330
1331 1332
10-11-2023
1334
1335 1336
10-11-2023
Request batching Each memory request has a bit (marked) associated with it
Batch formation:
Mark up to Marking-Cap oldest requests per bank for each thread
Marked requests constitute the batch
Within-batch scheduling Form a new batch when no marked requests are left
Parallelism aware
Marked requests are prioritized over unmarked ones
No reordering of requests across batches: no starvation, high fairness
1339 1340
10-11-2023
Time
Time
Breaks ties: rank thread with lower total-load higher T1 T0 T2 T0 4 T3 T2 T2 T3 4
T2 T2 T1 T2 3 T2 T2 T2 T3 3
T3 T3 T1 T0 T3 2 T1 T1 T1 T2 2
max-bank-load total-load T1 T3 T2 T3 1 T1 T0 T0 T0 1
T3
T3 T2 T3 T3 T0 1 3
Bank 0 Bank 1 Bank 2 Bank 3 Bank 0 Bank 1 Bank 2 Bank 3
T1 T0 T2 T0 T1 2 4
T2 T2 T1 T2 T2 2 6
T3 T1 T0 T3 Ranking: T0 > T1 > T2 > T3
T3 5 9
T1 T3 T2 T3
T0 T1 T2 T3 T0 T1 T2 T3
Bank 0 Bank 1 Bank 2 Bank 3 Ranking: Stall times 4 4 5 7 Stall times 1 2 4 7
T0 > T1 > T2 > T3
AVG: 5 bank access latencies AVG: 3.5 bank access latencies
1343 1344
10-11-2023
4
STFM 1
PAR-BS 0.9
3.5
0.8
3 0.7 FR-FCFS
0.6 FCFS
2.5 NFQ
0.5
STFM
0.4
2 PAR-BS
0.3
1.5 0.2
0.1
1 0
4-core 8-core 16-core 4-core 8-core 16-core
1347 1348
10-11-2023
Downsides:
Does not always prioritize the latency-sensitive applications Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter,
"Thread Cluster Memory Scheduling:
Exploiting Differences in Memory Access Behavior"
43rd International Symposium on Microarchitecture (MICRO),
pages 65-76, Atlanta, GA, December 2010. Slides (pptx) (pdf)
15
Better fairness
13
System throughput bias
FCFS
11 Good for throughput Does not starve
9 FRFCF thread A
7 S
5 Fairness bias STFM less memory thread B higher thread C thread A thread B
3 intensive priority
thread C
1 not prioritized
7 7.5 8 8.5 9 9.5 10
Weighted Speedup starvation unfairness reduced throughput
Better system throughput
No previous memory scheduling algorithm provides Single policy for all threads is insufficient
both the best fairness and system throughput
1352
10-11-2023
Achieving the Best of Both Worlds Thread Cluster Memory Scheduling [Kim+ MICRO’10]
higher For Throughput 1. Group threads into two
priority clusters
Prioritize memory-non-intensive threads 2. Prioritize non-intensive higher
thread
priority
thread cluster
Non-intensive
thread
Different policies forcluster
3.Memory-non-intensive each
thread For Fairness cluster
Throughput
Unfairness caused by memory-intensive
thread
thread
thread
thread
being prioritized over each other thread
Prioritized higher
thread
• Shuffle thread ranking thread
thread
thread
priority
thread
Memory-intensive threads have Threads in the system
thread
different vulnerability to interference Memory-intensive
• Shuffle asymmetrically
Intensive cluster Fairness
1353 1354
Time
thread
Non-intensive Intensive
cluster αT cluster Shuffle interval
During quantum: (~1K cycles)
• Monitor thread behavior
T
α < 10% 1. Memory intensity Beginning of quantum:
T = Total memory bandwidth usage ClusterThreshold 2. Bank-level parallelism • Perform clustering
3. Row-buffer locality • Compute niceness of
Step2 Memory bandwidth usage αT divides clusters intensive threads
1355 1356
10-11-2023
Better fairness
Maximum Slowdown
Non-Intensive cluster > Intensive cluster ATLAS
12
Non-Intensive cluster: lower intensity higher rank
STFM
Intensive cluster: rank shuffling 10
PAR-BS
8
TCM
6
10 bandwidth sensitive)
STFM ATLAS (Relatively) simple
8
PAR-BS
6 TCM
Downsides:
4 Scalability to large buffer sizes?
2 Adjusting Robustness of clustering and shuffling algorithms?
12 13 14 ClusterThreshold
15 16
Weighted Speedup
Better system throughput
Required Reading (for the Next Few Lectures) Required Readings on DRAM
Onur Mutlu, Justin Meza, and Lavanya Subramanian, DRAM Organization and Operation Basics
"The Main Memory System: Challenges and Sections 1 and 2 of: Lee et al., “Tiered-Latency DRAM: A Low
Opportunities" Latency and Low Cost DRAM Architecture,” HPCA 2013.
Invited Article in Communications of the Korean Institute of http://users.ece.cmu.edu/~omutlu/pub/tldram_hpca13.pdf
Information Scientists and Engineers (KIISE), 2015.
Sections 1 and 2 of Kim et al., “A Case for Subarray-Level
http://users.ece.cmu.edu/~omutlu/pub/main-memory- Parallelism (SALP) in DRAM,” ISCA 2012.
system_kiise15.pdf http://users.ece.cmu.edu/~omutlu/pub/salp-dram_isca12.pdf
1366
Other Ways of
Handling Memory Interference 1. Prioritization or request scheduling
3. Core/source throttling
4. Application/thread scheduling
1370
Observation: Modern Systems Have Multiple Channels Data Mapping in Current Systems
Core Core
Page
Red Memory Channel 0 Memory Red Memory Channel 0 Memory
App Controller App Controller
Core Core
Core
Basic Idea
Map the data of badly-interfering applications to different
Blue Memory Channel 1 Memory channels
App Controller
Key Principles
Separate low and high memory-intensity applications
Eliminates interference between applications’ requests Separate low and high row-buffer locality applications
Muralidhara et al., “Memory Channel Partitioning,” MICRO’11. 1373 Muralidhara et al., “Memory Channel Partitioning,” MICRO’11. 1374
Key Insight 1: Separate by Memory Intensity Key Insight 2: Separate by Row-Buffer Locality
High memory-intensity applications interfere with low HighRequest Buffer
row-buffer
State
locality
Channelapplications
0 Request Buffer
interfere
State
with low
Channel 0
memory-intensity applications in shared memory channels Bank 0
row-buffer locality applications in shared memory channels
R1 Bank 0
Time Units Time Units Bank 1 R0 R0 Bank 1
Channel 0 Channel 0 R0 R3 R2 R0
5 4 3 2 1 5 4 3 2 1
Core Core Bank 0
Bank 0 Red Bank 0 Bank 0
Red R4 R1 R4
App Bank 1 App Bank 1 Bank 1 R3 R2 Bank 1
Time Time
Channel 1 Channel 1
Core Core Saved Cycles Bank 0
units Service Order units Service Order
Bank 0 Blue Channel 0 Channel 0
Blue 6 5 4 3 2 1 6 5 4 3 2 1
App Bank 1 App Saved Cycles Bank 1 R1 Bank 0 Bank 0
andBank
high1 row-buffer
Saved Bank 1
Map data of low and high memory-intensity applications Map data of low Cycles locality
R3
applications
R2
1375 1376
10-11-2023
Applications with very low memory-intensity rarely Always prioritize very low memory-intensity
access memory applications in the memory scheduler
Dedicating channels to them results in precious
memory bandwidth waste
They have the most potential to keep their cores busy Use memory channel partitioning to mitigate
We would really like to prioritize them interference between other applications
System Performance
1.1 FRFCFS
1.5 KB storage cost for a 24-core, 4-channel system
7%
1%
Normalized
1.05 ATLAS
4. Application/thread scheduling
1383 1384
10-11-2023
Source Throttling: A Fairness Substrate Fairness via Source Throttling (FST) [ASPLOS’10]
Key idea: Manage inter-thread interference at the cores
(sources), not at the shared resources Interval 1 Interval 2 Interval 3
Time
⎩
⎨
⎧
Dynamically estimate unfairness in the memory system
⎪
Slowdown
Feed back this information into a controller Estimation
FST
Throttle cores’ memory access rates accordingly
Unfairness Estimate
Whom to throttle and by how much depends on performance Runtime
target (throughput, fairness, per-thread QoS, etc) App-slowest Dynamic
Unfairness App-interfering
E.g., if unfairness > system-software-specified target then Request Throttling
Evaluation
throttle down core causing unfairness &
throttle up core that was unfairly treated 1- Estimating system unfairness if (Unfairness Estimate >Target)
2- Find app. with the highest {
slowdown (App-slowest) 1-Throttle down App-interfering
Ebrahimi et al., “Fairness via Source Throttling,” ASPLOS’10, TOCS’12. 3- Find app. causing most (limit injection rate and parallelism)
interference for App-slowest 2-Throttle up App-slowest
1385 (App-interfering) } 1386
How
Distributed to dynamically
Resource Management
schedule VMs onto
Host (DRM) policies
hosts? Host
LLC LLC
DRAM DRAM
1389 1390
e.g., CPU
App utilization,
App memory App
capacityApp VMs within a host compete for: VM VM
demand Shared cache capacity App App
Shared memory bandwidth
Memory Capacity Host Host Core0 Core1
VM LLC
CPU
App
DRAM
App 92% 369 MB 98% 1 MB/s Conventional DRM with Microarchitecture Awareness
93% Host
348 MB Host Host Host
Memory Capacity Memory Capacity
VM VM VM VM VM VM VM VM
VM VM
CPU App App App App CPU App App SWAP App App
App App
Core0 Core1 Core0 Core1 Core0 Core1 Core0 Core1
STREAM LLC LLC STREAM LLC LLC
App gromacs App gromacs
DRAM DRAM DRAM DRAM
1393 1394
level interference
Host awareness inHost Key Idea:
Memory Capacity
DRM! VM VM Monitor and detect microarchitecture-level shared
VM VM
VM resource interference
CPU App App App App
App Balance microarchitecture-level resource usage
Core0 Core1 Core0 Core1 across cluster to minimize memory interference
STREAM LLC LLC while maximizing system performance
App gromacs
DRAM DRAM
1395 1396
10-11-2023
Hosts Controller
Wang et al., “A-DRM: Architecture-aware Distributed
A-DRM: Global Architecture
–aware Resource Manager
Resource Management of Virtualized Clusters,” VEE 2015.
OS+Hypervisor
http://users.ece.cmu.edu/~omutlu/pub/architecture-aware-
Profiling Engine
VM VM distributed-resource-management_vee15.pdf
Ap Ap Architecture-aware
•••
p p Interference Detector
Architecture-aware
Distributed Resource
CPU/Memory Architectural Management (Policy)
Capacity Resources
Resource
Profiler Migration Engine
1397 1398
1403 1404
10-11-2023
Prioritizing Requests from Limiter Threads More on Parallel Application Memory Scheduling
Optional reading
Non-Critical Section Critical Section 1 Barrier
Waiting for Sync
or Lock
Critical Section 2 Critical Path Ebrahimi et al., “Parallel Application Memory Scheduling,”
Barrier MICRO 2011.
Thread A
http://users.ece.cmu.edu/~omutlu/pub/parallel-memory-
Thread B
scheduling_micro11.pdf
Thread C
Thread D
Time
Limiter Thread Identification Barrier
Thread A Most Contended
Thread B Critical Section: 1
Saved
Thread C Cycles Limiter Thread: C
A
B
D
Thread D
Time
1407 1408
10-11-2023
1410
DRAM Refresh
DRAM capacitor charge leaks over time
Downsides of refresh
-- Energy consumption: Each refresh consumes energy
-- Performance degradation: DRAM rank/bank unavailable while
refreshed
-- QoS/predictability impact: (Long) pause times during refresh
-- Refresh rate limits DRAM capacity scaling
1412
10-11-2023
BANK 0
Rows
46%
Row Buffer
DRAM Bus
8%
A batch of rows are DRAM CONTROLLER
periodically refreshed
via the auto-refresh command
1415 Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. 1416
10-11-2023
47%
15%
Observation: Most rows can be refreshed much less often
without losing data [Kim+, EDL’09]
Problem: No support in DRAM for different refresh rates per row
Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. 1417 1418
Bloom, “Space/Time Trade-offs in Hash Coding with Allowable Errors”, CACM 1970. 1425 1426
1427 1428
10-11-2023
1429 1430
Bloom Filters: Pros and Cons Benefits of Bloom Filters as Refresh Rate Bins
Advantages False positives: a row may be declared present in the
+ Enables storage-efficient representation of set membership Bloom filter even if it was never inserted
+ Insertion and testing for set membership (presence) are fast Not a problem: Refresh some rows more frequently than
needed
+ No false negatives: If Bloom Filter says an element is not
present in the set, the element must not have been inserted
+ Enables tradeoffs between time & storage efficiency & false No false negatives: rows are never refreshed less
positive rate (via sizing and hashing) frequently than needed (no correctness problems)
1433 1434
1435 1436
10-11-2023
1439 1440
10-11-2023
characterization_isca13.pdf
Lecture 24: Simulation and
Memory Latency Tolerance
Chang+, “Improving DRAM Performance by Parallelizing
Refreshes with Accesses,” HPCA 2014.
http://users.ece.cmu.edu/~omutlu/pub/dram-access-refresh-
parallelization_hpca14.pdf Prof. Onur Mutlu
Carnegie Mellon University
Spring 2015, 3/30/2015
1441
Simulation enables
The exploration of many dreams
A reality check of the dreams
Deciding which dream is better
1444
10-11-2023
propose as the next big idea to advance the state of the art
Especially over a large number of workloads
the goal is mainly to see relative effects of design decisions
Especially if you want to predict the performance of a good
chunk of a workload on a particular design Match the behavior of an existing system so that you can
Especially if you want to consider many design choices debug and verify it at cycle-level accuracy
Cache size, associativity, block size, algorithms propose small tweaks to the design that can make a difference in
Memory control and scheduling algorithms performance or energy
In-order vs. out-of-order execution the goal is very high accuracy
Reservation station sizes, ld/st queue size, register file size, …
Other goals in-between:
…
Refine the explored design space without going into a full
detailed, cycle-accurate design
Goal: Explore design choices quickly to see their impact on
Gain confidence in your design decisions made by higher-level
the workloads we are designing the platform for
design space exploration
1445 1446
1451 1452
10-11-2023
1457
Why are DRAM Controllers Difficult to Design? Many DRAM Timing Constraints
Need to obey DRAM timing constraints for correctness
There are many (50+) timing constraints in DRAM
tWTR: Minimum number of cycles to wait before issuing a read
command after a write command is issued
tRC: Minimum number of cycles between the issuing of two
consecutive activate commands to the same bank
…
Need to keep track of many resources to prevent conflicts
Channels, banks, ranks, data bus, address bus, row buffers
Need to handle DRAM refresh From Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing
Need to manage power consumption Write-Caused Interference in Memory Systems,” HPS Technical Report,
April 2010.
Need to optimize performance & QoS (in the presence of constraints)
Reordering is not simple
Fairness and QoS needs complicates the scheduling problem
1459 1460
10-11-2023
1463 1464
Ipek+, “Self Optimizing Memory Controllers: A Reinforcement Learning Approach,” ISCA 2008.
10-11-2023
1465 1466
• 0 at all other
transaction queue • Read - load miss
times • Number of pending • Read - store miss
writes and ROB
Goal is to maximize
heads waiting for • Precharge - pending
data bus
utilization
referenced row • Precharge - preemptive
• Request’s relative • NOP
ROB order
1467 1468
10-11-2023
Optional
Mutlu et al., “Efficient Runahead Execution: Power-Efficient
Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top Picks
2006.
Mutlu et al., “Address-Value Delta (AVD) Prediction,” MICRO
2005.
Armstrong et al., “Wrong Path Events,” MICRO 2004.
1472
10-11-2023
1477 1478
1479
10-11-2023
1481 1482
80
75
70
65
As main memory latency increases, instruction window size
60 should also increase to fully tolerate the memory latency.
55
50
45
40
35 Building a large instruction window is a challenging task
30
25 L2 Misses if we would like to achieve
20
15 Low power/energy consumption (tag matching logic, ld/st
10
5
buffers)
0 Short cycle time (access, wakeup/select latencies)
128-entry window 2048-entry window
Low design and verification complexity
500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model
1483 1484
10-11-2023
Runahead Example
Runahead Execution (I) Perfect Caches:
Load 1 Hit Load 2 Hit
A technique to obtain the memory-level parallelism benefits
Compute Compute
of a large instruction window
Checkpoint architectural state and enter runahead mode Load 1 Miss Load 2 Miss
Oldest instruction is examined for pseudo-retirement A pseudo-retired store writes its data and INV status to a
An INV instruction is removed from window immediately.
dedicated memory, called runahead cache.
A VALID instruction is removed when it completes execution.
Purpose: Data communication through memory in runahead mode.
Pseudo-retired instructions free their allocated resources.
This allows the processing of later instructions. A dependent load reads its data from the runahead cache.
Pseudo-retired stores communicate their data to Does not need to be always correct Size of runahead cache is
dependent loads. very small.
Compute Runahead
Miss 1
1496
10-11-2023
Disadvantages/Limitations: 0.6
16% 52%
0.5
-- Extra executed instructions
0.4
-- Limited by branch prediction accuracy
0.3 13%
-- Cannot prefetch dependent cache misses
0.2
-- Effectiveness limited by available “memory-level parallelism” (MLP)
-- Prefetch distance limited by memory latency 0.1
0.0
Implemented in IBM POWER6, Sun “Rock” S95 FP00 INT00 WEB MM PROD SERV WS AVG
1497 1498
Runahead Execution vs. Large Windows Runahead vs. A (Real) Large Window
1.5 When is one beneficial, when is the other?
entry window (baseline) -128
1.4
entry window with Runahead -128 Pros and cons of each
1.3 entry window-256
1.2 entry window-384
entry window-512
1.1
Which can tolerate FP operation latencies better?
Micro-operations Per Cycle
1.0
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
S95 FP00 INT00 WEB MM PROD SERV WS AVG
1499 1500
10-11-2023
0.9
20% 22%
0.8 17% 13%
0.7
39% 20%
73% 23%
0.6
28% 15% 50% 47%
0.5
0.4
73% 16%
0.3
0.2
0.1
0.0
S95 FP00 INT00 WEB MM PROD SERV WS AVG
1501 1502
1508
10-11-2023
Prefetching [initially in IBM 360/91, 1967] When the oldest instruction is a long-latency cache miss:
Works well for regular memory access patterns Checkpoint architectural state and enter runahead mode
Prefetching irregular access patterns is difficult, inaccurate, and hardware-
intensive In runahead mode:
Multithreading [initially in CDC 6600, 1964]
Speculatively pre-execute instructions
Works well if there are multiple threads The purpose of pre-execution is to generate prefetches
Improving single thread performance using multithreading hardware is an
ongoing research effort L2-miss dependent instructions are marked INV and dropped
Runahead mode ends when the original miss returns
Out-of-order execution [initially by Tomasulo, 1967]
Tolerates irregular cache misses that cannot be prefetched Checkpoint is restored and normal execution resumes
Requires extensive hardware resources for tolerating long latencies
Runahead execution alleviates this problem (as we will see today) Mutlu et al., “Runahead Execution: An Alternative to Very Large
Instruction Windows for Out-of-order Processors,” HPCA 2003.
1509 1510
Runahead Example
Perfect Caches:
Load 1 Hit Load 2 Hit
Compute Compute
Runahead:
Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Hit
Miss 2
10-11-2023
1513
Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Hit Saved Speculative
Instructions If the predictor returns a stable (confident) AVD for that
load, the value of the load is predicted
Compute Runahead
Miss 1
Saved Cycles Predicted Value = Effective Address – Predicted AVD
Miss 2
0.9
14.3%
15.5%
0.8
0.7
Prefetching
0.6
0.5
0.1
0.0
er
er
dd
th
G
rt
st
cf
f
p
no
ol
vp
ts
so
AV
m
m
al
et
rs
tw
ea
ro
he
bi
r im
pa
vo
tre
pe
10-11-2023
1525 1526
1527 1528
10-11-2023
How a HW Prefetcher Fits in the Memory System Prefetching: The Four Questions
What
What addresses to prefetch
When
When to initiate a prefetch request
Where
Where to place the prefetched data
How
Software, hardware, execution-based, cooperative
1529 1530
different instructions
for different cache
levels
1537 1538
1541 1542
1543 1544
10-11-2023
Cache-Block Address Based Stride Prefetching Stream Buffers (Jouppi, ISCA 1990)
Address tag Stride Control/Confidence Each stream buffer holds one stream of
Block sequentially prefetched cache lines FIFO
address
……. ……
On a load miss check the head of all
FIFO
Memory interface
stream buffers for an address match
if hit, pop the entry from FIFO, update the cache
with data
Can detect if not, allocate a new stream buffer to the new DCache
miss address (may have to recycle a stream
A, A+N, A+2N, A+3N, … buffer following LRU policy)
1547 1548
10-11-2023
1549 1550
1551 1552
10-11-2023
Middle-of-the-Road
3.0 Very Aggressive Change the location prefetches are inserted in cache based on
48% past performance
2.0
29%
1.0 High Accuracy Med Accuracy Low Accuracy
0.0
n
ke
er
el
2
ck
p
ise
cf
im
a
p
id
ar
vp
rt e
pl
re
ea
ga
ip
es
m
m
lg
gr
rs
ra
ua
sw
ap
w
bz
ce
am
vo
ga
gm
m
pa
xt
up
eq
fa
si
w
Polluting Increase Late Decrease Not-Late
Srinath et al., “Feedback Directed Prefetching: Improving the
Performance and Bandwidth-Efficiency of Hardware Prefetchers“, Decrease Increase No Change
HPCA 2007.
1553 1554
Feedback-Directed Prefetcher Throttling (II) How to Prefetch More Irregular Access Patterns?
Regular patterns: Stride, stream prefetchers do well
More irregular access patterns
Indirect array accesses
Linked data structures
Multiple regular strides (1,2,3,1,2,3,1,2,3,…)
11% 13% Random patterns?
Generalized prefetcher for all patterns?
Srinath et al., “Feedback Directed Prefetching: Improving the Correlation based prefetchers
Performance and Bandwidth-Efficiency of Hardware Prefetchers“,
HPCA 2007. Content-directed prefetchers
Srinath et al., “Feedback Directed Prefetching: Improving the Precomputation or execution-based prefetchers
Performance and Bandwidth-Efficiency of Hardware Prefetchers“,
HPCA 2007.
1555 1556
10-11-2023
Required Reading
Onur Mutlu, Justin Meza, and Lavanya Subramanian,
"The Main Memory System: Challenges and
Opportunities"
Invited Article in Communications of the Korean Institute of Prefetching
Information Scientists and Engineers (KIISE), 2015.
http://users.ece.cmu.edu/~omutlu/pub/main-memory-
system_kiise15.pdf
1559
10-11-2023
Review: Outline of Prefetching Lecture(s) Review: How to Prefetch More Irregular Access Patterns?
Why prefetch? Why could/does it work? Regular patterns: Stride, stream prefetchers do well
The four questions More irregular access patterns
What (to prefetch), when, where, how Indirect array accesses
Software prefetching Linked data structures
Hardware prefetching algorithms Multiple regular strides (1,2,3,1,2,3,1,2,3,…)
Execution-based prefetching Random patterns?
Generalized prefetcher for all patterns?
Prefetching performance
1561 1562
Address Correlation Based Prefetching (I) Address Correlation Based Prefetching (II)
Consider the following history of cache block addresses Cache Block Addr Prefetch Confidence …. Prefetch Confidence
A, B, C, D, C, E, A, C, F, F, E, A, A, B, C, D, E, A, B, C, D, C Cache (tag) Candidate 1 …. Candidate N
After referencing a particular address (say A or E), are Block
Addr ……. ……. …… .… ……. ……
some addresses more likely to be referenced next
….
1.0 Joseph and Grunwald, “Prefetching using Markov Predictors,” ISCA 1997.
Also called “Markov prefetchers”
1563 1564
10-11-2023
x80011100
[31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20]
= = = = = = = =
Speculative thread: Pre-executed program piece can
Virtual Address Predictor be considered a “thread”
Generate Prefetch
1567 1568
10-11-2023
1573 1574
1575 1576
10-11-2023
1577 1578
Review: Runahead Execution Review: Runahead Execution (Mutlu et al., HPCA 2003)
Miss 2
How?
1581 1582
Multiprocessors
Coherence and consistency
Interconnection networks
Multi-core issues (e.g., heterogeneous multi-core)
1583
10-11-2023
The Main Memory System Major Trends Affecting Main Memory (I)
Need for main memory capacity, bandwidth, QoS increasing
First
(37/43) (45/54) (28/32) Appearance
Up to Up to Up to
7 6 5
errors errors errors
Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of 158
All modules from 2012–2013 are vulnerable
159
DRAM Disturbance Errors,” ISCA 2014.
9 0
http://users.ece.cmu.edu/~omutlu/pub/dra
m-row-hammer_isca14.pdf
http://googleprojectzero.blogspot.com/201
5/03/exploiting-dram-rowhammer-bug-to-
gain.html
1591 1592
10-11-2023
Solutions (to memory scaling) require
…
software/hardware/device cooperation 1594
Trends: Problems with DRAM as Main Memory Solutions to the DRAM Scaling Problem
Need for main memory capacity increasing Two potential solutions
DRAM capacity hard to scale Tolerate DRAM (by taking a fresh look at it)
Enable emerging memory technologies to eliminate/minimize
DRAM
1595 1596
10-11-2023
Kim+, “A Case for Exploiting Subarray-Level Parallelism in DRAM,” ISCA 2012.
Lee+, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013.
System-DRAM co-design Liu+, “An Experimental Study of Data Retention Behavior in Modern DRAM Devices,” ISCA 2013.
Seshadri+, “RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” MICRO 2013.
Novel DRAM architectures, interface, functions Pekhimenko+, “Linearly Compressed Pages: A Main Memory Compression Framework,” MICRO 2013.
Better waste management (efficient utilization) Chang+, “Improving DRAM Performance by Parallelizing Refreshes with Accesses,” HPCA 2014.
Khan+, “The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental
Study,” SIGMETRICS 2014.
Luo+, “Characterizing Application Memory Error Vulnerability to Optimize Data Center Cost,” DSN 2014.
Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,”
Key issues to tackle
ISCA 2014.
Lee+, “Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” HPCA 2015.
Reduce energy
Qureshi+, “AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems,” DSN 2015.
Enable reliability at low cost Meza+, “Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New
Trends from the Field,” DSN 2015.
Improve bandwidth and latency Kim+, “Ramulator: A Fast and Extensible DRAM Simulator,” IEEE CAL 2015.
Seshadri+, “The Evicted-Address Filter: A Unified Mechanism to Address Both Cache Pollution and Thrashing,” PACT 2012.
Pekhimenko+, “Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches,” PACT 2012.
Seshadri+, “The Dirty-Block Index,” ISCA 2014.
Pekhimenko+, “Exploiting Compressed Block Size as an Indicator of Future Reuse,” HPCA 2015.
1597 1598
1603 1604
10-11-2023
STT-MRAM
Inject current to change magnet polarity
Resistance determined by polarity
Memristors/RRAM/ReRAM
Inject current to change atomic structure
Resistance determined by atom distance
PCM is resistive memory: High resistance (0), Low resistance (1)
PCM cell can be switched between states reliably and quickly
1605 1606
Large Small
Can be denser than DRAM
Current Current Can store multiple bits per cell due to large resistance range
Prototypes with 2 bits/cell in ISSCC’08, 4 bits/cell by 2012
Memory
Element
Non-volatile
SET (cryst) Access RESET (amorph)
Retain data for >10 years at 85C
Low resistance Device High resistance
103-104 W 106-107 W No refresh needed, low idle power
Photo Courtesy: Bipin Rajendran, IBM Slide Courtesy: Moinuddin Qureshi, IBM 1607 1608
10-11-2023
1609 1610
Endurance
Writes induce phase change at 650C
Contacts degrade from thermal expansion/contraction
Read Latency 108 writes per cell
50ns: 4x DRAM, 10-3x NAND Flash
10-8x DRAM, 103x NAND Flash
Write Latency
150ns: 12x DRAM Cell Size
Write Bandwidth 9-12F2 using BJT, single-level cells
5-10 MB/s: 0.1x DRAM, 1x NAND Flash 1.5x DRAM, 2-3x NAND (will scale with feature size)
1611 1612
10-11-2023
Phase Change Memory: Pros and Cons PCM-based Main Memory: Some Questions
Pros over DRAM Where to place PCM in the memory hierarchy?
Better technology scaling Hybrid OS controlled PCM-DRAM
Non volatility Hybrid OS controlled PCM and hardware-controlled DRAM
Low idle power (no refresh) Pure PCM main memory
Cons
Higher latencies: ~4-15x DRAM (especially write) How to mitigate shortcomings of PCM?
Higher active energy: ~2-50x DRAM (especially write)
Lower endurance (a cell dies after ~108 writes)
How to take advantage of (byte-addressable and fast) non-
Reliability issues (resistance drift)
volatile main memory?
Challenges in enabling PCM as DRAM replacement/helper:
Mitigate PCM shortcomings
Find the right way to place PCM in the system
Ensure secure and fault-tolerant PCM operation
1613 1614
Hybrid PCM+DRAM [Qureshi+ ISCA’09, Dhiman+ DAC’09, Meza+ Design of cache hierarchy, memory controllers, OS
IEEE CAL’12]: Mitigate PCM shortcomings, exploit PCM advantages
How to partition/migrate data between PCM and DRAM
Design of PCM/DRAM chips and modules
Rethink the design of PCM/DRAM with new requirements
1615 1616
10-11-2023
1617
Aside: STT MRAM: Pros and Cons An Initial Study: Replace DRAM with PCM
Pros over DRAM Lee, Ipek, Mutlu, Burger, “Architecting Phase Change
Better technology scaling Memory as a Scalable DRAM Alternative,” ISCA 2009.
Non volatility Surveyed prototypes from 2003-2008 (e.g. IEDM, VLSI, ISSCC)
Low idle power (no refresh) Derived “average” PCM parameters for F=90nm
Cons
Higher write latency
Higher write energy
Reliability?
1619 1620
10-11-2023
Results: Naïve Replacement of DRAM with PCM Architecting PCM to Mitigate Shortcomings
Replace DRAM with PCM in a 4-core, 4MB L2 system Idea 1: Use multiple narrow row buffers in each PCM chip
PCM organized the same as DRAM: row buffers, banks, peripherals Reduces array reads/writes better endurance, latency, energy
1.6x delay, 2.2x energy, 500-hour average lifetime
Idea 2: Write into array at
cache block or word
granularity DRAM PCM
Reduces unnecessary wear
One Option: DRAM as a Cache for PCM DRAM vs. PCM: An Observation
PCM is main memory; DRAM caches memory rows/blocks Row buffers are the same in DRAM and PCM
Benefits: Reduced latency on DRAM cache hit; write filtering Row buffer hit latency same in DRAM and PCM
Memory controller hardware manages the DRAM cache Row buffer miss latency small in DRAM, large in PCM
Benefit: Eliminates system software overhead
CPU
Three issues: DRA PCM
What data should be placed in DRAM versus kept in PCM? Row buffer MCtrl Ctrl
DRAM Cache PCM Main Memory
What is the granularity of data movement?
Ban Ban Ban Ban
How to design a huge (DRAM) cache at low cost? k k k k
N ns row hit N ns row hit
Two solutions: Fast row miss Slow row miss
Locality-aware data placement [Yoon+ , ICCD 2012] Accessing the row buffer in PCM is fast
Cheap tag stores and dynamic granularity [Meza+, IEEE CAL 2012] What incurs high latency is the PCM array access avoid this
1625 1626
0.2
0
Yoon et al., “Row Buffer Locality-Aware Data Placement in Hybrid Server
Memory Cloud and fairness
energy-efficiency Avgalso
Memories,” ICCD 2012 Best Paper Award. Workload
improve correspondingly
1627 1628
10-11-2023
1.6
1.2 1.4
0.8
1 1.2 31% More robust system design
0.8 1 0.6 e.g., reducing data loss
0.6
0.8
0.4 0.4
0.2
0.6 31% better performance than all PCM, Processing tightly-coupled with memory
0 0.4 within 29% of all DRAM performance
0.2 e.g., enabling efficient search and filtering
0.2Weighted Speedup Max. Slowdown Perf. per Watt
0 Normalized Metric
0
1629 1630
Coordinated Memory and Storage with NVM (I) Coordinated Memory and Storage with NVM (II)
The traditional two-level storage model is a bottleneck with NVM
Goal: Unify memory and storage management in a single unit to
Volatile data in memory a load/store interface
eliminate wasted work to locate, transfer, and translate data
Operating
Virtual memory system Persistent Memory
and file system Manager
Processor
Processor and caches
Address and caches Load/Store Feedback
translation
The Persistent Memory Manager (PMM) The Persistent Memory Manager (PMM)
Exposes a load/store interface to access persistent data
Applications can directly access persistent memory no conversion, Persistent objects
translation, location overhead for persistent data
~24X ~16X
~5X ~5X
Enabling and Exploiting NVM: Issues Three Principles for (Memory) Scaling
Many issues and ideas from
technology layer to algorithms layer Better cooperation between devices and the system
Problems Expose more information about devices to upper layers
Algorithms More flexible interfaces
Enabling NVM and hybrid memory
Programs User
How to tolerate errors?
How to enable secure operation? Better-than-worst-case design
How to tolerate performance and Runtime System Do not optimize for the worst case
(VM, OS, MM)
power shortcomings? Worst case should not determine the common case
ISA
How to minimize cost?
Microarchitecture
Logic
Heterogeneity in design (specialization, asymmetry)
Exploiting emerging tecnologies Enables a more efficient design (No one size fits all)
Devices
How to exploit non-volatility?
How to minimize energy consumption?
These principles are related and sometimes coupled
How to exploit NVM on chip?
1637 1638
2. Non-volatility Data persists in memory after powerdown 2. Non-volatility Data persists in memory after powerdown
Easy retrieval of privileged or private information Easy retrieval of privileged or private information
Efficient encryption/decryption of whole main memory
Hybrid memory system management
3. Multiple bits per cell Information leakage (via side channel) 3. Multiple bits per cell Information leakage (via side channel)
System design to hide side channel information
1639 1640
10-11-2023
Multiprocessors
Coherence and consistency
Interconnection networks
Multi-core issues (e.g., heterogeneous multi-core)
1643
10-11-2023
Recommended
Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE,
1966
Hill, Jouppi, Sohi, “Multiprocessors and Multicomputers,” pp. 551-
560 in Readings in Computer Architecture.
Hill, Jouppi, Sohi, “Dataflow and Multithreading,” pp. 309-314 in
Readings in Computer Architecture.
1645 1646
(4N units at freq F/4) consume less power than (N units at freq F)
Why? Task Level Parallelism
Improve cost efficiency and scalability, reduce complexity Different “tasks/threads” can be executed in parallel
Harder to design a single unit that performs as well as N simpler units Multithreading
Improve dependability: Redundant execution in space Multiprocessing (multi-core)
1649 1650
Synchronization
How to synchronize (efficiently) between tasks Fine grained
How to communicate between tasks Cycle by cycle
Thornton, “CDC 6600: Design of a Computer,” 1970.
Locks, barriers, pipeline stages, condition variables,
semaphores, atomic operations, … Burton Smith, “A pipelined, shared resource MIMD computer,” ICPP
1978.
1655 1656
10-11-2023
1657 1658
1659 1660
10-11-2023
1661 1662
Superlinear Speedup
Can speedup be greater than P with P processing
elements?
Unfair comparisons
Compare best parallel
algorithm to wimpy serial
algorithm unfair
Cache/memory effects
More processors
more cache or memory
fewer misses in cache/mem
1663 1664
10-11-2023
Efficiency
E = (Time with 1 processor) / (processors x Time with P processors)
E = U/R
1665 1666
1667 1668
10-11-2023
1671 1672
10-11-2023
Amdahl, “Validity of the single processor approach to achieving large scale 80 N=1000
0.04
0.08
0.12
0.16
0.24
0.28
0.32
0.36
0.44
0.48
0.52
0.56
0.64
0.68
0.72
0.76
0.84
0.88
0.92
0.96
0
1
0.2
0.4
0.6
0.8
Load imbalance overhead (imperfect parallelization)
f (parallel fraction)
Resource sharing overhead (contention among N processors)
1673 1674
1675 1676
10-11-2023
1679 1680
10-11-2023
A B C
Difficulty is in
loop { Getting parallel programs to work correctly
Compute1 A
Optimizing performance in the presence of bottlenecks
Compute2 B
1685 1686
1687 1688
10-11-2023
Two operations can be executed and retired in any order if Multiple processors execute memory operations
they have no dependency concurrently
Advantage: Lots of parallelism high performance How does the memory see the order of operations from all
processors?
Disadvantage: Order can change across runs of the same
In other words, what is the ordering of operations across
program Very hard to debug
different processors?
1691 1692
10-11-2023
Why Does This Even Matter? When Could Order Affect Correctness?
Ease of debugging When protecting shared data
It is nice to have the same execution done at different times
to have the same order of execution Repeatability
Correctness
Can we have incorrect execution if the order of memory
operations is different from the point of view of different
processors?
1693 1694
1699 1700
10-11-2023
The Problem
The two processors did NOT see the same order of
operations to memory
1701 1702
1703 1704
10-11-2023
1705 1706
Disadvantage
More burden on the programmer or software (need to get the
“fences” correct)
1711 1712
10-11-2023
Caching in Multiprocessors
Caching not only complicates ordering of all operations…
A memory location can be present in multiple caches
Prevents the effect of a store or load to be seen by other Cache Coherence
processors makes it difficult for all processors to see the
same global order of (all) memory operations
1713 1714
1715 1716
10-11-2023
Interconnection Network
Interconnection Network
1000
x
1000
x
Main Memory
Main Memory
1717 1718
P1 P2 ld r2, x P1 P2 ld r2, x
1000 1000
x x
Main Memory Main Memory
1719 1720
10-11-2023
Hardware
1000
x Simplifies software’s job
Main Memory One idea: Invalidate all other copies of block A when a processor writes
to it
1721 1722
1727 1728
10-11-2023
the cost of invalidate-reacquire (broadcast update pattern) Processors observe other processors’ actions
E.g.: P1 makes “read-exclusive” request for A on bus, P0 sees this
- If data is rewritten without intervening reads by other cores,
and invalidates its own copy of A
updates were useless
- Write-through cache policy bus becomes bottleneck
Directory [Censier and Feautrier, IEEE ToC 1978]
Invalidate Single point of serialization per block, distributed among nodes
+ After invalidation broadcast, core has exclusive access rights Processors make explicit requests for blocks
+ Only cores that keep reading after each write retain a copy Directory tracks which caches have each block
- If write contention is high, leads to ping-ponging (rapid Directory coordinates invalidation and updates
mutual invalidation-reacquire) E.g.: P1 asks directory for exclusive copy, directory asks P0 to
invalidate, waits for ACK, then responds to P1
1729 1730
Directory Based Coherence Example (I) Directory Based Coherence Example (I)
1733 1734
1736
10-11-2023
Multiprocessors
Coherence and consistency
Interconnection networks
Multi-core issues (e.g., heterogeneous multi-core)
1737 1738
Papamarcos and Patel, “A low-overhead coherence solution for multiprocessors with Processors observe other processors’ actions
private cache memories,” ISCA 1984. E.g.: P1 makes “read-exclusive” request for A on bus, P0 sees this
and invalidates its own copy of A
Recommended
Censier and Feautrier, “A new solution to coherence problems in multicache systems,” Directory [Censier and Feautrier, IEEE ToC 1978]
IEEE Trans. Computers, 1978.
Single point of serialization per block, distributed among nodes
Goodman, “Using cache memory to reduce processor-memory traffic,” ISCA 1983.
Laudon and Lenoski, “The SGI Origin: a ccNUMA highly scalable server,” ISCA 1997. Processors make explicit requests for blocks
Martin et al, “Token coherence: decoupling performance and correctness,” ISCA 2003. Directory tracks which caches have each block
Baer and Wang, “On the inclusion properties for multi-level cache hierarchies,” ISCA Directory coordinates invalidation and updates
1988.
E.g.: P1 asks directory for exclusive copy, directory asks P0 to
invalidate, waits for ACK, then responds to P1
1739 1740
10-11-2023
Directory Based Coherence Example (I) Directory Based Coherence Example (I)
1743 1744
10-11-2023
1745 1746
PrWr/--
BusRd/Flush PrWr/BusRdX
S PrRd (S’)/BusRd
PrRd (S)/BusRd
[Culler/Singh96]
1755 1756
10-11-2023
MESI State Machine from Lab 8 MESI State Machine from Lab 8
1761 1762
Tradeoffs in Sophisticated Cache Coherence Protocols Revisiting Two Cache Coherence Methods
The protocol can be optimized with more states and How do we ensure that the proper caches are updated?
prediction mechanisms to
+ Reduce unnecessary invalidates and transfers of blocks Snoopy Bus [Goodman ISCA 1983, Papamarcos+ ISCA 1984]
Bus-based, single point of serialization for all memory requests
Directory
- Adds indirection to miss latency (critical path): request dir. mem.
- Requires extra storage space to track sharer sets
Can be approximate (false positives are OK for correctness)
- Protocols and race conditions are more complex (for high-performance)
+ Does not require broadcast to all caches
+ Exactly as scalable as interconnect and directory storage
(much more scalable than bus)
1765 1766
An example mechanism:
For each cache block in memory, store P+1 bits in directory
One bit for each cache, indicating whether the block is in cache
Exclusive bit: indicates that the cache that has the only copy of
the block and can update it without notifying others
On a read: set the cache’s bit and arrange the supply of data
On a write: invalidate all caches that have the block and reset
their bits
Have an “exclusive bit” associated with each block in each
cache
1767 1768
10-11-2023
2. Invl P0 Home P1
P0 Home 5a. Rev
Owner
5b. DatEx
3b. DatEx
1773 1774
Fairness
How can we scale the system to thousands of nodes?
Which requestor should be preferred in a conflict?
Interconnect delivery order, and distance, both matter
Ping-ponging is a higher-level issue Can we get the best of snooping and directory protocols?
Heterogeneity
With solutions like combining trees (for locks/barriers) and
better shared-data-structure design E.g., token coherence [Martin+, ISCA 2003]
1775 1776
10-11-2023
3 2
Workload trends snooping protocols
slide 1779 slide 1780
10-11-2023
Key insight
Step#1 Enforce this invariant directly using tokens
Fixed number of tokens per block
One token to read, all tokens to write
Step#2
Onur Mutlu,
"Asymmetry Everywhere (with Automatic Resource Management)" Vivek Seshadri
CRA Workshop on Advancing Computer Architecture Research: Popular
Carnegie Mellon University
Parallel Programming, San Diego, CA, February 2010.
Position paper Spring 2015, 4/13/2015
1787
10-11-2023
1789 1790
1791 1792
10-11-2023
top
Inverter
1 Small – Cannot drive circuits
2 Reading destroys the state
bottom
1793 1794
VDD
T
VDD 0
en en
dis
en VT > V B
0 VDD
½V
½V
VDD
DD+δ
½V
0 DD
VDD 0
1797 1798
Row Driver
Row Decoder
Tile Tile Tile
Row Driver
1799 1800
Row Decoder
Row Driver
Tile
DRAM Chip
Cell Array Cell Array Cell Array Cell Array
Array of Sense Amplifiers (8Kb) Array of Sense Amplifiers (8Kb) Array of Sense Amplifiers (8Kb) Array of Sense Amplifiers (8Kb)
Decoder
Row Decoder
Row Decoder
Row Decoder
Row
Shared internal bus
Cell Array Cell Array Cell Array Cell Array
DRAM Subarray
Array of Sense Amplifiers Array of Sense Amplifiers Array of Sense Amplifiers Array of Sense Amplifiers
Decoder
Row Decoder
Row Decoder
Row
Cell Array Cell Array Cell Array Cell Array
Row
Row
Array of Sense Amplifiers Array of Sense Amplifiers Array of Sense Amplifiers Array of Sense Amplifiers
Row Decoder
Decoder
Row Decoder
Decoder
Cell Array Cell Array Cell Array Cell Array
Row
Row
Array of Sense Amplifiers (8Kb) Array of Sense Amplifiers (8Kb) Array of Sense Amplifiers (8Kb) Array of Sense Amplifiers (8Kb)
Row Decoder
Decoder
Row Decoder
Decoder
Cell Array Cell Array Cell Array Cell Array
Tile
Tile
Tile
Tile
1803
1801
Row Address
Column Address
Address
DRAM Bank
Cell Array
Cell Array
Data
Address
Cell Array
Cell Array
Cell Array
Cell Array
Data
Bank I/O (64b)
Array of Sense Amplifiers (8Kb)
3 PRECHARGE
1 ACTIVATE Row
2 READ/WRITE Column
1802
1804
10-11-2023
10-11-2023
1805 1806
How it is built?
How it operates? RowClone
What are the trade-offs?
Can we use DRAM for more than just storage? Fast and Energy-Efficient In-DRAM Bulk
In-DRAM copying Data Copy and Initialization
In-DRAM bitwise operations
Vivek Seshadri
Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun,
G. Pekhimenko, Y. Luo, O. Mutlu,
P. B. Gibbons, M. A. Kozuch, T. C. Mowry
1807
10-11-2023
Limited Bandwidth
Memory
Memory
Core Core
Cache
Cache
M M
C
Channel C
Channel
Core Core
High Energy
Reduce unnecessary data movement
Bulk Data Copy and Initialization Bulk Data Copy and Initialization
Cache
(e.g., security) Checkpointing M
C
Channel src
Core
X
High Energy
M
C
Channel src
Core
X
High latency
X
Interference
1816
10-11-2023
½V
VDD
DD +δ Latency 11X Energy 74X
1817 1818
1819 1820
10-11-2023
Compared to Baseline
Fraction of Memory Traffic
60%
0.8
50%
0.6 40%
30%
0.4
20%
0.2
10%
0%
0
bootup compile forkbench mcached mysql shell bootup compile forkbench mcached mysql shell
1821 1822
B AB + BC +
Can we use DRAM for more than just storage?
In-DRAM copying
AC
In-DRAM bitwise operations C
C(A + B)
di
en + ~C(AB)
s
½V0DD
1823 1824
10-11-2023
16MB
32MB
1MB
2MB
4MB
8MB
128KB
256KB
512KB
16KB
32KB
64KB
8KB
5. RowClone Result into C
Performance Relative to
1.4
age < 1818 < age < 2525 < age < 60 age > 60 1.2
1.0
Baseline
0.8
Bitmap 2
Bitmap 3
Bitmap 4
Bitmap 1
0.6
0.4
0.2
0.0
3 9 20 45 98 118 128
Number of OR bins
1827 1828
10-11-2023
Lavanya Subramanian
Carnegie Mellon University
Spring 2015, 4/15/2015
1829
Slowdown
Slowdown
4 4
Cache Memory 2 2
Core Core Core Core
1 1
0 0
Core Core Core Core leslie3d (core 0) gcc (core 1) leslie3d (core 0) mcf (core 1)
2. An application’s performance
1. High application slowdowns due to
depends on which application it is
shared resource interference
running with
1831 1832
10-11-2023
inNeed
theto presence of shared
meet performance resources
requirements of critical jobs
Core Core Core Core
1835 1836
10-11-2023
1837 1838
0.5
Request Service
Performanc e SharedRate Shared
0.4
0.3
Easy
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Normalized Request Service Rate
1839 1840
10-11-2023
Key Observation 2
1. Run alone
Time units Service order
Request Buffer
3 2 1
State Main Main Memory Interference-induced Slowdown Estimation
Memory Memory
(MISE) model for memory bound applications
2. Run with another application
Request Service Rate Alone (RSRAlone)
Slowdown
Time units Service order
Request Buffer
3 2 1
State Main Main Request Service Rate Shared (RSRShared)
Memory Memory
Memory Phase
Memory Phase Memory Interference-induced Slowdown Estimation
1a a
(MISE) model for non-memory bound applications
No Req Req Req
No
interference time interference RSRAlone
Slowdown (1 - a ) a
time
1843 1844
10-11-2023
Interference Cycles
Advantage 2:
Workloads
STFM does not take into account compute phase for
SPEC CPU2006
non-memory-bound applications
300 multi programmed workloads
MISE accounts for compute phase Better accuracy
1855 1856
10-11-2023
Slowdown
Slowdown
Slowdown
4 2 2 2
3.5 1 1 1
Slowdown
3 0
0 Average
50 100
0
0error of
50 MISE:
100
0
08.2% 50 100
Actual cactusADM GemsFDTD soplex
2.5
Average error of STFM: 29.4%
2 STFM 4 4 4
MISE 3
(across3
300 workloads)3
Slowdown
Slowdown
Slowdown
1.5
2 2 2
1
1 1 1
0 20 40 60 80 100
0
Million Cycles 0 50 100
0
0 50 100
0
0 50 100
wrf calculix povray
1857 1858
2. Control Slowdown
Providing Soft Slowdown Guarantees
1859 1860
10-11-2023
1861 1862
Slowdown Bound = 10
Each application (25 applications in total) Slowdown Bound = 3.33
Slowdown Bound = 2
3
considered the QoS-critical application
Run with 12 sets of co-runners of different 2.5
memory intensities
Total of 300 multi programmed workloads MISE 2 is effective in AlwaysPrioritize
Slowdown
MISE-QoS-10/1
Each workload run with 10 slowdown bound 1. 1.5
meeting the slowdown bound for the MISE-QoS-10/3
QoS-
values critical application MISE-QoS-10/5
1 MISE-QoS-10/7
Baseline memory scheduling mechanism 2. improving performance of non-QoS-critical
MISE-QoS-10/9
Always prioritize QoS-critical application
Performance
Met Not Met
MISE-QoS-10/1
System
QoS Bound 0.6
MISE-QoS-10/3
78.8% 2.1%
Met
MISE-QoS-10/5
0.4
QoS Bound MISE-QoS-10/7
2.2% 16.9%
Not Met MISE-QoS-10/9
0.2
1867 1868
10-11-2023
Memory Cache
Core Core Core Core Core Core Core Core
Access Rate Access Rate
Core Core Core Core
Shared Main
Core Core Core Core
Shared Main
Cache Memory Cache Memory
Core Core Core Core Core Core Core Core
Cache 2.2
Core Core Core Core
Access Rate 2
Slowdown
1.8
Core Core Core Core
Shared Main
Cache Memory 1.6 astar
Core Core Core Core
1.4 lbm
Core Core Core Core 1.2 bzip2
1
Cache Access Rate Alone 1 1.2 1.4 1.6 1.8 2 2.2
Slowdown
Cache Access Rate Shared Cache Access Rate Ratio
1871 1872
10-11-2023
140
120 Slowdown-aware cache allocation
Error (in %)
GemsF…
perlbe…
omnet…
calculix
soplex
NPBua
tonto
namd
sjeng
NPBft
povray
lbm
libq
leslie3d
NPBis
milc
dealII
gobmk
sphinx3
mcf
NPBbt
Average
Select applications
Average error of ASM’s slowdown estimates: 10%
1879 1880
10-11-2023
Goal: Partition the shared cache among Previous way partitioning schemes optimize for miss
applications to mitigate contention count
Problem: Not aware of performance and slowdowns
1881 1882
(Lower is better)
way allocations
Performance
0.6
Fairness 10
0.4 NoPart
Key Idea: Allocate each way to the application whose 5 UCP
slowdown reduces the most 0.2
ASM-Cache
0 0
4 8 16 4 8 16
Number of Cores Number of Cores
Cache capacity-aware
20 0.8 bandwidth allocation
(Lower is better)
Performance
FRFCFS
10 0.4
TCM
5 0.2 PARBS
Core Core Core Core
Shared Main
0 0 ASM-Mem Cache Memory
Core Core Core Core
4 8 16 4 8 16
Number of Cores Number of Cores
Core Core Core Core
Significant fairness benefits across different 1. Employ ASM-Cache to partition cache capacity
systems 2. Drive ASM-Mem with slowdowns from ASM-
1887
Cache 1888
10-11-2023
Performance
9 0.25 FRFCFS+UCP
Fairness
1891 1892
10-11-2023
18-447 18-447
Computer Architecture Computer Architecture
Lecture 31: Predictable Performance Lecture 32: Heterogeneous Systems
Multiprocessors
Coherence and consistency
In-memory computation and predictable performance
Multi-core issues (e.g., heterogeneous multi-core)
Interconnection networks
1895 1896
10-11-2023
We Have Another Course for Collaboration A Note on Testing Your Own Code
740 is the next course in sequence We provide the reference simulator to aid you
Tentative Time: Lect. MW 7:30-9:20pm, (Rect. T 7:30pm) Do not expect it to be given, and do not rely on it much
Content:
Lectures: More advanced, with a different perspective In real life, there are no reference simulators
Recitations: Delving deeper into papers, advanced topics
Readings: Many fundamental and research readings; will do The architect designs the reference simulator
many reviews
The architect verifies it
Project: More open ended research project. Proposal
milestones final poster and presentation The architect tests it
Done in groups of 1-3 The architect fixes it
Focus of the course is the project (and papers) The architect makes sure there are no bugs
Exams: lighter and fewer The architect ensures the simulator matches the specification
Homeworks: None
1897 1898
Multiprocessors
Coherence and consistency
In-memory computation and predictable performance
Multi-core issues (e.g., heterogeneous multi-core)
Interconnection networks
1899 1900
10-11-2023
Asymmetry Enables Customization We Have Already Seen Examples Before (in 447)
CRAY-1 design: scalar + vector pipelines
C C C C C2
Remember: Throughput vs. Fairness Remember: Achieving the Best of Both Worlds
higher For Throughput
Throughput biased approach Fairness biased approach
priority
Prioritize less memory-intensive threads Take turns accessing memory Prioritize memory-non-intensive threads
thread
thread
Good for throughput Does not starve thread
thread A
thread For Fairness
less memory thread B higher thread C thread A thread B Unfairness caused by memory-intensive
thread
intensive priority being prioritized over each other
thread C
not prioritized thread
• Shuffle thread ranking
starvation unfairness reduced throughput thread
Memory-intensive threads have
thread
different vulnerability to interference
Single policy for all threads is insufficient • Shuffle asymmetrically
1909 1910
1911 1912
10-11-2023
1913 1914
1915 1916
10-11-2023
1917 1918
1
What we get: Speedup =
f
Amdahl’s Law (serial bottleneck) 1-f + N
Bottlenecks in the parallel portion
Amdahl, “Validity of the single processor approach to achieving large scale
computing capabilities,” AFIPS 1967.
Maximum speedup limited by serial portion: Serial bottleneck
Parallel portion is usually not perfectly parallel
Synchronization overhead (e.g., updates to shared data)
Load imbalance overhead (imperfect parallelization)
Resource sharing overhead (contention among N processors)
1919 1920
10-11-2023
Critical
Causes of serialized code sections Section Access Open Tables Cache 8
Sequential portions (Amdahl’s “serial part”) 7
Speedup
Barriers 5
4
Limiter stages in pipelined programs 3
2 Today
Perform the operations 1
Serialized code sections …. 0
Reduce performance Parallel 0 8 16 24 32
1921 1922
1925 1926
1927 1928
10-11-2023
1929 1930
“Tile-Large” “Tile-Small”
Tile Small
+ High throughput on the parallel part (16 units)
- Low performance on the serial part, single thread (1 unit),
reduced single-thread performance compared to existing single
thread processors
1935 1936
10-11-2023
Small Small
Small Small Small Small Small Small Small Small
core core
Large Large core core core core core core core core Large
core core core Small Small
Small Small Small Small Small Small Small Small
core core
core core core core core core core core
Provide one large core and many small cores ACMP Approach
+ Accelerate serial part using the large core (2 units)
+ Execute parallel part on small cores and large core for high
throughput (12+2 units)
1937 1938
1. Small cores takes an area budget of 1 and has Small Small Small Small Small Small Small Small
Large Large core core core core core core core core
performance of 1 core core Small Small Small Small Small Small Small Small
core core core core core core core core
Idea: Dynamically identify these code portions that cause M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt,
"Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures"
serialization and execute them on a large core Proceedings of the 14th International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS), 2009
1943 1944
10-11-2023
P=4 P=4
P=1 P=1
0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11 12
1945 1946
Suleman et al., “Accelerating Critical Section Execution with From small cores
Asymmetric Multi-Core Architectures,” ASPLOS 2009.
1951 1952
10-11-2023
1953 1954
1955 1956
10-11-2023
Small Small Small Small Small Small Small Small Small Small Small Small
Multi-core x86 simulator
core core core core core core core core core core core core
1 large and 28 small cores
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Aggressive stream prefetcher employed at each core
1957 1958
------ SCMP
ACS Performance Equal-Area Comparisons ------ ACMP
------ ACS
Chip Area = 32 small cores Equal-area comparison
SCMP = 32 small cores Number of threads = No. of cores
ACMP = 1 large and 28 small cores Number of threads = Best threads 3.5 3 5 7 3.5 14
3 2.5 6 3 12
4
2.5 2 5 2.5 10
269 180 185 2 3 4 2 8
1.5 2 3 1.5 6
140 1
1 2 1 4
120 0.5 1
0.5 1 0.5 2
100
0 0 0 0 0 0
80 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32
60 (a) ep (b) is (c) pagemine (d) puzzle (e) qsort (f) tsp
40 Accelerating Sequential Kernels
6 10 8 12 3 12
20 Accelerating Critical Sections 5 10 2.5 10
0 8
6
4 8 2 8
6
-2
-1
up
n
t
e
e
e
p
or
ea
ch
in
lit
ts
jb
zl
3 4 6 1.5 6
tp
tp
ok
qs
z
m
ec
sq
hm
ca
ol
ol
pu
4
lo
ge
sp
eb
2 4 1 4
ip
pa
2
1 2 2 0.5 2
0 0 0 0 0 0
0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32
(g) sqlite (h) iplookup (i) oltp-1 (i) oltp-2 (k) specjbb (l) webcache
Coarse-grain locks Fine-grain locks
Chip Area (small cores)
1959 1960
10-11-2023
ACS Summary
Critical sections reduce performance and limit scalability
18-447
Accelerate critical sections by executing them on a powerful
core Computer Architecture
Lecture 33: Interconnection Networks
ACS reduces average execution time by:
34% compared to an equal-area SCMP
23% compared to an equal-area ACMP
1963 1964
10-11-2023
Multiprocessors
Coherence and consistency
In-memory computation and predictable performance
Multi-core issues (e.g., heterogeneous multi-core)
Interconnection networks
1965 1966
Interconnection network
1967 1968
10-11-2023
Point-to-Point Crossbar
Every node connected to every other 0 Every node connected to every other with a shared link for
with direct/isolated links each destination
7 1
Enables concurrent transfers to non-conflicting destinations
+ Lowest contention Could be cost-effective for small number of nodes
+ Potentially lowest latency 7
- Expensive 5
per node 5 3
2
Used in core-to-cache-bank 1
O(N2) links
4 networks in 0
-- Not scalable
- IBM POWER5
-- How to lay out on chip? - Sun Niagara I/II
0 1 2 3 4 5 6 7
1973 1974
3
4-stage pipeline:
4 req, arbitration,
5
selection,
transmission
6
7
2-deep queue for
0 1 2 3 4 5 6 7
each src/dest pair
to hold data
transfer request
1975 1976
10-11-2023
Bufferless and Buffered Crossbars Can We Get Lower Cost than A Crossbar?
Yet still have low contention compared to a bus?
0 NI Flow
+ Simpler
Control
arbitration/ Idea: Multistage networks
scheduling
1 NI Flow
Control
+ Efficient
2 NI Flow support for
Control
variable-size
packets
3 NI Flow
Control
- Requires
Bufferless
Buffered
Output
Arbiter
Output
Arbiter
Output
Arbiter
Output
Arbiter
N2 buffers
Crossbar
1977 1978
010 010 6 6
011 011
7
100 100 7
101 101
A multistage network has more restrictions on feasible 2-by-2 crossbar
110 110
concurrent Tx-Rx pairs vs a crossbar
111 111 But more scalable than crossbar in cost, e.g., O(N
conflict logN) for Butterfly
1979 1980
10-11-2023
4 4
Packet switching routes per packet in each router
5 5
Route each packet individually (possibly via different paths)
6 6 If link is free, any packet can use it
1983 1984
10-11-2023
1985 1986
R R R R
+ Reduces latency
2x2 router
+ Improves scalability
0 1 N-2 N-1
2
- Slightly more complex injection policy (need to select which
Single directional pathway ring to inject a packet into)
Simple topology and implementation
Reasonable performance if N and performance needs
(bandwidth & latency) still moderately low
O(N) cost
N/2 average hops; latency depends on utilization
1987 1988
10-11-2023
+ More scalable
+ Lower latency
- More complex
1989 1990
Mesh Torus
Each node connected to 4 neighbors (N, E, S, W) Mesh is not symmetric on edges: performance very
O(N) cost sensitive to placement of task on edge vs. middle
Average latency: O(sqrt(N)) Torus avoids this problem
Easy to layout on-chip: regular and equal-length links + Higher path diversity (and bisection bandwidth) than mesh
Path diversity: many ways to get from one node to another - Higher cost
- Harder to lay out on-chip
Used in Tilera 100-core - Unequal link lengths
And many on-chip network
prototypes
1991 1992
10-11-2023
S S S S S S S S + Easy to Layout
MP MP MP MP MP MP MP MP
- Root can become a bottleneck
Fat trees avoid this problem (CM-5)
Fat Tree
1993 1994
Two packets trying to use the same link at the same time
What do you do?
Buffer one
Drop one
Misroute one (deflection)
Tradeoffs? Destination
1Baran, “On Distributed Communication Networks.” RAND Tech. Report., 1962 / IEEE Trans.Comm., 1964.
1999 2000
10-11-2023
2001 2002
+ Simple
+ Deadlock freedom (no cycles in resource allocation)
- Could lead to high contention
- Does not exploit path diversity
2003 2004
10-11-2023
2005 2006
On-Chip Networks
PE PE PE
2D mesh: Most commonly used
R R R topology
Primarily serve cache misses and
memory requests
R Router
Processing Element
PE
(Cores, L2 Banks, Memory Controllers, etc)
2009