0% found this document useful (0 votes)
524 views503 pages

Computer Architecture Basics Overview

Uploaded by

sonia.thuluri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
524 views503 pages

Computer Architecture Basics Overview

Uploaded by

sonia.thuluri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

10-11-2023

A Key Question
 How Was Wright Able To Design Fallingwater?
18-447  Can have many guesses
(Ultra) hard work, perseverance, dedication (over decades)
Computer Architecture 

 Experience of decades
Lecture 1: Introduction and Basics  Creativity
 Out-of-the-box thinking
 Principled design
 A good understanding of past designs
 Good judgment and intuition
Prof. Onur Mutlu
 Strong combination of skills (math, architecture, art, …)
Carnegie Mellon University
 …
Spring 2015, 1/12/2015
 (You will be exposed to and hopefully develop/enhance
many of these skills in this course)
2

A Quote from The Architect Himself Major High-Level Goals of This Course
 “architecture […] based upon principle, and not upon  Understand the principles
precedent”  Understand the precedents

 Based on such understanding:


 Enable you to evaluate tradeoffs of different designs and ideas
 Enable you to develop principled designs
 Enable you to develop novel, out-of-the-box designs

 The focus is on:


 Principles, precedents, and how to use them for new designs

 In Computer Architecture

3 4
10-11-2023

Role of The (Computer) Architect Takeaways


 Look backward (to the past)  Being an architect is not easy
 Understand tradeoffs and designs, upsides/downsides, past  You need to consider many things in designing a new
workloads. Analyze and evaluate the past. system + have good intuition/insight into ideas/tradeoffs
 Look forward (to the future)
 Be the dreamer and create new designs. Listen to dreamers.  But, it is fun and can be very technically rewarding
 Push the state of the art. Evaluate new design choices.  And, enables a great future
 Look up (towards problems in the computing stack)  E.g., many scientific and everyday-life innovations would not
 Understand important problems and their nature. have been possible without architectural innovation that
 Develop architectures and ideas to solve important problems. enabled very high performance systems
 Look down (towards device/circuit technology)  E.g., your mobile phones
 Understand the capabilities of the underlying technology.
 Predict and adapt to the future of technology (you are  This course will teach you how to become a good computer
designing for N years ahead). Enable the future technology. architect
5 6

So, I Hope You Are Here for This Levels of Transformation


“C” as a model of computation “The purpose of computing is insight” (Richard Hamming)
18-213 We gain and generate insight by solving problems
Programmer’s view of how How do we ensure problems are solved by electrons?
a computer system works

 How does an assembly


Problem
program end up executing as Architect/microarchitect’s view: Algorithm
digital logic? How to design a computer that
meets system design goals. Program/Language
 What happens in-between? Choices critically affect both Runtime System
 How is a computer designed the SW programmer and (VM, OS, MM)
the HW designer
using logic gates and wires ISA (Architecture)
to satisfy specific goals? Microarchitecture

HW designer’s view of how Logic


a computer system works Circuits
18-240 Digital logic as a Electrons
model of computation
7 8
10-11-2023

The Power of Abstraction Crossing the Abstraction Layers


 Levels of transformation create abstractions  As long as everything goes well, not knowing what happens
 Abstraction: A higher level only needs to know about the in the underlying level (or above) is not a problem.
interface to the lower level, not how the lower level is
implemented  What if
 E.g., high-level language programmer does not really need to  The program you wrote is running slow?
know what the ISA is and how a computer executes instructions  The program you wrote does not run correctly?
 The program you wrote consumes too much energy?
 Abstraction improves productivity
 No need to worry about decisions made in underlying levels
 What if
 The hardware you designed is too hard to program?
 E.g., programming in Java vs. C vs. assembly vs. binary vs. by
 The hardware you designed is too slow because it does not provide the
specifying control signals of each transistor every cycle
right primitives to the software?

 Then, why would you want to know what goes on  What if


underneath or above?  You want to design a much more efficient and higher performance
system?
9 10

Crossing the Abstraction Layers An Example: Multi-Core Systems


Multi-Core
 Two key goals of this course are Chip

L2 CACHE 1
L2 CACHE 0
SHARED L3 CACHE
to understand how a processor works underneath the

DRAM INTERFACE

CORE 0 CORE 1

DRAM BANKS
software layer and how decisions made in hardware affect the
software/programmer
DRAM MEMORY
CONTROLLER
 to enable you to be comfortable in making design and
optimization decisions that cross the boundaries of different

L2 CACHE 2

L2 CACHE 3
layers and system components
CORE 2 CORE 3

*Die photo credit: AMD Barcelona


11 12
10-11-2023

Unexpected Slowdowns in Multi-Core A Question or Two


High priority
 Can you figure out why there is a disparity in slowdowns if
you do not know how the system executes the programs?

Memory Performance Hog  Can you fix the problem without knowing what is
Low priority happening “underneath”?

(Core 0) (Core 1)
Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service
in multi-core systems,” USENIX Security 2007.
13 14

Why the Disparity in Slowdowns? DRAM Bank Operation


Access Address:
(Row 0, Column 0) Columns
CORE
matlab1 gcc 2
CORE Multi-Core (Row 0, Column 1)
Chip (Row 0, Column 85)

Row decoder
(Row 1, Column 0)
L2 L2

Rows
CACHE
Row address 0
1
CACHE
unfairness
INTERCONNECT
Shared DRAM
DRAM MEMORY CONTROLLER Memory System

Row
Empty
Row 01 Row Buffer CONFLICT
HIT !

Column address 0
1
85 Column mux
DRAM DRAM DRAM DRAM
Bank 0 Bank 1 Bank 2 Bank 3 Data

15 16
10-11-2023

DRAM Controllers The Problem


 Multiple applications share the DRAM controller
 A row-conflict memory access takes significantly longer
than a row-hit access  DRAM controllers designed to maximize DRAM data
throughput

 Current controllers take advantage of the row buffer


 DRAM scheduling policies are unfair to some applications
 Commonly used scheduling policy (FR-FCFS) [Rixner 2000]*  Row-hit first: unfairly prioritizes apps with high row buffer locality
 Threads that keep on accessing the same row
(1) Row-hit first: Service row-hit memory accesses first
 Oldest-first: unfairly prioritizes memory-intensive applications
(2) Oldest-first: Then service older accesses first

 DRAM controller vulnerable to denial of service attacks


 This scheduling policy aims to maximize DRAM throughput
 Can write programs to exploit unfairness

*Rixner et al., “Memory Access Scheduling,” ISCA 2000.


*Zuravleff and Robinson, “Controller for a synchronous DRAM …,” US Patent 5,630,096, May 1997.

17 18

A Memory Performance Hog What Does the Memory Hog Do?


// initialize large arrays A, B // initialize large arrays A, B

Row decoder
for (j=0; j<N; j++) { for (j=0; j<N; j++) {
index = j*linesize; streaming index = rand(); random
A[index] = B[index]; A[index] = B[index]; T0: Row 0
… … T0:
T1: Row 05
} }
T1:
T0:Row
Row111
0
T1:
T0:Row
Row16
0
STREAM RANDOM Memory Request Buffer Row 0 Row Buffer
- Sequential memory access - Random memory access
- Very high row buffer locality (96% hit rate) - Very low row buffer locality (3% hit rate)
- Memory intensive - Similarly memory intensive Row size: 8KB, cache blockColumn mux
size: 64B
T0: STREAM
128
T1: (8KB/64B)
RANDOM requests of T0 serviced
Data before T1

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007. Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

19 20
10-11-2023

Now That We Know What Happens Underneath Reading on Memory Performance Attacks
 How would you solve the problem?  Thomas Moscibroda and Onur Mutlu,
"Memory Performance Attacks: Denial of Memory Service
in Multi-Core Systems"
 What is the right place to solve the problem? Proceedings of the 16th USENIX Security Symposium (USENIX SECURITY),
pages 257-274, Boston, MA, August 2007. Slides (ppt)
 Programmer? Problem
 System software? Algorithm
 Compiler? Program/Language
 One potential reading for your Homework 1 assignment
 Hardware (Memory controller)? Runtime System
(VM, OS, MM)
 Hardware (DRAM)?
ISA (Architecture)
 Circuits?
Microarchitecture
Logic
 Two other goals of this course: Circuits
 Enable you to think critically Electrons
 Enable you to think broadly
21 22

If You Are Interested … Further Readings Takeaway


 Onur Mutlu and Thomas Moscibroda,  Breaking the abstraction layers (between components and
"Stall-Time Fair Memory Access Scheduling for Chip transformation hierarchy levels) and knowing what is
Multiprocessors"
Proceedings of the 40th International Symposium on Microarchitecture
underneath enables you to solve problems
(MICRO), pages 146-158, Chicago, IL, December 2007. Slides (ppt)

 Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut


Kandemir, and Thomas Moscibroda,
"Reducing Memory Interference in Multicore Systems via
Application-Aware Memory Channel Partitioning"
Proceedings of the 44th International Symposium on Microarchitecture
(MICRO), Porto Alegre, Brazil, December 2011. Slides (pptx)

23 24
10-11-2023

Another Example DRAM in the System


 DRAM Refresh Multi-Core
Chip

L2 CACHE 1
L2 CACHE 0
SHARED L3 CACHE

DRAM INTERFACE
CORE 0 CORE 1

DRAM BANKS
DRAM MEMORY
CONTROLLER

L2 CACHE 2

L2 CACHE 3
CORE 2 CORE 3

*Die photo credit: AMD Barcelona


25 26

A DRAM Cell DRAM Refresh


 DRAM capacitor charge leaks over time
wordline (row enable)

 The memory controller needs to refresh each row periodically


to restore charge
 Activate each row every N ms
bitline

bitline

bitline

bitline

 Typical N = 64 ms

 Downsides of refresh
-- Energy consumption: Each refresh consumes energy
-- Performance degradation: DRAM rank/bank unavailable while
 A DRAM cell consists of a capacitor and an access transistor
refreshed
 It stores data in terms of charge in the capacitor
-- QoS/predictability impact: (Long) pause times during refresh
 A DRAM chip consists of (10s of 1000s of) rows of such cells
-- Refresh rate limits DRAM capacity scaling
28
10-11-2023

First, Some Analysis Refresh Overhead: Performance


 Imagine a system with 1 ExaByte DRAM
 Assume a row size of 8 KiloBytes

 How many rows are there?


 How many refreshes happen in 64ms?
46%
 What is the total power consumption of DRAM refresh?
 What is the total energy consumption of DRAM refresh
during a day?

8%
 Part of your Homework 1

29 Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. 30

Refresh Overhead: Energy How Do We Solve the Problem?


 Do we need to refresh all rows every 64ms?

 What if we knew what happened underneath and exposed


that information to upper layers?

47%

15%

Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. 31 32


10-11-2023

Underneath: Retention Time Profile of DRAM Taking Advantage of This Profile


 Expose this retention time profile information to
 the memory controller
 the operating system
 the programmer?
 the compiler?

 How much information to expose?


 Affects hardware/software overhead, power consumption,
verification complexity, cost

 How to determine this profile information?


 Also, who determines it?

Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. 33 34

An Example: RAIDR Reading on RAIDR


 Observation: Most DRAM rows can be refreshed much less often  Jamie Liu, Ben Jaiyen, Richard Veras, and Onur Mutlu,
without losing data [Kim+, EDL’09][Liu+ ISCA’13] "RAIDR: Retention-Aware Intelligent DRAM Refresh"
Proceedings of the 39th International Symposium on Computer Architecture
 Key idea: Refresh rows containing weak cells (ISCA), Portland, OR, June 2012. Slides (pdf)
more frequently, other rows less frequently
1. Profiling: Profile retention time of all rows
2. Binning: Store rows into bins by retention time in memory controller  One potential reading for your Homework 1 assignment
Efficient storage with Bloom Filters (only 1.25KB for 32GB memory)
3. Refreshing: Memory controller refreshes rows in different bins at
different rates

 Results: 8-core, 32GB, SPEC, TPC-C, TPC-H


74.6% refresh reduction @ 1.25KB storage
~16%/20% DRAM dynamic/idle power reduction
~9% performance improvement

Benefits increase with DRAM capacity

35 36
Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.
10-11-2023

If You Are Interested … Further Readings Takeaway


 Onur Mutlu,  Breaking the abstraction layers (between components and
"Memory Scaling: A Systems Architecture Perspective" transformation hierarchy levels) and knowing what is
Technical talk at MemCon 2013 (MEMCON), Santa Clara, CA, August 2013.
Slides (pptx) (pdf) Video underneath enables you to solve problems and design
better future systems
 Kevin Chang, Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson,
Yoongu Kim, and Onur Mutlu,
 Cooperation between multiple components and layers can
"Improving DRAM Performance by Parallelizing
enable more effective solutions and systems
Refreshes with Accesses"
Proceedings of the 20th International Symposium on High-Performance
Computer Architecture (HPCA), Orlando, FL, February 2014. Slides (pptx) (pdf)

37 38

Yet Another Example Disturbance Errors in Modern DRAM


 DRAM Row Hammer (or, DRAM Disturbance Errors)
Row of Cells Wordline
Row Row
Victim
Row Opened
Aggressor Closed
Row VHIGH
LOW
Row Row
Victim
Row

Repeatedly opening and closing a row enough times within a


refresh interval induces disturbance errors in adjacent rows in
most real DRAM chips you can buy today
Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of 40
39
DRAM Disturbance Errors,” ISCA 2014.
10-11-2023

Most DRAM Modules Are At Risk x86 CPU DRAM Module


A company B company C company

(37/43) (45/54) (28/32)


loop:
Up to Up to Up to mov (X), %eax
mov (Y), %ebx X
× × × clflush (X)
7 6 5 clflush (Y)
mfence Y
errors errors errors jmp loop
Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of
DRAM Disturbance Errors,” ISCA 2014. 41

Observed Errors in Real Systems Errors vs. Vintage


CPU Architecture Errors Access-Rate
Intel Haswell (2013) 22.9K 12.3M/sec
Intel Ivy Bridge (2012) 20.7K 11.7M/sec First
Appearance
Intel Sandy Bridge (2011) 16.1K 11.6M/sec
AMD Piledriver (2012) 59 6.1M/sec

• A real reliability & security issue


• In a more controlled environment, we can
induce as many as ten million disturbance errors
All modules from 2012–2013 are vulnerable
Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of 43 44
DRAM Disturbance Errors,” ISCA 2014.
10-11-2023

How Do We Solve The Problem? More on DRAM Disturbance Errors


 Do business as usual but better: Improve circuit and device
 Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk
technology such that disturbance does not happen. Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu,
Use stronger error correcting codes. "Flipping Bits in Memory Without Accessing Them: An
Experimental Study of DRAM Disturbance Errors"
Proceedings of the 41st International Symposium on Computer
 Tolerate it: Make DRAM and controllers more intelligent so Architecture (ISCA), Minneapolis, MN, June 2014. Slides (pptx) (pdf)
that they can proactively fix the errors Lightning Session Slides (pptx) (pdf) Source Code and Data

 Eliminate or minimize it: Replace DRAM with a different  Source Code to Induce Errors in Modern DRAM Chips
technology that does not have the problem  https://github.com/CMU-SAFARI/rowhammer

 Embrace it: Design heterogeneous-reliability memories that


map error-tolerant data to less reliable portions  One potential reading for your Homework 1 assignment

 …
45 46

Recap: Some Goals of 447 Review: Major High-Level Goals of This Course
 Teach/enable/empower you to:  Understand the principles
 Understand how a computing platform (processor + memory +  Understand the precedents
interconnect) works
 Implement a simple platform (with not so simple parts), with a
 Based on such understanding:
focus on the processor and memory
 Enable you to evaluate tradeoffs of different designs and ideas
 Understand how decisions made in hardware affect the
software/programmer as well as hardware designer  Enable you to develop principled designs
 Think critically (in solving problems)  Enable you to develop novel, out-of-the-box designs
 Think broadly across the levels of transformation
 Understand how to analyze and make tradeoffs in design  The focus is on:
 Principles, precedents, and how to use them for new designs

 In Computer Architecture

47 48
10-11-2023

A Note on Hardware vs. Software What Do I Expect From You?


 This course is classified under “Computer Hardware”  Required background: 240 (digital logic, RTL implementation,
Verilog), 213 (systems, virtual memory, assembly)
 However, you will be much more capable if you master
both hardware and software (and the interface between  Learn the material thoroughly
them)  attend lectures, do the readings, do the homeworks
 Can develop better software if you understand the underlying  Do the work & work hard
hardware
 Ask questions, take notes, participate
 Can design better hardware if you understand what software
it will execute  Perform the assigned readings
 Can design a better computing system if you understand both  Come to class on time
 Start early – do not procrastinate
 This course covers the HW/SW interface and  If you want feedback, come to office hours
microarchitecture
 We will focus on tradeoffs and how they affect software  Remember “Chance favors the prepared mind.” (Pasteur)
49 50

What Do I Expect From You? Required Readings for This Week


 How you prepare and manage your time is very important  Patt, “Requirements, Bottlenecks, and Good Fortune: Agents for
Microprocessor Evolution,” Proceedings of the IEEE 2001.
 One of
 There will be an assignment due almost every week  Moscibroda and Mutlu, “Memory Performance Attacks: Denial of Memory
 8 Labs and 7 Homework Assignments Service in Multi-Core Systems,” USENIX Security 2007.
 Liu+, “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.
 Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental
 This will be a heavy course Study of DRAM Disturbance Errors,” ISCA 2014.
 However, you will learn a lot of fascinating topics and
understand how a microprocessor actually works (and how it  P&P Chapter 1 (Fundamentals)
can be made to work better)
 P&H Chapters 1 and 2 (Intro, Abstractions, ISA, MIPS)
 And, it will hopefully change how you look at and think about
designs around you
 Reference material throughout the course
 MIPS ISA Reference Manual + x86 ISA Reference Manual
 http://www.ece.cmu.edu/~ece447/s15/doku.php?id=techdocs
51 52
10-11-2023

What Will You Learn


 Computer Architecture: The science and art of
18-447 designing, selecting, and interconnecting hardware
components and designing the hardware/software interface
Computer Architecture to create a computing system that meets functional,
performance, energy consumption, cost, and other specific
Lecture 1: Introduction and Basics goals.

 Traditional definition: “The term architecture is used


here to describe the attributes of a system as seen by the
Prof. Onur Mutlu programmer, i.e., the conceptual structure and functional
Carnegie Mellon University behavior as distinct from the organization of the dataflow
and controls, the logic design, and the physical
Spring 2015, 1/12/2015
implementation.” Gene Amdahl, IBM Journal of R&D, April
1964
54

Computer Architecture in Levels of Transformation Levels of Transformation, Revisited


 A user-centric view: computer designed for users
Problem Problem
Algorithm Algorithm
Program/Language Program/Language User
Runtime System
(VM, OS, MM)
Runtime System
ISA (Architecture)
(VM, OS, MM)
Microarchitecture
ISA
Logic
Microarchitecture
Circuits
Logic
Electrons
Circuits
 Read: Patt, “Requirements, Bottlenecks, and Good Fortune: Agents for Electrons
Microprocessor Evolution,” Proceedings of the IEEE 2001.
 The entire stack should be optimized for user
55 56
10-11-2023

What Will You Learn? Course Goals


 Fundamental principles and tradeoffs in designing the  Goal 1: To familiarize those interested in computer system
hardware/software interface and major components of a design with both fundamental operation principles and design
modern programmable microprocessor tradeoffs of processor, memory, and platform architectures in
 Focus on state-of-the-art (and some recent research and trends) today’s systems.
 Trade-offs and how to make them  Strong emphasis on fundamentals and design tradeoffs.

 How to design, implement, and evaluate a functional modern  Goal 2: To provide the necessary background and experience to
processor design, implement, and evaluate a modern processor by
 Semester-long lab assignments performing hands-on RTL and C-level implementation.
 A combination of RTL implementation and higher-level simulation  Strong emphasis on functionality and hands-on design.

 Focus is functionality first (some on “how to do even better”)

 How to dig out information, think critically and broadly


 How to work even harder!
57 58

Agenda for Today


 Finish up logistics from last lecture
18-447
 Why study computer architecture?
Computer Architecture
Lecture 2: Fundamental Concepts and ISA  Some fundamental concepts in computer architecture

 ISA

Prof. Onur Mutlu


Carnegie Mellon University
Spring 2015, 1/14/2014

60
10-11-2023

Last Lecture Recap Review: Key Takeaway (from 3 Problems)


 What it means/takes to be a good (computer) architect  Breaking the abstraction layers (between components and
 Roles of a computer architect (look everywhere!) transformation hierarchy levels) and knowing what is
 Goals of 447 and what you will learn in this course underneath enables you to solve problems and design
better future systems
 Levels of transformation
 Abstraction layers, their benefits, and the benefits of
comfortably crossing them  Cooperation between multiple components and layers can
enable more effective solutions and systems
 Three example problems and solution ideas
 Memory Performance Attacks
 DRAM Refresh
 Row Hammer: DRAM Disturbance Errors
 Hamming Distance and Bloom Filters
 Course Logistics
 Assignments: HW0 (Jan 16), Lab1 (Jan 23), HW1 (Jan 28)
61 62

A Note on Hardware vs. Software Computer Architecture in Levels of Transformation


 This course is classified under “Computer Hardware”
Problem
 However, you will be much more capable if you master Algorithm
both hardware and software (and the interface between Program/Language
them) Runtime System
 Can develop better software if you understand the underlying (VM, OS, MM)
hardware ISA (Architecture)
 Can design better hardware if you understand what software Microarchitecture
it will execute Logic
 Can design a better computing system if you understand both Circuits
Electrons

 This course covers the HW/SW interface and  Read: Patt, “Requirements, Bottlenecks, and Good Fortune: Agents for
microarchitecture Microprocessor Evolution,” Proceedings of the IEEE 2001.
 We will focus on tradeoffs and how they affect software
63 64
10-11-2023

Aside: What Is An Algorithm? Levels of Transformation, Revisited


 Step-by-step procedure where each step has three  A user-centric view: computer designed for users
properties: Problem
 Definite (precisely defined) Algorithm
 Effectively computable (by a computer) Program/Language User
 Terminates
Runtime System
(VM, OS, MM)
ISA
Microarchitecture
Logic
Circuits
Electrons

 The entire stack should be optimized for user


65 66

What Will You Learn? Course Goals


 Fundamental principles and tradeoffs in designing the  Goal 1: To familiarize those interested in computer system
hardware/software interface and major components of a design with both fundamental operation principles and design
modern programmable microprocessor tradeoffs of processor, memory, and platform architectures in
 Focus on state-of-the-art (and some recent research and trends) today’s systems.
 Trade-offs and how to make them  Strong emphasis on fundamentals, design tradeoffs, key
current/future issues
 Strong emphasis on looking backward, forward, up and down
 How to design, implement, and evaluate a functional modern
processor
 Semester-long lab assignments  Goal 2: To provide the necessary background and experience to
 A combination of RTL implementation and higher-level simulation design, implement, and evaluate a modern processor by
 Focus is functionality first (then, on “how to do even better”) performing hands-on RTL and C-level implementation.
 Strong emphasis on functionality, hands-on design &

 How to dig out information, think critically and broadly implementation, and efficiency.
 Strong emphasis on making things work, realizing ideas
 How to work even harder and more efficiently!
67 68
10-11-2023

What is Computer Architecture?

 The science and art of designing, selecting, and


Why Study Computer interconnecting hardware components and designing the
hardware/software interface to create a computing system
Architecture? that meets functional, performance, energy consumption,
cost, and other specific goals.

 We will soon distinguish between the terms architecture,


and microarchitecture.

69 70

An Enabler: Moore’s Law

Moore, “Cramming more components onto integrated circuits,”


Electronics Magazine, 1965. Component counts double every other year
Number of transistors on an integrated circuit doubles ~ every two years
Image source: Intel
71 Image source: Wikipedia
72
10-11-2023

Recommended Reading What Do We Use These Transistors for?


 Moore, “Cramming more components onto integrated  Your readings for this week should give you an idea…
circuits,” Electronics Magazine, 1965.
 Patt, “Requirements, Bottlenecks, and Good Fortune:
 Only 3 pages Agents for Microprocessor Evolution,” Proceedings of the
IEEE 2001.
 A quote:
“With unit cost falling as the number of components per  One of:
circuit rises, by 1975 economics may dictate squeezing as  Moscibroda and Mutlu, “Memory Performance Attacks: Denial
many as 65 000 components on a single silicon chip.” of Memory Service in Multi-Core Systems,” USENIX Security
2007.
 Liu+, “RAIDR: Retention-Aware Intelligent DRAM Refresh,”
 Another quote: ISCA 2012.
“Will it be possible to remove the heat generated by tens of  Kim+, “Flipping Bits in Memory Without Accessing Them: An
thousands of components in a single silicon chip?” Experimental Study of DRAM Disturbance Errors,” ISCA 2014.
73 74

Why Study Computer Architecture? Computer Architecture Today (I)


 Enable better systems: make computers faster, cheaper,  Today is a very exciting time to study computer architecture
smaller, more reliable, …
 By exploiting advances and changes in underlying technology/circuits  Industry is in a large paradigm shift (to multi-core and
beyond) – many different potential system designs possible
 Enable new applications
 Many difficult problems motivating and caused by the shift
 Life-like 3D visualization 20 years ago?
 Power/energy constraints  multi-core?
 Virtual reality?
 Complexity of design  multi-core?
 Personalized genomics? Personalized medicine?
 Difficulties in technology scaling  new technologies?
 Memory wall/gap
 Enable better solutions to problems  Reliability wall/issues
 Software innovation is built into trends and changes in computer architecture  Programmability wall/problem
 > 50% performance improvement per year has enabled this innovation
 Huge hunger for data and new data-intensive applications

 Understand why computers work the way they do  No clear, definitive answers to these problems
75 76
10-11-2023

Computer Architecture Today (II) Computer Architecture Today (III)


 These problems affect all parts of the computing stack – if  Computing landscape is very different from 10-20 years ago
we do not change the way we design systems  Both UP (software and humanity trends) and DOWN
Many new demands Problem (technologies and their issues), FORWARD and BACKWARD,
from the top Algorithm and the resulting requirements and constraints
(Look Up) Program/Language User Fast changing
demands and
personalities
of users Hybrid Main Memory
Runtime System
(VM, OS, MM) (Look Up)

ISA Heterogeneous Persistent Memory/Storage


Microarchitecture Processors and
Accelerators Every component and its
Many new issues Logic interfaces, as well as
at the bottom Circuits entire system designs
(Look Down)
Electrons are being re-examined
General Purpose GPUs
 No clear, definitive answers to these problems
77 78

Computer Architecture Today (IV) Computer Architecture Today (IV)


 You can revolutionize the way computers are built, if you  You can revolutionize the way computers are built, if you
understand both the hardware and the software (and understand both the hardware and the software (and
change each accordingly) change each accordingly)

 You can invent new paradigms for computation,  You can invent new paradigms for computation,
communication, and storage communication, and storage

 Recommended book: Thomas Kuhn, “The Structure of  Recommended book: Thomas Kuhn, “The Structure of
Scientific Revolutions” (1962) Scientific Revolutions” (1962)
 Pre-paradigm science: no clear consensus in the field  Pre-paradigm science: no clear consensus in the field
 Normal science: dominant theory used to explain/improve  Normal science: dominant theory used to explain/improve
things (business as usual); exceptions considered anomalies things (business as usual); exceptions considered anomalies
 Revolutionary science: underlying assumptions re-examined  Revolutionary science: underlying assumptions re-examined

79 80
10-11-2023

… but, first …
 Let’s understand the fundamentals…

 You can change the world only if you understand it well Fundamental Concepts
enough…
 Especially the past and present dominant paradigms
 And, their advantages and shortcomings – tradeoffs
 And, what remains fundamental across generations
 And, what techniques you can use and develop to solve
problems

81 82

What is A Computer? What is A Computer?


 Three key components  We will cover all three components

 Computation
 Communication Processing
 Storage (memory)
control Memory
(sequencing) (program I/O
and data)
datapath

83 84
10-11-2023

The Von Neumann Model/Architecture The Von Neumann Model/Architecture


 Also called stored program computer (instructions in  Recommended reading
memory). Two key properties:  Burks, Goldstein, von Neumann, “Preliminary discussion of the
logical design of an electronic computing instrument,” 1946.
 Stored program  Patt and Patel book, Chapter 4, “The von Neumann Model”
 Instructions stored in a linear memory array
 Memory is unified between instructions and data  Stored program
 The interpretation of a stored value depends on the control
signals When is a value interpreted as an instruction?  Sequential instruction processing

 Sequential instruction processing


 One instruction processed (fetched, executed, and completed) at a
time
 Program counter (instruction pointer) identifies the current instr.
 Program counter is advanced sequentially except for control transfer
instructions
85 86

The Von Neumann Model (of a Computer) The Von Neumann Model (of a Computer)
MEMORY  Q: Is this the only way that a computer can operate?
Mem Addr Reg

Mem Data Reg  A: No.


 Qualified Answer: But, it has been the dominant way
 i.e., the dominant paradigm for computing
PROCESSING UNIT
INPUT OUTPUT  for N decades
ALU TEMP

CONTROL UNIT

IP Inst Register

87 88
10-11-2023

The Dataflow Model (of a Computer) Von Neumann vs Dataflow


 Von Neumann model: An instruction is fetched and  Consider a Von Neumann program
executed in control flow order  What is the significance of the program order?
 As specified by the instruction pointer  What is the significance of the storage locations?
 Sequential unless explicit control flow instruction a b
v <= a + b;
w <= b * 2;
 Dataflow model: An instruction is fetched and executed in x <= v - w + *2
data flow order y <= v + w
z <= x * y
 i.e., when its operands are ready
- +
 i.e., there is no instruction pointer
Sequential
 Instruction ordering specified by data flow dependence
*
 Each instruction specifies “who” should receive the result Dataflow
 An instruction can “fire” whenever all operands are received
 Potentially many instructions can execute at the same time z
 Inherently more parallel  Which model is more natural to you as a programmer?
89 90

More on Data Flow Data Flow Nodes


 In a data flow machine, a program consists of data flow
nodes
 A data flow node fires (fetched and executed) when all it
inputs are ready
 i.e. when all inputs have tokens

 Data flow node and its ISA representation

91 92
10-11-2023

An Example Data Flow Program ISA-level Tradeoff: Instruction Pointer


 Do we need an instruction pointer in the ISA?
 Yes: Control-driven, sequential execution
 An instruction is executed when the IP points to it
 IP automatically changes sequentially (except for control flow
instructions)
 No: Data-driven, parallel execution
 An instruction is executed when all its operand values are
available (data flow)

 Tradeoffs: MANY high-level ones


 Ease of programming (for average programmers)?
 Ease of compilation?
OUT  Performance: Extraction of parallelism?
 Hardware complexity?

93 94

ISA vs. Microarchitecture Level Tradeoff Let’s Get Back to the Von Neumann Model
 A similar tradeoff (control vs. data-driven execution) can be  But, if you want to learn more about dataflow…
made at the microarchitecture level
 Dennis and Misunas, “A preliminary architecture for a basic
 ISA: Specifies how the programmer sees instructions to be data-flow processor,” ISCA 1974.
executed  Gurd et al., “The Manchester prototype dataflow
 Programmer sees a sequential, control-flow execution order vs. computer,” CACM 1985.
 Programmer sees a data-flow execution order  A later 447 lecture, 740/742

 Microarchitecture: How the underlying implementation  If you are really impatient:


actually executes instructions  http://www.youtube.com/watch?v=D2uue7izU2c
 Microarchitecture can execute instructions in any order as long
 http://www.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?medi
as it obeys the semantics specified by the ISA when making the
a=onur-740-fall13-module5.2.1-dataflow-part1.ppt
instruction results visible to software
 Programmer should see the order specified by the ISA
95 96
10-11-2023

The Von-Neumann Model What is Computer Architecture?


 All major instruction set architectures today use this model  ISA+implementation definition: The science and art of
 x86, ARM, MIPS, SPARC, Alpha, POWER designing, selecting, and interconnecting hardware
components and designing the hardware/software interface
 Underneath (at the microarchitecture level), the execution to create a computing system that meets functional,
model of almost all implementations (or, microarchitectures) performance, energy consumption, cost, and other specific
is very different goals.
 Pipelined instruction execution: Intel 80486 uarch

 Multiple instructions at a time: Intel Pentium uarch  Traditional (ISA-only) definition: “The term
 Out-of-order execution: Intel Pentium Pro uarch architecture is used here to describe the attributes of a
 Separate instruction and data caches system as seen by the programmer, i.e., the conceptual
structure and functional behavior as distinct from the
 But, what happens underneath that is not consistent with organization of the dataflow and controls, the logic design,
the von Neumann model is not exposed to software and the physical implementation.” Gene Amdahl, IBM
 Difference between ISA and microarchitecture Journal of R&D, April 1964
97 98

ISA vs. Microarchitecture ISA vs. Microarchitecture


 What is part of ISA vs. Uarch?
 ISA  Gas pedal: interface for “acceleration”
 Agreed upon interface between software Problem
 Internals of the engine: implement “acceleration”
and hardware Algorithm
 SW/compiler assumes, HW promises  Implementation (uarch) can be various as long as it
Program
 What the software writer needs to know ISA satisfies the specification (ISA)
to write and debug system/user programs Microarchitecture  Add instruction vs. Adder implementation
 Microarchitecture Circuits
 Bit serial, ripple carry, carry lookahead adders are all part of
microarchitecture
 Specific implementation of an ISA Electrons
 x86 ISA has many implementations: 286, 386, 486, Pentium,
 Not visible to the software Pentium Pro, Pentium 4, Core, …

 Microprocessor
 Microarchitecture usually changes faster than ISA
 ISA, uarch, circuits
 Few ISAs (x86, ARM, SPARC, MIPS, Alpha) but many uarchs
 “Architecture” = ISA + microarchitecture  Why?
99 100
10-11-2023

ISA Microarchitecture
 Instructions  Implementation of the ISA under specific design constraints
 Opcodes, Addressing Modes, Data Types and goals
Instruction Types and Formats

 Anything done in hardware without exposure to software
 Registers, Condition Codes
 Pipelining
 Memory
 In-order versus out-of-order instruction execution
 Address space, Addressability, Alignment
 Virtual memory management  Memory access scheduling policy
 Call, Interrupt/Exception Handling  Speculative execution
Superscalar processing (multiple instruction issue?)
 Access Control, Priority/Privilege 

 Clock gating
 I/O: memory-mapped vs. instr.
 Caching? Levels, size, associativity, replacement policy
 Task/thread Management
 Prefetching?
 Power and Thermal Management  Voltage/frequency scaling?
 Multi-threading support, Multiprocessor support  Error correction?
101 102

Last Lecture Recap


 Levels of Transformation
18-447  Algorithm, ISA, Microarchitecture
 Moore’s Law
Computer Architecture  What is Computer Architecture
Lecture 3: ISA Tradeoffs  Why Study Computer Architecture
 Fundamental Concepts
 Von Neumann Model
 Dataflow Model
Prof. Onur Mutlu  ISA vs. Microarchitecture
Carnegie Mellon University
Spring 2015, 1/16/2015  Assignments: HW0 (today!), Lab1 (Jan 23), HW1 (Jan 28)

104
10-11-2023

Review: ISA vs. Microarchitecture Review: ISA


 Instructions
 ISA  Opcodes, Addressing Modes, Data Types
 Agreed upon interface between software Problem
 Instruction Types and Formats
and hardware Algorithm
 Registers, Condition Codes
 SW/compiler assumes, HW promises
Program
 Memory
 What the software writer needs to know ISA
 Address space, Addressability, Alignment
to write and debug system/user programs  Virtual memory management
Microarchitecture
 Call, Interrupt/Exception Handling
 Microarchitecture Circuits
 Access Control, Priority/Privilege
 Specific implementation of an ISA Electrons
 I/O: memory-mapped vs. instr.
 Not visible to the software
 Task/thread Management
 Microprocessor  Power and Thermal Management
 ISA, uarch, circuits
 Multi-threading support, Multiprocessor support
 “Architecture” = ISA + microarchitecture
105 106

Microarchitecture Property of ISA vs. Uarch?


 Implementation of the ISA under specific design constraints  ADD instruction’s opcode
and goals  Number of general purpose registers
 Anything done in hardware without exposure to software  Number of ports to the register file
 Pipelining  Number of cycles to execute the MUL instruction
 In-order versus out-of-order instruction execution  Whether or not the machine employs pipelined instruction
 Memory access scheduling policy execution
 Speculative execution
 Superscalar processing (multiple instruction issue?)
 Clock gating
 Remember
 Caching? Levels, size, associativity, replacement policy
 Microarchitecture: Implementation of the ISA under specific
 Prefetching?
design constraints and goals
 Voltage/frequency scaling?
 Error correction?
107 108
10-11-2023

Design Point Application Space


 A set of design considerations and their importance  Dream, and they will appear…
 leads to tradeoffs in both ISA and uarch
 Considerations Problem
 Cost Algorithm

 Performance Program

 Maximum power consumption ISA


Microarchitecture
 Energy consumption (battery life)
Circuits
 Availability
Electrons
 Reliability and Correctness
 Time to Market

 Design point determined by the “Problem” space


(application space), the intended users/market
109 110

Tradeoffs: Soul of Computer Architecture Why Is It (Somewhat) Art?


New demands Problem
 ISA-level tradeoffs
from the top Algorithm
(Look Up) New demands and
Program/Language User
 Microarchitecture-level tradeoffs personalities of users
(Look Up)

Runtime System
 System and Task-level tradeoffs (VM, OS, MM)
 How to divide the labor between hardware and software ISA
Microarchitecture
New issues and Logic
capabilities
Circuits
 Computer architecture is the science and art of making the at the bottom
Electrons
appropriate trade-offs to meet a design point (Look Down)

 Why art?
 We do not (fully) know the future (applications, users, market)
111 112
10-11-2023

Why Is It (Somewhat) Art? Analogue from Macro-Architecture


 Future is not constant in macro-architecture, either
Changing demands Problem
at the top Algorithm
(Look Up and Forward)
Program/Language User Changing demands and  Example: Can a power plant boiler room be later used as a
personalities of users classroom?
(Look Up and Forward)

Runtime System
(VM, OS, MM)
ISA
Microarchitecture
Changing issues and Logic
capabilities
Circuits
at the bottom
(Look Down and Forward) Electrons

 And, the future is not constant (it changes)!


113 114

Macro-Architecture: Boiler Room How Can We Adapt to the Future


 This is part of the task of a good computer architect

 Many options (bag of tricks)


 Keen insight and good design
 Good use of fundamentals and principles
 Efficient design
 Heterogeneity
 Reconfigurability
 …
 Good use of the underlying technology
 …

115 116
10-11-2023

Many Different ISAs Over Decades


 x86
 PDP-x: Programmed Data Processor (PDP-11)
ISA Principles and Tradeoffs  VAX
 IBM 360
 CDC 6600
 SIMD ISAs: CRAY-1, Connection Machine
 VLIW ISAs: Multiflow, Cydrome, IA-64 (EPIC)
 PowerPC, POWER
 RISC ISAs: Alpha, MIPS, SPARC, ARM

 What are the fundamental differences?


 E.g., how instructions are specified and what they do
 E.g., how complex are the instructions
117 118

Instruction MIPS
 Basic element of the HW/SW interface
 Consists of
 opcode: what the instruction does
 operands: who it is to do it to
0 rs rt rd shamt funct R-type
6-bit 5-bit 5-bit 5-bit 5-bit 6-bit

 Example from the Alpha ISA: opcode rs rt immediate I-type


6-bit 5-bit 5-bit 16-bit

opcode immediate J-type


6-bit 26-bit

119 120
10-11-2023

ARM Set of Instructions, Encoding, and Spec


 Example from LC-3b ISA
 http://www.ece.utexas.e
du/~patt/11s.460N/hand
outs/new_byte.pdf
 x86 Manual

 Why unused instructions?


 Aside: concept of “bit
steering”
 A bit in the instruction
determines the
interpretation of other
bits

121 122

Bit Steering in Alpha What Are the Elements of An ISA?


 Instruction sequencing model
 Control flow vs. data flow
 Tradeoffs?

 Instruction processing style


 Specifies the number of “operands” an instruction “operates”
on and how it does so
 0, 1, 2, 3 address machines
 0-address: stack machine (op, push A, pop A)
 1-address: accumulator machine (op ACC, ld A, st A)
 2-address: 2-operand machine (op S,D; one is both source and dest)
 3-address: 3-operand machine (op S1,S2,D; source and dest separate)
 Tradeoffs? See your homework question
 Larger operate instructions vs. more executed operations
 Code size vs. execution time vs. on-chip memory space
123 124
10-11-2023

An Example: Stack Machine An Example: Stack Machine (II)


+ Small instruction size (no operands needed for operate
instructions)
 Simpler logic
 Compact code
Koopman, “Stack Computers:
The New Wave,” 1989.
+ Efficient procedure calls: all parameters on stack http://www.ece.cmu.edu/~koo
 No additional cycles for parameter passing pman/stack_computers/sec3
_2.html

-- Computations that are not easily expressible with “postfix


notation” are difficult to map to stack machines
 Cannot perform operations on many values at the same time
(only top N values on the stack at the same time)
 Not flexible

125 126

An Example: Stack Machine Operation Other Examples


 PDP-11: A 2-address machine
 PDP-11 ADD: 4-bit opcode, 2 6-bit operand specifiers
 Why? Limited bits to specify an instruction
 Disadvantage: One source operand is always clobbered with
Koopman, “Stack Computers: the result of the instruction
The New Wave,” 1989.
 How do you ensure you preserve the old value of the source?
http://www.ece.cmu.edu/~koo
pman/stack_computers/sec3
_2.html

 X86: A 2-address (memory/memory) machine


 Alpha: A 3-address (load/store) machine
 MIPS?
 ARM?

127 128
10-11-2023

What Are the Elements of An ISA? Data Type Tradeoffs


 Instructions  What is the benefit of having more or high-level data types
 Opcode in the ISA?
 Operand specifiers (addressing modes)  What is the disadvantage?
 How to obtain the operand? Why are there different addressing modes?

 Think compiler/programmer vs. microarchitect


 Data types
 Definition: Representation of information for which there are
instructions that operate on the representation  Concept of semantic gap
 Integer, floating point, character, binary, decimal, BCD  Data types coupled tightly to the semantic level, or complexity
 Doubly linked list, queue, string, bit vector, stack of instructions
 VAX: INSQUEUE and REMQUEUE instructions on a doubly linked
list or queue; FINDFIRST  Example: Early RISC architectures vs. Intel 432
 Digital Equipment Corp., “VAX11 780 Architecture Handbook,”  Early RISC: Only integer data type
1977.
 Intel 432: Object data type, capability based machine
 X86: SCAN opcode operates on character strings; PUSH/POP
129 130

An Example: BCD What Are the Elements of An ISA?


 Each decimal digit is encoded with a fixed number of bits  Memory organization
 Address space: How many uniquely identifiable locations in
memory
 Addressability: How much data does each uniquely identifiable
location store
 Byte addressable: most ISAs, characters are 8 bits
 Bit addressable: Burroughs 1700. Why?
 64-bit addressable: Some supercomputers. Why?
 32-bit addressable: First Alpha
 Food for thought
 How do you add 2 32-bit numbers with only byte addressability?
 How do you add 2 8-bit numbers with only 32-bit addressability?
 Big endian vs. little endian? MSB at low or high byte.
"Binary clock" by Alexander Jones & Eric Pierce - Own work, based on Wapcaplet's Binary clock.png on the English
Wikipedia. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Binary_clock.svg#mediaviewer/File:Binary_clock.svg  Support for virtual memory
"Digital-BCD-clock" by Julo - Own work. Licensed under Public Domain via Wikimedia Commons -
131 132
http://commons.wikimedia.org/wiki/File:Digital-BCD-clock.jpg#mediaviewer/File:Digital-BCD-clock.jpg
10-11-2023

Some Historical Readings What Are the Elements of An ISA?


 If you want to dig deeper  Registers
 How many
 Wilner, “Design of the Burroughs 1700,” AFIPS 1972.  Size of each register

 Levy, “The Intel iAPX 432,” 1981.  Why is having registers a good idea?
 http://www.cs.washington.edu/homes/levy/capabook/Chapter  Because programs exhibit a characteristic called data locality
9.pdf  A recently produced/accessed value is likely to be used more
than once (temporal locality)
 Storing that value in a register eliminates the need to go to
memory each time that value is needed

133 134

Programmer Visible (Architectural) State Aside: Programmer Invisible State


 Microarchitectural state
 Programmer cannot access this directly
M[0]
M[1]
M[2]
 E.g. cache state
M[3] Registers
M[4] - given special names in the ISA  E.g. pipeline registers
(as opposed to addresses)
- general vs. special purpose

M[N-1]
Memory Program Counter
array of storage locations memory address
indexed by an address of the current instruction

Instructions (and programs) specify how to transform


the values of programmer visible state
135 136
10-11-2023

Evolution of Register Architecture Instruction Classes


 Accumulator  Operate instructions
 a legacy from the “adding” machine days  Process data: arithmetic and logical operations
 Fetch operands, compute result, store result
 Accumulator + address registers  Implicit sequential control flow
 need register indirection
 initially address registers were special-purpose, i.e., can only  Data movement instructions
be loaded with an address for indirection  Move data between memory, registers, I/O devices
 eventually arithmetic on addresses became supported  Implicit sequential control flow

 General purpose registers (GPR)  Control flow instructions


 all registers good for all purposes  Change the sequence of instructions that are executed
 grew from a few registers to 32 (common for RISC) to 128 in
Intel IA-64
137 138

What Are the Elements of An ISA? What Are the Elements of An ISA?
 Load/store vs. memory/memory architectures  Addressing modes specify how to obtain the operands
 Absolute LW rt, 10000
 Load/store architecture: operate instructions operate only on use immediate value as address
registers  Register Indirect: LW rt, (rbase)
 E.g., MIPS, ARM and many RISC ISAs use GPR[rbase] as address
 Displaced or based: LW rt, offset(rbase)
 Memory/memory architecture: operate instructions can use offset+GPR[rbase] as address
operate on memory locations
 Indexed: LW rt, (rbase, rindex)
 E.g., x86, VAX and many CISC ISAs
use GPR[rbase]+GPR[rindex] as address
 Memory Indirect LW rt ((rbase))
use value at M[ GPR[ rbase ] ] as address
 Auto inc/decrement LW Rt, (rbase)
use GRP[rbase] as address, but inc. or dec. GPR[rbase] each time
139 140
10-11-2023

What Are the Benefits of Different Addressing Modes? ISA Orthogonality


 Another example of programmer vs. microarchitect tradeoff  Orthogonal ISA:
 All addressing modes can be used with all instruction types
 Advantage of more addressing modes:  Example: VAX
 Enables better mapping of high-level constructs to the  (~13 addressing modes) x (>300 opcodes) x (integer and FP
formats)
machine: some accesses are better expressed with a different
mode  reduced number of instructions and code size
 Think array accesses (autoincrement mode)  Who is this good for?
 Think indirection (pointer chasing)  Who is this bad for?
 Sparse matrix accesses

 Disadvantage:
 More work for the compiler
 More work for the microarchitect

141 142

Is the LC-3b ISA Orthogonal? LC-3b: Addressing Modes of ADD

143 144
10-11-2023

LC-3b: Addressing Modes of of JSR(R) What Are the Elements of An ISA?


 How to interface with I/O devices
 Memory mapped I/O
 A region of memory is mapped to I/O devices
 I/O operations are loads and stores to those locations

 Special I/O instructions


 IN and OUT instructions in x86 deal with ports of the chip

 Tradeoffs?
 Which one is more general purpose?

145 146

What Are the Elements of An ISA? Another Question or Two


 Privilege modes
 Does the LC-3b ISA contain complex instructions?
 User vs supervisor
 Who can execute what instructions?
 How complex can an instruction be?
 Exception and interrupt handling
 What procedure is followed when something goes wrong with an
instruction?
 What procedure is followed when an external device requests the
processor?
 Vectored vs. non-vectored interrupts (early MIPS)

 Virtual memory
 Each program has the illusion of the entire memory space, which is greater
than physical memory

 Access protection

 We will talk about these later 147 148


10-11-2023

Complex vs. Simple Instructions Complex vs. Simple Instructions


 Complex instruction: An instruction does a lot of work, e.g.  Advantages of Complex instructions
many operations + Denser encoding  smaller code size  better memory
 Insert in a doubly linked list utilization, saves off-chip bandwidth, better cache hit rate
 Compute FFT (better packing of instructions)
 String copy + Simpler compiler: no need to optimize small instructions as
much

 Simple instruction: An instruction does small amount of


 Disadvantages of Complex Instructions
work, it is a primitive using which complex operations can
be built - Larger chunks of work  compiler has less opportunity to
optimize (limited in fine-grained optimizations it can do)
 Add
- More complex hardware  translation from a high level to
 XOR control signals and optimization needs to be done by hardware
 Multiply

149 150

ISA-level Tradeoffs: Semantic Gap ISA-level Tradeoffs: Semantic Gap


 Where to place the ISA? Semantic gap  Some tradeoffs (for you to think about)
 Closer to high-level language (HLL)  Small semantic gap,
complex instructions  Simple compiler, complex hardware vs.
 Closer to hardware control signals?  Large semantic gap, complex compiler, simple hardware
simple instructions
 Caveat: Translation (indirection) can change the tradeoff!

 RISC vs. CISC machines


 Burden of backward compatibility
 RISC: Reduced instruction set computer
 CISC: Complex instruction set computer
 FFT, QUICKSORT, POLY, FP instructions?  Performance? Energy Consumption?
 VAX INDEX instruction (array access with bounds checking)  Optimization opportunity: Example of VAX INDEX instruction:
who (compiler vs. hardware) puts more effort into
optimization?
 Instruction size, code size
151 152
10-11-2023

X86: Small Semantic Gap: String Operations X86: Small Semantic Gap: String Operations
 An instruction operates on a string REP MOVS (DEST SRC)

 Move one string of arbitrary length to another location


 Compare two strings

 Enabled by the ability to specify repeated execution of an


instruction (in the ISA)
 Using a “prefix” called REP prefix

 Example: REP MOVS instruction


 Only two bytes: REP prefix byte and MOVS opcode byte (F2 A4)
 Implicit source and destination registers pointing to the two
strings (ESI, EDI)
 Implicit count register (ECX) specifies how long the string is
How many instructions does this take in MIPS?
153 154

Small Semantic Gap Examples in VAX Small versus Large Semantic Gap
 FIND FIRST  CISC vs. RISC
 Find the first set bit in a bit field
 Complex instruction set computer  complex instructions
 Helps OS resource allocation operations
 Initially motivated by “not good enough” code generation
 SAVE CONTEXT, LOAD CONTEXT
 Reduced instruction set computer  simple instructions
 Special context switching instructions

 INSQUEUE, REMQUEUE
 John Cocke, mid 1970s, IBM 801
 Goal: enable better compiler control and optimization
 Operations on doubly linked list

 INDEX
 Array access with bounds checking  RISC motivated by
 STRING Operations  Memory stalls (no work done in a complex instruction when
 Compare strings, find substrings, … there is a memory stall?)
 Cyclic Redundancy Check Instruction  When is this correct?
 EDITPC  Simplifying the hardware  lower cost, higher frequency
 Implements editing functions to display fixed format output
 Enabling the compiler to optimize the code better
 Digital Equipment Corp., “VAX11 780 Architecture Handbook,” 1977-78.  Find fine-grained parallelism to reduce stalls

155 156
10-11-2023

An Aside How High or Low Can You Go?


 An Historical Perspective on RISC Development at IBM  Very large semantic gap
 http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/risc/  Each instruction specifies the complete set of control signals in
the machine
 Compiler generates control signals
 Open microcode (John Cocke, circa 1970s)
 Gave way to optimizing compilers

 Very small semantic gap


 ISA is (almost) the same as high-level language
 Java machines, LISP machines, object-oriented machines,
capability-based machines

157 158

A Note on ISA Evolution Effect of Translation


 ISAs have evolved to reflect/satisfy the concerns of the day  One can translate from one ISA to another ISA to change
the semantic gap tradeoffs
 Examples:  ISA (virtual ISA)  Implementation ISA
 Limited on-chip and off-chip memory size
 Limited compiler optimization technology  Examples
 Limited memory bandwidth  Intel’s and AMD’s x86 implementations translate x86
 Need for specialization in important applications (e.g., MMX) instructions into programmer-invisible microoperations (simple
instructions) in hardware
 Transmeta’s x86 implementations translated x86 instructions
 Use of translation (in HW and SW) enabled underlying into “secret” VLIW instructions in software (code morphing
implementations to be similar, regardless of the ISA software)
 Concept of dynamic/static interface: translation/interpretation
 Contrast it with hardware/software interface  Think about the tradeoffs

159 160
10-11-2023

Hardware-Based Translation Software-Based Translation

Klaiber, “The Technology Behind Crusoe Processors,” Transmeta White Paper 2000. Klaiber, “The Technology Behind Crusoe Processors,” Transmeta White Paper 2000.
161 162

Last Lecture Recap


Instruction processing style
18-447 

 0, 1, 2, 3 address machines
Computer Architecture  Elements of an ISA
Lecture 4: ISA Tradeoffs (Continued) and  Instructions, data types, memory organizations, registers, etc
 Addressing modes
MIPS ISA  Complex (CISC) vs. simple (RISC) instructions
 Semantic gap
 ISA translation
Prof. Onur Mutlu
Kevin Chang
Carnegie Mellon University
Spring 2015, 1/21/2015
163 164
10-11-2023

ISA-level Tradeoffs: Instruction Length ISA-level Tradeoffs: Uniform Decode


 Fixed length: Length of all instructions the same  Uniform decode: Same bits in each instruction correspond
+ Easier to decode single instruction in hardware to the same meaning
+ Easier to decode multiple instructions concurrently  Opcode is always in the same location
-- Wasted bits in instructions (Why is this bad?)  Ditto operand specifiers, immediate values, …
-- Harder-to-extend ISA (how to add new instructions?)
 Many “RISC” ISAs: Alpha, MIPS, SPARC
 Variable length: Length of instructions different + Easier decode, simpler hardware
(determined by opcode and sub-opcode) + Enables parallelism: generate target address before knowing the
+ Compact encoding (Why is this good?) instruction is a branch
Intel 432: Huffman encoding (sort of). 6 to 321 bit instructions. How? -- Restricts instruction format (fewer instructions?) or wastes space
-- More logic to decode a single instruction
-- Harder to decode multiple instructions concurrently
 Non-uniform decode
 Tradeoffs  E.g., opcode can be the 1st-7th byte in x86
 Code size (memory space, bandwidth, latency) vs. hardware complexity + More compact and powerful instruction format
 ISA extensibility and expressiveness vs. hardware complexity -- More complex decode logic
 Performance? Energy? Smaller code vs. ease of decode
165 166

x86 vs. Alpha Instruction Formats MIPS Instruction Format


 x86:  R-type, 3 register operands
0 rs rt rd shamt funct R-type
6-bit 5-bit 5-bit 5-bit 5-bit 6-bit

 I-type, 2 register operands and 16-bit immediate operand


opcode rs rt immediate I-type
6-bit 5-bit 5-bit 16-bit

 J-type, 26-bit immediate operand


 Alpha: opcode immediate J-type
6-bit 26-bit

 Simple Decoding
 4 bytes per instruction, regardless of format
 must be 4-byte aligned (2 lsb of PC must be 2b’00)
 format and fields easy to extract in hardware
167 168
10-11-2023

ARM A Note on Length and Uniformity


 Uniform decode usually goes with fixed length

 In a variable length ISA, uniform decode can be a property


of instructions of the same length
 It is hard to think of it as a property of instructions of different
lengths

169 170

A Note on RISC vs. CISC ISA-level Tradeoffs: Number of Registers


 Usually, …  Affects:
 Number of bits used for encoding register address
 RISC  Number of values kept in fast storage (register file)
 Simple instructions  (uarch) Size, access time, power consumption of register file
 Fixed length
 Uniform decode  Large number of registers:
 Few addressing modes + Enables better register allocation (and optimizations) by
compiler  fewer saves/restores
 CISC -- Larger instruction size
 Complex instructions -- Larger register file size
 Variable length
 Non-uniform decode
 Many addressing modes
171 172
10-11-2023

ISA-level Tradeoffs: Addressing Modes x86 vs. Alpha Instruction Formats


 Addressing mode specifies how to obtain an operand of an  x86:
instruction
 Register
 Immediate
 Memory (displacement, register indirect, indexed, absolute,
memory indirect, autoincrement, autodecrement, …)

 More modes:
+ help better support programming constructs (arrays, pointer-  Alpha:
based accesses)
-- make it harder for the architect to design
-- too many choices for the compiler?
 Many ways to do the same thing complicates compiler design
 Wulf, “Compilers and Computer Architecture,” IEEE Computer 1981
173 174

x86 x86
register
indirect
indexed
Memory
absolute (base +
index)

SIB +
displacement

scaled
register + (base +
displacement index*4)
register

Register

175 176
10-11-2023

X86 SIB-D Addressing Mode X86 Manual: Suggested Uses of Addressing Modes

Static address

Dynamic storage

Arrays

Records

x86 Manual Vol. 1, page 3-22 -- see course resources on website x86 Manual Vol. 1, page 3-22 -- see course resources on website
Also, see Section 3.7.3 and 3.7.5 Also, see Section 3.7.3 and 3.7.5
177 178

X86 Manual: Suggested Uses of Addressing Modes Other Example ISA-level Tradeoffs
 Condition codes vs. not
 VLIW vs. single instruction
 Precise vs. imprecise exceptions
Static arrays w/ fixed-size elements
 Virtual memory vs. not
 Unaligned access vs. not
 Hardware interlocks vs. software-guaranteed interlocking
2D arrays  Software vs. hardware managed page fault handling
 Cache coherence (hardware vs. software)
 …
2D arrays

x86 Manual Vol. 1, page 3-22 -- see course resources on website


Also, see Section 3.7.3 and 3.7.5
179 180
10-11-2023

Back to Programmer vs. (Micro)architect MIPS: Aligned Access


MSB byte-3 byte-2 byte-1 byte-0 LSB
 Many ISA features designed to aid programmers
byte-7 byte-6 byte-5 byte-4
 But, complicate the hardware designer’s job
 LW/SW alignment restriction: 4-byte word-alignment
 Virtual memory  not designed to fetch memory bytes not within a word boundary
 vs. overlay programming  not designed to rotate unaligned bytes into registers
 Should the programmer be concerned about the size of code  Provide separate opcodes for the “infrequent” case
blocks fitting physical memory?
A B C D
 Addressing modes
 Unaligned memory access LWL rd 6(r0)  byte-6 byte-5 byte-4 D
 Compiler/programmer needs to align data
LWR rd 3(r0)  byte-6 byte-5 byte-4 byte-3

 LWL/LWR is slower
 Note LWL and LWR still fetch within word boundary
181 182

X86: Unaligned Access X86: Unaligned Access


 LD/ST instructions automatically align data that spans a
“word” boundary
 Programmer/compiler does not need to worry about where
data is stored (whether or not in a word-aligned location)

183 184
10-11-2023

Aligned vs. Unaligned Access


 Pros of having no restrictions on alignment
18-447 MIPS ISA

James C. Hoe
 Cons of having no restrictions on alignment Dept of ECE, CMU

 Filling in the above: an exercise for you…

185

MIPS R2000 Program Visible State Data Format


 Most things are 32 bits
Program Counter
32-bit memory address **Note** r0=0  instruction and data addresses
r1
of the current instruction r2  signed and unsigned integers
 just bits
General Purpose
M[0] Register File  Also 16-bit word and 8-bit word (aka byte)
M[1] 32 32-bit words
named r0...r31  Floating-point numbers
M[2]
M[3]  IEEE standard 754
M[4]  float: 8-bit exponent, 23-bit significand
Memory  double: 11-bit exponent, 52-bit significand
232 by 8-bit locations (4 Giga Bytes)
32-bit address
(there is some magic going on)
M[N-1]
10-11-2023

Big Endian vs. Little Endian Instruction Formats


(Part I, Chapter 4, Gulliver’s Travels)

 32-bit signed or unsigned integer comprises 4 bytes  3 simple formats


 R-type, 3 register operands
MSB LSB
(most significant) 8-bit 8-bit 8-bit 8-bit (least significant) 0 rs rt rd shamt funct R-type
6-bit 5-bit 5-bit 5-bit 5-bit 6-bit
 On a byte-addressable machine . . . . . . .  I-type, 2 register operands and 16-bit immediate
Big Endian Little Endian operand
opcode rs rt immediate I-type
MSB LSB MSB LSB 6-bit 5-bit 5-bit 16-bit
byte 0 byte 1 byte 2 byte 3 byte 3 byte 2 byte 1 byte 0
byte 4 byte 5 byte 6 byte 7 byte 7 byte 6 byte 5 byte 4  J-type, 26-bit immediate operand
byte 8 byte 9 byte 10 byte 11 byte 11 byte 10 byte 9 byte 8 opcode immediate J-type
byte 12 byte 13 byte 14 byte 15 byte 15 byte 14 byte 13 byte 12 6-bit 26-bit
byte 16 byte 17 byte 18 byte 19 byte 19 byte 18 byte 17 byte 16
pointer points to the big end pointer points to the little end  Simple Decoding
 4 bytes per instruction, regardless of format
 must be 4-byte aligned (2 lsb of PC must be
2b’00)
 What difference does it make?check out htonl(), ntohl() in in.h

ALU Instructions Reg-Reg Instruction Encoding


 Assembly (e.g., register-register signed addition)
ADD rdreg rsreg rtreg
 Machine encoding

0 rs rt rd 0 ADD R-type
 Semantics
6-bit 5-bit 5-bit 5-bit 5-bit 6-bit

 GPR[rd]  GPR[rs] + GPR[rt]


 PC  PC + 4
 Exception on “overflow”
 Variations [MIPS R4000 Microprocessor User’s Manual]

 Arithmetic: {signed, unsigned} x {ADD, SUB}


 Logical: {AND, OR, XOR, NOR}
 Shift: {Left, Right-Logical, Right-Arithmetic}
What patterns do you see? Why are they there?
10-11-2023

ALU Instructions Reg-Immed Instruction Encoding

 Assembly (e.g., regi-immediate signed additions)


ADDI rtreg rsreg immediate16
 Machine encoding
ADDI rs rt immediate I-type
6-bit 5-bit 5-bit 16-bit

 Semantics
 GPR[rt]  GPR[rs] + sign-extend (immediate)
 PC  PC + 4
[MIPS R4000 Microprocessor User’s Manual]
 Exception on “overflow”
 Variations
 Arithmetic: {signed, unsigned} x {ADD, SUB}
 Logical: {AND, OR, XOR, LUI}

Assembly Programming 101 Load Instructions


 Break down high-level program constructs into a sequence  Assembly (e.g., load 4-byte word)
of elemental operations LW rtreg offset16 (basereg)
 Machine encoding
 E.g. High-level Code
f = ( g + h ) – ( i + j ) LW base rt offset I-type
 Semantics
6-bit 5-bit 5-bit 16-bit

 Assembly Code  effective_address = sign-extend(offset) + GPR[base]


 suppose f, g, h, i, j are in rf, rg, rh, ri, rj  GPR[rt]  MEM[ translate(effective_address) ]
 suppose rtemp is a free register  PC  PC + 4
add rtemp rg rh # rtemp = g+h  Exceptions
add rf ri rj # rf = i+j  address must be “word-aligned”
sub rf rtemp rf # f = rtemp – rf What if you want to load an unaligned word?
 MMU exceptions
10-11-2023

Store Instructions Assembly Programming 201


 Assembly (e.g., store 4-byte word)  E.g. High-level Code
SW rtreg offset16 (basereg) A[ 8 ] = h + A[ 0 ]
 Machine encoding
where A is an array of integers (4–byte each)
SW base rt offset I-type  Assembly Code
 Semantics
6-bit 5-bit 5-bit 16-bit  suppose &A, h are in rA, rh
 effective_address = sign-extend(offset) + GPR[base]  suppose rtemp is a free register
 MEM[ translate(effective_address) ]  GPR[rt]
 PC  PC + 4 LW rtemp 0(rA) # rtemp = A[0]
 Exceptions add rtemp rh rtemp # rtemp = h + A[0]
 address must be “word-aligned” SW rtemp 32(rA) # A[8] = rtemp
 MMU exceptions # note A[8] is 32 bytes
# from A[0]

Load Delay Slots Control Flow Instructions


Assembly Code
LW ra --- Control Flow Graph
 C-Code (linearized)
addi r- ra r- code A code A

{ code A }
addi r- ra r- if X==Y if X==Y
if X==Y then True False goto
 R2000 load has an architectural latency of 1 inst*. { code B } code C
code B code C
 the instruction immediately following a load (in the “delay else
slot”) still sees the old register value { code C }
goto
 the load instruction no longer has an atomic semantics { code D }
code B
Why would you do it this way?
code D
 Is this a good idea? (hint: R4000 redefined LW to
complete atomically) code D

*BTW, notice that latency is defined in “instructions” not cyc. or sec.


these things are called basic blocks
10-11-2023

(Conditional) Branch Instructions Jump Instructions


 Assembly (e.g., branch if equal)
BEQ rsreg rtreg immediate16  Assembly
 Machine encoding J immediate26
 Machine encoding
BEQ rs rt immediate I-type J immediate J-type
 Semantics
6-bit 5-bit 5-bit 16-bit 6-bit 26-bit

 target = PC + sign-extend(immediate) x 4  Semantics


 if GPR[rs]==GPR[rt] then PC  target  target = PC[31:28]x228 |bitwise-or zero-
else PC  PC + 4 extend(immediate)x4
 How far can you jump?  PC  target
Variations  How far can you jump?

PC + 4 w/ PC + 4 w/
 BEQ, BNE, BLEZ, BGTZ branch delay slot  Variations branch delay slot
 Jump and Link
Why isn’t there a BLE or BGT instruction?  Jump Registers

Assembly Programming 301 Branch Delay Slots

 E.g. High-level Code fork  R2000 branch instructions also have an architectural
if (i == j) then then
latency of 1 instructions
e = g  the instruction immediately after a branch is always
else executed (in fact PC-offset is computed from the delay
else
e = h slot instruction)
f = e  branch target takes effect on the 2nd instruction
join
bne ri rj L1 bne ri rj L1
 Assembly Code nop
 suppose e, f, g, h, i, j are in re, rf, rg, rh, ri, rj add re rg r0 add re rg r0
j L2 j L2
bne ri rj L1 # L1 and L2 are addr labels nop re
add rg r0
L1: add re rh r0 L1: add re rh r0
# assembler computes offset
add re rg r0 # e = g L2: add rf re r0 L2: add rf re r0
j L2 . . . . . . . .
L1: add re rh r0 # e = h
L2: add r r r0 # f = e
10-11-2023

Strangeness in the Semantics Function Call and Return


 Jump and Link: JAL offset26
 return address = PC + 8
Where do you think you will end up?
 target = PC[31:28]x2
28 |
bitwise-or zero-
extend(immediate)x4
_s: j L1
 PC  target
j L2
j L3  GPR[r31]  return address

On a function call, the callee needs to know where to go


L1: j L4 back to afterwards
L2: j L5
 Jump Indirect: JR rsreg
L3: foo  target = GPR [rs]
L4: bar  PC  target
L5: baz
PC-offset jumps and branches always jump to the same
target every time the same instruction is executed
Jump Indirect allows the same instruction to jump to any
location specified by rs (usually r31)

Assembly Programming 401 Caller and Callee Saved Registers

Callee  Callee-Saved Registers


Caller
... code A ... _myfxn: ... code B ...  Caller says to callee, “The values of these registers
JAL _myfxn JR r31 should not change when you return to me.”
... code C ...  Callee says, “If I need to use these registers, I promise
JAL _myfxn
to save the old values to memory first and restore them
before I return to you.”
... code D ...

 ..... A call B return C call B return D .....  Caller-Saved Registers


 How do you pass argument between caller and callee?  Caller says to callee, “If there is anything I care about
in these registers, I already saved it myself.”
 If A set r10 to 1, what is the value of r10 when B returns
to C?  Callee says to caller, “Don’t count on them staying the
same values after I am done.
 What registers can B use?
 What happens to r31 if B calls another function
10-11-2023

R2000 Register Usage Convention R2000 Memory Usage Convention


high address
 r0: always 0
stack space
 r1: reserved for the assembler
 r2, r3: function return values grow down

 r4~r7: function call arguments free space stack pointer


 r8~r15: “caller-saved” temporaries GPR[r29]
grow up
 r16~r23 “callee-saved” temporaries
 r24~r25 “caller-saved” temporaries dynamic data
 r26, r27: reserved for the operating system
static data
 r28: global pointer binary executable
 r29: stack pointer text
 r30: callee-saved temporaries
reserved
 r31: return address low address

Calling Convention To Summarize: MIPS RISC


.......

1. caller saves caller-saved registers  Simple operations


 2-input, 1-output arithmetic and logical operations
2. caller loads arguments into r4~r7
 few alternatives for accomplishing the same thing
3. caller jumps to callee using JAL
 Simple data movements
4. callee allocates space on the stack (dec. stack pointer)
ALU ops are register-to-register (need a large register
prologue


5. callee saves callee-saved registers to stack (also file)
r4~r7, old r29, r31)  “Load-store” architecture
....... body of callee (can “nest” additional calls) .......
 Simple branches
6. callee loads results to r2, r3 limited varieties of branch conditions and targets
epilogue

7. callee restores saved register values  Simple instruction encoding


8. JR r31  all instructions encoded in the same number of bits
9. caller continues with return values in r2, r3  only a few formats
........ Loosely speaking, an ISA intended for compilers rather
than assembly programmers
10-11-2023

Agenda for Today & Next Few Lectures


Start Microarchitecture
18-447 

Computer Architecture  Single-cycle Microarchitectures


Lecture 5: Intro to Microarchitecture:
Multi-cycle Microarchitectures
Single-Cycle 

 Microprogrammed Microarchitectures

Prof. Onur Mutlu  Pipelining


Carnegie Mellon University
Spring 2015, 1/26/2015  Issues in Pipelining: Control & Data Dependence Handling,
State Maintenance and Recovery, …
214

Recap of Two Weeks and Last Lecture Assignment for You


 Computer Architecture Today and Basics (Lectures 1 & 2)  Not to be turned in
 Fundamental Concepts (Lecture 3)
 ISA basics and tradeoffs (Lectures 3 & 4)  As you learn the MIPS ISA, think about what tradeoffs the
designers have made
 Last Lecture: ISA tradeoffs continued + MIPS ISA  in terms of the ISA properties we talked about
 Instruction length  And, think about the pros and cons of design choices
 Uniform vs. non-uniform decode  In comparison to ARM, Alpha
 Number of registers  In comparison to x86, VAX
 Addressing modes
 Aligned vs. unaligned access  And, think about the potential mistakes
RISC vs. CISC properties  Branch delay slot?

 MIPS ISA Overview  Load delay slot? Look Backward


 No FP, no multiply, MIPS (initial)
215 216
10-11-2023

Food for Thought for You Review: Other Example ISA-level Tradeoffs
 How would you design a new ISA?  Condition codes vs. not
 VLIW vs. single instruction
 Where would you place it?  SIMD (single instruction multiple data) vs. SISD
 What design choices would you make in terms of ISA  Precise vs. imprecise exceptions
properties?  Virtual memory vs. not
 Unaligned access vs. not
 What would be the first question you ask in this process?  Hardware interlocks vs. software-guaranteed interlocking
 “What is my design point?”  Software vs. hardware managed page fault handling
 Cache coherence (hardware vs. software)
 …

Look Forward & Up Think Programmer vs. (Micro)architect


217 218

Review: A Note on RISC vs. CISC Now That We Have an ISA


 Usually, …  How do we implement it?

 RISC  i.e., how do we design a system that obeys the


 Simple instructions hardware/software interface?
 Fixed length
 Uniform decode  Aside: “System” can be solely hardware or a combination of
 Few addressing modes hardware and software
 Remember “Translation of ISAs”
 CISC  A virtual ISA can be converted by “software” into an
 Complex instructions implementation ISA
 Variable length
 Non-uniform decode  We will assume “hardware” for most lectures
 Many addressing modes
219 220
10-11-2023

How Does a Machine Process Instructions?


 What does processing an instruction mean?
 Remember the von Neumann model
Implementing the ISA:
Microarchitecture Basics AS = Architectural (programmer visible) state before an
instruction is processed

Process instruction

AS’ = Architectural (programmer visible) state after an


instruction is processed

 Processing an instruction: Transforming AS to AS’ according


to the ISA specification of the instruction
221 222

The “Process instruction” Step A Very Basic Instruction Processing Engine


 ISA specifies abstractly what AS’ should be, given an
 Each instruction takes a single clock cycle to execute
instruction and AS
 Only combinational logic is used to implement instruction
 It defines an abstract finite state machine where
execution
 State = programmer-visible state
 No intermediate, programmer-invisible state updates
 Next-state logic = instruction execution specification
 From ISA point of view, there are no “intermediate states”
between AS and AS’ during instruction execution AS = Architectural (programmer visible) state
 One state transition per instruction at the beginning of a clock cycle
 Microarchitecture implements how AS is transformed to AS’
 There are many choices in implementation Process instruction in one clock cycle
 We can have programmer-invisible state to optimize the speed of
instruction execution: multiple state transitions per instruction AS’ = Architectural (programmer visible) state
 Choice 1: AS  AS’ (transform AS to AS’ in a single clock cycle)
at the end of a clock cycle
 Choice 2: AS  AS+MS1  AS+MS2  AS+MS3  AS’ (take multiple
clock cycles to transform AS to AS’)
223 224
10-11-2023

A Very Basic Instruction Processing Engine Remember: Programmer Visible (Architectural) State
 Single-cycle machine

M[0]
M[1]
M[2]
AS’ Sequential AS M[3] Registers
Combinational M[4] - given special names in the ISA
Logic Logic
(as opposed to addresses)
(State) - general vs. special purpose

M[N-1]
Memory Program Counter
array of storage locations memory address
 What is the clock cycle time determined by? of the current instruction
indexed by an address
 What is the critical path of the combinational logic
determined by? Instructions (and programs) specify how to transform
the values of programmer visible state
225 226

Single-cycle vs. Multi-cycle Machines Instruction Processing “Cycle”


 Single-cycle machines  Instructions are processed under the direction of a “control
 Each instruction takes a single clock cycle unit” step by step.
All state updates made at the end of an instruction’s execution

 Instruction cycle: Sequence of steps to process an instruction
 Big disadvantage: The slowest instruction determines cycle time 
long clock cycle time  Fundamentally, there are six phases:

 Multi-cycle machines  Fetch


 Instruction processing broken into multiple cycles/stages  Decode
 State updates can be made during an instruction’s execution  Evaluate Address
 Architectural state updates made only at the end of an instruction’s
execution  Fetch Operands
 Advantage over single-cycle: The slowest “stage” determines cycle time  Execute
 Store Result
 Both single-cycle and multi-cycle machines literally follow the
von Neumann model at the microarchitecture level
 Not all instructions require all six stages (see P&P Ch. 4)
227 228
10-11-2023

Instruction Processing “Cycle” vs. Machine Clock Cycle Instruction Processing Viewed Another Way
 Instructions transform Data (AS) to Data’ (AS’)
 Single-cycle machine:
 All six phases of the instruction processing cycle take a single  This transformation is done by functional units
Units that “operate” on data
machine clock cycle to complete 

 These units need to be told what to do to the data


 Multi-cycle machine:  An instruction processing engine consists of two components
 All six phases of the instruction processing cycle can take  Datapath: Consists of hardware elements that deal with and
multiple machine clock cycles to complete transform data signals
 In fact, each phase can take multiple clock cycles to complete  functional units that operate on data
 hardware structures (e.g. wires and muxes) that enable the flow of
data into the functional units and registers
 storage units that store data (e.g., registers)
 Control logic: Consists of hardware elements that determine
control signals, i.e., signals that specify what the datapath
elements should do to the data
229 230

Single-cycle vs. Multi-cycle: Control & Data Many Ways of Datapath and Control Design
 Single-cycle machine:  There are many ways of designing the data path and
 Control signals are generated in the same clock cycle as the control logic
one during which data signals are operated on
 Everything related to an instruction happens in one clock cycle  Single-cycle, multi-cycle, pipelined datapath and control
(serialized processing)
 Single-bus vs. multi-bus datapaths
 See your homework 2 question
 Multi-cycle machine:
 Hardwired/combinational vs. microcoded/microprogrammed
 Control signals needed in the next cycle can be generated in
control
the current cycle
 Control signals generated by combinational logic versus
 Latency of control processing can be overlapped with latency
of datapath operation (more parallelism)  Control signals stored in a memory structure

 We will see the difference clearly in microprogrammed  Control signals and structure depend on the datapath
multi-cycle microarchitectures design
231 232
10-11-2023

Flash-Forward: Performance Analysis


 Execution time of an instruction
 {CPI} x {clock cycle time}
 Execution time of a program A Single-Cycle Microarchitecture


Sum over all instructions [{CPI} x {clock cycle time}]
{# of instructions} x {Average CPI} x {clock cycle time}
A Closer Look
 Single cycle microarchitecture performance
 CPI = 1
 Clock cycle time = long
 Multi-cycle microarchitecture performance
 CPI = different for each instruction Now, we have
 Average CPI  hopefully small two degrees of freedom
to optimize independently
 Clock cycle time = short
233 234

Remember… Let’s Start with the State Elements


 Single-cycle machine  Data and control inputs 5 Read
register 1
Read
5 data 1
Read
register 2
Registers
PC 5 Write
register
AS’
Read

Sequential AS Write data 2

Combinational data

Logic Logic RegWrite

(State)
MemWrite

Instruction
address
Address Read
data
Instruction
Write Data
Instruction
data memory
memory

MemRead

235 236
**Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
10-11-2023

For Now, We Will Assume Instruction Processing


 “Magic” memory and register file  5 generic steps (P&H book)
 Instruction fetch (IF)
 Combinational read
 Instruction decode and register operand fetch (ID/RF)
 output of the read data port is a combinational function of the
 Execute/Evaluate memory address (EX/AG)
register file contents and the corresponding read select port
 Memory operand fetch (MEM)
 Synchronous write  Store/writeback result (WB)
the selected register is updated on the positive edge clock

WB
transition when write enable is asserted
 Cannot affect read output in between clock edges
IF Data

Register #
PC Address Instruction Registers ALU Address
Register #
 Single-cycle, synchronous memory
Instruction
memory ID/RF Data
Register # EX/AG memory
 Contrast this with memory that tells when the data is ready
 i.e., Ready bit: indicating the read or write is done Data
MEM
237 **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] 238

What Is To Come: The Full MIPS Datapath


PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]

Single-Cycle Datapath for


left 2
26 28 0 1

PC+4 [31– 28] M M


u u
x x
ALU
Add result 1 0

4
Add

Instruction [31– 26]


RegDst
Jump
Branch
MemRead
Shift
left 2
PCSrc2=Br Taken
Arithmetic and Logical Instructions
Control MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25– 21] Read


Read register 1
PC address Read
Instruction [20– 16] data 1
Read
register 2 Zero
bcond
Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control

Instruction [5– 0]

**Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. 239
JAL, JR, JALR omitted 240
ALL RIGHTS RESERVED.]
10-11-2023

R-Type ALU Instructions ALU Datapath


 Assembly (e.g., register-register signed addition)
ADD rdreg rsreg rtreg
Add

4
 Machine encoding 25:21 3
ALU operation
Read
Read register 1
PC address Read
20:16 Read data 1
register 2 Zero

0 rs rt rd 0 ADD R-type Instruction


Instruction
15:11 Write
Registers

register
ALU ALU
result
6-bit 5-bit 5-bit 5-bit 5-bit 6-bit memory
Read
Write data 2
data

RegWrite

 Semantics 1

if MEM[PC] == ADD rd rs rt
IF ID EX MEM WB
GPR[rd]  GPR[rs] + GPR[rt] if MEM[PC] == ADD rd rs rt
GPR[rd]  GPR[rs] + GPR[rt]
Combinational
PC  PC + 4
PCfrom
**Based on original figure [P&HPC + 4 2004 Elsevier. ALL RIGHTS RESERVED.]
CO&D, COPYRIGHT
state update logic
241 242
**Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]

I-Type ALU Instructions Datapath for R and I-Type ALU Insts.


 Assembly (e.g., register-immediate signed additions)
ADDI rtreg rsreg immediate16 Add

4
ALU operatio
 Machine encoding PC
Read
address
25:21
Read
register 1
Read
3

data 1
Read
20:16 Zero
register 2
Instruction Registers ALU ALU
ADDI rs rt immediate I-type Instruction
15:11
Write
register
Read
result

6-bit 5-bit 5-bit 16-bit memory


Write data 2
RegDest data

isItype RegWrite
ALUSrc
 Semantics 116
Sign
32
isItype
if MEM[PC] == ADDI rt rs immediate extend

GPR[rt]  GPR[rs] + sign-extend (immediate)


PC  PC + 4 IF ID EX MEM WB
if MEM[PC] == ADDI rt rs immediate
GPR[rt]  GPR[rs] + sign-extend (immediate)
Combinational
243 PC  PC + 4 state update logic244
**Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
10-11-2023

Load Instructions
 Assembly (e.g., load 4-byte word)
LW rtreg offset16 (basereg)
Single-Cycle Datapath for
Data Movement Instructions  Machine encoding
LW base rt offset I-type
6-bit 5-bit 5-bit 16-bit

 Semantics
if MEM[PC]==LW rt offset16 (base)
EA = sign-extend(offset) + GPR[base]
GPR[rt]  MEM[ translate(EA) ]
PC  PC + 4

245 246

LW Datapath Store Instructions


 Assembly (e.g., store 4-byte word)
SW rtreg offset16 (basereg)
Add
0
add
Machine encoding
4

Read 3 ALU operation MemWrite 


Read register 1
PC address Read
data 1
Read
register 2 Zero Address Read

Instruction
Instruction
Write
Registers
register
ALU ALU
result
data
SW base rt offset I-type
memory
Write
Read
data 2 Write Data
memory
6-bit 5-bit 5-bit 16-bit
data
data
RegDest RegWrite
isItype ALUSrc
116 Sign
32
isItype
MemRead
 Semantics
extend
1
if MEM[PC]==SW rt offset16 (base)
EA = sign-extend(offset) + GPR[base]
MEM[ translate(EA) ]  GPR[rt]
if MEM[PC]==LW rt offset16 (base) IF ID EX MEM WB PC  PC + 4
EA = sign-extend(offset) + GPR[base]
GPR[rt]  MEM[ translate(EA) ]
Combinational
PC  PC + 4 state update logic247 248
10-11-2023

SW Datapath Load-Store Datapath

Add
1
4 add Add
ALU operation MemWrite
Read 3
Read register 1
PC address Read
data 1
4
add
Read
register 2 Zero Address Read Read 3 ALU operation isStore
Instruction Registers ALU ALU data Read
PC register 1 MemWrite
Write result address Read
Instruction register data 1
Read Data Read
memory Write
Write data 2
data memory register 2 Zero
data Instruction Registers ALU ALU
Write Read
RegDest RegWrite result Address
data
register
isItype 016 ALUSrc MemRead
Instruction
memory
Read
Sign
32
isItype Write
data
data 2
Data
extend
0 RegDest RegWrite Write
memory
data
isItype !isStore ALUSrc
16 32
Sign isItype MemRead
extend
isLoad
if MEM[PC]==SW rt offset16 (base) IF ID EX MEM WB
EA = sign-extend(offset) + GPR[base]
MEM[ translate(EA) ]  GPR[rt]
Combinational
PC  PC + 4 state update logic249 **Based on original figure from [P&H CO&D, COPYRIGHT
2004 Elsevier. ALL RIGHTS RESERVED.]
250

Datapath for Non-Control-Flow Insts.

Add Single-Cycle Datapath for


Control Flow Instructions
4

Read 3 ALU operation isStore


Read register 1 MemWrite
PC address Read
data 1
Read
register 2 Zero
Instruction Registers ALU ALU
Write Read
result Address
Instruction register data
Read
memory data 2
Write Data
data
memory
RegDest RegWrite Write
data
isItype !isStore
16 32
ALUSrc
Sign isItype MemRead
extend
isLoad

MemtoReg
isLoad
**Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] 251 252
10-11-2023

Unconditional Jump Instructions Unconditional Jump Datapath


 Assembly
J immediate26

isJ Add
 Machine encoding PCSrc
4
X ALU operation
Read 3 0
J immediate J-type PC Read
address
register 1

Read
Read
data 1
MemWrite

6-bit 26-bit register 2 Zero


Instruction Registers ALU ALU
Write Read
result Address
Instruction register data
Read
memory data 2
Semantics
concat Write Data
 data
? RegWrite Write
memory

if MEM[PC]==J immediate26
data
ALUSrc
0 16 32
Sign X MemRead
target = { PC[31:28], immediate26, 2’b00 } extend
0
PC  target
**Based on original figure from [P&H CO&D, COPYRIGHT
2004 Elsevier. ALL RIGHTS RESERVED.]

if MEM[PC]==J immediate26
253 PC = { PC[31:28], immediate26, 2’b00 } What about JR, JAL, JALR?
254

Aside: MIPS Cheat Sheet Conditional Branch Instructions


 http://www.ece.cmu.edu/~ece447/s15/lib/exe/fetch.php?m  Assembly (e.g., branch if equal)
edia=mips_reference_data.pdf BEQ rsreg rtreg immediate16

 On the 447 website  Machine encoding

BEQ rs rt immediate I-type


6-bit 5-bit 5-bit 16-bit

 Semantics (assuming no branch delay slot)


if MEM[PC]==BEQ rs rt immediate16
target = PC + 4 + sign-extend(immediate) x 4
if GPR[rs]==GPR[rt] then PC  target
else PC  PC + 4

255 256
10-11-2023

Conditional Branch Datapath (for you to finish) Putting It All Together


PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1

PC+4 [31– 28] M M


watch out u
x
u
x
ALU
Add result 1 0
PC + 4 from instruction datapath
Add Add
RegDst Shift PCSrc2=Br Taken
Jump left 2
PCSrc Add Sum Branch target 4 Branch
4
MemRead
Shift Instruction [31– 26]
Control MemtoReg
left 2
subALU operation
Read ALUOp
PC address MemWrite
Read 3 ALUSrc
register 1 RegWrite
Read
Instruction data 1
Read Instruction [25– 21] Read
Instruction register 2 Read register 1
To branch PC address Read
memory Registers ALU bcond
Zero Instruction [20– 16] data 1
concat Write control logic Read
register 2 Zero
bcond
register Instruction 0 Registers Read ALU ALU
Read [31– 0] 0 Read
M Write data 2 result Address 1
data 2 Instruction register M data
Write u M
memory x u
data Instruction [15– 11] Write x u
1 data Data x
RegWrite 1 memory 0
Write

16 0 32 16 32
data

Sign Instruction [15– 0] Sign


extend extend ALU ALU operation
control

Instruction [5– 0]

**Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]

How to uphold the delayed branch semantics?


257 **Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier.
ALL RIGHTS RESERVED.]
258
JAL, JR, JALR omitted

Single-Cycle Hardwired Control


 As combinational function of Inst=MEM[PC]

Single-Cycle Control Logic 31

0
6-bit
26

rs
5-bit
21

rt
5-bit
16

rd
5-bit
11

shamt
5-bit
6

funct
6-bit
0

R-type
31 26 21 16 0

opcode rs rt immediate I-type


6-bit 5-bit 5-bit 16-bit
31 26 0

opcode immediate J-type


6-bit 26-bit

 Consider
 All R-type and I-type ALU instructions

 LW and SW

 BEQ, BNE, BLEZ, BGTZ

 J, JR, JAL, JALR

259 260
10-11-2023

Single-Bit Control Signals Single-Bit Control Signals

When De-asserted When asserted Equation


When De-asserted When asserted Equation
GPR write select GPR write select opcode==0
RegDest according to rt, i.e., according to rd, i.e., Memory read disabled Memory read port opcode==LW
inst[20:16] inst[15:11] MemRead return load value
2nd ALU input from 2nd 2nd ALU input from sign- (opcode!=0) && Memory write disabled Memory write enabled opcode==SW
ALUSrc GPR read port extended 16-bit (opcode!=BEQ) && MemWrite
immediate (opcode!=BNE) According to PCSrc2 next PC is based on 26- (opcode==J) ||
Steer ALU result to GPR steer memory load to opcode==LW PCSrc1 bit immediate jump (opcode==JAL)
MemtoReg
write port GPR wr. port target
GPR write disabled GPR write enabled (opcode!=SW) && next PC = PC + 4 next PC is based on 16- (opcode==Bxx) &&
(opcode!=Bxx) && PCSrc2 bit immediate branch “bcond is satisfied”
RegWrite target
(opcode!=J) &&
(opcode!=JR))

261
JAL and JALR require additional RegDest and MemtoReg options 262
JR and JALR require additional PCSrc options

ALU Control R-Type ALU


 case opcode
‘0’  select operation according to funct Jump address [31– 0]
PCSrc1=Jump
Instruction [25– 0] Shift

‘ALUi’  selection operation according to opcode 26


left 2
28 0 1

PC+4 [31– 28] M M

‘LW’  select addition


u u
x x
ALU
Add result 1 0

‘SW’  select addition Add


RegDst
Jump
Shift
left 2
PCSrc2=Br Taken
4

‘Bxx’  select bcond generation function


Branch
MemRead
Instruction [31– 26]
Control MemtoReg

__  don’t care
ALUOp
MemWrite
ALUSrc

1
RegWrite

0
Instruction [25– 21] Read
Read register 1

Example ALU operations


PC address Read
 Instruction [20– 16] Read
register 2
data 1
bcond
Zero
Instruction 0 Registers Read ALU ALU

ADD, SUB, AND, OR, XOR, NOR, etc.


[31– 0] 0 Read
M Write data 2 result Address 1
data
 Instruction
memory
u
x
register M
u M
Instruction [15– 11] Write x u
1 Data x
bcond on equal, not equal, LE zero, GT zero, etc.
data 1 memory 0
 Write
data
16 32
Instruction [15– 0]

0
Sign
extend
funct
ALU
control
ALU operation

Instruction [5– 0]

263 **Based on original figure from [P&H CO&D, COPYRIGHT


2004 Elsevier. ALL RIGHTS RESERVED.]
264
10-11-2023

I-Type ALU LW

PCSrc1=Jump PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0] Instruction [25– 0] Shift Jump address [31– 0]
left 2 left 2
26 28 0 1 26 28 0 1

PC+4 [31– 28] M M PC+4 [31– 28] M M


u u u u
x x x x
ALU
Add result 1 0 ALU
Add result 1 0
Add Add
RegDst Shift PCSrc2=Br Taken RegDst Shift PCSrc2=Br Taken
Jump left 2 Jump left 2
4 Branch 4 Branch
MemRead MemRead
Instruction [31– 26] Instruction [31– 26]
Control MemtoReg Control MemtoReg
ALUOp ALUOp
MemWrite MemWrite
ALUSrc ALUSrc

1 1
RegWrite RegWrite

0 0
Instruction [25– 21] Read Instruction [25– 21] Read
Read register 1 Read register 1
PC address Read PC address Read
Instruction [20– 16] Read data 1 Instruction [20– 16] Read data 1
register 2 bcond
Zero register 2 bcond
Zero
Instruction 0 Registers Read ALU ALU Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read [31– 0] 0 Read
M Write data 2 result Address 1 M Write data 2 result Address 1
Instruction u register M data Instruction u register M data
u M u M
memory Instruction [15– 11] x u memory Instruction [15– 11] x u
1 Write x Data 1 Write x Data
data x data x
1 memory 1 memory
0 0
Write Write
data data
16 32 16 32
Instruction [15– 0] Instruction [15– 0]

0 1
Sign Sign
extend
opcodeALU operation
ALU
control
extend
Add
ALU
control
ALU operation

Instruction [5– 0] Instruction [5– 0]

**Based on original figure from [P&H CO&D, COPYRIGHT 2004 265 **Based on original figure from [P&H CO&D, COPYRIGHT 2004 266
Elsevier. ALL RIGHTS RESERVED.] Elsevier. ALL RIGHTS RESERVED.]

SW Branch (Not Taken)


Some control signals are dependent
on the processing of data
PCSrc1=Jump PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0] Instruction [25– 0] Shift Jump address [31– 0]
left 2 left 2
26 28 0 1 26 28 0 1

PC+4 [31– 28] M M PC+4 [31– 28] M M


u u u u
x x x x
ALU
Add result 1 0 ALU
Add result 1 0
Add Add
RegDst Shift PCSrc2=Br Taken RegDst Shift PCSrc2=Br Taken
Jump left 2 Jump left 2
4 Branch 4 Branch
MemRead MemRead
Instruction [31– 26] Instruction [31– 26]
Control MemtoReg Control MemtoReg
ALUOp ALUOp
MemWrite MemWrite
ALUSrc ALUSrc

0 0
RegWrite RegWrite

1 0
Instruction [25– 21] Read Instruction [25– 21] Read
Read register 1 Read register 1
PC address Read PC address Read
Instruction [20– 16] Read data 1 Instruction [20– 16] Read data 1
register 2 bcond
Zero register 2 bcond
Zero
Instruction 0 Registers Read ALU ALU Instruction 0 Registers Read ALU ALU
[31– 0] [31– 0]

X
M Write data 2 0 Address Read M Write data 2 0 Address Read
result 1 result 1

X
Instruction u register M data Instruction u register M data

X X
u M u M
memory Instruction [15– 11] x u memory Instruction [15– 11] x u
1 Write x Data 1 Write x Data
data x data x
1 memory 1 memory
0 0
Write Write
data data
16 32 16 32
Instruction [15– 0] Instruction [15– 0]

0 0
Sign Sign
extend
Add
ALU
control
ALU operation extend
bcond
ALU
control
ALU operation

Instruction [5– 0] Instruction [5– 0]

**Based on original figure from [P&H CO&D, COPYRIGHT 2004 267 **Based on original figure from [P&H CO&D, COPYRIGHT 2004 268
Elsevier. ALL RIGHTS RESERVED.] Elsevier. ALL RIGHTS RESERVED.]
10-11-2023

Branch (Taken) Jump


Some control signals are dependent
on the processing of data
PCSrc1=Jump PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0] Instruction [25– 0] Shift Jump address [31– 0]
left 2 left 2
26 28 0 1 26 28 0 1
M M M M

X
PC+4 [31– 28] PC+4 [31– 28]
u u u u
x x x x
ALU
Add result 1 0 ALU
Add result 1 0
Add Add
RegDst Shift PCSrc2=Br Taken RegDst Shift PCSrc2=Br Taken
Jump left 2 Jump left 2
4 Branch 4 Branch
MemRead MemRead
Instruction [31– 26] Instruction [31– 26]
Control MemtoReg Control MemtoReg
ALUOp ALUOp
MemWrite MemWrite
ALUSrc ALUSrc

0 0
RegWrite RegWrite

0 0
Instruction [25– 21] Read Instruction [25– 21] Read
Read register 1 Read register 1
PC address Read PC address Read
Instruction [20– 16] Read data 1 Instruction [20– 16] Read data 1
register 2 bcond
Zero register 2 bcond
Zero
Instruction 0 Registers Read ALU ALU Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read [31– 0] 0 Read
M Write data 2 result Address 1 M Write data 2 result Address 1

X X
Instruction register M data Instruction register M data

X
u u

X X
u M u M
memory Instruction [15– 11] x u memory Instruction [15– 11] x u
1 Write x Data 1 Write x Data
data x data x
1 memory 1 memory
0 0
Write Write
data data
16 32 16 32
Instruction [15– 0] Instruction [15– 0]

0 0
Sign Sign
extend
bcondALU operation
ALU
control
extend

X
ALU
control
ALU operation

Instruction [5– 0] Instruction [5– 0]

**Based on original figure from [P&H CO&D, COPYRIGHT 269 **Based on original figure from [P&H CO&D, COPYRIGHT 270
2004 Elsevier. ALL RIGHTS RESERVED.] 2004 Elsevier. ALL RIGHTS RESERVED.]

What is in That Control Box?


 Combinational Logic  Hardwired Control
 Idea: Control signals generated combinationally based on
instruction Evaluating the Single-Cycle
 Necessary in a single-cycle microarchitecture…
Microarchitecture
 Sequential Logic  Sequential/Microprogrammed Control
 Idea: A memory structure contains the control signals
associated with an instruction
 Control Store

271 272
10-11-2023

A Single-Cycle Microarchitecture A Single-Cycle Microarchitecture: Analysis


 Is this a good idea/design?  Every instruction takes 1 cycle to execute
 CPI (Cycles per instruction) is strictly 1
 When is this a good design?
 How long each instruction takes is determined by how long
 When is this a bad design? the slowest instruction takes to execute
 Even though many instructions do not need that long to
execute
 How can we design a better microarchitecture?
 Clock cycle time of the microarchitecture is determined by
how long it takes to complete the slowest instruction
 Critical path of the design is determined by the processing
time of the slowest instruction

273 274

What is the Slowest Instruction to Process? Single-Cycle Datapath Analysis


 Let’s go back to the basics  Assume
 memory units (read or write): 200 ps
 All six phases of the instruction processing cycle take a single  ALU and adders: 100 ps
machine clock cycle to complete  register file (read or write): 50 ps
 Fetch 1. Instruction fetch (IF)  other combinational logic: 0 ps
 Decode 2. Instruction decode and
steps IF ID EX MEM WB
 Evaluate Address register operand fetch (ID/RF)
Delay
 Fetch Operands 3. Execute/Evaluate memory address (EX/AG) resources mem RF ALU mem RF
 Execute 4. Memory operand fetch (MEM)
5. Store/writeback result (WB) R-type 200 50 100 50 400
 Store Result
I-type 200 50 100 50 400
LW 200 50 100 200 50 600
 Do each of the above phases take the same time (latency) SW 200 50 100 200 550
for all instructions?
Branch 200 50 100 350
275 Jump 200 200 276
10-11-2023

Let’s Find the Critical Path R-Type and I-Type ALU

PCSrc1=Jump PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0] Instruction [25– 0] Shift Jump address [31– 0]
left 2 left 2
26 28 0 1 26 28 0 1

PC+4 [31– 28] M M PC+4 [31– 28] M M


u u u u
x x x x
ALU ALU

100ps
Add result 1 0 Add result 1 0
Add Add
RegDst Shift PCSrc2=Br Taken RegDst Shift PCSrc2=Br Taken
Jump left 2 Jump left 2
4 Branch 4 Branch
MemRead MemRead
Instruction [31– 26] Instruction [31– 26]
Control MemtoReg Control MemtoReg
ALUOp ALUOp

100ps
MemWrite MemWrite
ALUSrc ALUSrc
RegWrite RegWrite

Instruction [25– 21] Read Instruction [25– 21] Read


Read register 1 Read register 1
PC Read PC Read

200ps
address address

250ps
Instruction [20– 16] Read data 1 Instruction [20– 16] Read data 1
register 2 bcond
Zero register 2 bcond
Zero
Instruction 0 Registers Read ALU ALU Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read [31– 0] 0 Read
M Write data 2 result Address 1 M Write data 2 result Address 1
Instruction register data Instruction register data

400ps
u M u M
u M u M
memory x u memory x u

350ps
Instruction [15– 11] Write x Instruction [15– 11] Write x
1 data Data x 1 data Data x
1 memory 1 memory
0 0
Write Write
data data
16 32 16 32
Instruction [15– 0] Sign Instruction [15– 0] Sign
extend ALU ALU operation extend ALU ALU operation
control control

Instruction [5– 0] Instruction [5– 0]

[Based on original figure from P&H CO&D, COPYRIGHT 2004


Elsevier. ALL RIGHTS RESERVED.]
277 [Based on original figure from P&H CO&D, COPYRIGHT
2004 Elsevier. ALL RIGHTS RESERVED.]
278

LW SW

PCSrc1=Jump PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0] Instruction [25– 0] Shift Jump address [31– 0]
left 2 left 2
26 28 0 1 26 28 0 1

PC+4 [31– 28] M M PC+4 [31– 28] M M


u u u u
x x x x
ALU ALU

100ps 100ps
Add result 1 0 Add result 1 0
Add Add
RegDst Shift PCSrc2=Br Taken RegDst Shift PCSrc2=Br Taken
Jump left 2 Jump left 2
4 Branch 4 Branch
MemRead MemRead
Instruction [31– 26] Instruction [31– 26]
Control MemtoReg Control MemtoReg
ALUOp ALUOp

100ps 100ps
MemWrite MemWrite
ALUSrc ALUSrc
RegWrite RegWrite

Instruction [25– 21] Read Instruction [25– 21] Read


Read register 1 Read register 1
PC Read PC Read

200ps 200ps
address address

250ps 250ps
Instruction [20– 16] Read data 1 Instruction [20– 16] Read data 1
register 2 bcond
Zero register 2 bcond
Zero
Instruction 0 Registers Read ALU ALU Instruction 0 Registers Read ALU ALU

550ps
[31– 0] 0 Read [31– 0] 0 Read
M Write data 2 result Address 1 M Write data 2 result Address 1
Instruction u register M data Instruction u register M data
u M u M
memory x u memory x u

600ps 350ps 350ps 550ps


Instruction [15– 11] Write x Instruction [15– 11] Write x
1 data Data x 1 data Data x
1 memory 1 memory
0 0
Write Write
data data
16 32 16 32
Instruction [15– 0] Sign Instruction [15– 0] Sign
extend ALU ALU operation extend ALU ALU operation
control control

Instruction [5– 0] Instruction [5– 0]

[Based on original figure from P&H CO&D, COPYRIGHT


2004 Elsevier. ALL RIGHTS RESERVED.]
279 [Based on original figure from P&H CO&D, COPYRIGHT
2004 Elsevier. ALL RIGHTS RESERVED.]
280
10-11-2023

Branch Taken Jump

PCSrc1=Jump PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0] Instruction [25– 0] Shift Jump address [31– 0]
left 2 left 2
26 28 0 1 26 28 0 1

200ps
PC+4 [31– 28] M M PC+4 [31– 28] M M
u u u u

100ps ALU
Add result
x
1
x
0
100ps ALU
Add result
x
1
x
0
Add Add
RegDst Shift PCSrc2=Br Taken RegDst Shift PCSrc2=Br Taken
Jump left 2 Jump left 2
4 Branch 4 Branch
MemRead MemRead
Instruction [31– 26] Instruction [31– 26]
Control MemtoReg Control MemtoReg
ALUOp ALUOp

350ps 200ps
MemWrite MemWrite
ALUSrc ALUSrc
RegWrite RegWrite

PC
Read
Instruction [25– 21] Read
register 1
Read
350ps PC
Read
Instruction [25– 21] Read
register 1
Read

200ps 200ps
address address

250ps
Instruction [20– 16] Read data 1 Instruction [20– 16] Read data 1
register 2 bcond
Zero register 2 bcond
Zero
Instruction 0 Registers Read ALU ALU Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read [31– 0] 0 Read
M Write data 2 result Address 1 M Write data 2 result Address 1
Instruction u register M data Instruction u register M data
u M u M
memory Instruction [15– 11] x u memory Instruction [15– 11] x u
1 Write x Data 1 Write x Data
data x data x
1 memory 1 memory
0 0
Write Write
data data
16 32 16 32
Instruction [15– 0] Sign Instruction [15– 0] Sign
extend ALU ALU operation extend ALU ALU operation
control control

Instruction [5– 0] Instruction [5– 0]

[Based on original figure from P&H CO&D, COPYRIGHT


2004 Elsevier. ALL RIGHTS RESERVED.]
281 [Based on original figure from P&H CO&D, COPYRIGHT
2004 Elsevier. ALL RIGHTS RESERVED.]
282

What About Control Logic? What is the Slowest Instruction to Process?


 How does that affect the critical path?  Memory is not magic

 Food for thought for you:  What if memory sometimes takes 100ms to access?
 Can control logic be on the critical path?
 A note on CDC 5600: control store access too long…  Does it make sense to have a simple register to register
add or jump to take {100ms+all else to do a memory
operation}?

 And, what if you need to access memory more than once to


process an instruction?
 Which instructions need this?
 Do you provide multiple ports to memory?

283 284
10-11-2023

Single Cycle uArch: Complexity (Micro)architecture Design Principles


 Contrived  Critical path design
 All instructions run as slow as the slowest instruction
 Find and decrease the maximum combinational logic delay
 Inefficient  Break a path into multiple cycles if it takes too long
 All instructions run as slow as the slowest instruction
 Must provide worst-case combinational resources in parallel as required  Bread and butter (common case) design
by any instruction
 Spend time and resources on where it matters most
 Need to replicate a resource if it is needed more than once by an
 i.e., improve what the machine is really designed to do
instruction during different parts of the instruction processing cycle
 Common case vs. uncommon case
 Not necessarily the simplest way to implement an ISA
 Single-cycle implementation of REP MOVS (x86) or INDEX (VAX)?  Balanced design
 Balance instruction/data flow through hardware components
 Not easy to optimize/improve performance  Design to eliminate bottlenecks: balance the hardware for the
 Optimizing the common case does not work (e.g. common instructions) work
 Need to optimize the worst case all the time
285 286

Single-Cycle Design vs. Design Principles Aside: System Design Principles


 Critical path design  When designing computer systems/architectures, it is
important to follow good principles
 Bread and butter (common case) design
 Remember: “principled design” from our first lecture
 Balanced design  Frank Lloyd Wright: “architecture […] based upon principle,
and not upon precedent”

How does a single-cycle microarchitecture fare in light of


these principles?

287 288
10-11-2023

Aside: From Lecture 1 Aside: System Design Principles


 “architecture […] based upon principle, and not upon  We will continue to cover key principles in this course
precedent”  Here are some references where you can learn more

 Yale Patt, “Requirements, Bottlenecks, and Good Fortune: Agents for


Microprocessor Evolution,” Proc. of IEEE, 2001. (Levels of
transformation, design point, etc)
 Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE,
1966. (Flynn’s Bottleneck  Balanced design)
 Gene M. Amdahl, "Validity of the single processor approach to achieving
large scale computing capabilities," AFIPS Conference, April 1967.
(Amdahl’s Law  Common-case design)
 Butler W. Lampson, “Hints for Computer System Design,” ACM
Operating Systems Review, 1983.
 http://research.microsoft.com/pubs/68221/acrobat.pdf

289 290

Aside: One Important Principle


 Keep it simple

 “Everything should be made as simple as possible, but no


simpler.”
Multi-Cycle Microarchitectures
 Albert Einstein

 And, do not forget: “An engineer is a person who can do


for a dime what any fool can do for a dollar.”

 For more, see:


 Butler W. Lampson, “Hints for Computer System Design,” ACM
Operating Systems Review, 1983.
 http://research.microsoft.com/pubs/68221/acrobat.pdf
291 292
10-11-2023

Agenda for Today & Next Few Lectures


Single-cycle Microarchitectures
18-447 

Computer Architecture  Multi-cycle and Microprogrammed Microarchitectures


Lecture 6: Multi-Cycle and
Pipelining
Microprogrammed Microarchitectures 

 Issues in Pipelining: Control & Data Dependence Handling,


State Maintenance and Recovery, …
Prof. Onur Mutlu
Carnegie Mellon University  Out-of-Order Execution
Spring 2015, 1/28/2015
 Issues in OoO Execution: Load-Store Handling, …
294

Readings for Today Readings for Next Lecture


 P&P, Revised Appendix C  Pipelining
 Microarchitecture of the LC-3b  P&H Chapter 4.5-4.8
 Appendix A (LC-3b ISA) will be useful in following this
 Pipelined LC-3b Microarchitecture
 P&H, Appendix D  http://www.ece.cmu.edu/~ece447/s14/lib/exe/fetch.php?medi
 Mapping Control to Hardware a=18447-lc3b-pipelining.pdf

 Optional
 Maurice Wilkes, “The Best Way to Design an Automatic
Calculating Machine,” Manchester Univ. Computer Inaugural
Conf., 1951.

295 296
10-11-2023

Recap of Last Lecture Review: A Key System Design Principle


 Intro to Microarchitecture: Single-cycle Microarchitectures
 Keep it simple
 Single-cycle vs. multi-cycle
 Instruction processing “cycle”
 “Everything should be made as simple as possible, but no
 Datapath vs. control logic
simpler.”
 Hardwired vs. microprogrammed control
 Albert Einstein
 Performance analysis: Execution time equation
 Power analysis: Dynamic power equation
 And, keep it low cost: “An engineer is a person who can do
 Detailed walkthrough of a single-cycle MIPS implementation for a dime what any fool can do for a dollar.”
 Datapath
 Control logic  For more, see:
 Critical path analysis  Butler W. Lampson, “Hints for Computer System Design,” ACM
Operating Systems Review, 1983.
 (Micro)architecture design principles  http://research.microsoft.com/pubs/68221/acrobat.pdf
297 298

Review: (Micro)architecture Design Principles Review: Single-Cycle Design vs. Design Principles
 Critical path design  Critical path design
 Find and decrease the maximum combinational logic delay
 Break a path into multiple cycles if it takes too long  Bread and butter (common case) design

 Bread and butter (common case) design  Balanced design


 Spend time and resources on where it matters most
 i.e., improve what the machine is really designed to do
 Common case vs. uncommon case

How does a single-cycle microarchitecture fare in light of


 Balanced design these principles?
 Balance instruction/data flow through hardware components
 Design to eliminate bottlenecks: balance the hardware for the
work
299 300
10-11-2023

Multi-Cycle Microarchitectures
 Goal: Let each instruction take (close to) only as much time
it really needs

Multi-Cycle Microarchitectures  Idea


 Determine clock cycle time independently of instruction
processing time
 Each instruction takes as many clock cycles as it needs to take
 Multiple state transitions per instruction
 The states followed by each instruction is different

301 302

Remember: The “Process instruction” Step Multi-Cycle Microarchitecture


 ISA specifies abstractly what AS’ should be, given an
AS = Architectural (programmer visible) state
instruction and AS
at the beginning of an instruction
 It defines an abstract finite state machine where
 State = programmer-visible state
 Next-state logic = instruction execution specification Step 1: Process part of instruction in one clock cycle
 From ISA point of view, there are no “intermediate states”
between AS and AS’ during instruction execution
One state transition per instruction

Step 2: Process part of instruction in the next clock cycle
 Microarchitecture implements how AS is transformed to AS’
 There are many choices in implementation

 We can have programmer-invisible state to optimize the speed of
instruction execution: multiple state transitions per instruction
 Choice 1: AS  AS’ (transform AS to AS’ in a single clock cycle) AS’ = Architectural (programmer visible) state
 Choice 2: AS  AS+MS1  AS+MS2  AS+MS3  AS’ (take multiple at the end of a clock cycle
clock cycles to transform AS to AS’)
303 304
10-11-2023

Benefits of Multi-Cycle Design Remember: Performance Analysis


 Critical path design  Execution time of an instruction
 Can keep reducing the critical path independently of the worst-  {CPI} x {clock cycle time}
case processing time of any instruction
 Execution time of a program
 Bread and butter (common case) design  Sum over all instructions [{CPI} x {clock cycle time}]
 Can optimize the number of states it takes to execute “important”  {# of instructions} x {Average CPI} x {clock cycle time}
instructions that make up much of the execution time
 Single cycle microarchitecture performance
 Balanced design  CPI = 1
 No need to provide more capability or resources than really  Clock cycle time = long
needed
 An instruction that needs resource X multiple times does not require
 Multi-cycle microarchitecture performance
multiple X’s to be implemented  CPI = different for each instruction Now, we have
 Leads to more efficient hardware: Can reuse hardware components  Average CPI  hopefully small two degrees of freedom
needed multiple times for an instruction to optimize independently
 Clock cycle time = short
305 306

How Do We Implement This?


 Maurice Wilkes, “The Best Way to Design an Automatic
Calculating Machine,” Manchester Univ. Computer
A Multi-Cycle Microarchitecture Inaugural Conf., 1951.

A Closer Look

 The concept of microcoded/microprogrammed machines

307 308
10-11-2023

Microprogrammed Multi-Cycle uArch The Instruction Processing Cycle


 Key Idea for Realization

 One can implement the “process instruction” step as a


finite state machine that sequences between states and  Fetch
eventually returns back to the “fetch instruction” state  Decode
 Evaluate Address
 A state is defined by the control signals asserted in it  Fetch Operands
 Execute
 Control signals for the next state determined in current  Store Result
state

309 310

A Basic Multi-Cycle Microarchitecture Microprogrammed Control Terminology


 Instruction processing cycle divided into “states”  Control signals associated with the current state
 A stage in the instruction processing cycle can take multiple states  Microinstruction

 A multi-cycle microarchitecture sequences from state to  Act of transitioning from one state to another
state to process an instruction  Determining the next state and the microinstruction for the
 The behavior of the machine in a state is completely determined by next state
control signals in that state
 Microsequencing
 The behavior of the entire processor is specified fully by a
finite state machine  Control store stores control signals for every possible state
 Store for microinstructions for the entire FSM
 In a state (clock cycle), control signals control two things:
 How the datapath should process the data  Microsequencer determines which set of control signals will
 How to generate the control signals for the next clock cycle be used in the next clock cycle (i.e., next state)
311 312
10-11-2023

What Happens In A Clock Cycle? A Clock Cycle


 The control signals (microinstruction) for the current state
control two things:
 Processing in the data path
 Generation of control signals (microinstruction) for the next
cycle
 See Supplemental Figure 1 (next slide)

 Datapath and microsequencer operate concurrently

 Question: why not generate control signals for the current


cycle in the current cycle?
 This will lengthen the clock cycle
 Why would it lengthen the clock cycle?
 See Supplemental Figure 2
313 314

A Bad Clock Cycle! A Simple LC-3b Control and Datapath


Read Appendix C
under Technical Docs

315 316
10-11-2023

What Determines Next-State Control Signals? A Simple LC-3b Control and Datapath
 What is happening in the current clock cycle
 See the 9 control signals coming from “Control” block
 What are these for?

 The instruction that is being executed


 IR[15:11] coming from the Data Path

 Whether the condition of a branch is met, if the instruction


being processed is a branch
 BEN bit coming from the datapath

 Whether the memory operation is completing in the current


cycle, if one is in progress
 R bit coming from memory
317 318

The State Machine for Multi-Cycle Processing An LC-3b State Machine


 The behavior of the LC-3b uarch is completely determined by  Patt and Patel, Appendix C, Figure C.2
 the 35 control signals and
 additional 7 bits that go into the control logic from the datapath  Each state must be uniquely specified
 Done by means of state variables

 35 control signals completely describe the state of the control


structure  31 distinct states in this LC-3b state machine
 Encoded with 6 state variables
 We can completely describe the behavior of the LC-3b as a
state machine, i.e. a directed graph of  Examples
 Nodes (one corresponding to each state)  State 18,19 correspond to the beginning of the instruction
 Arcs (showing flow from each state to the next state(s)) processing cycle
 Fetch phase: state 18, 19  state 33  state 35
 Decode phase: state 32
319 320
10-11-2023

18, 19
MAR <! PC

LC-3b State Machine: Some Questions


PC <! PC + 2

33
MDR <! M

R R
35
IR <! MDR
 How many cycles does the fastest instruction take?
32
1011
RTI
BEN<! IR[11] & N + IR[10] & Z + IR[9] & P To 11
1010
To 8
ADD [IR[15:12]]
BR
To 10

DR<! SR1+OP2*
1
AND

XOR
JMP
0

[BEN] 0
 How many cycles does the slowest instruction take?
set CC TRAP JSR
SHF
LEA STB
LDB LDW STW 1
22
To 18 5
DR<! SR1&OP2*
PC<! PC+LSHF(off9,1)
set CC

To 18
DR<! SR1 XOR OP2*
9 12
PC<! BaseR
To 18
 Why does the BR take as long as it takes in the FSM?
set CC

To 18 15 4
To 18
MAR<! LSHF(ZEXT[IR[7:0]],1) [IR[11]]

MDR<! M[MAR]
28 20
0 1
 What determines the clock cycle time?
R7<! PC R7<! PC
PC<! BaseR
R R
21
30
PC<! MDR R7<! PC
To 18 PC<! PC+LSHF(off11,1)
13
To 18
DR<! SHF(SR,A,D,amt4)
set CC To 18
14 2 6 7 3
To 18 DR<! PC+LSHF(off9, 1)
set CC MAR<! B+off6 MAR<! B+LSHF(off6,1) MAR<! B+LSHF(off6,1) MAR<! B+off6

To 18
29 25 23 24
NOTES MDR<! M[MAR[15:1]’0] MDR<! M[MAR] MDR<! SR MDR<! SR[7:0]
B+off6 : Base + SEXT[offset6]
PC+off9 : PC + SEXT[offset9] R R R R
27 16 17
*OP2 may be SR2 or SEXT[imm5] 31
DR<! SEXT[BYTE.DATA] DR<! MDR
** [15:8] or [7:0] depending on M[MAR]<! MDR M[MAR]<! MDR**
set CC set CC
MAR[0]

To 18 To 18 To 18
R R R R
322
To 19

LC-3b Datapath
 Patt and Patel, Appendix C, Figure C.3

 Single-bus datapath design


 At any point only one value can be “gated” on the bus (i.e.,
can be driving the bus)
 Advantage: Low hardware cost: one bus
 Disadvantage: Reduced concurrency – if instruction needs the
bus twice for two different things, these need to happen in
different states

 Control signals (26 of them) determine what happens in the


datapath in one clock cycle
 Patt and Patel, Appendix C, Table C.1

323
10-11-2023

IR[11:9] IR[11:9]
DR SR1
111 IR[8:6]

DRMUX SR1MUX

(a) (b)
Remember the MIPS datapath

IR[11:9]

N Logic BEN
Z
P

(c)

LC-3b Datapath: Some Questions LC-3b Microprogrammed Control Structure


 How does instruction fetch happen in this datapath  Patt and Patel, Appendix C, Figure C.4
according to the state machine?
 Three components:
Microinstruction, control store, microsequencer
 What is the difference between gating and loading? 

 Microinstruction: control signals that control the datapath


 Is this the smallest hardware you can design? (26 of them) and help determine the next state (9 of them)
 Each microinstruction is stored in a unique location in the
control store (a special memory structure)
 Unique location: address of the state corresponding to the
microinstruction
 Remember each state corresponds to one microinstruction
 Microsequencer determines the address of the next
microinstruction (i.e., next state)
327 328
10-11-2023

R COND1 COND0
IR[15:11]
BEN

BEN R IR[11]

Microsequencer

Branch Ready Addr.


Mode

6 J[5] J[4] J[3] J[2] J[1] J[0]

Control Store

2 6 x 35
0,0,IR[15:12]
6
35

IRD
Microinstruction
6
9 26
Address of Next State
(J, COND, IRD)
X
eS MU

UX

UX

E
UX

LS .SIZ
1M

2M
Ga DR

Gat AR

AD UX

LC-3b Microsequencer
Ga LU
LD AR

LD DR

UX
HF
LD N

LD G

UX

R.W N
M
Ga C
C
E

1
.PC

teM

teM

DR

DR

UK

.E

TA
1M
.IR

teA
.M

.M

eP

HF
M
.B

.R

.C

AR
M
nd

IO
D

AD

DA
DR
LD

LD

LD

Gat

AL
PC

SR
Co
IR

M
J

000000 (State 0)
000001 (State 1)
000010 (State 2)
000011 (State 3)
000100 (State 4)
000101 (State 5)

Patt and Patel, Appendix C, Figure C.5


000110 (State 6)


000111 (State 7)
001000 (State 8)
001001 (State 9)
001010 (State 10)
001011 (State 11)
001100 (State 12)

The purpose of the microsequencer is to determine the


001101 (State 13)
001110 (State 14)
001111 (State 15)
010000 (State 16)

address of the next microinstruction (i.e., next state)
010001 (State 17)
010010 (State 18)
010011 (State 19)
010100 (State 20)
010101 (State 21)
010110 (State 22)
010111 (State 23)
011000 (State 24)
011001 (State 25)

Next address depends on 9 control signals (plus 7 data


011010 (State 26)
011011 (State 27)
011100 (State 28)
011101 (State 29)

signals)
011110 (State 30)
011111 (State 31)
100000 (State 32)
100001 (State 33)
100010 (State 34)
100011 (State 35)
100100 (State 36)
100101 (State 37)
100110 (State 38)
100111 (State 39)
101000 (State 40)
101001 (State 41)
101010 (State 42)
101011 (State 43)
101100 (State 44)
101101 (State 45)
101110 (State 46)
101111 (State 47)
110000 (State 48)
110001 (State 49)
110010 (State 50)
110011 (State 51)
110100 (State 52)
110101 (State 53)
110110 (State 54)
110111 (State 55)
111000 (State 56)
111001 (State 57)
111010 (State 58)
111011 (State 59)
111100 (State 60)
111101 (State 61)
111110 (State 62)
111111 (State 63)
332
10-11-2023

COND1 COND0
The Microsequencer: Some Questions
 When is the IRD signal asserted?
BEN R IR[11]

 What happens if an illegal instruction is decoded?


Branch Ready Addr.
Mode
J[5] J[4] J[3] J[2] J[1] J[0]  What are condition (COND) bits for?

 How is variable latency memory handled?


0,0,IR[15:12]
6
 How do you do the state encoding?
 Minimize number of state variables (~ control store size)
IRD
 Start with the 16-way branch
6
 Then determine constraint tables and states dependent on COND
Address of Next State
334

Handouts
 7 pages of Microprogrammed LC-3b design

An Exercise in  http://www.ece.cmu.edu/~ece447/s14/doku.php?id=techd
Microprogramming ocs

 http://www.ece.cmu.edu/~ece447/s14/lib/exe/fetch.php?m
edia=lc3b-figures.pdf

335 336
10-11-2023

18, 19
MAR <! PC

A Simple LC-3b Control and Datapath


PC <! PC + 2

33
MDR <! M

R R
35
IR <! MDR

32
1011
RTI
BEN<! IR[11] & N + IR[10] & Z + IR[9] & P To 11
1010
To 8
ADD [IR[15:12]]
BR
To 10
AND
0
1 XOR
DR<! SR1+OP2* JMP
TRAP [BEN] 0
set CC JSR
SHF
LEA STB
LDB LDW STW 1
22
To 18 5
DR<! SR1&OP2*
PC<! PC+LSHF(off9,1)
set CC

9 12
To 18
DR<! SR1 XOR OP2* To 18
PC<! BaseR
set CC

To 18 15 4
To 18
MAR<! LSHF(ZEXT[IR[7:0]],1) [IR[11]]

0 1
28 20
MDR<! M[MAR]
R7<! PC R7<! PC
PC<! BaseR
R R
21
30
PC<! MDR R7<! PC
To 18 PC<! PC+LSHF(off11,1)
13
To 18
DR<! SHF(SR,A,D,amt4)
set CC To 18
14 2 6 7 3
To 18 DR<! PC+LSHF(off9, 1)
set CC MAR<! B+off6 MAR<! B+LSHF(off6,1) MAR<! B+LSHF(off6,1) MAR<! B+off6

To 18
29 25 23 24
NOTES MDR<! M[MAR[15:1]’0] MDR<! M[MAR] MDR<! SR MDR<! SR[7:0]
B+off6 : Base + SEXT[offset6]
PC+off9 : PC + SEXT[offset9] R R R R
27 16 17
*OP2 may be SR2 or SEXT[imm5] 31
DR<! SEXT[BYTE.DATA] DR<! MDR
** [15:8] or [7:0] depending on M[MAR]<! MDR M[MAR]<! MDR**
set CC set CC
MAR[0]
337 To 18 To 18 To 18
R R R R
To 19

State Machine for LDW Microsequencer COND1 COND0

BEN R IR[11]

Branch Ready Addr.


Mode
J[5] J[4] J[3] J[2] J[1] J[0]

0,0,IR[15:12]
6

IRD

Address of Next State

State 18 (010010)
State 33 (100001)
State 35 (100011)
State 32 (100000)
State 6 (000110)
State 25 (011001)
State 27 (011011)
10-11-2023

IR[11:9] IR[11:9]
DR SR1
111 IR[8:6]

DRMUX SR1MUX

(a) (b)

IR[11:9]

N Logic BEN
Z
P

(c)

R COND1 COND0
IR[15:11]
BEN

BEN R IR[11]

Microsequencer

Branch Ready Addr.


Mode

6 J[5] J[4] J[3] J[2] J[1] J[0]

Control Store

2 6 x 35
0,0,IR[15:12]
6
35

IRD
Microinstruction
6
9 26
Address of Next State
(J, COND, IRD)
10-11-2023

X
eS MU

UX

UX

E
UX

LS .SIZ
1M

2M
Ga DR

Gat AR

AD UX
Ga LU
LD AR

LD DR

UX
HF
LD N

LD G

UX

R.W N
M
Ga C
C
E

1
.PC

teM

teM

DR

DR

UK

.E

TA
1M
.IR

teA
.M

.M

eP

HF
M
.B

.R

.C

AR
M
nd

IO
D

AD

DA
DR
LD

LD

LD

Gat

AL
PC

SR
Co
IR

M
J
000000 (State 0)
000001 (State 1)
000010 (State 2)
000011 (State 3)
000100 (State 4)
000101 (State 5)
000110 (State 6)
000111 (State 7)
001000 (State 8)
001001 (State 9)
001010 (State 10)

End of the Exercise in


001011 (State 11)
001100 (State 12)
001101 (State 13)
001110 (State 14)
001111 (State 15)
010000 (State 16)
010001 (State 17)
010010 (State 18)

Microprogramming
010011 (State 19)
010100 (State 20)
010101 (State 21)
010110 (State 22)
010111 (State 23)
011000 (State 24)
011001 (State 25)
011010 (State 26)
011011 (State 27)
011100 (State 28)
011101 (State 29)
011110 (State 30)
011111 (State 31)
100000 (State 32)
100001 (State 33)
100010 (State 34)
100011 (State 35)
100100 (State 36)
100101 (State 37)
100110 (State 38)
100111 (State 39)
101000 (State 40)
101001 (State 41)
101010 (State 42)
101011 (State 43)
101100 (State 44)
101101 (State 45)
101110 (State 46)
101111 (State 47)
110000 (State 48)
110001 (State 49)
110010 (State 50)
110011 (State 51)
110100 (State 52)
110101 (State 53)
110110 (State 54)
110111 (State 55)
111000 (State 56)
111001 (State 57)
111010 (State 58)
111011 (State 59)
111100 (State 60)
111101 (State 61)
111110 (State 62)
111111 (State 63)
346

The Control Store: Some Questions Variable-Latency Memory


 What control signals can be stored in the control store?  The ready signal (R) enables memory read/write to execute
correctly
vs.  Example: transition from state 33 to state 35 is controlled by
the R bit asserted by memory when memory data is available

 What control signals have to be generated in hardwired


 Could we have done this in a single-cycle
logic?
microarchitecture?
 i.e., what signal cannot be available without processing in the
datapath?

 Remember the MIPS datapath


 One PCSrc signal depends on processing that happens in the
datapath (bcond logic)
347 348
10-11-2023

The Microsequencer: Advanced Questions The Power of Abstraction


 What happens if the machine is interrupted?  The concept of a control store of microinstructions enables
the hardware designer with a new abstraction:
 What if an instruction generates an exception? microprogramming

 How can you implement a complex instruction using this  The designer can translate any desired operation to a
control structure? sequence of microinstructions
 Think REP MOVS  All the designer needs to provide is
 The sequence of microinstructions needed to implement the
desired operation
 The ability for the control logic to correctly sequence through
the microinstructions
 Any additional datapath control signals needed (no need if the
operation can be “translated” into existing control signals)

349 350

Let’s Do Some More Microprogramming Aside: Alignment Correction in Memory


 Implement REP MOVS in the LC-3b microarchitecture  Remember unaligned accesses

 What changes, if any, do you make to the  LC-3b has byte load and byte store instructions that move
 state machine? data not aligned at the word-address boundary
 datapath?  Convenience to the programmer/compiler
 control store?
 microsequencer?  How does the hardware ensure this works correctly?
 Take a look at state 29 for LDB
 Show all changes and microinstructions  States 24 and 17 for STB
 Coming up in Homework 2  Additional logic to handle unaligned accesses

351 352
10-11-2023

Aside: Memory Mapped I/O Advantages of Microprogrammed Control


 Address control logic determines whether the specified  Allows a very simple design to do powerful computation by
address of LDx and STx are to memory or I/O devices controlling the datapath (using a sequencer)
 High-level ISA translated into microcode (sequence of microinstructions)
 Microcode (ucode) enables a minimal datapath to emulate an ISA
 Correspondingly enables memory or I/O devices and sets  Microinstructions can be thought of as a user-invisible ISA (micro ISA)
up muxes
 Enables easy extensibility of the ISA
 Another instance where the final control signals (e.g.,  Can support a new instruction by changing the microcode
MEM.EN or INMUX/2) cannot be stored in the control store  Can support complex instructions as a sequence of simple microinstructions

 These signals are dependent on address


 If I can sequence an arbitrary instruction then I can sequence
an arbitrary “program” as a microprogram sequence
 will need some new state (e.g. loop counters) in the microcode for sequencing
more elaborate programs

353 354

Update of Machine Behavior


 The ability to update/patch microcode in the field (after a
processor is shipped) enables 18-447
Ability to add new instructions without changing the processor!

 Ability to “fix” buggy hardware implementations


Computer Architecture
Lecture 7: Pipelining
 Examples
 IBM 370 Model 145: microcode stored in main memory, can be
updated after a reboot
 IBM System z: Similar to 370/145.
 Heller and Farrell, “Millicode in an IBM zSeries processor,” IBM Prof. Onur Mutlu
JR&D, May/Jul 2004. Carnegie Mellon University
 B1700 microcode can be updated while the processor is running Spring 2015, 1/30/2015
 User-microprogrammable machine!

355
10-11-2023

Agenda for Today & Next Few Lectures Recap of Last Lecture
 Single-cycle Microarchitectures  Multi-cycle and Microprogrammed Microarchitectures
 Benefits vs. Design Principles
 Multi-cycle and Microprogrammed Microarchitectures  When to Generate Control Signals
 Microprogrammed Control: uInstruction, uSequencer, Control
Store
 Pipelining
 LC-3b State Machine, Datapath, Control Structure
 An Exercise in Microprogramming
 Issues in Pipelining: Control & Data Dependence Handling,  Variable Latency Memory, Alignment, Memory Mapped I/O, …
State Maintenance and Recovery, …
 Microprogramming
 Out-of-Order Execution  Power of abstraction (for the HW designer)
 Advantages of uProgrammed Control
 Issues in OoO Execution: Load-Store Handling, …  Update of Machine Behavior
357 358

Review: A Simple LC-3b Control and Datapath R


IR[15:11]
BEN

Microsequencer

6 Simple Design
of the Control Structure

Control Store

2 6 x 35

35

Microinstruction

9 26

359 (J, COND, IRD)


10-11-2023

Review: The Power of Abstraction


 The concept of a control store of microinstructions enables
the hardware designer with a new abstraction:
microprogramming

 The designer can translate any desired operation to a


sequence of microinstructions
 All the designer needs to provide is
 The sequence of microinstructions needed to implement the
desired operation
 The ability for the control logic to correctly sequence through
A Simple Datapath
the microinstructions
Can Become  Any additional datapath elements and control signals needed
Very Powerful (no need if the operation can be “translated” into existing
control signals)
362

Review: Advantages of Microprogrammed Control Wrap Up Microprogrammed Control


 Allows a very simple design to do powerful computation by  Horizontal vs. Vertical Microcode
controlling the datapath (using a sequencer)  Nanocode vs. Microcode vs. Millicode
High-level ISA translated into microcode (sequence of u-instructions)

 Microprogrammed MIPS: An Example
 Microcode (u-code) enables a minimal datapath to emulate an ISA
 Microinstructions can be thought of as a user-invisible ISA (u-ISA)

 Enables easy extensibility of the ISA


 Can support a new instruction by changing the microcode
 Can support complex instructions as a sequence of simple
microinstructions

 Enables update of machine behavior


 A buggy implementation of an instruction can be fixed by changing the
microcode in the field

363 364
10-11-2023

Horizontal Microcode Vertical Microcode


 Two-level control store: the first specifies abstract operations
 A single control store provides the control signals
1-bit signal means do this RT

k-bit control signal output


(or combination of RTs)
Microcode “PC  PC+4”
storage “PC  ALUOut”
Microcode
ALUSrcA
storage “PC  PC[ 31:28 ],IR[ 25:0 ],2’b00”
Datapath
IorD Outputs control
Datapath “IR  MEM[ PC ]”
IRWrite outputs
Outputs control “A  RF[ IR[ 25:21 ] ]”
PCWrite
outputs
“B  RF[ IR[ 20:16 ] ]”
PCWriteCond
MIPS design …. n-bit mPC input …………. …….

From mPC input


n-bit Input
Input
m-bit input
1
P&H, Appendix D
1 Sequencing
Microprogram counter
Sequencing
control ROM
Microprogram counter Adder
control
Adder
Address select logic
k-bit control signal output
Address select logic

[Based on original figure from P&H CO&D, COPYRIGHT


2004 Elsevier. ALL RIGHTS RESERVED.] Inputs from instruction

….
PCWriteCond
PCWrite
IRWrite
IorD
ALUSrcA
register opcode field
Inputs from instruction
register opcode field

If done right (i.e., m<<n, and m<<k), two ROMs together


[Based on original figure from P&H CO&D, COPYRIGHT
2004 Elsevier. ALL RIGHTS RESERVED.] Control Store: 2n k bit (not including sequencing) 365 (2nm+2mk bit) should be smaller than horizontal microcode ROM (2nk bit) 366

Nanocode and Millicode Nanocode Concept Illustrated


 Nanocode: a level below traditional microcode
 microprogrammed control for sub-systems (e.g., a
complicated floating-point module) that acts as a slave in a a “mcoded” processor implementation
microcontrolled datapath
ROM
processor
datapath
 Millicode: a level above traditional microcode mPC
 ISA-level subroutines that can be called by the microcontroller
to handle complicated operations and system functions
 E.g., Heller and Farrell, “Millicode in an IBM zSeries
processor,” IBM JR&D, May/Jul 2004.
We refer to this a “mcoded” FPU implementation
as “nanocode”
 In both cases, we avoid complicating the main u-controller when a mcoded ROM
arithmetic
 You can think of these as “microcode” at different levels of subsystem is embedded datapath
abstraction in a mcoded system mPC
367 368
10-11-2023

Microcoded Multi-Cycle MIPS Design Microcoded Multi-Cycle MIPS Design


 Any ISA can be implemented with a microprogrammed
microarchitecture

 P&H, Appendix D: Microprogrammed MIPS design

 We will not cover this in class


 However, you can do an extra credit assignment for Lab 2

369 [Based on original figure from P&H CO&D, COPYRIGHT


2004 Elsevier. ALL RIGHTS RESERVED.]
370

Control Logic for MIPS FSM Microprogrammed Control for MIPS FSM

[Based on original figure from P&H CO&D, COPYRIGHT


2004 Elsevier. ALL RIGHTS RESERVED.]
371 [Based on original figure from P&H CO&D, COPYRIGHT
2004 Elsevier. ALL RIGHTS RESERVED.]
372
10-11-2023

Multi-Cycle vs. Single-Cycle uArch Microprogrammed vs. Hardwired Control


 Advantages  Advantages

 Disadvantages  Disadvantages

 You should be very familiar with this right now  You should be very familiar with this right now

373 374

Can We Do Better? Can We Use the Idle Hardware to Improve Concurrency?

 What limitations do you see with the multi-cycle design?  Goal: More concurrency  Higher instruction throughput
(i.e., more “work” completed in one cycle)
 Limited concurrency
 Some hardware resources are idle during different phases of  Idea: When an instruction is using some resources in its
instruction processing cycle processing phase, process other instructions on idle
 “Fetch” logic is idle when an instruction is being “decoded” or resources not needed by that instruction
“executed”  E.g., when an instruction is being decoded, fetch the next
 Most of the datapath is idle when a memory access is instruction
happening  E.g., when an instruction is being executed, decode another
instruction
 E.g., when an instruction is accessing data memory (ld/st),
execute the next instruction
 E.g., when an instruction is writing its result into the register
file, access data memory for the next instruction
375 376
10-11-2023

Pipelining: Basic Idea


 More systematically:
 Pipeline the execution of multiple instructions
Analogy: “Assembly line processing” of instructions
Pipelining 

 Idea:
 Divide the instruction processing cycle into distinct “stages” of
processing
 Ensure there are enough hardware resources to process one
instruction in each stage
 Process a different instruction in each stage
 Instructions consecutive in program order are processed in
consecutive stages

 Benefit: Increases instruction processing throughput (1/CPI)


 Downside: Start thinking about this…
377 378

Example: Execution of Four Independent ADDs The Laundry Analogy


 Multi-cycle: 4 cycles per instruction Time
6 PM 7 8 9 10 11 12 1 2 AM

Task
order
F D E W
A
F D E W B

F D E W C

F D E W D

Time
 “place one dirty load of clothes in the washer”
 Pipelined: 4 cycles per 4 instructions (steady state)
 “when the washer is finished, place the wet load in the dryer”
F D E W  “when the dryer is finished, take out the dry load and fold”
F D E W  “when folding is finished, ask your roommate (??) to put the clothes
Is life always this beautiful? away”
F D E W
- steps to do a load are sequentially dependent
F D E W
- no dependence between different loads
Time - different steps do not share resources

379 380
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
10-11-2023

Pipelining Multiple Loads of Laundry Pipelining Multiple Loads of Laundry: In Practice


6 PM 7 8 9 10 11 12 1 2 AM
Time
6 PM 7 8 9 10 11 12 1 2 AM
Time Task
order
Task
order A
A
B
B
C
C
D
D

6 PM 7 8 9 10 11 12 1 2 AM
Time
6 PM 7 8 9 10 11 12 1 2 AM
Time Task
order
Task A
order

A
- 4 loads of laundry in parallel B

C
B
- no additional resources
C
- throughput increased by 4
D

- latency per load is the same


D
the slowest step decides throughput
381 382
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]

Pipelining Multiple Loads of Laundry: In Practice An Ideal Pipeline


Time
6 PM 7 8 9 10 11 12 1 2 AM  Goal: Increase throughput with little increase in cost
Task
order
(hardware cost, in case of instruction processing)
A

B  Repetition of identical operations


C  The same operation is repeated on a large number of different
D inputs (e.g., all laundry loads go through the same steps)
 Repetition of independent operations
6 PM 7 8 9 10 11 12 1 2 AM
Time
 No dependencies between repeated operations
Task
order  Uniformly partitionable suboperations
A
A  Processing can be evenly divided into uniform-latency
B
B suboperations (that do not share resources)
C
A
Fitting examples: automobile assembly line, doing laundry
D
B 

throughput restored (2 loads per hour) using 2 dryers  What about the instruction processing “cycle”?
383 384
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
10-11-2023

Ideal Pipelining More Realistic Pipeline: Throughput


 Nonpipelined version with delay T
BW = 1/(T+S) where S = latch delay
combinational logic (F,D,E,M,W) BW=~(1/T)
T psec

T ps

T/2 ps (F,D,E) T/2 ps (M,W) BW=~(2/T)


 k-stage pipelined version
BWk-stage = 1 / (T/k +S ) Latch delay reduces throughput
BWmax = 1 / (1 gate delay + S ) (switching overhead b/w stages)

T/3 T/3 T/3 BW=~(3/T)


ps (F,D) ps (E,M) ps (M,W) T/k T/k
ps ps

385 386

More Realistic Pipeline: Cost


 Nonpipelined version with combinational cost G
Cost = G+L where L = latch cost
Pipelining Instruction Processing
G gates

 k-stage pipelined version


Costk-stage = G + Lk Latches increase hardware cost

G/k G/k

387 388
10-11-2023

Remember: The Instruction Processing Cycle Remember the Single-Cycle Uarch


Instruction [25– 0] Shift Jump address [31– 0]

26
left 2
28 0 PCSrc
1 1=Jump
PC+4 [31– 28] M M
u u
x x
ALU
Add result 1 0

 Fetch fetch (IF)


1. Instruction
Add Shift
RegDst
Jump left 2
4 Branch

 Decodedecode and
2. Instruction MemRead
Instruction [31– 26]
Control MemtoReg PCSrc2=Br Taken
ALUOp

register operand
 Evaluate fetch (ID/RF)
Address
MemWrite
ALUSrc

3. Execute/Evaluate memory address (EX/AG)


RegWrite

 Fetch Operands
Instruction [25– 21] Read
Read

4. Memory operand fetch (MEM)


PC register 1
address Read
Instruction [20– 16] data 1
Read

 Execute
register 2 Zero
Instruction

5. Store/writeback result (WB)


[31– 0] 0 Registers Read ALU ALU
M Write data 2 0 Address Read
result data 1
Instruction u register M
u M

 Store Result
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
bcond data
16 32
Instruction [15– 0] Sign
extend ALU
control

Instruction [5– 0]

ALU operation

T BW=~(1/T)
389 Based on original figure from [P&H CO&D, COPYRIGHT 2004
Elsevier. ALL RIGHTS RESERVED.]
390

Dividing Into Stages Instruction Pipeline Throughput


200ps 100ps 200ps 200ps 100ps Program
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back execution 2 2004 400 6 600 8 800 101000 1200
12 1400
14 1600
16 1800
18
order Time
register file read address calculation
0 (in instructions)
M
u Instruction Data
x lw $1, 100($0) fetch
Reg ALU
access
Reg
1
ignore
for now lw $2, 200($0) 8 ns
800ps
Instruction
fetch
Reg ALU
Data
access
Reg

Add Instruction
lw $3, 300($0) 8800ps
ns fetch
Add
4

Shift
Add
result
...
left 2 8 ns
800ps
Read
register 1 Program 200 4 400 6 600 8 800 1000 1200 1400
PC Address Read
Read
data 1
execution 2 10 12 14
register 2 Zero
Time
Instruction Registers Read
0
ALU ALU
Read order
RF
Write data 2 result Address 1
data
Instruction register M
u Data M (in instructions)
memory u
Instruction Data
Write
data
x
1
memory x
0
write lw $1, 100($0)
fetch
Reg ALU
access
Reg
Write
data
16 32 Instruction Data
Sign lw $2, 200($0) 2 ns Reg ALU Reg
extend 200ps fetch access

Instruction Data
lw $3, 300($0) 2 ns
200ps Reg ALU Reg
fetch access

2 ns
200ps 2 ns
200ps 2200ps
ns 2 ns
200ps 2 ns
200ps

Is this the correct partitioning?


Why not 4 or 6 stages? Why not different boundaries? 5-stage speedup is 4, not 5 as predicted by the ideal model. Why?

Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
391 392
10-11-2023

Enabling Pipelined Processing: Pipeline Registers Pipelined Operation Example


lw
All instruction classes must follow the same path
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back Instruction fetch and timing through the pipeline stages.
register file read address calculation lw lw lw
00
MM
No resource is used by more than 1 stage! 00
M
0
00
MM
Instruction decode
Any performance impact?
Execution Memory
lw
uuu
uu
xx xxxx Write back
11 111

IF/ID
IF/ID
IF/ID
IF/ID
IF/ID ID/EX
ID/EX
ID/EX
ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB
MEM/WB
MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
PCD+4

PCE+4
Add

nPCM
Add
Add Add
Add

Add
Add
Add
Add Add
Add 444
4 Add Add
Add result
Add
Add
Add
44 Add result result
result
result
result
Shift
Shift
Shift
Shift
Shift
Shift
left
left 222
left
left
leftleft2 2
Read
Read
Read

Instruction
Read

Instruction
Read

Instruction
Read

Instruction
register
register 111
Instruction

register PC
PC
PC Address
Address
Address register Read
11 Read

AE
Read
PCF

PCPC Address
Address register Read
Read Read
data
data111

AoutM
data Read data
data
data 1
Read data 11 Read
Read
Read
Read
Read register Zero
Zero
Zero Instruction register 222
register Zero
Zero
Zero

MDRW
Instruction register
register 22 Instruction
Instruction Registers ALU
Registers Read ALU ALU memory Registers
Registers Read
Read
Read ALU ALU
ALU
ALU
ALU ALU
ALU
ALU
Instruction Registers Read ALU memory
memory Write 00 ALU Read
IRD

memory ALU Write


Write 00 Address
Address Read
Read
Read
Write
Write data 00 Address Read
Read data
data222
data
data result
result
result
result
result Address
Address
Address 1111
data 22 result
result Address data 11 register
register
register
register MMM Data
data
data
data
data M
register
register MM data uu Data
Data M M M
Instruction
uu Data
Data MM uu Data
Data
memory uuuu
u
memory
BE uu Write
Write
Write xxxxx memory
memory
Write
Write xx memory data memory
memory xxxx
memory xx data
data
data
data 11
11
11 0000
Write 00 Write
Write
Write
Write
Write
Write data
data
data
data
data data

AoutW
BM
16 32
ImmE

16
16 32
32
32
1616 3232 Sign
Sign
Sign
Sign
Sign extend
extend
extend
extend
extend

T/k T/k
ps T ps
393 394
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]

Pipelined Operation Example Illustrating Pipeline Operation: Operation View

lw $10,
sub $11,20($1)
$2, $3 lw $10,
sub $11,20($1)
$2, $3 lw $10, 20($1)
Instruction fetch Instruction decode Execution
sub $11, $2, $3 lw $10,
sub $11,20($1)
$2, $3 sub $11,20($1)
lw $10, $2, $3
t0 t1 t2 t3 t4 t5
0
00
Write back
back
Inst0 IF ID EX MEM WB
M
MM
uu
Execution Memory Write
xxx
11
1

IF/ID
IF/ID
IF/ID ID/EX
ID/EX
ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB
Inst1 IF ID EX MEM WB
444
Add
Add
Add

Add Add
Add
Add
Add
Add
result
Inst2 IF ID EX MEM WB
result
result
Shift
Shift
Shift
left 22
left
left 2 Inst3 IF ID EX MEM WB
Inst4 IF ID EX MEM
Read
Read
Read
Instruction
Instruction
Instruction

PC
PC Address
Address register 11
register
register 1 Read
PC Address Read
Read
Read data 11
data
data 1
Read
Read
register 22
register
register 2 Zero
Zero
Instruction
Instruction
Instruction Registers Read ALU ALU

IF ID EX
memory Registers
Registers Read
Read ALU
ALU ALU
ALU
memory
memory Write
Write 0
00 Address Read
Read
Read
Write data 22
data result
result
result Address
Address 1
11
register
register
register M
M data
data
data
M M
M
M
u
uu Data
Data
Data
Data u

Is life always this beautiful?


Write
Write xx uu
Write memory
memory xxx
data
data
data 11

IF ID
1 00
Write 0
Write
Write
data
data
data

steady state
16
16
16 32
32
32
Sign
Sign
Sign

IF
extend
extend
extend

(full pipeline)
Clock
Clock
Clock5621 43
Clock
Clock

395 396
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
10-11-2023

Illustrating Pipeline Operation: Resource View Control Points in a Pipeline


PCSrc

0
M
u

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 1
x

IF/ID ID/EX EX/MEM MEM/WB

IF I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 Add

Add
4 Add
result
Branch
Shift
RegWrite left 2

ID I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 Read MemWrite

Instruction
PC Address register 1 Read
data 1
Read ALUSrc
register 2 Zero
Zero MemtoReg
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1

EX I0 I1 I2 I3 I4 I5 I6 I7 I8
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
Instruction
[15– 0] 16 32 6

MEM I0 I1 I2 I3 I4 I5 I6 I7
Sign ALU
extend control MemRead

Instruction
[20– 16]
0
M ALUOp
Instruction u

WB
[15– 11] x
I0 I1 I2 I3 I4 I5 I6 Based on original figure from [P&H CO&D,
COPYRIGHT 2004 Elsevier. ALL RIGHTS
1

RESERVED.] RegDst

397 Identical set of control points as the single-cycle datapath!! 398

Control Signals in a Pipeline Pipelined Control Signals


PCSrc

 For a given instruction ID/EX


0

 same control signals as single-cycle, but M


u
x
WB
EX/MEM

control signals required at different cycles, depending on stage


1
 Control M WB
MEM/WB

 Option 1: decode once using the same logic as single-cycle and IF/ID
EX M WB

buffer signals until consumed Add


WB
Add
4 Add result

RegWrite
Instruction Shift Branch
Control M WB left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction
EX M WB PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
IF/ID ID/EX EX/MEM MEM/WB [15– 0] Sign ALU MemRead
extend control
 Option 2: carry relevant “instruction word/field” down the pipeline Instruction

and decode locally within each or in a previous stage


[20– 16]
0 ALUOp
M
Instruction u

Which one is better? [15– 11] x


1
RegDst

399 Based on original figure from [P&H CO&D,


COPYRIGHT 2004 Elsevier. ALL RIGHTS
400
RESERVED.]
10-11-2023

Remember: An Ideal Pipeline Instruction Pipeline: Not An Ideal Pipeline


 Goal: Increase throughput with little increase in cost  Identical operations ... NOT!
(hardware cost, in case of instruction processing)  different instructions  not all need the same stages
Forcing different instructions to go through the same pipe stages
 Repetition of identical operations  external fragmentation (some pipe stages idle for some instructions)
 The same operation is repeated on a large number of different
inputs (e.g., all laundry loads go through the same steps)  Uniform suboperations ... NOT!
 Repetition of independent operations  different pipeline stages  not the same latency
Need to force each stage to be controlled by the same clock
 No dependencies between repeated operations  internal fragmentation (some pipe stages are too fast but all take
 Uniformly partitionable suboperations the same clock cycle time)
 Processing an be evenly divided into uniform-latency
suboperations (that do not share resources)  Independent operations ... NOT!
 instructions are not independent of each other
Need to detect and resolve inter-instruction dependencies to ensure
 Fitting examples: automobile assembly line, doing laundry the pipeline provides correct results
 What about the instruction processing “cycle”?  pipeline stalls (pipeline is not always moving)
401 402

Issues in Pipeline Design Causes of Pipeline Stalls


 Balancing work in pipeline stages  Stall: A condition when the pipeline stops moving
 How many stages and what is done in each stage
 Resource contention
 Keeping the pipeline correct, moving, and full in the
presence of events that disrupt pipeline flow
 Handling dependences  Dependences (between instructions)
 Data  Data
 Control  Control
 Handling resource contention
 Handling long-latency (multi-cycle) operations  Long-latency (multi-cycle) operations

 Handling exceptions, interrupts

 Advanced: Improving pipeline throughput


 Minimizing stalls
403 404
10-11-2023

Dependences and Their Types Handling Resource Contention


 Also called “dependency” or less desirably “hazard”  Happens when instructions in two pipeline stages need the
same resource
 Dependences dictate ordering requirements between
instructions  Solution 1: Eliminate the cause of contention
 Duplicate the resource or increase its throughput
 Two types  E.g., use separate instruction and data memories (caches)
 E.g., use multiple ports for memory structures
 Data dependence
 Control dependence
 Solution 2: Detect the resource contention and stall one of
the contending stages
 Resource contention is sometimes called resource
 Which stage do you stall?
dependence
 Example: What if you had a single read and write port for the
 However, this is not fundamental to (dictated by) program
register file?
semantics, so we will treat it separately
405 406

Data Dependences Data Dependence Types


 Types of data dependences
Flow dependence
Flow dependence (true data dependence – read after write)
 r1 op r2

r3 Read-after-Write
 Output dependence (write after write)
r5  r3 op r4 (RAW)
 Anti dependence (write after read)
Anti dependence
Which ones cause stalls in a pipelined machine?

r3  r1 op r2 Write-after-Read
For all of them, we need to ensure semantics of the program

is correct
r1  r4 op r5 (WAR)
 Flow dependences always need to be obeyed because they
constitute true dependence on a value Output-dependence
 Anti and output dependences exist due to limited number of r3  r1 op r2 Write-after-Write
architectural registers r5  r3 op r4 (WAW)
They are dependence on a name, not a value

r3  r6 op r7
 We will later see what we can do about them
407 408
10-11-2023

Pipelined Operation Example


lw $10,
sub $11,20($1)
$2, $3 lw $10,
sub $11,20($1)
$2, $3 lw $10, 20($1)
Instruction fetch

M
0
00
MM
uu
Instruction decode Execution
sub $11, $2, $3
Execution
lw $10,
sub $11,20($1)
$2, $3
Memory
sub $11,20($1)
lw $10, $2, $3
Write back
Write back
Data Dependence Handling
xxx
1
11

IF/ID
IF/ID
IF/ID ID/EX
ID/EX
ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB

Add
Add
Add

Add AddAdd
Add
444 Add
Add result
result
result
Shift
Shift
Shift
left 22
left

Read
Read
Read
Instruction
Instruction
Instruction

PC
PC Address
Address register 11
register
register 1 Read
PC Address Read
Read
Read data 11
data
data 1
Read
Read
register 22
register 2 Zero
Zero
Zero
Instruction
Instruction
Instruction register
Registers Read
Registers
Registers Read ALU ALU
ALU
ALU
memory
memory
memory Write Read 0
00 ALU
ALU Read
Read
Read
Write
Write data 22
data result
result
result Address
Address
Address 1
11
register
register
register M
M data
data
data
M M
M
M
uuu Data
Data
Data
Data u

What if the SUB were dependent on LW?


Write
Write xx uu
Write memory
memory xxx
data
data
data 11
1 00
Write 0
Write
Write
data
data
data
16
16
16 32
32
32
Sign
Sign
Sign
extend
extend
extend

Clock
Clock
Clock5621 43
Clock
Clock

409 410
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]

Readings for Next Few Lectures How to Handle Data Dependences


 P&H Chapter 4.9-4.11  Anti and output dependences are easier to handle
 write to the destination in one stage and in program order
 Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995  Flow dependences are more interesting
 More advanced pipelining
 Interrupt and exception handling  Five fundamental ways of handling flow dependences
 Out-of-order and superscalar execution concepts  Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent instruction
 Detect and eliminate the dependence at the software level
 No need for the hardware to detect dependence
 Predict the needed value(s), execute “speculatively”, and verify
 Do something else (fine-grained multithreading)
 No need to detect
411 412
10-11-2023

Interlocking Approaches to Dependence Detection (I)


 Detection of dependence between instructions in a  Scoreboarding
pipelined processor to guarantee correct execution  Each register in register file has a Valid bit associated with it
 An instruction that is writing to the register resets the Valid bit
 Software based interlocking  An instruction in Decode stage checks if all its source and
vs. destination registers are Valid
 Yes: No need to stall… No dependence
 Hardware based interlocking
 No: Stall the instruction

 MIPS acronym?  Advantage:


 Simple. 1 bit per register

 Disadvantage:
 Need to stall for all types of dependences, not only flow dep.
413 414

Not Stalling on Anti and Output Dependences Approaches to Dependence Detection (II)
 What changes would you make to the scoreboard to enable  Combinational dependence check logic
this?  Special logic that checks if any instruction in later stages is
supposed to write to any source register of the instruction that
is being decoded
 Yes: stall the instruction/pipeline
 No: no need to stall… no flow dependence

 Advantage:
 No need to stall on anti and output dependences

 Disadvantage:
 Logic is more complex than a scoreboard
 Logic becomes more complex as we make the pipeline deeper
and wider (flash-forward: think superscalar execution)
415 416
10-11-2023

Once You Detect the Dependence in Hardware


 What do you do afterwards? 18-447
 Observation: Dependence between two instructions is
Computer Architecture
detected before the communicated data value becomes Lecture 8: Pipelining II:
available
Data and Control Dependence Handling
 Option 1: Stall the dependent instruction right away
 Option 2: Stall the dependent instruction only when
necessary  data forwarding/bypassing
Prof. Onur Mutlu
 Option 3: …
Carnegie Mellon University
Spring 2015, 2/2/2015

417

Agenda for Today & Next Few Lectures Readings for Next Few Lectures (I)
 Single-cycle Microarchitectures  P&H Chapter 4.9-4.11

 Multi-cycle and Microprogrammed Microarchitectures  Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
 Pipelining  More advanced pipelining
 Interrupt and exception handling
 Out-of-order and superscalar execution concepts
 Issues in Pipelining: Control & Data Dependence Handling,
State Maintenance and Recovery, …
 McFarling, “Combining Branch Predictors,” DEC WRL
Technical Report, 1993.
 Out-of-Order Execution

 Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro


 Issues in OoO Execution: Load-Store Handling, …
1999.
419 420
10-11-2023

Readings for Next Few Lectures (II) Recap of Last Lecture


 Smith and Plezskun, “Implementing Precise Interrupts in  Wrap Up Microprogramming
Pipelined Processors,” IEEE Trans on Computers 1988  Horizontal vs. Vertical Microcode
(earlier version in ISCA 1985).  Nanocode vs. Millicode

 Pipelining
 Basic Idea and Characteristics of An Ideal Pipeline
 Pipelined Datapath and Control
 Issues in Pipeline Design
 Resource Contention
 Dependences and Their Types
 Control vs. data (flow, anti, output)
 Five Fundamental Ways of Handling Data Dependences
 Dependence Detection
 Interlocking
 Scoreboarding vs. Combinational
421 422

Review: Issues in Pipeline Design Review: Dependences and Their Types


 Balancing work in pipeline stages  Also called “dependency” or less desirably “hazard”
 How many stages and what is done in each stage
 Dependences dictate ordering requirements between
 Keeping the pipeline correct, moving, and full in the
instructions
presence of events that disrupt pipeline flow
 Handling dependences
 Data  Two types
 Control  Data dependence
 Handling resource contention  Control dependence
 Handling long-latency (multi-cycle) operations
 Resource contention is sometimes called resource
 Handling exceptions, interrupts dependence
 Advanced: Improving pipeline throughput  However, this is not fundamental to (dictated by) program
semantics, so we will treat it separately
 Minimizing stalls
423 424
10-11-2023

Review: Interlocking Review: Once You Detect the Dependence in Hardware


 Detection of dependence between instructions in a  What do you do afterwards?
pipelined processor to guarantee correct execution
 Observation: Dependence between two instructions is
 Software based interlocking detected before the communicated data value becomes
vs. available
 Hardware based interlocking
 Option 1: Stall the dependent instruction right away
 MIPS acronym?  Option 2: Stall the dependent instruction only when
necessary  data forwarding/bypassing
 Option 3: …

425 426

Data Forwarding/Bypassing A Special Case of Data Dependence


 Problem: A consumer (dependent) instruction has to wait in  Control dependence
decode stage until the producer instruction writes its value  Data dependence on the Instruction Pointer / Program Counter
in the register file

 Goal: We do not want to stall the pipeline unnecessarily

 Observation: The data value needed by the consumer


instruction can be supplied directly from a later stage in the
pipeline (instead of only from the register file)

 Idea: Add additional dependence check logic and data


forwarding paths (buses) to supply the producer’s value to
the consumer right after the value is available

 Benefit: Consumer can move in the pipeline until the point


the value can be supplied  less stalling
427 428
10-11-2023

Control Dependence
 Question: What should the fetch PC be in the next cycle?
 Answer: The address of the next instruction
 All instructions are control dependent on previous ones. Why? Data Dependence Handling:
 If the fetched instruction is a non-control-flow instruction:
More Depth & Implementation
 Next Fetch PC is the address of the next-sequential instruction
 Easy to determine if we know the size of the fetched instruction

 If the instruction that is fetched is a control-flow instruction:


 How do we determine the next Fetch PC?

 In fact, how do we know whether or not the fetched


instruction is a control-flow instruction?
429 430

Remember: Data Dependence Types Remember: How to Handle Data Dependences


 Anti and output dependences are easier to handle
Flow dependence
write to the destination in one stage and in program order
 r1 op r2

r3 Read-after-Write
r5  r3 op r4 (RAW)
 Flow dependences are more interesting
Anti dependence
r3  r1 op r2 Write-after-Read  Five fundamental ways of handling flow dependences
r1  r4 op r5 (WAR)  Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent instruction
Output-dependence  Detect and eliminate the dependence at the software level
No need for the hardware to detect dependence
r3  r1 op r2 Write-after-Write 

Predict the needed value(s), execute “speculatively”, and verify


r5  r3 op r4 (WAW) 

Do something else (fine-grained multithreading)


r3  r6 op r7

 No need to detect
431 432
10-11-2023

Aside: Relevant Seminar Announcement RAW Dependence Handling


 Practical Data Value Speculation for Future High-End  Which one of the following flow dependences lead to
Processors conflicts in the 5-stage pipeline?
 Arthur Perais, INRIA (France)
 Thursday, Feb 5, 4:30-5:30pm, CIC Panther Hollow Room
addi ra r- -
IF ID EX MEM WB
 Summary:
 Value prediction (VP) was proposed to enhance the addi r- ra - IF ID EX MEM WB
performance of superscalar processors by breaking RAW
dependencies. However, it has generally been considered too addi r- ra - IF ID EX MEM
complex to implement. During this presentation, we will
review different sources of additional complexity and propose
addi r- ra - IF ID EX
solutions to address them. addi r- ra - IF ?ID
 http://www.ece.cmu.edu/~calcm/doku.php?id=seminars:se addi r- ra - IF
minars
433 434

Register Data Dependence Analysis Safe and Unsafe Movement of Pipeline


R/I-Type LW SW Br J Jr
stage X
IF j:_rk Reg Read j:rk_ Reg Write j:rk_ Reg Write
ID read RF read RF read RF read RF read RF

EX
iFj i Aj iOj
MEM

WB write RF write RF stage Y


i:rk_ Reg Write i:_rk Reg Read i:rk_ Reg Write
 For a given pipeline, when is there a potential conflict
between two data dependent instructions?
 dependence type: RAW, WAR, WAW? RAW Dependence WAR Dependence WAW Dependence

instruction types involved?



dist(i,j)  dist(X,Y)  Unsafe
?? to keep j moving
distance between the two instructions?
dist(i,j) > dist(X,Y)  Safe

435 ?? 436
10-11-2023

RAW Dependence Analysis Example Pipeline Stall: Resolving Data Dependence


R/I-Type LW SW Br J Jr
IF
ID read RF read RF read RF read RF read RF
t0 t1 t2 t3 t4 t5
EX Insth IF ID ALU MEM WB
MEM Insti i IF ID ALU MEM WB
WB write RF write RF Instj j IF ID ALU
ID MEM
ALU
ID ID
WB
MEM
ALU ALU
WB
MEM
 Instructions IA and IB (where IA comes before IB) have RAW Instk IF ID
IF ALU
ID
IF MEM
ALU
ID
IF WB
MEM
ALU
ID
dependence iff Instl IF ID
IF ALU
ID
IF MEM
ALU
ID
IF
 IB (R/I, LW, SW, Br or JR) reads a register written by IA (R/I or LW) IF ID
IF ALU
ID
IF
dist(IA, IB)  dist(ID, WB) = 3 i: rx  _

j: _  rx
bubble dist(i,j)=1 IF ID
IF
Stall = make the dependent instruction
What about WAW and WAR dependence?
j: _  rx
bubble dist(i,j)=2 wait until its source data value is available
IF
j: _  rx
bubble dist(i,j)=3 1. stop all up-stream stages
What about memory data dependence? j: _  rx dist(i,j)=4 2. drain all down-stream stages
437 438

How to Implement Stalling Stall Conditions


PCSrc

0
ID/EX  Instructions IA and IB (where IA comes before IB) have RAW
dependence iff
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

IF/ID
EX M WB
 IB (R/I, LW, SW, Br or JR) reads a register written by IA (R/I or LW)
4
Add

Add
Add result
 dist(IA, IB)  dist(ID, WB) = 3
RegWrite

Branch
Shift
left 2
MemWrite

ALUSrc
Read
MemtoReg

Must stall the ID stage when IB in ID stage wants to read a


Instruction

PC Address register 1
Read

Instruction
Read
register 2
data 1
Zero 
Registers Read ALU ALU

register to be written by IA in EX, MEM or WB stage


memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst

 Stall
 disable PC and IR latching; ensure stalled instruction stays in its stage
 Insert “invalid” instructions/nops into the stage following the stalled one
(called “bubbles”)
439 440
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
10-11-2023

Stall Condition Evaluation Logic Impact of Stall on Performance


 Helper functions  Each stall cycle corresponds to one lost cycle in which no
 rs(I) returns the rs field of I instruction can be completed
 use_rs(I) returns true if I requires RF[rs] and rs!=r0
 For a program with N instructions and S stall cycles,
 Stall when Average CPI=(N+S)/N
 (rs(IRID)==destEX) && use_rs(IRID) && RegWriteEX or
 (rs(IRID)==destMEM) && use_rs(IRID) && RegWriteMEM or
 S depends on
 (rs(IRID)==destWB) && use_rs(IRID) && RegWriteWB or
 frequency of RAW dependences
 (rt(IRID)==destEX) && use_rt(IRID) && RegWriteEX or
 exact distance between the dependent instructions
 (rt(IRID)==destMEM) && use_rt(IRID) && RegWriteMEM or
 distance between dependences
 (rt(IRID)==destWB) && use_rt(IRID) && RegWriteWB
suppose i1,i2 and i3 all depend on i0, once i1’s dependence is
resolved, i2 and i3 must be okay too
 It is crucial that the EX, MEM and WB stages continue to
advance normally during stall cycles
441 442

Sample Assembly (P&H) Reducing Stalls with Data Forwarding


 for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { ...... }  Also called Data Bypassing
addi $s1, $s0, -1 3 stalls
for2tst: slti $t0, $s1, 0 3 stalls  We have already seen the basic idea before
bne $t0, $zero, exit2  Forward the value to the dependent instruction as soon as
sll $t1, $s1, 2 3 stalls it is available
add $t2, $a0, $t1 3 stalls
lw $t3, 0($t2)
 Remember dataflow?
lw $t4, 4($t2) 3 stalls  Data value supplied to dependent instruction as soon as it is
slt $t0, $t4, $t3 3 stalls available
beq $t0, $zero, exit2
 Instruction executes when all its operands are available
.........
addi $s1, $s1, -1
j for2tst  Data forwarding brings a pipeline closer to data flow
exit2: execution principles
443 444
10-11-2023

Data Forwarding (or Data Bypassing) Resolving RAW Dependence with Forwarding
 It is intuitive to think of RF as state  Instructions IA and IB (where IA comes before IB) have RAW
 “add rx ry rz” literally means get values from RF[ry] and RF[rz] dependence iff
respectively and put result in RF[rx]  IB (R/I, LW, SW, Br or JR) reads a register written by IA (R/I or LW)
 But, RF is just a part of a communication abstraction  dist(IA, IB)  dist(ID, WB) = 3
 “add rx ry rz” means
1. get the results of the last instructions to define the values of  In other words, if IB in ID stage reads a register written by
RF[ry] and RF[rz], respectively,
IA in EX, MEM or WB stage, then the operand required by IB
2. until another instruction redefines RF[rx], younger instructions
that refer to RF[rx] should use this instruction’s result is not yet in RF
 What matters is to maintain the correct “data flow”  retrieve operand from datapath instead of the RF
between operations, thus  retrieve operand from the youngest definition if multiple
definitions are outstanding
add rz r- r-
IF ID EX MEM WB
addi r- rz r- IF ID ID
EX ID
MEM ID
WB
445 446

Data Forwarding Paths (v1) Data Forwarding Paths (v2)


ID/EX EX/MEM MEM/WB ID/EX EX/MEM MEM/WB

dist(i,j)=3 dist(i,j)=3
M M
u u
x x
Registers Registers
ForwardA ALU ForwardA ALU
dist(i,j)=2 dist(i,j)=2
M
u
dist(i,j)=1 Data
memory
M
u
dist(i,j)=1 Data
memory
x M x M
u u
x x

internal Rs ForwardB Rs ForwardB

forward? Rt
Rt M
Rt
Rt M
u EX/MEM.RegisterRd u EX/MEM.RegisterRd
Rd Rd
x x
Forwarding MEM/WB.RegisterRd Forwarding MEM/WB.RegisterRd
unit unit

dist(i,j)=3
b. With forwarding b. With forwarding

[Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
447
[Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Assumes RF forwards internally
448
10-11-2023

Data Forwarding Logic (for v2) Data Forwarding (Dependence Analysis)


if (rsEX!=0) && (rsEX==destMEM) && RegWriteMEM then
forward operand from MEM stage // dist=1 R/I-Type LW SW Br J Jr
else if (rsEX!=0) && (rsEX==destWB) && RegWriteWB then IF
forward operand from WB stage // dist=2 ID use
else use
EX produce use use use
use operand from register file // dist >= 3
MEM produce (use)

Ordering matters!! Must check youngest match first WB

Why doesn’t use_rs( ) appear in the forwarding logic?  Even with data-forwarding, RAW dependence on an
immediately preceding LW instruction requires a stall
What does the above not take into account?

449 450

Sample Assembly, No Forwarding (P&H) Sample Assembly, Revisited (P&H)


 for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { ...... }  for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { ...... }
addi $s1, $s0, -1
addi $s1, $s0, -1 3 stalls for2tst: slti $t0, $s1, 0
for2tst: slti $t0, $s1, 0 3 stalls bne $t0, $zero, exit2
bne $t0, $zero, exit2 sll $t1, $s1, 2
sll $t1, $s1, 2 3 stalls add $t2, $a0, $t1
add $t2, $a0, $t1 3 stalls lw $t3, 0($t2)
lw $t3, 0($t2) lw $t4, 4($t2)
lw $t4, 4($t2) 3 stalls nop
slt $t0, $t4, $t3 3 stalls slt $t0, $t4, $t3
beq $t0, $zero, exit2 beq $t0, $zero, exit2
......... .........
addi $s1, $s1, -1 addi $s1, $s1, -1
j for2tst j for2tst
exit2: exit2:
451 452
10-11-2023

Pipelining the LC-3b


 Let’s remember the single-bus datapath

Pipelining the LC-3b  We’ll divide it into 5 stages


 Fetch
 Decode/RF Access
 Address Generation/Execute
 Memory
 Store Result

 Conservative handling of data and control dependences


 Stall on branch
 Stall on flow dependence

453 454

An Example LC-3b Pipeline

456
10-11-2023

457 458

459 460
10-11-2023

Control of the LC-3b Pipeline


 Three types of control signals

 Datapath Control Signals


 Control signals that control the operation of the datapath

 Control Store Signals


 Control signals (microinstructions) stored in control store to be
used in pipelined datapath (can be propagated to stages later
than decode)

 Stall Signals
 Ensure the pipeline operates correctly in the presence of
dependencies
461 462

Control Store in a Pipelined Machine

463 464
10-11-2023

Stall Signals Pipelined LC-3b


 Pipeline stall: Pipeline does not move because an operation  http://www.ece.cmu.edu/~ece447/s14/lib/exe/fetch.php?m
in a stage cannot complete edia=18447-lc3b-pipelining.pdf
 Stall Signals: Ensure the pipeline operates correctly in the
presence of such an operation
 Why could an operation in a stage not complete?

465 466

Questions to Ponder
 What is the role of the hardware vs. the software in data
dependence handling?
End of Pipelining the LC-3b  Software based interlocking
 Hardware based interlocking
 Who inserts/manages the pipeline bubbles?
 Who finds the independent instructions to fill “empty” pipeline
slots?
 What are the advantages/disadvantages of each?

467 468
10-11-2023

Questions to Ponder More on Software vs. Hardware


 What is the role of the hardware vs. the software in the  Software based scheduling of instructions  static scheduling
order in which instructions are executed in the pipeline?  Compiler orders the instructions, hardware executes them in
that order
 Software based instruction scheduling  static scheduling
 Contrast this with dynamic scheduling (in which hardware can
 Hardware based instruction scheduling  dynamic scheduling
execute instructions out of the compiler-specified order)
 How does the compiler know the latency of each instruction?

 What information does the compiler not know that makes


static scheduling difficult?
 Answer: Anything that is determined at run time
 Variable-length operation latency, memory addr, branch direction

 How can the compiler alleviate this (i.e., estimate the


unknown)?
 Answer: Profiling
469 470

Review: Control Dependence


 Question: What should the fetch PC be in the next cycle?
 Answer: The address of the next instruction
Control Dependence Handling  All instructions are control dependent on previous ones. Why?

 If the fetched instruction is a non-control-flow instruction:


 Next Fetch PC is the address of the next-sequential instruction
 Easy to determine if we know the size of the fetched instruction

 If the instruction that is fetched is a control-flow instruction:


 How do we determine the next Fetch PC?

 In fact, how do we even know whether or not the fetched


instruction is a control-flow instruction?
471 472
10-11-2023

Branch Types How to Handle Control Dependences


Type Direction at Number of When is next  Critical to keep the pipeline full with correct sequence of
fetch time possible next fetch address
fetch addresses? resolved? dynamic instructions.
Conditional Unknown 2 Execution (register
dependent)
 Potential solutions if the instruction is a control-flow
Unconditional Always taken 1 Decode (PC +
offset)
instruction:
Call Always taken 1 Decode (PC +
offset)  Stall the pipeline until we know the next fetch address
Return Always taken Many Execution (register
dependent)
 Guess the next fetch address (branch prediction)
Indirect Always taken Many Execution (register  Employ delayed branching (branch delay slot)
dependent)  Do something else (fine-grained multithreading)
 Eliminate control-flow instructions (predicated execution)
Different branch types can be handled differently
 Fetch from both possible paths (if you know the addresses
of both possible paths) (multipath execution)
473 474

Stall Fetch Until Next PC is Available: Good Idea? Doing Better than Stalling Fetch …
 Rather than waiting for true-dependence on PC to resolve,
just guess nextPC = PC+4 to keep fetching every cycle
t0 t1 t2 t3 t4 t5 Is this a good guess?
Insth IF ID ALU MEM WB What do you lose if you guessed incorrectly?
Insti IF IF ID ALU MEM WB
Instj IF IF ID ALU  ~20% of the instruction mix is control flow
Instk IF IF  ~50 % of “forward” control flow (i.e., if-then-else) is taken
~90% of “backward” control flow (i.e., loop back) is taken
Instl

Overall, typically ~70% taken and ~30% not taken


[Lee and Smith, 1984]

 Expect “nextPC = PC+4” ~86% of the time, but what


about the remaining 14%?
This is the case with non-control-flow and unconditional br instructions! 475 476
10-11-2023

Guessing NextPC = PC + 4 Guessing NextPC = PC + 4


 Always predict the next sequential instruction is the next  How else can you make this more effective?
instruction to be executed
 This is a form of next fetch address prediction (and branch  Idea: Get rid of control flow instructions (or minimize their
prediction) occurrence)

 How can you make this more effective?


 How?
1. Get rid of unnecessary control flow instructions 
 Idea: Maximize the chances that the next sequential combine predicates (predicate combining)
instruction is the next instruction to be executed 2. Convert control dependences into data dependences 
 Software: Lay out the control flow graph such that the “likely predicated execution
next instruction” is on the not-taken path of a branch
 Profile guided code positioning  Pettis & Hansen, PLDI 1990.
 Hardware: ??? (how can you do this in hardware…)
 Cache traces of executed instructions  Trace cache
477 478

Predicate Combining (not Predicated Execution)


 Complex predicates are converted into multiple branches
 if ((a == b) && (c < d) && (a > 5000)) { … } 18-447
3 conditional branches

Problem: This increases the number of control


Computer Architecture
dependencies Lecture 9: Branch Prediction I
 Idea: Combine predicate operations to feed a single branch
instruction instead of having one branch for each
 Predicates stored and operated on using condition registers
 A single branch checks the value of the combined predicate
Prof. Onur Mutlu
+ Fewer branches in code  fewer mipredictions/stalls
Carnegie Mellon University
-- Possibly unnecessary work
Spring 2015, 2/4/2015
-- If the first predicate is false, no need to compute other predicates
 Condition registers exist in IBM RS6000 and the POWER architecture
479
10-11-2023

Agenda for Today & Next Few Lectures Reminder: Readings for Next Few Lectures (I)
 Single-cycle Microarchitectures  P&H Chapter 4.9-4.11

 Multi-cycle and Microprogrammed Microarchitectures  Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
 Pipelining  More advanced pipelining
 Interrupt and exception handling
 Out-of-order and superscalar execution concepts
 Issues in Pipelining: Control & Data Dependence Handling,
State Maintenance and Recovery, …
 McFarling, “Combining Branch Predictors,” DEC WRL
Technical Report, 1993. HW3 summary paper
 Out-of-Order Execution

 Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro


 Issues in OoO Execution: Load-Store Handling, …
1999.
481 482

Reminder: Readings for Next Few Lectures (II) Reminder: Relevant Seminar Tomorrow
 Smith and Plezskun, “Implementing Precise Interrupts in  Practical Data Value Speculation for Future High-End
Pipelined Processors,” IEEE Trans on Computers 1988 Processors
(earlier version in ISCA 1985). HW3 summary paper  Arthur Perais, INRIA (France)
 Thursday, Feb 5, 4:30-5:30pm, CIC Panther Hollow Room

 Summary:
 Value prediction (VP) was proposed to enhance the
performance of superscalar processors by breaking RAW
dependencies. However, it has generally been considered too
complex to implement. During this presentation, we will
review different sources of additional complexity and propose
solutions to address them.

 http://www.ece.cmu.edu/~calcm/doku.php?id=seminars:se
minars
483 484
10-11-2023

Recap of Last Lecture Tentative Plan for Friday and Monday


Data Dependence Handling

 I will be out of town
 Data Forwarding/Bypassing
 Attending the HPCA Conference
 In-depth Implementation
 Register dependence analysis
 Stalling  We will finish Branch Prediction on either of these days
 Performance analysis with and without forwarding
 LC-3b Pipelining
 Questions to Ponder  Lab 2 is due Friday
 HW vs. SW handling of data dependences  Step 1: Get the baseline functionality correct
 Static versus dynamic scheduling
 What makes compiler based instruction scheduling difficult?
 Step 2: Do the extra credit portion (it will be rewarding)
 Profiling (representative input sets needed; dynamic adaptation difficult)
 Introduction to static instruction scheduling (e.g., fix-up code)  Tentative Plan:
 Control Dependence Handling  Friday: Recitation session  Come with questions on Lab 2,
 Six ways of handling control dependences HW 2, lectures, concepts, etc
 Stalling until next fetch address is available: Bad idea
 Predicting the next-sequential instruction as next fetch address
 Monday: Finish branch prediction (Rachata)
485 486

Sample Papers from HPCA


 Donghyuk Lee+, “Adaptive Latency DRAM: Optimizing
DRAM Timing for the Common Case,” HPCA 2015.
 http://users.ece.cmu.edu/~omutlu/pub/adaptive-latency-
dram_hpca15.pdf
Control Dependence Handling
 Gennady Pekhimenko+, “Exploiting Compressed Block Size
as an Indicator of Future Reuse,” HPCA 2015.
 http://users.ece.cmu.edu/~omutlu/pub/compression-aware-
cache-management_hpca15.pdf

 Yu Cai, Yixin Luo+, “Data Retention in MLC NAND Flash


Memory: Characterization, Optimization and Recovery,”
HPCA 2015.
 http://users.ece.cmu.edu/~omutlu/pub/flash-memory-data-
retention_hpca15.pdf
487 488
10-11-2023

Review: Control Dependence How to Handle Control Dependences


 Question: What should the fetch PC be in the next cycle?  Critical to keep the pipeline full with correct sequence of
dynamic instructions.
 If the instruction that is fetched is a control-flow instruction:
 How do we determine the next Fetch PC?  Potential solutions if the instruction is a control-flow
instruction:
 In fact, how do we even know whether or not the fetched
instruction is a control-flow instruction?  Stall the pipeline until we know the next fetch address
 Guess the next fetch address (branch prediction)
 Employ delayed branching (branch delay slot)
 Do something else (fine-grained multithreading)
 Eliminate control-flow instructions (predicated execution)
 Fetch from both possible paths (if you know the addresses
of both possible paths) (multipath execution)
489 490

Review: Guessing NextPC = PC + 4 Review: Guessing NextPC = PC + 4


 Always predict the next sequential instruction is the next  How else can you make this more effective?
instruction to be executed
 This is a form of next fetch address prediction (and branch  Idea: Get rid of control flow instructions (or minimize their
prediction) occurrence)

 How can you make this more effective?


 How?
1. Get rid of unnecessary control flow instructions 
 Idea: Maximize the chances that the next sequential combine predicates (predicate combining)
instruction is the next instruction to be executed 2. Convert control dependences into data dependences 
 Software: Lay out the control flow graph such that the “likely predicated execution
next instruction” is on the not-taken path of a branch
 Profile guided code positioning  Pettis & Hansen, PLDI 1990.
 Hardware: ??? (how can you do this in hardware…)
 Cache traces of executed instructions  Trace cache
491 492
10-11-2023

Review: Predicate Combining (not Predicated Execution) Predicated Execution


 Complex predicates are converted into multiple branches  Idea: Convert control dependence to data dependence
 if ((a == b) && (c < d) && (a > 5000)) { … }
 3 conditional branches  Simple example: Suppose we had a Conditional Move
instruction…
 Problem: This increases the number of control
dependencies  CMOV condition, R1  R2
 R1 = (condition == true) ? R2 : R1
 Idea: Combine predicate operations to feed a single branch
instruction instead of having one branch for each  Employed in most modern ISAs (x86, Alpha)
 Predicates stored and operated on using condition registers
 A single branch checks the value of the combined predicate  Code example with branches vs. CMOVs
+ Fewer branches in code  fewer mipredictions/stalls if (a == 5) {b = 4;} else {b = 3;}
-- Possibly unnecessary work
CMPEQ condition, a, 5;
-- If the first predicate is false, no need to compute other predicates
CMOV condition, b  4;
 Condition registers exist in IBM RS6000 and the POWER architecture
CMOV !condition, b  3;
493 494

Conditional Execution in ARM Predicated Execution


 Same as predicated execution  Eliminates branches  enables straight line code (i.e.,
larger basic blocks in code)
 Every instruction is conditionally executed  Advantages
 Always-not-taken prediction works better (no branches)
 Compiler has more freedom to optimize code (no branches)
 control flow does not hinder inst. reordering optimizations
 code optimizations hindered only by data dependencies

 Disadvantages
 Useless work: some instructions fetched/executed but
discarded (especially bad for easy-to-predict branches)
 Requires additional ISA support

 Can we eliminate all branches this way?


495 496
10-11-2023

Predicated Execution How to Handle Control Dependences


 We will get back to this…  Critical to keep the pipeline full with correct sequence of
dynamic instructions.
 Some readings (optional):
 Allen et al., “Conversion of control dependence to data  Potential solutions if the instruction is a control-flow
dependence,” POPL 1983. instruction:
 Kim et al., “Wish Branches: Combining Conditional Branching
and Predication for Adaptive Predicated Execution,” MICRO  Stall the pipeline until we know the next fetch address
2005.
 Guess the next fetch address (branch prediction)
 Employ delayed branching (branch delay slot)
 Do something else (fine-grained multithreading)
 Eliminate control-flow instructions (predicated execution)
 Fetch from both possible paths (if you know the addresses
of both possible paths) (multipath execution)
497 498

Delayed Branching (I) Delayed Branching (II)


 Change the semantics of a branch instruction
 Branch after N instructions
Normal code: Timeline: Delayed branch code: Timeline:
 Branch after N cycles
A A if ex
if ex
 Idea: Delay the execution of a branch. N instructions (delay B C
slots) that come after the branch are always executed C A BC X A
regardless of branch direction. BC X B A B C A
D C B D BC C
 Problem: How do you find instructions to fill the delay E BC C E B BC
slots? F -- BC F G B
 Branch must be independent of delay slot instructions X: G G -- X: G

6 cycles 5 cycles
 Unconditional branch: Easier to find instructions to fill the delay slot
 Conditional branch: Condition computation should not depend on
instructions in delay slots  difficult to fill the delay slot
499 500
10-11-2023

Fancy Delayed Branching (III) Delayed Branching (IV)


 Delayed branch with squashing  Advantages:
 In SPARC + Keeps the pipeline full with useful instructions in a simple way assuming
 Semantics: If the branch falls through (i.e., it is not taken), 1. Number of delay slots == number of instructions to keep the pipeline
full before the branch resolves
the delay slot instruction is not executed
2. All delay slots can be filled with useful instructions
 Why could this help?
Normal code: Delayed branch code: Delayed branch w/ squashing:
 Disadvantages:
X: A X: A A -- Not easy to fill the delay slots (even with a 2-stage pipeline)
B B X: B 1. Number of delay slots increases with pipeline depth, superscalar
C C C execution width
BC X BC X BC X 2. Number of delay slots should be variable with variable latency
operations. Why?
D NOP A
-- Ties ISA semantics to hardware implementation
E D D
-- SPARC, MIPS, HP-PA: 1 delay slot
E E
-- What if pipeline implementation changes with the next design?
501 502

An Aside: Filling the Delay Slot How to Handle Control Dependences


a. From before b. From target c. From fall through

add $s1, $s2, $s3


sub $t4, $t5, $t6
add $s1, $s2, $s3  Critical to keep the pipeline full with correct sequence of
reordering data if $s2 = 0 then

if $s1 = 0 then dynamic instructions.
add $s1, $s2, $s3
independent Delay slot Delay slot

(RAW, WAW, if $s1 = 0 then

WAR) Delay slot sub $t4, $t5, $t6  Potential solutions if the instruction is a control-flow
instructions instruction:
does not change Becomes Becomes Becomes

program semantics
add $s1, $s2, $s3  Stall the pipeline until we know the next fetch address
Guess the next fetch address (branch prediction)
if $s2 = 0 then if $s1 = 0 then
add $s1, $s2, $s3 
add $s1, $s2, $s3 sub $t4, $t5, $t6
if $s1 = 0 then  Employ delayed branching (branch delay slot)
Safe?
Do something else (fine-grained multithreading)
sub $t4, $t5, $t6

within same For correctness: For correctness:  Eliminate control-flow instructions (predicated execution)
basic block add a new instruction add a new instruction
to the not-taken path? to the taken path?  Fetch from both possible paths (if you know the addresses
of both possible paths) (multipath execution)
[Based on original figure from P&H CO&D, COPYRIGHT 503 504
2004 Elsevier. ALL RIGHTS RESERVED.]
10-11-2023

Fine-Grained Multithreading Fine-grained Multithreading (II)


 Idea: Hardware has multiple thread contexts. Each cycle,  Idea: Switch to another thread every cycle such that no two
fetch engine fetches from a different thread. instructions from a thread are in the pipeline concurrently
 By the time the fetched branch/instruction resolves, no
instruction is fetched from the same thread  Tolerates the control and data dependency latencies by
 Branch/instruction resolution latency overlapped with overlapping the latency with useful work from other threads
execution of other threads’ instructions
 Improves pipeline utilization by taking advantage of multiple
threads
+ No logic needed for handling control and
data dependences within a thread  Thornton, “Parallel Operation in the Control Data 6600,” AFIPS
-- Single thread performance suffers 1964.
-- Extra logic for keeping thread contexts  Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978.
-- Does not overlap latency if not enough
threads to cover the whole pipeline
505 506

Fine-grained Multithreading: History Fine-grained Multithreading in HEP


 CDC 6600’s peripheral processing unit is fine-grained  Cycle time: 100ns
multithreaded
 Thornton, “Parallel Operation in the Control Data 6600,” AFIPS 1964.  8 stages  800 ns to
 Processor executes a different I/O thread every cycle complete an
 An operation from the same thread is executed every 10 cycles instruction
 assuming no memory
 Denelcor HEP (Heterogeneous Element Processor) access
 Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978.
 120 threads/processor  No control and data
 available queue vs. unavailable (waiting) queue for threads dependency checking
 each thread can have only 1 instruction in the processor pipeline; each thread
independent
 to each thread, processor looks like a non-pipelined machine
 system throughput vs. single thread performance tradeoff

507 508
10-11-2023

Multithreaded Pipeline Example Sun Niagara Multithreaded Pipeline

Kongetira et al., “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro 2005.
Slide credit: Joel Emer 509 510

Fine-grained Multithreading How to Handle Control Dependences


 Advantages  Critical to keep the pipeline full with correct sequence of
+ No need for dependency checking between instructions dynamic instructions.
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
 Potential solutions if the instruction is a control-flow
+ Otherwise-bubble cycles used for executing useful instructions from
instruction:
different threads
+ Improved system throughput, latency tolerance, utilization
 Stall the pipeline until we know the next fetch address
 Disadvantages  Guess the next fetch address (branch prediction)
- Extra hardware complexity: multiple hardware contexts (PCs, register  Employ delayed branching (branch delay slot)
files, …), thread selection logic  Do something else (fine-grained multithreading)
- Reduced single thread performance (one instruction fetched every N
cycles from the same thread)  Eliminate control-flow instructions (predicated execution)
- Resource contention between threads in caches and memory  Fetch from both possible paths (if you know the addresses
- Some dependency checking logic between threads remains (load/store) of both possible paths) (multipath execution)
511 512
10-11-2023

Branch Prediction: Guess the Next Instruction to Fetch

Branch Prediction PC ??
0x0006
0x0008
0x0007
0x0005
0x0004

I-$ DEC RF WB
0x0001
LD R1, MEM[R0]
0x0002 D-$
ADD R2, R2, #1
0x0003
BRZERO 0x0001
0x0004
ADD R3, R2, #1 12 cycles
0x0005
MUL R1, R2, R3
0x0006
LD R2, MEM[R2] Branch prediction
0x0007
LD R0, MEM[R2]
8 cycles

513

Misprediction Penalty Branch Prediction


 Processors are pipelined to increase concurrency
 How do we keep the pipeline full in the presence of branches?
PC
 Guess the next instruction when a branch is fetched

I-$ DEC RF WB  Requires guessing the direction and target of a branch


0x0001
LD R1, MEM[R0] 0x0007 0x0006 0x0005 0x0004 0x0003
0x0002
ADD R2, R2, #1 D-$ A Branch condition, TARGET
0x0003
BRZERO 0x0001
B1 B3 Pipeline
0x0004
ADD R3, R2, #1 Fetch Decode Rename Schedule RegisterRead Execute
0x0005 D
MUL R1, R2, R3 AD B1
E
B3
B1
F D
E
F A
A B1
F
D A
E B1
E B1
D
F F
D F
E B1
A DE F
A B1
A F
E B1
D D
A E B1
E B1
F
D
A D
E B1
A A
A B1
D A
0x0006
LD R2, MEM[R2]
E What
Target
Fetchtofrom
Misprediction
fetch
thenext? Detected!
correct Verify
target Flush the the Prediction
pipeline
0x0007
LD R0, MEM[R2]
F

516
10-11-2023

Branch Prediction: Always PC+4 Pipeline Flush on a Misprediction

t0 t1 t2 t3 t4 t5 t0 t1 t2 t3 t4 t5
Insth IFPC ID ALU MEM Insth IFPC ID ALU MEM WB
Insti IFPC+4 ID ALU Insti IFPC+4 ID killed
Instj IFPC+8 ID Instj IFPC+8 killed
Instk IFtarget Instk IFtarget ID ALU WB
Instl Insth branch condition and target Instl IF ID ALU
evaluated in ALU IF ID
When a branch resolves IF
- branch target (Instk) is fetched
- all instructions fetched since
insth (so called “wrong-path”
Insth is a branch instructions) must be flushed517 Insth is a branch 518

Performance Analysis Reducing Branch Misprediction Penalty


 correct guess  no penalty ~86% of the time  Resolve branch condition and target address early
 incorrect guess  2 bubbles IF.Flush

Assume
Hazard
 detection
unit
M ID/EX

no data dependency related stalls


u
 x
WB
EX/MEM
M

 20% control flow instructions Control


0
u
x
M WB
MEM/WB

IF/ID EX M WB

 70% of control flow instructions are taken


4 Shift

 CPI = [ 1 + (0.20*0.7) * 2 ] = left 2


M
u
x
=
= [ 1 + 0.14 * 2 ] = 1.28 PC
Instruction
memory
Registers
ALU
Data
memory M
u
M x
u
x

probability of penalty for


Sign
extend Is this a good idea?
a wrong guess a wrong guess M
u
x
Forwarding
unit

Can we reduce either of the two penalty terms?


519 [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
CPI = [ 1 + (0.2*0.7) * 1 ] = 1.14 520
10-11-2023

Branch Prediction (Enhanced) Fetch Stage with BTB and Direction Prediction
 Idea: Predict the next fetch address (to be used in the next
Direction predictor (taken?)
cycle)
taken?
 Requires three things to be predicted at fetch stage:
 Whether the fetched instruction is a branch PC + inst size Next Fetch
 (Conditional) branch direction Address
Program
Branch target address (if taken) hit?
 Counter

 Observation: Target address remains the same for a Address of the


conditional direct branch across dynamic instances current branch

 Idea: Store the target address from previous instance and access target address
it with the PC
 Called Branch Target Buffer (BTB) or Branch Target Address Cache of Target Addresses (BTB: Branch Target Buffer)
Cache Always taken CPI = [ 1 + (0.20*0.3) * 2 ] = 1.12 (70% of branches taken)
521 522

More Sophisticated Branch Direction Prediction Three Things to Be Predicted


 Requires three things to be predicted at fetch stage:
Which direction earlier Direction predictor (taken?)
1. Whether the fetched instruction is a branch
branches went
taken? 2. (Conditional) branch direction
3. Branch target address (if taken)
Global branch
history PC + inst size Next Fetch
XOR Address  Third (3.) can be accomplished using a BTB
Program
hit? Remember target address computed last time branch was
Counter
executed
Address of the  First (1.) can be accomplished using a BTB
current branch If BTB provides a target address for the program counter, then it

target address must be a branch


Or, we can store “branch metadata” bits in instruction

Cache of Target Addresses (BTB: Branch Target Buffer) cache/memory  partially decoded instruction stored in I-cache
 Second (2.): How do we predict the direction?
523 524
10-11-2023

Simple Branch Direction Prediction Schemes More Sophisticated Direction Prediction


 Compile time (static)  Compile time (static)
 Always not taken  Always not taken
 Always taken  Always taken
 BTFN (Backward taken, forward not taken)  BTFN (Backward taken, forward not taken)
 Profile based (likely direction)  Profile based (likely direction)
 Program analysis based (likely direction)
 Run time (dynamic)
 Last time prediction (single-bit)  Run time (dynamic)
 Last time prediction (single-bit)
 Two-bit counter based prediction
 Two-level prediction (global vs. local)
 Hybrid

525 526

Static Branch Prediction (I) Static Branch Prediction (II)


 Always not-taken  Profile-based
 Simple to implement: no need for BTB, no direction prediction  Idea: Compiler determines likely direction for each branch
 Low accuracy: ~30-40% (for conditional branches) using a profile run. Encodes that direction as a hint bit in the
 Remember: Compiler can layout code such that the likely path branch instruction format.
is the “not-taken” path  more effective prediction
+ Per branch prediction (more accurate than schemes in
 Always taken previous slide)  accurate if profile is representative!
 No direction prediction -- Requires hint bits in the branch instruction format
 Better accuracy: ~60-70% (for conditional branches) -- Accuracy depends on dynamic branch behavior:
 Backward branches (i.e. loop branches) are usually taken TTTTTTTTTTNNNNNNNNNN  50% accuracy
 Backward branch: target address lower than branch PC TNTNTNTNTNTNTNTNTNTN  50% accuracy
-- Accuracy depends on the representativeness of profile input
 Backward taken, forward not taken (BTFN) set
 Predict backward (loop) branches as taken, others not-taken
527 528
10-11-2023

Static Branch Prediction (III) Static Branch Prediction (IV)


 Program-based (or, program analysis based)  Programmer-based
 Idea: Use heuristics based on program analysis to determine statically-  Idea: Programmer provides the statically-predicted direction
predicted direction  Via pragmas in the programming language that qualify a branch as
 Example opcode heuristic: Predict BLEZ as NT (negative integers used likely-taken versus likely-not-taken
as error values in many programs)
 Example loop heuristic: Predict a branch guarding a loop execution as
taken (i.e., execute the loop)
+ Does not require profiling or program analysis
 Pointer and FP comparisons: Predict not equal
+ Programmer may know some branches and their program better than
other analysis techniques
+ Does not require profiling -- Requires programming language, compiler, ISA support
-- Heuristics might be not representative or good -- Burdens the programmer?
-- Requires compiler analysis and ISA support (ditto for other static methods)

 Ball and Larus, ”Branch prediction for free,” PLDI 1993.


 20% misprediction rate
529 530

Pragmas Static Branch Prediction


 Idea: Keywords that enable a programmer to convey hints  All previous techniques can be combined
to lower levels of the transformation hierarchy  Profile based
 Program based
 if (likely(x)) { ... }  Programmer based
 if (unlikely(error)) { … }
 How would you do that?
 Many other hints and optimizations can be enabled with
pragmas  What is the common disadvantage of all three techniques?
 E.g., whether a loop can be parallelized  Cannot adapt to dynamic changes in branch behavior
 #pragma omp parallel  This can be mitigated by a dynamic compiler, but not at a fine
granularity (and a dynamic compiler has its overheads…)
 Description
 What is a Dynamic Compiler?
 The omp parallel directive explicitly instructs the compiler to
parallelize the chosen segment of code.  Remember Transmeta? Code Morphing Software?
 Java JIT (just in time) compiler, Microsoft CLR (common lang. runtime)
531 532
10-11-2023

Dynamic Branch Prediction Last Time Predictor


 Idea: Predict branches based on dynamic information  Last time predictor
(collected at run-time)  Single bit per branch (stored in BTB)
 Indicates which direction branch went last time it executed
 Advantages TTTTTTTTTTNNNNNNNNNN  90% accuracy
+ Prediction based on history of the execution of branches
+ It can adapt to dynamic changes in branch behavior  Always mispredicts the last iteration and the first iteration
+ No need for static profiling: input set representativeness of a loop branch
problem goes away  Accuracy for a loop with N iterations = (N-2)/N

 Disadvantages + Loop branches for loops with large N (number of iterations)


-- More complex (requires additional hardware) -- Loop branches for loops will small N (number of iterations)
TNTNTNTNTNTNTNTNTNTN  0% accuracy
Last-time predictor CPI = [ 1 + (0.20*0.15) * 2 ] = 1.06 (Assuming 85% accuracy)
533 534

Implementing the Last-Time Predictor State Machine for Last-Time Prediction


tag BTB idx

actually
taken
N-bit BHT:
One
tag BTB: one target
Bit
table address per entry actually predict predict actually
per
entry not taken not taken taken
taken
taken? PC+4
= 1 0 actually
not taken

nextPC
The 1-bit BHT (Branch History Table) entry is updated with
the correct outcome after each execution of a branch
535 536
10-11-2023

Improving the Last Time Predictor Two-Bit Counter Based Prediction


 Problem: A last-time predictor changes its prediction from  Each branch associated with a two-bit counter
TNT or NTT too quickly  One more bit provides hysteresis
 even though the branch may be mostly taken or mostly not  A strong prediction does not change with one single
taken different outcome

 Solution Idea: Add hysteresis to the predictor so that  Accuracy for a loop with N iterations = (N-1)/N
prediction does not change on a single different outcome
TNTNTNTNTNTNTNTNTNTN  50% accuracy
 Use two bits to track the history of predictions for a branch
instead of a single bit (assuming counter initialized to weakly taken)

 Can have 2 states for T or NT instead of 1 state for each


+ Better prediction accuracy
2BC predictor CPI = [ 1 + (0.20*0.10) * 2 ] = 1.04 (90% accuracy)
 Smith, “A Study of Branch Prediction Strategies,” ISCA
-- More hardware cost (but counter can be part of a BTB entry)
1981.
537 538

State Machine for 2-bit Saturating Counter Hysteresis Using a 2-bit Counter
 Counter using saturating arithmetic
 Arithmetic with maximum and minimum values actually actually “weakly
taken !taken taken”
actually actually
taken pred !taken pred “strongly pred pred
taken taken taken” taken taken
11 actually 10 actually
taken taken actually
actually
!taken
actually actually taken
actually “strongly
taken !taken !taken !taken”
pred pred
actually “weakly !taken !taken actually
pred !taken pred !taken” actually !taken
!taken !taken taken
actually
01 actually 00
!taken 539 Change prediction after 2 consecutive mistakes 540
taken
10-11-2023

Is This Good Enough? Rethinking the The Branch Problem


 ~85-90% accuracy for many programs with 2-bit counter  Control flow instructions (branches) are frequent
based prediction (also called bimodal prediction)  15-25% of all instructions

 Is this good enough?  Problem: Next fetch address after a control-flow instruction
is not determined after N cycles in a pipelined processor
 How big is the branch problem?  N cycles: (minimum) branch resolution latency

 If we are fetching W instructions per cycle (i.e., if the


pipeline is W wide)
 A branch misprediction leads to N x W wasted instruction slots

541 542

Importance of The Branch Problem Can We Do Better?


 Assume N = 20 (20 pipe stages), W = 5 (5 wide fetch)
 Last-time and 2BC predictors exploit “last-time”
 Assume: 1 out of 5 instructions is a branch
predictability
 Assume: Each 5 instruction-block ends with a branch

 How long does it take to fetch 500 instructions?  Realization 1: A branch’s outcome can be correlated with
 100% accuracy other branches’ outcomes
 100 cycles (all instructions fetched on the correct path)
 No wasted work  Global branch correlation
 99% accuracy
100 (correct path) + 20 (wrong path) = 120 cycles

 Realization 2: A branch’s outcome can be correlated with
 20% extra instructions fetched
past outcomes of the same branch (other than the outcome
 98% accuracy
 100 (correct path) + 20 * 2 (wrong path) = 140 cycles of the branch “last-time” it was executed)
 40% extra instructions fetched  Local branch correlation
 95% accuracy
 100 (correct path) + 20 * 5 (wrong path) = 200 cycles
 100% extra instructions fetched
543 544
10-11-2023

Agenda for Today & Next Few Lectures


 Single-cycle Microarchitectures
18-447
Computer Architecture  Multi-cycle and Microprogrammed Microarchitectures

Lecture 10: Branch Prediction II  Pipelining

 Issues in Pipelining: Control & Data Dependence Handling,


State Maintenance and Recovery, …
Prof. Onur Mutlu
Rachata Ausavarungnirun  Out-of-Order Execution
Carnegie Mellon University
Spring 2015, 2/6/2015  Issues in OoO Execution: Load-Store Handling, …
546

Reminder: Readings for Next Few Lectures (I) Reminder: Readings for Next Few Lectures (II)
 P&H Chapter 4.9-4.11  Smith and Plezskun, “Implementing Precise Interrupts in
Pipelined Processors,” IEEE Trans on Computers 1988
 Smith and Sohi, “The Microarchitecture of Superscalar (earlier version in ISCA 1985). HW3 summary paper
Processors,” Proceedings of the IEEE, 1995
 More advanced pipelining
 Interrupt and exception handling
 Out-of-order and superscalar execution concepts

 McFarling, “Combining Branch Predictors,” DEC WRL


Technical Report, 1993. HW3 summary paper

 Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro


1999.
547 548
10-11-2023

Recap of Last Lecture Review: More Sophisticated Direction Prediction


 Predicated Execution Primer  Compile time (static)
 Delayed Branching  Always not taken
 With and without squashing  Always taken
 Branch Prediction  BTFN (Backward taken, forward not taken)
 Profile based (likely direction)
 Reducing misprediction penalty (branch resolution latency)
 Program analysis based (likely direction)
 Branch target buffer (BTB)
 Static Branch Prediction
 Run time (dynamic)
 Dynamic Branch Prediction
 Last time prediction (single-bit)
 How Big Is the Branch Problem?
 Two-bit counter based prediction
 Two-level prediction (global vs. local)
 Hybrid

549 550

Review: Importance of The Branch Problem Review: Can We Do Better?


 Assume N = 20 (20 pipe stages), W = 5 (5 wide fetch)
 Last-time and 2BC predictors exploit “last-time”
 Assume: 1 out of 5 instructions is a branch
predictability
 Assume: Each 5 instruction-block ends with a branch

 How long does it take to fetch 500 instructions?  Realization 1: A branch’s outcome can be correlated with
 100% accuracy other branches’ outcomes
 100 cycles (all instructions fetched on the correct path)
 No wasted work  Global branch correlation
 99% accuracy
100 (correct path) + 20 (wrong path) = 120 cycles

 Realization 2: A branch’s outcome can be correlated with
 20% extra instructions fetched
past outcomes of the same branch (other than the outcome
 98% accuracy
 100 (correct path) + 20 * 2 (wrong path) = 140 cycles of the branch “last-time” it was executed)
 40% extra instructions fetched  Local branch correlation
 95% accuracy
 100 (correct path) + 20 * 5 (wrong path) = 200 cycles
 100% extra instructions fetched
551 552
10-11-2023

Global Branch Correlation (I) Global Branch Correlation (II)


 Recently executed branch outcomes in the execution path
is correlated with the outcome of the next branch

 If Y and Z both taken, then X also taken


 If first branch not taken, second also not taken  If Y or Z not taken, then X also not taken

 If first branch taken, second definitely not taken

553 554

Global Branch Correlation (III) Capturing Global Branch Correlation


 Eqntott, SPEC 1992  Idea: Associate branch outcomes with “global T/NT history”
of all branches
if (aa==2) ;; B1  Make a prediction based on the outcome of the branch the
aa=0; last time the same global branch history was encountered
if (bb==2) ;; B2
bb=0;  Implementation:
if (aa!=bb) { ;; B3  Keep track of the “global T/NT history” of all branches in a
…. register  Global History Register (GHR)
}  Use GHR to index into a table that recorded the outcome that
was seen for each GHR value in the recent past  Pattern
History Table (table of 2-bit counters)
If B1 is not taken (i.e., aa==0@B3) and B2 is not taken (i.e.
bb=0@B3) then B3 is certainly taken
 Global history/branch predictor
 Uses two levels of history (GHR + history at that GHR)
555 556
10-11-2023

Two Level Global Branch Prediction How Does the Global Predictor Work?
 First level: Global branch history register (N bits)
 The direction of last N branches
 Second level: Table of saturating counters for each history entry
 The direction the branch took the last time the same history was

seen
Pattern History Table (PHT)
00 …. 00
1 1 ….. 1 0 00 …. 01
2 3 This branch tests i
GHR
previous 00 …. 10 Last 4 branches test j
(global
branch’s History: TTTN
history
direction Predict taken for i
register)
index Next history: TTNT
0 1
(shift in last outcome)

11 …. 11
 McFarling, “Combining Branch Predictors,” DEC WRL TR 1993.

Yeh and Patt, “Two-Level Adaptive Training Branch Prediction,” MICRO 1991. 557 558

Intel Pentium Pro Branch Predictor Improving Global Predictor Accuracy


 4-bit global history register  Idea: Add more context information to the global predictor to take into
account which branch is being predicted
 Multiple pattern history tables (of 2 bit counters)
 Gshare predictor: GHR hashed with the Branch PC
 Which pattern history table to use is determined by lower + More context information
order bits of the branch address
+ Better utilization of PHT
-- Increases access latency

 McFarling, “Combining Branch Predictors,” DEC WRL Tech Report, 1993.


559 560
10-11-2023

Review: One-Level Branch Predictor Two-Level Global History Branch Predictor


Direction predictor (2-bit counters) Which direction earlier Direction predictor (2-bit counters)
branches went
taken? taken?

Global branch
PC + inst size Next Fetch history PC + inst size Next Fetch
Address Address
Program Program
hit? hit?
Counter Counter

Address of the Address of the


current instruction current instruction

target address target address

Cache of Target Addresses (BTB: Branch Target Buffer) Cache of Target Addresses (BTB: Branch Target Buffer)

561 562

Two-Level Gshare Branch Predictor Can We Do Better?


 Last-time and 2BC predictors exploit only “last-time”
Which direction earlier Direction predictor (2-bit counters)
predictability for a given branch
branches went
taken?

Global branch  Realization 1: A branch’s outcome can be correlated with


history PC + inst size Next Fetch other branches’ outcomes
Program
XOR Address  Global branch correlation
hit?
Counter

 Realization 2: A branch’s outcome can be correlated with


Address of the past outcomes of the same branch (in addition to the
current instruction
outcome of the branch “last-time” it was executed)
target address  Local branch correlation

Cache of Target Addresses (BTB: Branch Target Buffer)

563 564
10-11-2023

Local Branch Correlation Capturing Local Branch Correlation


 Idea: Have a per-branch history register
 Associate the predicted outcome of a branch with “T/NT history”
of the same branch
 Make a prediction based on the outcome of the branch the
last time the same local branch history was encountered

 Called the local history/branch predictor


 Uses two levels of history (Per-branch history register +
history at that history register value)

 McFarling, “Combining Branch Predictors,” DEC WRL TR 1993.

565 566

Two Level Local Branch Prediction Two-Level Local History Branch Predictor
Which directions earlier instances of *this branch* went
 First level: A set of local history registers (N bits each)
Direction predictor (2-bit counters)
 Select the history register based on the PC of the branch

 Second level: Table of saturating counters for each history entry taken?
 The direction the branch took the last time the same history was

seen
Pattern History Table (PHT) PC + inst size Next Fetch
Address
00 …. 00 Program
hit?
1 1 ….. 1 0 Counter
00 …. 01
2 3
00 …. 10
Address of the
current instruction
index target address
0 1

Local history Cache of Target Addresses (BTB: Branch Target Buffer)


11 …. 11
registers
567 568
Yeh and Patt, “Two-Level Adaptive Training Branch Prediction,” MICRO 1991.
10-11-2023

Hybrid Branch Predictors Alpha 21264 Tournament Predictor


 Idea: Use more than one type of predictor (i.e., multiple
algorithms) and select the “best” prediction
 E.g., hybrid of 2-bit counters and global predictor

 Advantages:
+ Better accuracy: different predictors are better for different branches
+ Reduced warmup time (faster-warmup predictor used until the
slower-warmup predictor warms up)
 Minimum branch penalty: 7 cycles
 Disadvantages:  Typical branch penalty: 11+ cycles
-- Need “meta-predictor” or “selector”  48K bits of target addresses stored in I-cache
-- Longer access latency  Predictor tables are reset on a context switch

 McFarling, “Combining Branch Predictors,” DEC WRL Tech Report, 1993.  Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999.
569 570

Branch Prediction Accuracy (Example) Biased Branches


 Bimodal: table of 2bc indexed by branch address  Observation: Many branches are biased in one direction
(e.g., 99% taken)

 Problem: These branches pollute the branch prediction


structures  make the prediction of other branches difficult
by causing “interference” in branch prediction tables and
history registers

 Solution: Detect such biased branches, and predict them


with a simpler predictor (e.g., last time, static, …)

 Chang et al., “Branch classification: a new mechanism for improving


branch predictor performance,” MICRO 1994.

571 572
10-11-2023

Some Other Branch Predictor Types How to Handle Control Dependences


 Loop branch detector and predictor  Critical to keep the pipeline full with correct sequence of
 Works well for loops with small number of iterations, where dynamic instructions.
iteration count is predictable
 Potential solutions if the instruction is a control-flow
 Perceptron branch predictor instruction:
 Learns the direction correlations between individual branches

 Assigns weights to correlations  Stall the pipeline until we know the next fetch address
 Jimenez and Lin, “Dynamic Branch Prediction with  Guess the next fetch address (branch prediction)
Perceptrons,” HPCA 2001.
 Employ delayed branching (branch delay slot)
 Do something else (fine-grained multithreading)
 Geometric history length predictor
 Eliminate control-flow instructions (predicated execution)
 Fetch from both possible paths (if you know the addresses
 Your predictor?
of both possible paths) (multipath execution)
573 574

Review: Predicate Combining (not Predicated Execution) Predication (Predicated Execution)


 Idea: Compiler converts control dependence into data
 Complex predicates are converted into multiple branches
dependence  branch is eliminated
 if ((a == b) && (c < d) && (a > 5000)) { … }  Each instruction has a predicate bit set based on the predicate computation
 3 conditional branches  Only instructions with TRUE predicates are committed (others turned into NOPs)
 Problem: This increases the number of control (normal branch code) (predicated code)
dependencies
A
 Idea: Combine predicate operations to feed a single branch if (cond) { T N A
instruction b = 0; B
C B
 Predicates stored and operated on using condition registers } C
 A single branch checks the value of the combined predicate else { D D
+ Fewer branches in code  fewer mipredictions/stalls b = 1; A
p1 = (cond) A
-- Possibly unnecessary work } branch p1, TARGET p1 = (cond)
B
mov b, 1 B
-- If the first predicate is false, no need to compute other predicates jmp JOIN (!p1) mov b, 1
C C
 Condition registers exist in IBM RS6000 and the POWER architecture TARGET:
(p1) mov b, 0
mov b, 0
D D
575 add x, b, 1 add x, b, 1 576
10-11-2023

Conditional Move Operations Review: CMOV Operation


 Very limited form of predicated execution  Suppose we had a Conditional Move instruction…
 CMOV condition, R1  R2
 CMOV R1  R2  R1 = (condition == true) ? R2 : R1
 R1 = (ConditionCode == true) ? R2 : R1  Employed in most modern ISAs (x86, Alpha)
 Employed in most modern ISAs (x86, Alpha)
 Code example with branches vs. CMOVs
if (a == 5) {b = 4;} else {b = 3;}

CMPEQ condition, a, 5;
CMOV condition, b  4;
CMOV !condition, b  3;

577 578

Predicated Execution (II) Predicated Execution (III)


 Predicated execution can be high performance and energy-  Advantages:
+ Eliminates mispredictions for hard-to-predict branches
efficient
+ No need for branch prediction for some branches
+ Good if misprediction cost > useless work due to predication

Predicated Execution + Enables code optimizations hindered by the control dependency


A Fetch Decode Rename Schedule RegisterRead Execute + Can move instructions more freely within predicated code

F
E
A
D
B
C F
D
E
C
A
B F
E
C
D
B
A A
D
B
C
E
F A
B
C
D
E
F B
A
D
C
E
F E
F
C
D
B
A F
D
E
B
C
A A
C
D
B
E A
B
C
D A
B
C B
A A
C B  Disadvantages:
Branch Prediction -- Causes useless work for branches that are easy to predict
D Fetch Decode Rename Schedule RegisterRead Execute -- Reduces performance if misprediction cost < useless work
-- Adaptivity: Static predication is not adaptive to run-time branch behavior. Branch
F E D B A behavior changes based on input set, program phase, control-flow path.
E
-- Additional hardware and ISA support
Pipeline flush!!
-- Cannot eliminate all hard to predict branches
F
-- Loop branches

579 580
10-11-2023

Predicated Execution in Intel Itanium Conditional Execution in the ARM ISA


 Each instruction can be separately predicated  Almost all ARM instructions can include an optional
 64 one-bit predicate registers condition code.
each instruction carries a 6-bit predicate field
 An instruction is effectively a NOP if its predicate is false  An instruction with a condition code is executed only if the
condition code flags in the CPSR meet the specified
condition.
cmp p1 p2 cmp
br p2 else1
p1 then1
else1
join1
else2
p1 then2
br
p2 else2
then1
join2
then2
join1
join2
581 582

Conditional Execution in ARM ISA Conditional Execution in ARM ISA

583 584
10-11-2023

Conditional Execution in ARM ISA Conditional Execution in ARM ISA

585 586

Conditional Execution in ARM ISA Idealism


 Wouldn’t it be nice
 If the branch is eliminated (predicated) only when it would
actually be mispredicted
 If the branch were predicted when it would actually be
correctly predicted

 Wouldn’t it be nice
 If predication did not require ISA support

587 588
10-11-2023

Improving Predicated Execution Wish Branches


 Three major limitations of predication  The compiler generates code (with wish branches) that
1. Adaptivity: non-adaptive to branch behavior can be executed either as predicated code or non-
2. Complex CFG: inapplicable to loops/complex control flow graphs predicated code (normal branch code)
3. ISA: Requires large ISA changes
 The hardware decides to execute predicated code or
A
normal branch code at run-time based on the confidence of
 Wish Branches [Kim+, MICRO 2005]
branch prediction
 Solve 1 and partially 2 (for loops)
 Easy to predict: normal branch code
 Dynamic Predicated Execution  Hard to predict: predicated code
 Diverge-Merge Processor [Kim+, MICRO 2006]
 Solves 1, 2 (partially), 3  Kim et al., “Wish Branches: Enabling Adaptive and
Aggressive Predicated Execution,” MICRO 2006, IEEE Micro
Top Picks, Jan/Feb 2006.
589 590

Wish Jump/Join Wish Branches vs. Predicated Execution


High Confidence
Low Confidence
 Advantages compared to predicated execution
A wish jump  Reduces the overhead of predication
A A  Increases the benefits of predicated code by allowing the compiler to
T N B wish join generate more aggressively-predicated code
B  Makes predicated code less dependent on machine configuration (e.g.
C B C C branch predictor)

D
D D
A
p1=(cond)
 Disadvantages compared to predicated execution
A
p1 = (cond)
A wish.jump p1 TARGET  Extra branch instructions use machine resources
p1 = (cond) B
branch p1, TARGET (!p1)
(1) mov b,1  Extra branch instructions increase the contention for branch predictor table
B B
mov b, 1 (!p1) mov b,1 wish.join
wish.join
!p1(1)JOIN
JOIN entries
jmp JOIN C TARGET:
C
C
(p1) mov b,0
 Constrains the compiler’s scope for code optimizations
TARGET: (p1) mov b,0
(1)
mov b,0 D JOIN:

normal branch code predicated code wish jump/join code


591 592
10-11-2023

How to Handle Control Dependences Multi-Path Execution


 Critical to keep the pipeline full with correct sequence of  Idea: Execute both paths after a conditional branch
For all branches: Riseman and Foster, “The inhibition of potential parallelism
dynamic instructions. 

by conditional jumps,” IEEE Transactions on Computers, 1972.


 For a hard-to-predict branch: Use dynamic confidence estimation
 Potential solutions if the instruction is a control-flow
instruction:  Advantages:
+ Improves performance if misprediction cost > useless work
 Stall the pipeline until we know the next fetch address + No ISA change needed
 Guess the next fetch address (branch prediction)
 Employ delayed branching (branch delay slot)  Disadvantages:
-- What happens when the machine encounters another hard-to-predict
 Do something else (fine-grained multithreading) branch? Execute both paths again?
 Eliminate control-flow instructions (predicated execution) -- Paths followed quickly become exponential
 Fetch from both possible paths (if you know the addresses -- Each followed path requires its own context (registers, PC, GHR)
of both possible paths) (multipath execution) -- Wasted work (and reduced performance) if paths merge
593 594

Dual-Path Execution versus Predication Remember: Branch Types


Type Direction at Number of When is next
Dual-path Predicated Execution fetch time possible next fetch address
fetch addresses? resolved?
A Hard to predict path 1 path 2 path 1 path 2
Conditional Unknown 2 Execution (register
dependent)
C B C B C B Unconditional Always taken 1 Decode (PC +
offset)
D D D CFM CFM Call Always taken 1 Decode (PC +
offset)
D Return Always taken Many Execution (register
E E E
dependent)
E Indirect Always taken Many Execution (register
F F F
dependent)
F

How can we predict an indirect branch with many target addresses?

595 596
10-11-2023

Call and Return Prediction Indirect Branch Prediction (I)


Call X
 Direct calls are easy to predict …  Register-indirect branches have multiple targets
Call X
Always taken, single target

… A br.cond TARGET A R1 = MEM[R2]
 Call marked in BTB, target predicted by BTB Call X
T N branch R1

?
Return
Returns are indirect branches Return TARG A+1

Return a b d r
 A function can be called from many points in code
 A return instruction can have many target addresses Indirect Jump
Conditional (Direct) Branch
 Next instruction after each call point for the same function
 Observation: Usually a return matches a call  Used to implement
 Idea: Use a stack to predict return addresses (Return Address Stack)  Switch-case statements
 A fetched call: pushes the return (next instruction) address on the stack  Virtual function calls
 A fetched return: pops the stack and uses the address as its predicted
 Jump tables (of function pointers)
target
 Accurate most of the time: 8-entry stack  > 95% accuracy  Interface calls

597 598

Indirect Branch Prediction (II) More Ideas on Indirect Branches?


 No direction prediction needed  Virtual Program Counter prediction
 Idea 1: Predict the last resolved target as the next fetch address  Idea: Use conditional branch prediction structures iteratively
+ Simple: Use the BTB to store the target address to make an indirect branch prediction
-- Inaccurate: 50% accuracy (empirical). Many indirect branches switch  i.e., devirtualize the indirect branch in hardware
between different targets

 Curious?
 Idea 2: Use history based target prediction  Kim et al., “VPC Prediction: Reducing the Cost of Indirect
 E.g., Index the BTB with GHR XORed with Indirect Branch PC Branches via Hardware-Based Dynamic Devirtualization,” ISCA
 Chang et al., “Target Prediction for Indirect Jumps,” ISCA 1997. 2007.
+ More accurate
-- An indirect branch maps to (too) many entries in BTB
-- Conflict misses with other branches (direct or indirect)
-- Inefficient use of space if branch has few target addresses

599 600
10-11-2023

Issues in Branch Prediction (I) Issues in Branch Prediction (II)


 Need to identify a branch before it is fetched  Latency: Prediction is latency critical
 Need to generate next fetch address for the next cycle
 How do we do this?  Bigger, more complex predictors are more accurate but slower
 BTB hit  indicates that the fetched instruction is a branch
 BTB entry contains the “type” of the branch PC + inst size
 Pre-decoded “branch type” information stored in the BTB target Next Fetch
instruction cache identifies type of branch Return Address Stack target Address
Indirect Branch Predictor target
Resolved target from Backend
 What if no BTB?
 Bubble in the pipeline until target address is computed
 E.g., IBM POWER4 ???

601 602

Complications in Superscalar Processors Multiple Instruction Fetch: Concepts


 Superscalar processors
 attempt to execute more than 1 instruction-per-cycle
 must fetch multiple instructions per cycle
 What if there is a branch in the middle of fetched instructions?

 Consider a 2-way superscalar fetch scenario


(case 1) Both insts are not taken control flow inst
 nPC = PC + 8
(case 2) One of the insts is a taken control flow inst
 nPC = predicted target addr
 *NOTE* both instructions could be control-flow; prediction based on
the first one predicted taken
 If the 1st instruction is the predicted taken branch
 nullify 2nd instruction fetched
603 604
10-11-2023

Review of Last Few Lectures


Control dependence handling in pipelined machines

 Delayed branching
18-447
 Fine-grained multithreading Computer Architecture
Branch prediction

 Compile time (static)


Lecture 11: Precise Exceptions,
 Always NT, Always T, Backward T Forward NT, Profile based State Maintenance, State Recovery
 Run time (dynamic)
 Last time predictor
 Hysteresis: 2BC predictor
 Global branch correlation  Two-level global predictor
 Local branch correlation  Two-level local predictor Prof. Onur Mutlu
 Hybrid branch predictors Carnegie Mellon University
 Predicated execution Spring 2015, 2/11/2015
 Multipath execution
 Return address stack & Indirect branch prediction
605

Agenda for Today & Next Few Lectures Reminder: Readings for Next Few Lectures (I)
 Single-cycle Microarchitectures  P&H Chapter 4.9-4.11

 Multi-cycle and Microprogrammed Microarchitectures  Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
 Pipelining  More advanced pipelining
 Interrupt and exception handling
 Out-of-order and superscalar execution concepts
 Issues in Pipelining: Control & Data Dependence Handling,
State Maintenance and Recovery, …
 McFarling, “Combining Branch Predictors,” DEC WRL
Technical Report, 1993. HW3 summary paper
 Out-of-Order Execution

 Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro


 Issues in OoO Execution: Load-Store Handling, …
1999.
607 608
10-11-2023

Reminder: Readings for Next Few Lectures (II) Readings Specifically for Today
 Smith and Plezskun, “Implementing Precise Interrupts in  Smith and Plezskun, “Implementing Precise Interrupts in
Pipelined Processors,” IEEE Trans on Computers 1988 Pipelined Processors,” IEEE Trans on Computers 1988
(earlier version in ISCA 1985). HW3 summary paper (earlier version in ISCA 1985). HW3 summary paper

 Smith and Sohi, “The Microarchitecture of Superscalar


Processors,” Proceedings of the IEEE, 1995
 More advanced pipelining
 Interrupt and exception handling
 Out-of-order and superscalar execution concepts

609 610

Review: How to Handle Control Dependences Review of Last Few Lectures


 Critical to keep the pipeline full with correct sequence of  Control dependence handling in pipelined machines
dynamic instructions.  Delayed branching
 Fine-grained multithreading
 Potential solutions if the instruction is a control-flow  Branch prediction
instruction:  Compile time (static)
 Always NT, Always T, Backward T Forward NT, Profile based
 Run time (dynamic)
 Stall the pipeline until we know the next fetch address  Last time predictor
 Guess the next fetch address (branch prediction)  Hysteresis: 2BC predictor
Global branch correlation  Two-level global predictor
 Employ delayed branching (branch delay slot) 

 Local branch correlation  Two-level local predictor


 Do something else (fine-grained multithreading)  Hybrid branch predictors
 Eliminate control-flow instructions (predicated execution)  Predicated execution
 Fetch from both possible paths (if you know the addresses  Multipath execution
of both possible paths) (multipath execution)  Return address stack & Indirect branch prediction
611 612
10-11-2023

Multi-Cycle Execution
 Not all instructions take the same amount of time for
“execution”
Pipelining and Precise Exceptions:
Preserving Sequential Semantics  Idea: Have multiple different functional units that take
different number of cycles
 Can be pipelined or not pipelined
 Can let independent instructions start execution on a different
functional unit before a previous long-latency instruction
finishes execution

614

Issues in Pipelining: Multi-Cycle Execute Exceptions vs. Interrupts


 Cause
 Instructions can take different number of cycles in
EXECUTE stage  Exceptions: internal to the running thread
 Integer ADD versus FP MULtiply  Interrupts: external to the running thread

FMUL R4  R1, R2 F D E E E E E E E E W
ADD R3  R1, R2
 When to Handle
F D E W
F D E W
 Exceptions: when detected (and known to be non-speculative)
F D E W  Interrupts: when convenient
F D E E E E E E E E W  Except for very high priority ones
FMUL R2  R5, R6
 Power failure
ADD R7  R5, R6 F D E W
 Machine check (error)
F D E W
 What is wrong with this picture?
 Sequential semantics of the ISA NOT preserved!
 Priority: process (exception), depends (interrupt)
 What if FMUL incurs an exception?
 Handling Context: process (exception), system (interrupt)
615 616
10-11-2023

Precise Exceptions/Interrupts Why Do We Want Precise Exceptions?


 The architectural state should be consistent when the  Semantics of the von Neumann model ISA specifies it
exception/interrupt is ready to be handled  Remember von Neumann vs. Dataflow

1. All previous instructions should be completely retired.  Aids software debugging

2. No later instruction should be retired.  Enables (easy) recovery from exceptions, e.g. page faults

Retire = commit = finish execution and update arch. state  Enables (easily) restartable processes

 Enables traps into software (e.g., software implemented


opcodes)

617 618

Ensuring Precise Exceptions in Pipelining Solutions


 Idea: Make each operation take the same amount of time  Reorder buffer

FMUL R3  R1, R2 F D E E E E E E E E W
ADD R4  R1, R2 F D E E E E E E E E W  History buffer
F D E E E E E E E E W
F D E E E E E E E E W
 Future register file
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W  Checkpointing

 Downside  Readings
 Worst-case instruction latency determines all instructions’ latency  Smith and Plezskun, “Implementing Precise Interrupts in Pipelined
 What about memory operations? Processors,” IEEE Trans on Computers 1988 and ISCA 1985.
 Hwu and Patt, “Checkpoint Repair for Out-of-order Execution
 Each functional unit takes worst-case number of cycles?
Machines,” ISCA 1987.
619 620
10-11-2023

Solution I: Reorder Buffer (ROB) What’s in a ROB Entry?


 Idea: Complete instructions out-of-order, but reorder them
before making results visible to architectural state
 When instruction is decoded it reserves an entry in the ROB V DestRegID DestRegVal StoreAddr StoreData PC
Valid bits for reg/data
Exc?
+ control bits
 When instruction completes, it writes result into ROB entry
 When instruction oldest in ROB and it has completed
without exceptions, its result moved to reg. file or memory  Need valid bits to keep track of readiness of the result(s)

Func Unit
Instruction Register Reorder
Cache File Func Unit Buffer

Func Unit

621 622

Reorder Buffer: Independent Operations Reorder Buffer: How to Access?


 Results first written to ROB, then to register file at commit  A register value can be in the register file, reorder buffer,
time (or bypass/forwarding paths)
F D E E E E E E E E R W
F D E R W
Instruction Register
F D E R W Cache File
F D E R W Func Unit
F D E E E E E E E E R W
F D E R W Func Unit
F D E R W
Content Reorder Func Unit
Addressable Buffer
 What if a later operation needs a value in the reorder Memory
(searched with bypass path
buffer?
register ID)
 Read reorder buffer in parallel with the register file. How?

623 624
10-11-2023

Simplifying Reorder Buffer Access Reorder Buffer in Intel Pentium III


 Idea: Use indirection

 Access register file first


 If register not valid, register file stores the ID of the reorder
buffer entry that contains (or will contain) the value of the
Boggs et al., “The
register
Microarchitecture of the
 Mapping of the register to a ROB entry: Register file maps the Pentium 4 Processor,” Intel
register to a reorder buffer entry if there is an in-flight Technology Journal, 2001.
instruction writing to the register

 Access reorder buffer next

 Now, reorder buffer does not need to be content addressable


625 626

Important: Register Renaming with a Reorder Buffer Renaming Example


 Output and anti dependencies are not true dependencies  Assume
 WHY? The same register refers to values that have nothing to  Register file has pointers to reorder buffer if the register is not
do with each other valid
 They exist due to lack of register ID’s (i.e. names) in  Reorder buffer works as described before
the ISA
 The register ID is renamed to the reorder buffer entry that  Where is the latest definition of R3 for each instruction
will hold the register’s value below in sequential order?
 Register ID  ROB entry ID LD R0(0)  R3
 Architectural register ID  Physical register ID LD R3, R1  R10
 After renaming, ROB entry ID used to refer to the register MUL R1, R2  R3
MUL R3, R4  R11
 This eliminates anti- and output- dependencies ADD R5, R6  R3
 Gives the illusion that there are a large number of registers ADD R7, R8  R12

627 628
10-11-2023

Reorder Buffer Storage Cost In-Order Pipeline with Reorder Buffer


 Idea: Reduce reorder buffer entry storage by specializing  Decode (D): Access regfile/ROB, allocate entry in ROB, check if
for instruction types instruction can execute, if so dispatch instruction
 Execute (E): Instructions can complete out-of-order
DestRegID DestRegVal StoreAddr StoreData
Control/val
Exc?
 Completion (R): Write result to reorder buffer
V PC/IP
id bits
 Retirement/Commit (W): Check for exceptions; if none, write result to
architectural register file or memory; else, flush pipeline and start from
 Do all instructions need all fields? exception handler
 Can you reuse some fields between instructions?  In-order dispatch/execution, out-of-order completion, in-order retirement
Integer add
 Can you implement separate buffers per instruction type? E
Integer mul
 LD, ST, BR, ALU E E E E
FP mul
R W
F D
E E E E E E E E
R
E E E E E E E E ...
Load/store

629 630

Reorder Buffer Tradeoffs Solution II: History Buffer (HB)


 Advantages  Idea: Update the register file when instruction completes,
 Conceptually simple for supporting precise exceptions but UNDO UPDATES when an exception occurs
 Can eliminate false dependencies
 When instruction is decoded, it reserves an HB entry
 Disadvantages  When the instruction completes, it stores the old value of
 Reorder buffer needs to be accessed to get the results that its destination in the HB
are yet to be written to the register file  When instruction is oldest and no exceptions/interrupts, the
 CAM or indirection  increased latency and complexity HB entry discarded
 When instruction is oldest and an exception needs to be
 Other solutions aim to eliminate the disadvantages handled, old values in the HB are written back into the
 History buffer architectural state from tail to head
 Future file
 Checkpointing

631 632
10-11-2023

History Buffer Comparison of Two Approaches


 Reorder buffer
Func Unit  Pessimistic register file update
Instruction Register History  Update only with non-speculative values (in program order)
File Func Unit
Cache Buffer  Leads to complexity/delay in accessing the new values
Func Unit
 History buffer
Used only on exceptions  Optimistic register file update
 Advantage:
 Update immediately, but log the old value for recovery
 Register file contains up-to-date values for incoming instructions
 History buffer access not on critical path  Leads to complexity/delay in logging old values
 Disadvantage:
 Need to read the old value of the destination register  Can we get the best of both worlds?
 Need to unwind the history buffer upon an exception   Principle: Heterogeneity
increased exception/interrupt handling latency  Idea: Have both types of register files
633 634

Solution III: Future File (FF) + ROB Future File


 Idea: Keep two register files (speculative and architectural)
 Arch reg file: Updated in program order for precise exceptions Func Unit
 Use a reorder buffer to ensure in-order updates Future Arch.
Instruction ROB
 Future reg file: Updated as soon as an instruction completes Cache File Func Unit File
(if the instruction is the youngest one to write to a register)
Data and Tag V Func Unit

 Future file is used for fast access to latest register values Used only on exceptions
(speculative state)  Advantage
 Frontend register file  No need to read the new values from the ROB (no CAM or
indirection) or the old value of destination register

 Architectural file is used for state recovery on exceptions


 Disadvantage
(architectural state)
 Multiple register files
 Backend register file
 Need to copy arch. reg. file to future file on an exception
635 636
10-11-2023

In-Order Pipeline with Future File and Reorder Buffer Can We Reduce the Overhead of Two Register Files?
 Decode (D): Access future file, allocate entry in ROB, check if instruction  Idea: Use indirection, i.e., pointers to data in frontend and
can execute, if so dispatch instruction retirement
 Execute (E): Instructions can complete out-of-order
 Have a single storage that stores register data values
 Completion (R): Write result to reorder buffer and future file
 Keep two register maps (speculative and architectural); also
 Retirement/Commit (W): Check for exceptions; if none, write result to
called register alias tables (RATs)
architectural register file or memory; else, flush pipeline, copy
architectural file to future file, and start from exception handler
 In-order dispatch/execution, out-of-order completion, in-order retirement  Future map used for fast access to latest register values
E
Integer add (speculative state)
Integer mul  Frontend register map
E E E E
FP mul
R W
F D
E E E E E E E E  Architectural map is used for state recovery on exceptions
E E E E E E E E ... (architectural state)
Load/store  Backend register map
637 638

Future Map in Intel Pentium 4 Reorder Buffer vs. Future Map Comparison

Boggs et al., “The


Microarchitecture of
the Pentium 4
Processor,” Intel
Technology Journal,
2001.

Many modern
processors
are similar:
- MIPS R10K
- Alpha 21264

639 640
10-11-2023

Before We Get to Checkpointing … Checking for and Handling Exceptions in Pipelining


 Let’s cover what happens on exceptions
 And branch mispredictions  When the oldest instruction ready-to-be-retired is detected
to have caused an exception, the control logic
 Recovers architectural state (register file, IP, and memory)
 Flushes all younger instructions in the pipeline
 Saves IP and registers (as specified by the ISA)
 Redirects the fetch engine to the exception handling routine
 Vectored exceptions

641 642

Pipelining Issues: Branch Mispredictions How Fast Is State Recovery?


 A branch misprediction resembles an “exception”  Latency of state recovery affects
 Except it is not visible to software (i.e., it is microarchitectural)  Exception service latency
 Interrupt service latency
 What about branch misprediction recovery?  Latency to supply the correct data to instructions fetched after
 Similar to exception handling except can be initiated before a branch misprediction
the branch is the oldest instruction (not architectural)
 All three state recovery methods can be used  Which ones above need to be fast?

 Difference between exceptions and branch mispredictions?  How do the three state maintenance methods fare in terms
 Branch mispredictions are much more common of recovery latency?
 need fast state recovery to minimize performance impact of  Reorder buffer
mispredictions  History buffer
 Future file
643 644
10-11-2023

Branch State Recovery Actions and Latency Can We Do Better?


 Reorder Buffer  Goal: Restore the frontend state (future file) such that the
 Flush instructions in pipeline younger than the branch correct next instruction after the branch can execute right
 Finish all instructions in the reorder buffer away after the branch misprediction is resolved

 History buffer  Idea: Checkpoint the frontend register state/map at the


 Flush instructions in pipeline younger than the branch time a branch is decoded and keep the checkpointed state
 Undo all instructions after the branch by rewinding from the updated with results of instructions older than the branch
tail of the history buffer until the branch & restoring old values  Upon branch misprediction, restore the checkpoint associated
one by one into the register file with the branch

 Future file  Hwu and Patt, “Checkpoint Repair for Out-of-order


 Wait until branch is the oldest instruction in the machine Execution Machines,” ISCA 1987.
 Copy arch. reg. file to future file
 Flush entire pipeline
645 646

Checkpointing Checkpointing
 When a branch is decoded  Advantages
 Make a copy of the future file/map and associate it with the  Correct frontend register state available right after checkpoint
branch restoration  Low state recovery latency
 …
 When an instruction produces a register value
 All future file/map checkpoints that are younger than the
instruction are updated with the value

 Disadvantages
 When a branch misprediction is detected  Storage overhead
 Restore the checkpointed future file/map for the mispredicted  Complexity in managing checkpoints
branch when the branch misprediction is resolved
 …
 Flush instructions in pipeline younger than the branch
 Deallocate checkpoints younger than the branch

647 648
10-11-2023

Many Modern Processors Use Checkpointing Summary: Maintaining Precise State


 MIPS R10000  Reorder buffer
 Alpha 21264
 Pentium 4  History buffer

 Yeager, “The MIPS R10000 Superscalar Microprocessor,”  Future register file


IEEE Micro, April 1996
 Checkpointing
 Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro,
March-April 1999.  Readings
 Smith and Plezskun, “Implementing Precise Interrupts in Pipelined
 Boggs et al., “The Microarchitecture of the Pentium 4 Processors,” IEEE Trans on Computers 1988 and ISCA 1985.
Processor,” Intel Technology Journal, 2001.  Hwu and Patt, “Checkpoint Repair for Out-of-order Execution
Machines,” ISCA 1987.
649 650

Registers versus Memory


So far, we considered mainly registers as part of state

18-447
 What about memory? Computer Architecture
Lecture 12: Out-of-Order Execution
What are the fundamental differences between registers

and memory?
(Dynamic Instruction Scheduling)
 Register dependences known statically – memory
dependences determined dynamically
 Register state is small – memory state is large
Prof. Onur Mutlu
 Register state is not visible to other threads/processors –
memory state is shared between threads/processors (in a Carnegie Mellon University
shared memory multiprocessor) Spring 2015, 2/13/2015

651
10-11-2023

Agenda for Today & Next Few Lectures Reminder: Announcements


 Single-cycle Microarchitectures  Lab 3 due next Friday (Feb 20)
 Pipelined MIPS
 Multi-cycle and Microprogrammed Microarchitectures
 Competition for high performance
You can optimize both cycle time and CPI
 Pipelining 

 Document and clearly describe what you do during check-off

 Issues in Pipelining: Control & Data Dependence Handling,


State Maintenance and Recovery, …  Homework 3 due Feb 25
 A lot of questions that enable you to learn the concepts via
 Out-of-Order Execution hands-on exercise
 Remember this is all for your benefit (to learn and prepare for
 Issues in OoO Execution: Load-Store Handling, … exams)
 HWs have very little contribution to overall grade
 Alternative Approaches to Instruction Level Parallelism  Solutions to almost all questions are online anyway

653 654

Readings Specifically for Today Recap of Last Lecture


 Smith and Sohi, “The Microarchitecture of Superscalar  Issues with Multi-Cycle Execution
Processors,” Proceedings of the IEEE, 1995  Exceptions vs. Interrupts
 More advanced pipelining  Precise Exceptions/Interrupts
Why Do We Want Precise Exceptions?
 Interrupt and exception handling 

 How Do We Ensure Precise Exceptions?


 Out-of-order and superscalar execution concepts
 Reorder buffer
 History buffer
 Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro  Future register file (best of both worlds)
1999.  Checkpointing
 Register renaming with a reorder buffer
 How to Handle Exceptions
 How to Handle Branch Mispredictions
 Speed of State Recovery: Recovery and Interrupt Latency
 Checkpointing
 Registers vs. Memory
655 656
10-11-2023

Important: Register Renaming with a Reorder Buffer Review: Register Renaming Examples
 Output and anti dependencies are not true dependencies
 WHY? The same register refers to values that have nothing to
do with each other
 They exist due to lack of register ID’s (i.e. names) in
the ISA
 The register ID is renamed to the reorder buffer entry that
will hold the register’s value
 Register ID  ROB entry ID
 Architectural register ID  Physical register ID
 After renaming, ROB entry ID used to refer to the register

 This eliminates anti- and output- dependencies


 Gives the illusion that there are a large number of registers
Boggs et al., “The Microarchitecture of the Pentium 4 Processor,”
657 Intel Technology Journal, 2001. 658

Review: Checkpointing Idea Review: Checkpointing


 Goal: Restore the frontend state (future file) such that the  When a branch is decoded
correct next instruction after the branch can execute right  Make a copy of the future file/map and associate it with the
away after the branch misprediction is resolved branch

 Idea: Checkpoint the frontend register state/map at the  When an instruction produces a register value
time a branch is decoded and keep the checkpointed state  All future file/map checkpoints that are younger than the
updated with results of instructions older than the branch instruction are updated with the value
 Upon branch misprediction, restore the checkpoint associated
with the branch  When a branch misprediction is detected
 Restore the checkpointed future file/map for the mispredicted
 Hwu and Patt, “Checkpoint Repair for Out-of-order branch when the branch misprediction is resolved
Execution Machines,” ISCA 1987.  Flush instructions in pipeline younger than the branch
 Deallocate checkpoints younger than the branch

659 660
10-11-2023

Review: Registers versus Memory Maintaining Speculative Memory State: Stores


 So far, we considered mainly registers as part of state  Handling out-of-order completion of memory operations
 UNDOing a memory write more difficult than UNDOing a
 What about memory? register write. Why?
 One idea: Keep store address/data in reorder buffer
 How does a load instruction find its data?
 What are the fundamental differences between registers
 Store/write buffer: Similar to reorder buffer, but used only for
and memory? store instructions
 Register dependences known statically – memory  Program-order list of un-committed store operations
dependences determined dynamically  When store is decoded: Allocate a store buffer entry
 Register state is small – memory state is large  When store address and data become available: Record in store
 Register state is not visible to other threads/processors – buffer entry
memory state is shared between threads/processors (in a  When the store is the oldest instruction in the pipeline: Update
shared memory multiprocessor) the memory address (i.e. cache) with store data

 We will get back to this after today!


661 662

Remember: Questions to Ponder


 What is the role of the hardware vs. the software in the
order in which instructions are executed in the pipeline?
Remember:  Software based instruction scheduling  static scheduling
Hardware based instruction scheduling  dynamic scheduling
Static vs. Dynamic Scheduling

 What information does the compiler not know that makes


static scheduling difficult?
 Answer: Anything that is determined at run time
 Variable-length operation latency, memory addr, branch direction

663 664
10-11-2023

Dynamic Instruction Scheduling


 Hardware has knowledge of dynamic events on a per-
instruction basis (i.e., at a very fine granularity)
 Cache misses Out-of-Order Execution
Branch mispredictions

 Load/store addresses
(Dynamic Instruction Scheduling)

 Wouldn’t it be nice if hardware did the scheduling of


instructions?

665

An In-order Pipeline Can We Do Better?


 What do the following two pieces of code have in common
Integer add
(with respect to execution in the previous design)?
E IMUL R3  R1, R2 LD R3  R1 (0)
Integer mul
ADD R3  R3, R1 ADD R3  R3, R1
E E E E
R W ADD R1  R6, R7 ADD R1  R6, R7
F D FP mul
IMUL R5  R6, R8 IMUL R5  R6, R8
E E E E E E E E
ADD R7  R9, R9 ADD R7  R9, R9
E E E E E E E E ...
Cache miss  Answer: First ADD stalls the whole pipeline!
 ADD cannot dispatch because its source registers unavailable
 Problem: A true data dependency stalls dispatch of younger  Later independent instructions cannot get executed
instructions into functional (execution) units
 Dispatch: Act of sending an instruction to a functional unit  How are the above code portions different?
 Answer: Load latency is variable (unknown until runtime)
 What does this affect? Think compiler vs. microarchitecture
667 668
10-11-2023

Preventing Dispatch Stalls Out-of-order Execution (Dynamic Scheduling)


 Multiple ways of doing it  Idea: Move the dependent instructions out of the way of
 You have already seen at least THREE: independent ones (s.t. independent ones can execute)
 1. Fine-grained multithreading  Rest areas for dependent instructions: Reservation stations
 2. Value prediction
 3. Compile-time instruction scheduling/reordering  Monitor the source “values” of each instruction in the
 What are the disadvantages of the above three? resting area
 When all source “values” of an instruction are available,
“fire” (i.e. dispatch) the instruction
 Any other way to prevent dispatch stalls?
 Instructions dispatched in dataflow (not control-flow) order
 Actually, you have briefly seen the basic idea before
 Dataflow: fetch and “fire” an instruction when its inputs are
ready  Benefit:
 Problem: in-order dispatch (scheduling, or execution)  Latency tolerance: Allows independent instructions to execute
 Solution: out-of-order dispatch (scheduling, or execution) and complete in the presence of a long latency operation
669 670

In-order vs. Out-of-order Dispatch Enabling OoO Execution


 In order dispatch + precise exceptions: 1. Need to link the consumer of a value to the producer
IMUL R3  R1, R2
F D E E E E R W  Register renaming: Associate a “tag” with each data value
ADD R3  R3, R1
F D STALL E R W ADD R1  R6, R7 2. Need to buffer instructions until they are ready to execute
F STALL D E R W IMUL R5  R6, R8
ADD R7  R3, R5
 Insert instruction into reservation stations after renaming
F D E E E E E R W
3. Instructions need to keep track of readiness of source values
F D STALL E R W
 Broadcast the “tag” when the value is produced
 Out-of-order dispatch + precise exceptions:  Instructions compare their “source tags” to the broadcast tag
 if match, source value becomes ready
F D E E E E R W
F D WAIT E R W 4. When all source values of an instruction are ready, need to
This slide is actually correct
F D E R W dispatch the instruction to its functional unit (FU)
F D E E E E R W  Instruction wakes up if all sources are ready
F D WAIT E R W  If multiple instructions are awake, need to select one per FU

 16 vs. 12 cycles
671 672
10-11-2023

Tomasulo’s Algorithm Two Humps in a Modern Pipeline


 OoO with register renaming invented by Robert Tomasulo TAG and VALUE Broadcast Bus

 Used in IBM 360/91 Floating Point Units


 Read: Tomasulo, “An Efficient Algorithm for Exploiting Multiple S
Arithmetic Units,” IBM Journal of R&D, Jan. 1967. Integer add R
C E
E
H Integer mul
O
E E E E E
 What is the major difference today? F D
D FP mul
R W
D
 Precise exceptions: IBM 360/91 did NOT have this U E E E E E E E E
E
L
 Patt, Hwu, Shebanow, “HPS, a new microarchitecture: rationale and
E E E E E E E E E ... R
introduction,” MICRO 1985. Load/store
 Patt et al., “Critical issues regarding HPS, a high performance
in order out of order in order
microarchitecture,” MICRO 1985.
 Hump 1: Reservation stations (scheduling window)
 Variants are used in most high-performance processors  Hump 2: Reordering (reorder buffer, aka instruction window
 Initially in Intel Pentium Pro, AMD K5 or active window)
 Alpha 21264, MIPS R10000, IBM POWER5, IBM z196, Oracle UltraSPARC T4, ARM Cortex A15
673 674

General Organization of an OOO Processor Tomasulo’s Machine: IBM 360/91

FP registers
from memory from instruction unit

load
buffers store buffers

operation bus

reservation
stations to memory
FP FU FP FU

Common data bus


 Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec.
1995.
675 676
10-11-2023

Register Renaming Tomasulo’s Algorithm: Renaming


 Output and anti dependencies are not true dependencies  Register rename table (register alias table)
 WHY? The same register refers to values that have nothing to
do with each other tag value valid?

 They exist because not enough register ID’s (i.e. R0 1


names) in the ISA R1 1
 The register ID is renamed to the reservation station entry R2 1
that will hold the register’s value R3 1
 Register ID  RS entry ID R4 1
 Architectural register ID  Physical register ID R5 1

 After renaming, RS entry ID used to refer to the register R6 1


R7 1
R8 1
 This eliminates anti- and output- dependencies
R9 1
 Approximates the performance effect of a large number of
registers even though ISA has a small number
677 678

Tomasulo’s Algorithm An Exercise


 If reservation station available before renaming MUL R3  R1, R2
 Instruction + renamed operands (source value/tag) inserted into the ADD R5  R3, R4
reservation station ADD R7  R2, R6 F D E W
 Only rename if reservation station is available ADD R10  R8, R9
 Else stall MUL R11  R7, R10
 While in reservation station, each instruction: ADD R5  R5, R11
 Watches common data bus (CDB) for tag of its sources
 When tag seen, grab value for the source and keep it in the reservation station
 When both operands available, instruction ready to be dispatched  Assume ADD (4 cycle execute), MUL (6 cycle execute)
 Dispatch instruction to the Functional Unit when instruction is ready
 After instruction finishes in the Functional Unit  Assume one adder and one multiplier
 Arbitrate for CDB  How many cycles
 Put tagged value onto CDB (tag broadcast)
 Register file is connected to the CDB  in a non-pipelined machine
 Register contains a tag indicating the latest writer to the register  in an in-order-dispatch pipelined machine with imprecise
 If the tag in the register file matches the broadcast tag, write broadcast value
into register (and set valid bit) exceptions (no forwarding and full forwarding)
 Reclaim rename tag  in an out-of-order dispatch pipelined machine imprecise
 no valid copy of tag in system!
exceptions (full forwarding)
679 680
10-11-2023

Exercise Continued Exercise Continued

681 682

Exercise Continued How It Works


MUL R3  R1, R2
ADD R5  R3, R4
ADD R7  R2, R6
ADD R10  R8, R9
MUL R11  R7, R10
ADD R5  R5, R11

683 684
10-11-2023

Cycle 0 Cycle 2

685 686

Cycle 4

Cycle 3

687 688
10-11-2023

Cycle 7 Cycle 8

689 690

Some Questions An Exercise, with Precise Exceptions


 What is needed in hardware to perform tag broadcast and MUL R3  R1, R2
ADD R5  R3, R4
value capture? ADD R7  R2, R6 F D E R W
 make a value valid ADD R10  R8, R9
MUL R11  R7, R10
 wake up an instruction ADD R5  R5, R11

 Does the tag have to be the ID of the Reservation Station  Assume ADD (4 cycle execute), MUL (6 cycle execute)
Entry?
 Assume one adder and one multiplier
 How many cycles
 What can potentially become the critical path?
 in a non-pipelined machine
 Tag broadcast  value capture  instruction wake up
 in an in-order-dispatch pipelined machine with reorder buffer
(no forwarding and full forwarding)
 How can you reduce the potential critical paths?  in an out-of-order dispatch pipelined machine with reorder
buffer (full forwarding)
691 692
10-11-2023

Out-of-Order Execution with Precise Exceptions Out-of-Order Execution with Precise Exceptions
 Idea: Use a reorder buffer to reorder instructions before TAG and VALUE Broadcast Bus

committing them to architectural state


S R
Integer add
An instruction updates the register alias table (essentially a C E
 E
H Integer mul
future file) when it completes execution E E E E E
O
F D R W
 An instruction updates the architectural register file when it is D
E E E E E E E E
FP mul
D
U
the oldest in the machine and has completed execution L
E
E E E E E E E E E ... R
Load/store

in order out of order in order

 Hump 1: Reservation stations (scheduling window)


 Hump 2: Reordering (reorder buffer, aka instruction window
or active window)
693 694

Modern OoO Execution w/ Precise Exceptions An Example from Modern Processors


 Most modern processors use
 Reorder buffer to support in-order retirement of instructions
 A single register file to store registers (speculative and
architectural) – INT and FP are still separate
 Future register map  used for renaming
 Architectural register map  used for state recovery

Boggs et al., “The Microarchitecture of the Pentium 4 Processor,”


695 Intel Technology Journal, 2001. 696
10-11-2023

Enabling OoO Execution, Revisited Summary of OOO Execution Concepts


1. Link the consumer of a value to the producer  Register renaming eliminates false dependencies, enables
 Register renaming: Associate a “tag” with each data value linking of producer to consumers

2. Buffer instructions until they are ready


 Buffering enables the pipeline to move for independent ops
 Insert instruction into reservation stations after renaming

3. Keep track of readiness of source values of an instruction  Tag broadcast enables communication (of readiness of
 Broadcast the “tag” when the value is produced produced value) between instructions
 Instructions compare their “source tags” to the broadcast tag
 if match, source value becomes ready  Wakeup and select enables out-of-order dispatch
4. When all source values of an instruction are ready, dispatch
the instruction to functional unit (FU)
 Wakeup and select/schedule the instruction

697 698

OOO Execution: Restricted Dataflow Dataflow Graph for Our Example


 An out-of-order engine dynamically builds the dataflow
graph of a piece of the program
 which piece? MUL R3  R1, R2
ADD R5  R3, R4
ADD R7  R2, R6
 The dataflow graph is limited to the instruction window ADD R10  R8, R9
 Instruction window: all decoded but not yet retired MUL R11  R7, R10
instructions ADD R5  R5, R11

 Can we do it for the whole program?


 Why would we like to?
 In other words, how can we have a large instruction
window?
 Can we do it efficiently with Tomasulo’s algorithm?
699 700
10-11-2023

State of RAT and RS in Cycle 7 Dataflow Graph

701 702

In-Class Exercise on Tomasulo In-Class Exercise on Tomasulo

703 704
10-11-2023

In-Class Exercise on Tomasulo In-Class Exercise on Tomasulo

705 706

In-Class Exercise on Tomasulo In-Class Exercise on Tomasulo

707 708
10-11-2023

In-Class Exercise on Tomasulo In-Class Exercise on Tomasulo

709 710

In-Class Exercise on Tomasulo In-Class Exercise on Tomasulo

711 712
10-11-2023

In-Class Exercise on Tomasulo In-Class Exercise on Tomasulo

713 714

Tomasulo Template
18-447
Computer Architecture
Lecture 13: Out-of-Order Execution
and Data Flow

Prof. Onur Mutlu


Carnegie Mellon University
Spring 2015, 2/16/2015

715
10-11-2023

Agenda for Today & Next Few Lectures Readings Specifically for Today
 Single-cycle Microarchitectures  Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
 Multi-cycle and Microprogrammed Microarchitectures  More advanced pipelining
 Interrupt and exception handling
 Pipelining  Out-of-order and superscalar execution concepts

 Issues in Pipelining: Control & Data Dependence Handling,


State Maintenance and Recovery, …  Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro
1999.
 Out-of-Order Execution

 Issues in OoO Execution: Load-Store Handling, …

 Alternative Approaches to Instruction Level Parallelism


717 718

Readings for Next Lecture Recap of Last Lecture


 SIMD Processing  Maintaining Speculative Memory State (Ld/St Ordering)
 Basic GPU Architecture  Out of Order Execution (Dynamic Scheduling)
 Link Dependent Instructions: Renaming
 Other execution models: VLIW, DAE, Systolic Arrays  Buffer Instructions: Reservation Stations
 Track Readiness of Source Values: Tag (and Value) Broadcast
 Schedule/Dispatch: Wakeup and Select
 Tomasulo’s Algorithm
 OoO Execution Exercise with Code Example: Cycle by Cycle
 Lindholm et al., "NVIDIA Tesla: A Unified Graphics and  OoO Execution with Precise Exceptions
Computing Architecture," IEEE Micro 2008.  Questions on OoO Implementation
 Fatahalian and Houston, “A Closer Look at GPUs,” CACM  Where data is stored? Single physical register file vs. reservation stations
2008.  Critical path, renaming IDs, …
 OoO Execution as Restricted Data Flow
 Reverse Engineering the Data Flow Graph

719 720
10-11-2023

Review: In-order vs. Out-of-order Dispatch Review: Out-of-Order Execution with Precise Exceptions
 In order dispatch + precise exceptions: TAG and VALUE Broadcast Bus

IMUL R3  R1, R2
F D E E E E R W
ADD R3  R3, R1
F D STALL E R W ADD R1  R6, R7 S
Integer add R
F STALL D E R W IMUL R5  R6, R8 C E
E
ADD R7  R3, R5 H Integer mul
F D E E E E E R W O
E E E E E
F D R W
F D STALL E R W D FP mul
E E E E E E E E D
U E
 Out-of-order dispatch + precise exceptions: L
... R
E E E E E E E E E
F D E E E E R W Load/store

F D WAIT E R W This slide is actually correct in order out of order in order


F D E R W
F D E E E E R W
 Hump 1: Reservation stations (scheduling window)
F D WAIT E R W  Hump 2: Reordering (reorder buffer, aka instruction window
or active window)
 16 vs. 12 cycles
721 722

Review: Enabling OoO Execution, Revisited Review: Summary of OOO Execution Concepts
1. Link the consumer of a value to the producer  Register renaming eliminates false dependencies, enables
 Register renaming: Associate a “tag” with each data value linking of producer to consumers

2. Buffer instructions until they are ready


 Buffering enables the pipeline to move for independent ops
 Insert instruction into reservation stations after renaming

3. Keep track of readiness of source values of an instruction  Tag broadcast enables communication (of readiness of
 Broadcast the “tag” when the value is produced produced value) between instructions
 Instructions compare their “source tags” to the broadcast tag
 if match, source value becomes ready  Wakeup and select enables out-of-order dispatch
4. When all source values of an instruction are ready, dispatch
the instruction to functional unit (FU)
 Wakeup and select/schedule the instruction

723 724
10-11-2023

Review: Our Example Review: State of RAT and RS in Cycle 7

MUL R3  R1, R2
ADD R5  R3, R4
ADD R7  R2, R6
ADD R10  R8, R9
MUL R11  R7, R10
ADD R5  R5, R11

All our in-class drawings are at:


http://www.ece.cmu.edu/~ece447/s15/lib/exe/fetch.php?media=447_tomasulo.pdf
725 726

Review: Corresponding Dataflow Graph Restricted Data Flow


 An out-of-order machine is a “restricted data flow” machine
 Dataflow-based execution is restricted to the microarchitecture
level
 ISA is still based on von Neumann model (sequential
execution)

 Remember the data flow model (at the ISA level):


 Dataflow model: An instruction is fetched and executed in
data flow order
 i.e., when its operands are ready
 i.e., there is no instruction pointer
 Instruction ordering specified by data flow dependence
 Each instruction specifies “who” should receive the result
 An instruction can “fire” whenever all operands are received
727 728
10-11-2023

Review: OOO Execution: Restricted Questions to Ponder


Dataflow
 An out-of-order engine dynamically builds the dataflow  Why is OoO execution beneficial?
graph of a piece of the program  What if all operations take single cycle?
 which piece?  Latency tolerance: OoO execution tolerates the latency of
multi-cycle operations by executing independent operations
 The dataflow graph is limited to the instruction window concurrently
 Instruction window: all decoded but not yet retired
instructions  What if an instruction takes 500 cycles?
 How large of an instruction window do we need to continue
 Can we do it for the whole program? decoding?
 Why would we like to?  How many cycles of latency can OoO tolerate?
 In other words, how can we have a large instruction  What limits the latency tolerance scalability of Tomasulo’s
algorithm?
window?
 Active/instruction window size: determined by both scheduling
 Can we do it efficiently with Tomasulo’s algorithm? window and reorder buffer size
729 730

Registers versus Memory, Revisited Memory Dependence Handling (I)


 So far, we considered register based value communication  Need to obey memory dependences in an out-of-order
between instructions machine
 and need to do so while providing high performance
 What about memory?
 Observation and Problem: Memory address is not known until
 What are the fundamental differences between registers a load/store executes
and memory?
 Register dependences known statically – memory  Corollary 1: Renaming memory addresses is difficult
dependences determined dynamically  Corollary 2: Determining dependence or independence of
 Register state is small – memory state is large loads/stores need to be handled after their (partial) execution
 Register state is not visible to other threads/processors –  Corollary 3: When a load/store has its address ready, there
memory state is shared between threads/processors (in a may be younger/older loads/stores with undetermined
shared memory multiprocessor)
addresses in the machine
731 732
10-11-2023

Memory Dependence Handling (II) Handling of Store-Load Dependences


 When do you schedule a load instruction in an OOO engine?  A load’s dependence status is not known until all previous store
addresses are available.
 Problem: A younger load can have its address ready before an
older store’s address is known
 How does the OOO engine detect dependence of a load instruction on a
 Known as the memory disambiguation problem or the unknown
previous store?
address problem
 Option 1: Wait until all previous stores committed (no need to check

for address match)


 Approaches  Option 2: Keep a list of pending stores in a store buffer and check

 Conservative: Stall the load until all previous stores have whether load address matches a previous store address
computed their addresses (or even retired from the machine)
 Aggressive: Assume load is independent of unknown-address  How does the OOO engine treat the scheduling of a load instruction wrt
previous stores?
stores and schedule the load right away
 Option 1: Assume load dependent on all previous stores
 Intelligent: Predict (with a more sophisticated predictor) if the
 Option 2: Assume load independent of all previous stores
load is dependent on the/any unknown address store
 Option 3: Predict the dependence of a load on an outstanding store

733 734

Memory Disambiguation (I) Memory Disambiguation (II)


 Option 1: Assume load dependent on all previous stores  Chrysos and Emer, “Memory Dependence Prediction Using Store
+ No need for recovery Sets,” ISCA 1998.
-- Too conservative: delays independent loads unnecessarily

 Option 2: Assume load independent of all previous stores


+ Simple and can be common case: no delay for independent loads
-- Requires recovery and re-execution of load and dependents on misprediction

 Option 3: Predict the dependence of a load on an


outstanding store
+ More accurate. Load store dependencies persist over time
-- Still requires recovery/re-execution on misprediction
 Predicting store-load dependencies important for performance
 Alpha 21264 : Initially assume load independent, delay loads found to be dependent
 Moshovos et al., “Dynamic speculation and synchronization of data dependences,”  Simple predictors (based on past history) can achieve most of
ISCA 1997. the potential performance
 Chrysos and Emer, “Memory Dependence Prediction Using Store Sets,” ISCA 1998.
735 736
10-11-2023

Data Forwarding Between Stores and Loads Food for Thought for You
 We cannot update memory out of program order  Many other design choices
 Need to buffer all store and load instructions in instruction window
 Should reservation stations be centralized or distributed
 Even if we know all addresses of past stores when we across functional units?
generate the address of a load, two questions still remain:  What are the tradeoffs?
1. How do we check whether or not it is dependent on a store
2. How do we forward data to the load if it is dependent on a store  Should reservation stations and ROB store data values or
should there be a centralized physical register file where all
 Modern processors use a LQ (load queue) and an SQ for this data values are stored?
 Can be combined or separate between loads and stores  What are the tradeoffs?
 A load searches the SQ after it computes its address. Why?
 A store searches the LQ after it computes its address. Why?  Exactly when does an instruction broadcast its tag?
 …
737 738

More Food for Thought for You General Organization of an OOO Processor
 How can you implement branch prediction in an out-of-
order execution machine?
 Think about branch history register and PHT updates
 Think about recovery from mispredictions
 How to do this fast?

 How can you combine superscalar execution with out-of-


order execution?
 These are different concepts
 Concurrent renaming of instructions
 Concurrent broadcast of tags

 How can you combine superscalar + out-of-order + branch  Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec.
prediction? 1995.
739 740
10-11-2023

A Modern OoO Design: Intel Pentium 4 Intel Pentium 4 Simplified


Mutlu+, “Runahead Execution,”
HPCA 2003.

741 742
Boggs et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, 2001.

Alpha 21264 MIPS R10000

Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, March-April 1999. 743 Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, April 1996 744
10-11-2023

IBM POWER4 IBM POWER4


 Tendler et al.,  2 cores, out-of-order execution
“POWER4 system  100-entry instruction window in each core
microarchitecture,”  8-wide instruction fetch, issue, execute
IBM J R&D, 2002.
 Large, local+global hybrid branch predictor
 1.5MB, 8-way L2 cache
 Aggressive stream based prefetching

745 746

IBM POWER5 Recommended Readings


 Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE  Out-of-order execution processor designs
Micro 2004.

 Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro,


March-April 1999.

 Boggs et al., “The Microarchitecture of the Pentium 4


Processor,” Intel Technology Journal, 2001.

 Yeager, “The MIPS R10000 Superscalar Microprocessor,”


IEEE Micro, April 1996

 Tendler et al., “POWER4 system microarchitecture,” IBM


Journal of Research and Development, January 2002.
747 748
10-11-2023

And More Readings…


 Stark et al., “On Pipelining Dynamic Scheduling Logic,”
MICRO 2000.
Other Approaches to Concurrency
 Brown et al., “Select-free Instruction Scheduling Logic,”
MICRO 2001.
(or Instruction Level Parallelism)
 Palacharla et al., “Complexity-effective Superscalar
Processors,” ISCA 1997.

749

Approaches to (Instruction-Level) Concurrency


 Pipelining
Out-of-order execution

 Dataflow (at the ISA level)


Data Flow:
 SIMD Processing (Vector and array processors, GPUs) Exploiting Irregular Parallelism
 VLIW
 Decoupled Access Execute
 Systolic Arrays

751
10-11-2023

Remember: State of RAT and RS in Cycle 7 Remember: Dataflow Graph

753 754

Review: More on Data Flow Data Flow Nodes


 In a data flow machine, a program consists of data flow
nodes
 A data flow node fires (fetched and executed) when all it
inputs are ready
 i.e. when all inputs have tokens

 Data flow node and its ISA representation

755 756
10-11-2023

Dataflow Nodes (II) Dataflow Graphs

 A small set of dataflow operators can be used to {x = a + b;


define a general programming language y=b*7 a b
in
(x-y) * (x+y)}
1 + 2 *7
Fork Primitive Ops Switch Merge
 Values in dataflow graphs are
represented as tokens x
T T
T F
y
+ T F token < ip , p , v >
3 - 4 +
 instruction ptr port data

 An operator executes when all its


input tokens are present; copies of
the result token are distributed to 5 *
T T
T F
+ T F the destination operators

no separate control flow

Example Data Flow Program Control Flow vs. Data Flow

OUT

759 760
10-11-2023

Data Flow Characteristics What About Loops and Function Calls?


 Data-driven execution of instruction-level graphical code  Problem: Multiple dynamic instances can be active for the
 Nodes are operators same instruction (i.e., due to loop iteration or invocation of
 Arcs are data (I/O) function from different location)
 As opposed to control-driven execution  IP is not enough to distinguish between these different
dynamic instances of the same static instruction
 Only real dependencies constrain processing token < ip , p , v >
instruction ptr port data

 No sequential instruction stream  Solution: Distinguish between different instances by creating


 No program counter new tags/frames (at the beginning of new iteration or call)
a tagged token <fp, ip, port, data>
 Execution triggered by the presence/readiness of data
frame instruction
 Operations execute asynchronously pointer pointer
761 (tag or context ID) 762

An Example Frame and Execution


Monsoon Dataflow Processor [ISCA 1990]
1 + 1 3L, 4L Program a b
2 * 2 3R, 4R
op r d1,d2 Instruction
3 - 3 5L 1 2 ip
+ *7 Fetch
4 + 4 5R Code
5 * 5 out x
<fp, ip, p , v> y fp+r Operand
Token
Fetch
3 - 4 + Frames Queue
1
2 ALU

3 L 7 5 *
4 Form
Frame Token
5
Need to provide storage for only one operand/operator
Network Network
763
10-11-2023

A Dataflow Processor MIT Tagged Token Data Flow Architecture


 Wait−Match Unit: try
to match incoming
token and context id
and a waiting token
with same instruction
address
 Success: Both
tokens forwarded,
fetch instruction
 Fail: Incoming token
stored in Waiting
Token Memory,
bubble inserted

765 766

TTDA Data Flow Example TTDA Data Flow Example

767 768
10-11-2023

TTDA Data Flow Example Manchester Data Flow Machine

 Matching Store: Pairs


together tokens
destined for the same
instruction
 Large data set 
overflow in overflow
unit
 Paired tokens fetch the
appropriate instruction
from the node store

769 770

Data Flow Advantages/Disadvantages Combining Data Flow and Control Flow


 Advantages
 Can we get the best of both worlds?
 Very good at exploiting irregular parallelism
 Only real dependencies constrain processing
 Two possibilities
 Disadvantages
 Debugging difficult (no precise state)  Model 1: Keep control flow at the ISA level, do dataflow
 Interrupt/exception handling is difficult (what is precise state underneath, preserving sequential semantics
semantics?)
 Implementing dynamic data structures difficult in pure data  Model 2: Keep dataflow model, but incorporate some control
flow models flow at the ISA level to improve efficiency, exploit locality, and
 Too much parallelism? (Parallelism control needed) ease resource management
 High bookkeeping overhead (tag matching, data storage)  Incorporate threads into dataflow: statically ordered instructions;
when the first instruction is fired, the remaining instructions
 Instruction cycle is inefficient (delay between dependent execute without interruption
instructions), memory locality is not exploited

771 772
10-11-2023

Data Flow Summary Further Reading on Data Flow


 Availability of data determines order of execution  ISA level dataflow
 A data flow node fires when its sources are ready  Gurd et al., “The Manchester prototype dataflow computer,”
 Programs represented as data flow graphs (of nodes) CACM 1985.

 Data Flow at the ISA level has not been (as) successful  Microarchitecture-level dataflow:
 Patt, Hwu, Shebanow, “HPS, a new microarchitecture: rationale
and introduction,” MICRO 1985.
 Data Flow implementations under the hood (while  Patt et al., “Critical issues regarding HPS, a high performance
preserving sequential ISA semantics) have been very microarchitecture,” MICRO 1985.
successful  Hwu and Patt, “HPSm, a high performance restricted data
 Out of order execution flow architecture having minimal functionality,” ISCA 1986.
 Hwu and Patt, “HPSm, a high performance restricted data flow
architecture having minimal functionality,” ISCA 1986.

773 774

Agenda for Today & Next Few Lectures


18-447  Single-cycle Microarchitectures

Computer Architecture  Multi-cycle and Microprogrammed Microarchitectures


Lecture 14: SIMD Processing  Pipelining
(Vector and Array Processors)
 Issues in Pipelining: Control & Data Dependence Handling,
State Maintenance and Recovery, …

 Out-of-Order Execution
Prof. Onur Mutlu
Carnegie Mellon University  Issues in OoO Execution: Load-Store Handling, …
Spring 2015, 2/18/2015
 Alternative Approaches to Instruction Level Parallelism
776
10-11-2023

Approaches to (Instruction-Level) Concurrency Readings for Today


 Pipelining  Lindholm et al., "NVIDIA Tesla: A Unified Graphics and
 Out-of-order execution Computing Architecture," IEEE Micro 2008.
 Dataflow (at the ISA level)
 SIMD Processing (Vector and array processors, GPUs)  Fatahalian and Houston, “A Closer Look at GPUs,” CACM
 VLIW 2008.
 Decoupled Access Execute
 Systolic Arrays

777 778

Recap of Last Lecture Reminder: Intel Pentium 4 Simplified


Mutlu+, “Runahead Execution,”
 OoO Execution as Restricted Data Flow HPCA 2003.
 Memory Disambiguation or Unknown Address Problem
 Memory Dependence Handling
 Conservative, Aggressive, Intelligent Approaches
 Load Store Queues
 Design Choices in an OoO Processor
 Combining OoO+Superscalar+Branch Prediction
 Example OoO Processor Designs

 Data Flow (at the ISA level) Approach to Concurrency


 Characteristics
 Supporting dynamic instances of a node: Tagging, Context IDs, Frames
 Example Operation
 Advantages and Disadvantages
 Combining Data Flow and Control Flow: Getting the Best of Both Worlds

779 780
10-11-2023

Reminder: Alpha 21264

Review: Data Flow:


Exploiting Irregular Parallelism

Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, March-April 1999. 781

Review: Pure Data Flow Pros and Cons Review: Combining Data Flow and Control Flow
 Advantages
 Can we get the best of both worlds?
 Very good at exploiting irregular parallelism
 Only real dependencies constrain processing
 Two possibilities
 Disadvantages
 Debugging difficult (no precise state)  Model 1: Keep control flow at the ISA level, do dataflow
 Interrupt/exception handling is difficult (what is precise state underneath, preserving sequential semantics
semantics?)
 Implementing dynamic data structures difficult in pure data  Model 2: Keep dataflow model, but incorporate some control
flow models flow at the ISA level to improve efficiency, exploit locality, and
 Too much parallelism? (Parallelism control needed) ease resource management
 High bookkeeping overhead (tag matching, data storage)  Incorporate threads into dataflow: statically ordered instructions;
when the first instruction is fired, the remaining instructions
 Instruction cycle is inefficient (delay between dependent execute without interruption in control flow order (e.g., one can
instructions), memory locality is not exploited pipeline them)

783 784
10-11-2023

Review: Data Flow Summary Approaches to (Instruction-Level) Concurrency


 Data Flow at the ISA level has not been (as) successful  Pipelining
 Out-of-order execution
 Data Flow implementations under the hood (while  Dataflow (at the ISA level)
preserving sequential ISA semantics) have been very  SIMD Processing (Vector and array processors, GPUs)
successful  VLIW
 Out of order execution
 Decoupled Access Execute
 Systolic Arrays

785 786

Flynn’s Taxonomy of Computers


 Mike Flynn, “Very High-Speed Computing Systems,” Proc.
of IEEE, 1966
SIMD Processing:
Exploiting Regular (Data) Parallelism  SISD: Single instruction operates on single data element
 SIMD: Single instruction operates on multiple data elements
 Array processor
 Vector processor
 MISD: Multiple instructions operate on single data element
 Closest form: systolic array processor, streaming processor
 MIMD: Multiple instructions operate on multiple data
elements (multiple instruction streams)
 Multiprocessor
 Multithreaded processor
788
10-11-2023

Data Parallelism SIMD Processing


 Concurrency arises from performing the same operations  Single instruction operates on multiple data elements
on different pieces of data  In time or in space
 Single instruction multiple data (SIMD)  Multiple processing elements
 E.g., dot product of two vectors

 Contrast with data flow  Time-space duality


 Concurrency arises from executing different operations in parallel (in
a data driven manner)
 Array processor: Instruction operates on multiple data
elements at the same time using different spaces
 Contrast with thread (“control”) parallelism
 Concurrency arises from executing different threads of control in
parallel  Vector processor: Instruction operates on multiple data
elements in consecutive time steps using the same space
 SIMD exploits instruction-level parallelism
 Multiple “instructions” (more appropriately, operations) are
concurrent: instructions happen to be the same
789 790

Array vs. Vector Processors SIMD Array Processing vs. VLIW


 VLIW: Multiple independent operations packed together by the compiler
ARRAY PROCESSOR VECTOR PROCESSOR

Instruction Stream Same op @ same time


Different ops @ time
LD VR  A[3:0] LD0 LD1 LD2 LD3 LD0
ADD VR  VR, 1 AD0 AD1 AD2 AD3 LD1 AD0
MUL VR  VR, 2
ST A[3:0]  VR MU0 MU1 MU2 MU3 LD2 AD1 MU0
ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0
Different ops @ same space AD3 MU2 ST1
MU3 ST2
Time Same op @ space ST3

Space Space

791 792
10-11-2023

SIMD Array Processing vs. VLIW Vector Processors


 Array processor: Single operation on multiple (different) data elements  A vector is a one-dimensional array of numbers
 Many scientific/commercial programs use vectors
for (i = 0; i<=49; i++)
C[i] = (A[i] + B[i]) / 2

 A vector processor is one whose instructions operate on


vectors rather than scalar (single data) values
 Basic requirements
 Need to load/store vectors  vector registers (contain vectors)
 Need to operate on vectors of different lengths  vector length
register (VLEN)
 Elements of a vector might be stored apart from each other in
memory  vector stride register (VSTR)
 Stride: distance between two elements of a vector

793 794

Vector Processors (II) Vector Processor Advantages


 A vector instruction performs an operation on each element + No dependencies within a vector
in consecutive cycles  Pipelining, parallelization work well
 Vector functional units are pipelined  Can have very deep pipelines, no dependencies!
 Each pipeline stage operates on a different data element
+ Each instruction generates a lot of work
 Vector instructions allow deeper pipelines  Reduces instruction fetch bandwidth requirements
 No intra-vector dependencies  no hardware interlocking
within a vector + Highly regular memory access pattern
 No control flow within a vector  Can interleave vector data elements across multiple memory banks for
 Known stride allows prefetching of vectors into higher memory bandwidth (to tolerate memory bank access latency)
registers/cache/memory  Prefetching a vector is relatively easy

+ No need to explicitly code loops


 Fewer branches in the instruction sequence
795 796
10-11-2023

Vector Processor Disadvantages Vector Processor Limitations


-- Works (only) if parallelism is regular (data/SIMD parallelism) -- Memory (bandwidth) can easily become a bottleneck,
++ Vector operations especially if
-- Very inefficient if parallelism is irregular 1. compute/memory operation balance is not maintained
-- How about searching for a key in a linked list? 2. data is not mapped appropriately to memory banks

Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. 797 798

Vector Registers
 Each vector data register holds N M-bit values
 Vector control registers: VLEN, VSTR, VMASK
Vector Processing in More Depth  Maximum VLEN can be N
 Maximum number of elements stored in a vector register
 Vector Mask Register (VMASK)
 Indicates which elements of vector to operate on

 Set by vector test instructions

 e.g., VMASK[i] = (Vk[i] == 0)


M-bit wide M-bit wide
V0,0 V1,0
V0,1 V1,1

V0,N-1 V1,N-1

800
10-11-2023

Vector Functional Units Vector Machine Organization (CRAY-1)


 Use deep pipeline to execute  CRAY-1
element operations  Russell, “The CRAY-1
V V V
 fast clock cycle computer system,”
1 2 3
CACM 1978.
 Control of deep pipeline is
simple because elements in  Scalar and vector modes
vector are independent  8 64-element vector
registers
 64 bits per element
Six stage multiply pipeline
 16 memory banks
 8 64-bit scalar registers
 8 24-bit address registers
V1 * V2  V3

Slide credit: Krste Asanovic 801 802

Loading/Storing Vectors from/to Memory Memory Banking


 Requires loading/storing multiple elements  Memory is divided into banks that can be accessed independently;
banks share address and data buses (to minimize pin cost)
 Elements separated from each other by a constant distance  Can start and complete one bank access per cycle
(stride)  Can sustain N parallel accesses if all N go to different banks
 Assume stride = 1 for now Bank Bank Bank Bank
0 1 2 15
 Elements can be loaded in consecutive cycles if we can
start the load of one element per cycle MDR MAR MDR MAR MDR MAR MDR MAR
 Can sustain a throughput of one element per cycle
Data bus
 Question: How do we achieve this with a memory that
takes more than 1 cycle to access? Address bus
 Answer: Bank the memory; interleave the elements across
banks CPU
803 Picture credit: Derek Chiou 804
10-11-2023

Vector Memory System Scalar Code Example


 Next address = Previous address + Stride  For I = 0 to 49
 If stride = 1 & consecutive elements interleaved across  C[i] = (A[i] + B[i]) / 2
banks & number of banks >= bank latency, then can
sustain 1 element/cycle throughput  Scalar code (instruction and its latency)
Bas
Stride
MOVI R0 = 50 1
Vector Registers e
MOVA R1 = A 1 304 dynamic instructions

Address
MOVA R2 = B 1
Generator + MOVA R3 = C 1
X: LD R4 = MEM[R1++] 11 ;autoincrement addressing
LD R5 = MEM[R2++] 11
ADD R6 = R4 + R5 4
SHFR R7 = R6 >> 1 1
0 1 2 3 4 5 6 7 8 9 A B C D E F ST MEM[R3++] = R7 11
DECBNZ --R0, X 2 ;decrement and branch if NZ
Memory Banks
Picture credit: Krste Asanovic 805 806

Scalar Code Execution Time (In Order) Vectorizable Loops


 Scalar execution time on an in-order processor with 1 bank  A loop is vectorizable if each iteration is independent of any
 First two loads in the loop cannot be pipelined: 2*11 cycles other
 4 + 50*40 = 2004 cycles

 For I = 0 to 49
 Scalar execution time on an in-order processor with 16  C[i] = (A[i] + B[i]) / 2
banks (word-interleaved: consecutive words are stored in
 Vectorized loop (each instruction and its latency):
consecutive banks)
MOVI VLEN = 50 1
 First two loads in the loop can be pipelined 7 dynamic instructions
MOVI VSTR = 1 1
 4 + 50*30 = 1504 cycles
VLD V0 = A 11 + VLN - 1
VLD V1 = B 11 + VLN – 1
 Why 16 banks?
VADD V2 = V0 + V1 4 + VLN - 1
 11 cycle memory access latency
VSHFR V3 = V2 >> 1 1 + VLN - 1
 Having 16 (>11) banks ensures there are enough banks to
VST C = V3 11 + VLN – 1
overlap enough memory operations to cover memory latency
807 808
10-11-2023

Basic Vector Code Performance Vector Chaining


 Assume no chaining (no vector data forwarding)  Vector chaining: Data forwarding from one vector
 i.e., output of a vector functional unit cannot be used as the functional unit to another
direct input of another
 The entire vector register needs to be ready before any
element of it can be used as part of another operation V V V V V
LV v1 1 2 3 4 5
 One memory port (one address generator) MULV v3,v1,v2
 16 memory banks (word-interleaved) ADDV v5, v3, v4

Chain Chain

Load
Unit
Mult. Add

Memory
 285 cycles
809 Slide credit: Krste Asanovic 810

Vector Code Performance - Chaining Vector Code Performance – Multiple Memory Ports
 Vector chaining: Data forwarding from one vector  Chaining and 2 load ports, 1 store port in each bank
functional unit to another
1 1 11 49 11 49

Strict assumption:
Each memory bank
4 49 has a single port
(memory bandwidth
bottleneck)
These two VLDs cannot be 1 49
pipelined. WHY?

11 49

 79 cycles
VLD and VST cannot be
 182 cycles pipelined. WHY?  19X perf. improvement!
811 812
10-11-2023

Questions (I) Gather/Scatter Operations


 What if # data elements > # elements in a vector register?
 Idea: Break loops so that each iteration operates on # Want to vectorize loops with indirect accesses:
elements in a vector register
for (i=0; i<N; i++)
 E.g., 527 data elements, 64-element VREGs
A[i] = B[i] + C[D[i]]
 8 iterations where VLEN = 64
 1 iteration where VLEN = 15 (need to change value of VLEN)
Indexed load instruction (Gather)
 Called vector stripmining LV vD, rD # Load indices in D vector
LVI vC, rC, vD # Load indirect from rC base
 What if vector data is not stored in a strided fashion in LV vB, rB # Load B vector
memory? (irregular memory access to a vector) ADDV.D vA,vB,vC # Do add
 Idea: Use indirection to combine/pack elements into vector SV vA, rA # Store result
registers
 Called scatter/gather operations

813 814

Gather/Scatter Operations Conditional Operations in a Loop


 Gather/scatter operations often implemented in hardware  What if some operations should not be executed on a vector
to handle sparse matrices (based on a dynamically-determined condition)?
 Vector loads and stores use an index vector which is added loop: if (a[i] != 0) then b[i]=a[i]*b[i]
to the base register to generate the addresses goto loop

Index Vector Data Vector (to Store) Stored Vector (in Memory)
 Idea: Masked operations
0 3.14 Base+0 3.14  VMASK register is a bit mask determining which data element
2 6.5 Base+1 X should not be acted upon
6 71.2 Base+2 6.5
VLD V0 = A
7 2.71 Base+3 X
Base+4 X VLD V1 = B
Base+5 X VMASK = (V0 != 0)
Base+6 71.2 VMUL V1 = V0 * V1
Base+7 2.71
VST B = V1
 Does this look familiar? This is essentially predicated execution.
815 816
10-11-2023

Another Example with Masking Masked Vector Instructions


for (i = 0; i < 64; ++i) Simple Implementation Density-Time Implementation
if (a[i] >= b[i]) Steps to execute the loop in SIMD code – execute all N operations, turn off – scan mask vector and only execute
result writeback according to mask elements with non-zero masks
c[i] = a[i]
1. Compare A, B to get
else M[7]=1 A[7] B[7] M[7]=1
VMASK
c[i] = b[i] M[6]=0 A[6] B[6] M[6]=0 A[7] B[7]
M[5]=1 A[5] B[5] M[5]=1
2. Masked store of A into C
M[4]=1 A[4] B[4] M[4]=1
M[3]=0 A[3] B[3] M[3]=0 C[5]
A B VMASK 3. Complement VMASK
1 2 0 M[2]=0 C[4]
2 2 1 4. Masked store of B into C M[1]=1
M[2]=0 C[2]
3 2 1 M[0]=0
4 10 0 M[1]=1 C[1] C[1]
-5 -4 0 Write data port
0 -3 1
6 5 1 M[0]=0 C[0]
-7 -8 1
Which one is better?
Write Enable Write data port
Tradeoffs?
817 Slide credit: Krste Asanovic 818

Some Issues
 Stride and banking
 As long as they are relatively prime to each other and there
are enough banks to cover bank access latency, we can
sustain 1 element/cycle throughput

 Storage of a matrix
 Row major: Consecutive elements in a row are laid out
consecutively in memory
 Column major: Consecutive elements in a column are laid out
consecutively in memory
 You need to change the stride when accessing a row versus
column

819 820
10-11-2023

Minimizing Bank Conflicts Array vs. Vector Processors, Revisited


 More banks  Array vs. vector processor distinction is a “purist’s”
distinction
 Better data layout to match the access pattern
 Is this always possible?  Most “modern” SIMD processors are a combination of both
 They exploit data parallelism in both time and space
 Better mapping of address to bank  GPUs are a prime example we will cover in a bit more detail
 E.g., randomized mapping
 Rau, “Pseudo-randomly interleaved memory,” ISCA 1991.

821 822

Remember: Array vs. Vector Processors Vector Instruction Execution


VADD A,B  C
ARRAY PROCESSOR VECTOR PROCESSOR

Execution using Execution using


one pipelined four pipelined
functional unit functional units
Instruction Stream Same op @ same time
Different ops @ time
LD VR  A[3:0] LD0 LD1 LD2 LD3 LD0
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
ADD VR  VR, 1 AD0 AD1 AD2 AD3 LD1 AD0 A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
MUL VR  VR, 2
ST A[3:0]  VR MU0 MU1 MU2 MU3 LD2 AD1 MU0 A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]

ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]

Different ops @ same space AD3 MU2 ST1


MU3 ST2 C[2] C[8] C[9] C[10] C[11]
Time Same op @ space ST3 C[1] C[4] C[5] C[6] C[7]

Space Space
C[0] C[0] C[1] C[2] C[3]

823 Slide credit: Krste Asanovic 824


10-11-2023

Vector Unit Structure Vector Instruction Level Parallelism


Functional Unit
Can overlap execution of multiple vector instructions
 Example machine has 32 elements per vector register and 8 lanes
 Completes 24 operations/cycle while issuing 1 vector instruction/cycle

Partitioned Load Unit Multiply Unit Add Unit


Vector load
Registers mul
Elements 0, Elements 1, Elements 2, Elements 3,
4, 8, … 5, 9, … 6, 10, … 7, 11, … add
time
load
mul
add

Lane
Instruction
issue
Memory Subsystem

Slide credit: Krste Asanovic 825 Slide credit: Krste Asanovic 826

Automatic Code Vectorization Vector/SIMD Processing Summary


for (i=0; i < N; i++)
C[i] = A[i] + B[i];  Vector/SIMD machines are good at exploiting regular data-
Scalar Sequential Code Vectorized Code level parallelism
load load  Same operation performed on many data elements
load
 Improve performance, simplify design (no intra-vector
Iter. 1 load load load dependencies)
Time

add add add


 Performance improvement limited by vectorizability of code
store store store  Scalar operations limit vector machine performance
 Remember Amdahl’s Law
load
Iter. Iter.  CRAY-1 was the fastest SCALAR machine at its time!
Iter. 2 load 1 2 Vector Instruction

add
Vectorization is a compile-time reordering of  Many existing ISAs include (vector-like) SIMD operations
operation sequencing
 requires extensive loop dependence analysis
 Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD
store
Slide credit: Krste Asanovic 827 828
10-11-2023

Intel Pentium MMX Operations


 Idea: One instruction operates on multiple data elements
simultaneously
SIMD Operations in Modern ISAs  Ala array processing (yet much more limited)
 Designed with multimedia (graphics) operations in mind
No VLEN register
Opcode determines data type:
8 8-bit bytes
4 16-bit words
2 32-bit doublewords
1 64-bit quadword

Stride always equal to 1.

Peleg and Weiser, “MMX Technology


Extension to the Intel Architecture,”
IEEE Micro, 1996.

830

MMX Example: Image Overlaying (I) MMX Example: Image Overlaying (II)
 Goal: Overlay the human in image 1 on top of the background in image 2

831 832
10-11-2023

Agenda for Today & Next Few Lectures


 Single-cycle Microarchitectures
18-447
Multi-cycle and Microprogrammed Microarchitectures
Computer Architecture 

Lecture 15: GPUs, VLIW, DAE  Pipelining

 Issues in Pipelining: Control & Data Dependence Handling,


State Maintenance and Recovery, …

 Out-of-Order Execution
Prof. Onur Mutlu
Carnegie Mellon University  Issues in OoO Execution: Load-Store Handling, …
Spring 2015, 2/20/2015
 Alternative Approaches to Instruction Level Parallelism
834

Approaches to (Instruction-Level) Concurrency Readings for Today


 Pipelining  Lindholm et al., "NVIDIA Tesla: A Unified Graphics and
 Out-of-order execution Computing Architecture," IEEE Micro 2008.
 Dataflow (at the ISA level)
 SIMD Processing (Vector and array processors, GPUs)  Fatahalian and Houston, “A Closer Look at GPUs,” CACM
 VLIW 2008.
 Decoupled Access Execute
 Systolic Arrays

835 836
10-11-2023

Recap of Last Lecture Review: Code Parallelization/Vectorization


for (i=0; i < N; i++)
 SIMD Processing C[i] = A[i] + B[i];
 Flynn’s taxonomy: SISD, SIMD, MISD, MIMD Scalar Sequential Code Vectorized Code
 VLIW vs. SIMD
load load load
 Array vs. Vector Processors
 Vector Processors in Depth Iter. 1 load load load

Time
 Vector Registers, Stride, Masks, Length
 Memory Banking add add add
 Vectorizable Code
 Scalar vs. Vector Code Execution store store store
 Vector Chaining
Vector Stripmining load
 Iter. Iter.
 Gather/Scatter Operations Iter. 2 load 1 2 Vector Instruction

 Minimizing Bank Conflicts


 Automatic Code Vectorization add
Vectorization is a compile-time reordering of
 SIMD Operations in Modern ISAs: Example from MMX operation sequencing
 requires extensive loop dependence analysis
store
837 Slide credit: Krste Asanovic 838

Recap: Vector/SIMD Processing Summary GPUs are SIMD Engines Underneath


 Vector/SIMD machines are good at exploiting regular data-  The instruction pipeline operates like a SIMD pipeline (e.g.,
level parallelism an array processor)
 Same operation performed on many data elements
 Improve performance, simplify design (no intra-vector  However, the programming is done using threads, NOT
dependencies) SIMD instructions

 Performance improvement limited by vectorizability of code  To understand this, let’s go back to our parallelizable code
 Scalar operations limit vector machine performance example
 Remember Amdahl’s Law
 CRAY-1 was the fastest SCALAR machine at its time!
 But, before that, let’s distinguish between
 Programming Model (Software)
 Many existing ISAs include SIMD operations vs.
 Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD
 Execution Model (Hardware)
839 840
10-11-2023

Programming Model vs. Hardware Execution Model How Can You Exploit Parallelism Here?
for (i=0; i < N; i++)
 Programming Model refers to how the programmer expresses C[i] = A[i] + B[i];
the code Scalar Sequential Code
 E.g., Sequential (von Neumann), Data Parallel (SIMD), Dataflow,
load
Multi-threaded (MIMD, SPMD), …
load
Let’s examine three programming
Iter. 1
options to exploit instruction-level
 Execution Model refers to how the hardware executes the add parallelism present in this sequential
code underneath code:
 E.g., Out-of-order execution, Vector processor, Array processor, store
Dataflow processor, Multiprocessor, Multithreaded processor, …
load 1. Sequential (SISD)
load
 Execution Model can be very different from the Programming Iter. 2
Model 2. Data-Parallel (SIMD)
add
 E.g., von Neumann model implemented by an OoO processor
 E.g., SPMD model implemented by a SIMD processor (a GPU) store
3. Multithreaded (MIMD/SPMD)
841 842

for (i=0; i < N; i++)


Prog. Model 1: Sequential (SISD) C[i] = A[i] + B[i]; Prog. Model 2: Data Parallel (SIMD)for (i=0; i < N; i++)
C[i] = A[i] + B[i];

Vector Instruction Vectorized Code


Scalar Sequential Code  Can be executed on a: Scalar Sequential Code

load load VLD A  V1


load
load
 Pipelined processor Iter. 1 load
load load
Iter. 1 load VLD B  V2
 Out-of-order execution processor
Independent instructions executed add
add add VADD V1 + V2  V3
add 

when ready
store store VST V3  C
store  Different iterations are present in the
instruction window and can execute in load
load Iter. Iter.
parallel in multiple functional units Realization: Each iteration is independent
load Iter. 2 1 load 2
Iter. 2  In other words, the loop is dynamically
unrolled by the hardware Idea: Programmer or compiler generates a SIMD
add
add instruction to execute the same instruction from
 Superscalar or VLIW processor all iterations across different data
store  Can fetch and execute multiple store
instructions per cycle Best executed by a SIMD processor (vector, array)
843 844
10-11-2023

for (i=0; i < N; i++) for (i=0; i < N; i++)


Prog. Model 3: Multithreaded C[i] = A[i] + B[i]; Prog. Model 3: Multithreaded C[i] = A[i] + B[i];

Scalar Sequential Code

load
load load load load

Iter. 1 load
load load load load

add
add add add add

store store store store

load
Iter. Iter. Iter. Iter.
1 load 2 Realization: Each iteration is independent 1 2 Realization: Each iteration is independent
Iter. 2
Idea: Programmer or compiler generates a thread Idea:This
Programmer
particularormodel
compiler generates
is also a thread
called:
add
to execute each iteration. Each thread does the to execute each iteration. Each thread does the
same thing (but on different data) SPMD:
same Single
thing (but on Program Multiple Data
different data)
store
Can be executed on a MIMD machine CanCan
Can be
be executed
executed
be executed on
on a
a SIMT
on a MIMD SIMD machine
machine
machine
845 Single Instruction Multiple Thread 846

for (i=0; i < N; i++)


A GPU is a SIMD (SIMT) Machine SPMD on SIMT Machine C[i] = A[i] + B[i];

 Except it is not programmed using SIMD instructions


load load Warp 0 at PC X

 It is programmed using threads (SPMD programming model) load load Warp 0 at PC X+1

 Each thread executes the same code but operates a different


add add Warp 0 at PC X+2
piece of data
 Each thread has its own context (i.e., can be store store Warp 0 at PC X+3
treated/restarted/executed independently)
Iter. Iter.
1 2 Warp: A set of threads that execute
 A set of threads executing the same instruction are Realization: Each iteration is independent
the same instruction (i.e., at the same PC)
dynamically grouped into a warp (wavefront) by the
hardware Idea:This
Programmer
particularormodel
compiler generates
is also a thread
called:
to execute each iteration. Each thread does the
 A warp is essentially a SIMD operation formed by hardware! SPMD:
same thing Single
(but on Program Multiple Data
different data)
A GPU
CanCan executes
be onitausing
executed
be executed the
on a SIMD
MIMD SIMT model:
machine
machine
847 Single Instruction Multiple Thread 848
10-11-2023

SIMD vs. SIMT Execution Model


 SIMD: A single sequential instruction stream of SIMD
instructions  each instruction specifies multiple data inputs
Graphics Processing Units  [VLD, VLD, VADD, VST], VLEN

SIMD not Exposed to Programmer (SIMT)  SIMT: Multiple instruction streams of scalar instructions 
threads grouped dynamically into warps
 [LD, LD, ADD, ST], NumThreads

 Two Major SIMT Advantages:


 Can treat each thread separately  i.e., can execute each thread
independently (on any type of scalar pipeline)  MIMD processing
 Can group threads into warps flexibly  i.e., can group threads
that are supposed to truly execute the same instruction 
dynamically obtain and maximize benefits of SIMD processing
850

for (i=0; i < N; i++)


Multithreading of Warps C[i] = A[i] + B[i]; Warps and Warp-Level FGMT
 Assume a warp consists of 32 threads  Warp: A set of threads that execute the same instruction
 If you have 32K iterations  1K warps (on different data elements)  SIMT (Nvidia-speak)
 Warps can be interleaved on the same pipeline  Fine grained  All threads run the same code
multithreading of warps  Warp: The threads that run lengthwise in a woven fabric …

load load 0 at PC X
Warp 1
Thread Warp 3
Thread Warp 8
load load Thread Warp Common PC
Scalar Scalar Scalar Scalar Thread Warp 7
add add Warp 20 at PC X+2
ThreadThread Thread Thread
W X Y Z
store store SIMD Pipeline

Iter.
Iter. Iter.
Iter.
1
33
20*32 + 1 2
34
20*32 +2
851 852
10-11-2023

High-Level View of a GPU Latency Hiding via Warp-Level FGMT


 Warp: A set of threads that
execute the same instruction
Warps available
(on different data elements) Thread Warp 3
Thread Warp 8 for scheduling

Thread Warp 7
SIMD Pipeline
 Fine-grained multithreading
I-Fetch
 One instruction per thread in
Decode
pipeline at a time (No
interlocking)

RF
RF

RF
 Interleave warp execution to Warps accessing

ALU

ALU

ALU
memory hierarchy
hide latencies Miss?
 Register values of all threads stay D-Cache Thread Warp 1
in register file All Hit? Data Thread Warp 2

 FGMT enables long latency Thread Warp 6


Writeback
tolerance
 Millions of pixels
853 Slide credit: Tor Aamodt 854

Warp Execution (Recall the Slide) SIMD Execution Unit Structure


32-thread warp executing ADD A[tid],B[tid]  C[tid] Functional Unit

Execution using Execution using


one pipelined four pipelined
functional unit functional units
Registers
for each
Thread Registers for Registers for Registers for Registers for
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] thread IDs thread IDs thread IDs thread IDs
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] 0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, …
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]

C[2] C[8] C[9] C[10] C[11]

C[1] C[4] C[5] C[6] C[7]


Lane

C[0] C[0] C[1] C[2] C[3] Memory Subsystem

Slide credit: Krste Asanovic 855 Slide credit: Krste Asanovic 856
10-11-2023

Warp Instruction Level Parallelism SIMT Memory Access


Can overlap execution of multiple instructions  Same instruction in different threads uses thread id to
Example machine has 32 threads per warp and 8 lanes

index and access different data elements
 Completes 24 operations/cycle while issuing 1 warp/cycle

Load Unit Multiply Unit Add Unit Let’s assume N=16, 4 threads per warp  4 warps
W0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Threads
W1
+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
W2 Data elements
time
W3
W4
W5
+ + + +

Warp 0 Warp 1 Warp 2 Warp 3

Warp issue

Slide credit: Krste Asanovic 857 Slide credit: Hyesoon Kim

Sample GPU SIMT Code (Simplified) Sample GPU Program (Less Simplified)

CPU code
for (ii = 0; ii < 100000; ++ii) {
C[ii] = A[ii] + B[ii];
}

CUDA code
// there are 100000 threads
__global__ void KernelFunction(…) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int varA = aa[tid];
int varB = bb[tid];
C[tid] = varA + varB;
}

Slide credit: Hyesoon Kim Slide credit: Hyesoon Kim 860


10-11-2023

Warp-based SIMD vs. Traditional SIMD SPMD


 Traditional SIMD contains a single thread  Single procedure/program, multiple data
Lock step: a vector instruction needs to finish before another can start

 This is a programming model rather than computer organization
 Programming model is SIMD (no extra threads)  SW needs to know
vector length
 ISA contains vector/SIMD instructions  Each processing element executes the same procedure, except on
different data elements
 Warp-based SIMD consists of multiple scalar threads executing in  Procedures can synchronize at certain points in program, e.g. barriers
a SIMD manner (i.e., same instruction executed by all threads)
 Does not have to be lock step  Essentially, multiple instruction streams execute the same
 Each thread can be treated individually (i.e., placed in a different program
warp)  programming model not SIMD
 Each program/procedure 1) works on different data, 2) can execute a
 SW does not need to know vector length different control-flow path, at run-time
 Enables multithreading and flexible dynamic grouping of threads  Many scientific applications are programmed this way and run on MIMD
 ISA is scalar  vector instructions can be formed dynamically hardware (multiprocessors)
 Essentially, it is SPMD programming model implemented on SIMD  Modern GPUs programmed in a similar way on a SIMD hardware
hardware
861 862

SIMD vs. SIMT Execution Model Threads Can Take Different Paths in Warp-based SIMD

 SIMD: A single sequential instruction stream of SIMD  Each thread can have conditional control flow instructions
instructions  each instruction specifies multiple data inputs  Threads can execute different control flow paths
 [VLD, VLD, VADD, VST], VLEN

 SIMT: Multiple instruction streams of scalar instructions 


A
threads grouped dynamically into warps
 [LD, LD, ADD, ST], NumThreads B
Thread Warp Common PC

Thread Thread Thread Thread


C D F 1 2 3 4
 Two Major SIMT Advantages:
 Can treat each thread separately  i.e., can execute each thread E
independently on any type of scalar pipeline  MIMD processing
G
 Can group threads into warps flexibly  i.e., can group threads
that are supposed to truly execute the same instruction 
dynamically obtain and maximize benefits of SIMD processing
863 Slide credit: Tor Aamodt 864
10-11-2023

Control Flow Problem in GPUs/SIMT Branch Divergence Handling (I)


 A GPU uses a SIMD  Idea: Dynamic predicated (conditional) execution
pipeline to save area
on control logic. A/1111
A Reconv. PC
Stack
Next PC Active Mask
 Groups scalar threads TOS - G
A
B
E 1111
Branch TOS E D 0110
into warps B/1111
B TOS E C
E 1001

Path A C/1001
C D/0110
D F
Thread Warp Common PC
 Branch divergence
E/1111
E Thread Thread Thread Thread
occurs when threads Path B
1 2 3 4
inside warps branch to G/1111
G

different execution A B C D E G A
paths.
This is the same as conditional execution.
Recall the Vector Mask and Masked Vector Operations?
Time
Slide credit: Tor Aamodt 865 Slide credit: Tor Aamodt 866

Branch Divergence Handling (II) Remember: Each Thread Is Independent


A;  Two Major SIMT Advantages:
if (some condition) {
B; One per warp  Can treat each thread separately  i.e., can execute each thread
} else { independently on any type of scalar pipeline  MIMD processing
C; Control Flow Stack  Can group threads into warps flexibly  i.e., can group threads
}
Next PC Recv PC Active Mask that are supposed to truly execute the same instruction 
D;
TOS D
A -- 1111 dynamically obtain and maximize benefits of SIMD processing
B D 1110
A C
D D 0001

Execution Sequence  If we have many threads


B C A C B D  We can find individual threads that are at the same PC
1 0 1 1
1 0 1 1  And, group them together into a single warp dynamically
1 0 1 1  This reduces “divergence”  improves SIMD utilization
1 1 0 1
D Time  SIMD utilization: fraction of SIMD lanes executing a useful
operation (i.e., executing an active thread)
Slide credit: Tor Aamodt 867 868
10-11-2023

Dynamic Warp Formation/Merging Dynamic Warp Formation/Merging


 Idea: Dynamically merge threads executing the same  Idea: Dynamically merge threads executing the same
instruction (after branch divergence) instruction (after branch divergence)
 Form new warps from warps that are waiting
 Enough threads branching to each path enables the creation
of full new warps
Branch

Warp X Warp Z Path A


Warp Y
Path B

 Fung et al., “Dynamic Warp Formation and Scheduling for


Efficient GPU Control Flow,” MICRO 2007.
869 870

Dynamic Warp Formation Example Hardware Constraints Limit Flexibility of Warp Grouping
x/1111
Functional Unit
A y/1111
Legend
x/1110 A A
B y/0011 Execution of Warp x Execution of Warp y
at Basic Block A at Basic Block A
x/1000 x/0110 x/0001
C y/0010 D y/0001 F y/1100
D Registers
x/1110
A new warp created from scalar for each
E y/0011
threads of both Warp x and y
Thread Registers for Registers for Registers for Registers for
executing at Basic Block D
thread IDs thread IDs thread IDs thread IDs
x/1111 0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, …
G y/1111
A A B B C C D D E E F F G G A A

Baseline
Time
Can you move any thread
Dynamic
Warp
A A B B C D E E F G G A A
Lane
flexibly to any lane?
Formation
Time
Memory Subsystem

Slide credit: Tor Aamodt 871 Slide credit: Krste Asanovic 872
10-11-2023

When You Group Threads Dynamically … What About Memory Divergence?


 What happens to memory accesses?  Modern GPUs have caches
 To minimize accesses to main memory (save bandwidth)
 Simple, strided (predictable) memory access patterns within  Ideally: Want all threads in the warp to hit (without
a warp can become complex, randomized (unpredictable) conflicting with each other)
with dynamic regrouping of threads  Problem: Some threads in the warp may hit others may miss
 Can reduce locality in memory  Problem: One thread in a warp can stall the entire warp if it
 Can lead to inefficient bandwidth utilization misses in the cache.

 Need techniques to
 Tolerate memory divergence
 Integrate solutions to branch and memory divergence

873 874

NVIDIA GeForce GTX 285


 NVIDIA-speak:
 240 stream processors

An Example GPU  “SIMT execution”

 Generic speak:
 30 cores

 8 SIMD functional units per core

876
Slide credit: Kayvon Fatahalian
10-11-2023

NVIDIA GeForce GTX 285 “core” NVIDIA GeForce GTX 285 “core”

64 KB of storage 64 KB of storage
… for thread contexts
(registers)
… for thread contexts
(registers)

 Groups of 32 threads share instruction stream (each group is


= SIMD functional unit, control = instruction stream decode
shared across 8 units a Warp)
= multiply-add = execution context storage
 Up to 32 warps are simultaneously interleaved
= multiply  Up to 1024 thread contexts can be stored
877 878
Slide credit: Kayvon Fatahalian Slide credit: Kayvon Fatahalian

NVIDIA GeForce GTX 285 GPU Readings


 Required
Tex Tex  Lindholm et al., "NVIDIA Tesla: A Unified Graphics and
… … … … … … Computing Architecture," IEEE Micro 2008.
 Fatahalian and Houston, “A Closer Look at GPUs,” CACM 2008.
Tex Tex
… … … … … …

 Recommended
Tex Tex
… … … … … …  Narasiman et al., “Improving GPU Performance via Large
Warps and Two-Level Warp Scheduling,” MICRO 2011.

Tex Tex  Fung et al., “Dynamic Warp Formation and Scheduling for
… … … … …
Efficient GPU Control Flow,” MICRO 2007.
Tex Tex
 Jog et al., “Orchestrated Scheduling and Prefetching for
… … … … … … GPGPUs,” ISCA 2013.

30 cores on the GTX 285: 30,720 threads


879 880
Slide credit: Kayvon Fatahalian
10-11-2023

Remember: SIMD/MIMD Classification of Computers

 Mike Flynn, “Very High Speed Computing Systems,” Proc.


of the IEEE, 1966
VLIW and DAE
 SISD: Single instruction operates on single data element
 SIMD: Single instruction operates on multiple data elements
 Array processor
 Vector processor
 MISD? Multiple instructions operate on single data element
 Closest form: systolic array processor?
 MIMD: Multiple instructions operate on multiple data
elements (multiple instruction streams)
 Multiprocessor
 Multithreaded processor
882

SISD Parallelism Extraction Techniques


 We have already seen
 Superscalar execution
 Out-of-order execution VLIW
 Are there simpler ways of extracting SISD parallelism?
 VLIW (Very Long Instruction Word)
 Decoupled Access/Execute

883
10-11-2023

VLIW (Very Long Instruction Word) VLIW Concept


 A very long instruction word consists of multiple
independent instructions packed together by the compiler
 Packed instructions can be logically unrelated (contrast with
SIMD)

 Idea: Compiler finds independent instructions and statically


schedules (i.e. packs/bundles) them into a single VLIW
instruction

 Traditional Characteristics
 Multiple functional units
 Each instruction in a bundle executed in lock step  Fisher, “Very Long Instruction Word architectures and the
 Instructions in a bundle statically aligned to be directly fed ELI-512,” ISCA 1983.
into the functional units  ELI: Enormously longword instructions (512 bits)
885 886

SIMD Array Processing vs. VLIW VLIW Philosophy


 Array processor  Philosophy similar to RISC (simple instructions and hardware)
 Except multiple instructions in parallel

 RISC (John Cocke, 1970s, IBM 801 minicomputer)


 Compiler does the hard work to translate high-level language
code to simple instructions (John Cocke: control signals)
 And, to reorder simple instructions for high performance
 Hardware does little translation/decoding  very simple

 VLIW (Fisher, ISCA 1983)


 Compiler does the hard work to find instruction level parallelism
 Hardware stays as simple and streamlined as possible
 Executes each instruction in a bundle in lock step
 Simple  higher frequency, easier to design
887 888
10-11-2023

VLIW Philosophy and Properties Commercial VLIW Machines


 Multiflow TRACE, Josh Fisher (7-wide, 28-wide)
 Cydrome Cydra 5, Bob Rau
 Transmeta Crusoe: x86 binary-translated into internal VLIW
 TI C6000, Trimedia, STMicro (DSP & embedded processors)
 Most successful commercially

 Intel IA-64
 Not fully VLIW, but based on VLIW principles
 EPIC (Explicitly Parallel Instruction Computing)
 Instruction bundles can have dependent instructions
 A few bits in the instruction format specify explicitly which
instructions in the bundle are dependent on which other ones

Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. 889 890

VLIW Tradeoffs VLIW Summary


 Advantages  VLIW simplifies hardware, but requires complex compiler
+ No need for dynamic scheduling hardware  simple hardware techniques
+ No need for dependency checking within a VLIW instruction   Solely-compiler approach of VLIW has several downsides
simple hardware for multiple instruction issue + no renaming that reduce performance
+ No need for instruction alignment/distribution after fetch to -- Too many NOPs (not enough parallelism discovered)
different functional units  simple hardware
-- Static schedule intimately tied to microarchitecture
-- Code optimized for one generation performs poorly for next
 Disadvantages
-- No tolerance for variable or long-latency operations (lock step)
-- Compiler needs to find N independent operations per cycle
-- If it cannot, inserts NOPs in a VLIW instruction
-- Parallelism loss AND code size increase ++ Most compiler optimizations developed for VLIW employed
-- Recompilation required when execution width (N), instruction in optimizing compilers (for superscalar compilation)
latencies, functional units change (Unlike superscalar processing)  Enable code optimizations
-- Lockstep execution causes independent operations to stall ++ VLIW successful when parallelism is easier to find by the
-- No instruction can progress until the longest-latency instruction completes compiler (traditionally embedded markets, DSPs)
891 892
10-11-2023

Decoupled Access/Execute (DAE)


 Motivation: Tomasulo’s algorithm too complex to
implement
Decoupled Access/Execute (DAE)  1980s before Pentium Pro

 Idea: Decouple operand


access and execution via
two separate instruction
streams that communicate
via ISA-visible queues.

 Smith, “Decoupled Access/Execute


Computer Architectures,” ISCA 1982,
ACM TOCS 1984.

894

Decoupled Access/Execute (II) Decoupled Access/Execute (III)


 Compiler generates two instruction streams (A and E)  Advantages:
 Synchronizes the two upon control flow instructions (using branch queues) + Execute stream can run ahead of the access stream and vice
versa
+ If A takes a cache miss, E can perform useful work
+ If A hits in cache, it supplies data to lagging E
+ Queues reduce the number of required registers
+ Limited out-of-order execution without wakeup/select complexity

 Disadvantages:
-- Compiler support to partition the program and manage queues
-- Determines the amount of decoupling
-- Branch instructions require synchronization between A and E
-- Multiple instruction streams (can be done with a single one,
though)
895 896
10-11-2023

Astronautics ZS-1 Astronautics ZS-1 Instruction Scheduling


 Single stream  Dynamic scheduling
steered into A and
X pipelines  A and X streams are issued/executed independently
 Each pipeline in-  Loads can bypass stores in the memory unit (if no conflict)
order
 Branches executed early in the pipeline
 To reduce synchronization penalty of A/X streams
 Smith et al., “The  Works only if the register a branch sources is available
ZS-1 central
processor,”
ASPLOS 1987.  Static scheduling
 Move compare instructions as early as possible before a branch
 Smith, “Dynamic
Instruction  So that branch source register is available when branch is decoded
Scheduling and  Reorder code to expose parallelism in each stream
the Astronautics
ZS-1,” IEEE  Loop unrolling:
Computer 1989.  Reduces branch count + exposes code reordering opportunities

897 898

Loop Unrolling

18-447
Computer Architecture
Lecture 16: Systolic Arrays & Static Scheduling

 Idea: Replicate loop body multiple times within an iteration


+ Reduces loop maintenance overhead
 Induction variable increment or loop condition test Prof. Onur Mutlu
+ Enlarges basic block (and analysis scope) Carnegie Mellon University
Enables code optimization and scheduling opportunities
Spring 2015, 2/23/2015

-- What if iteration count not a multiple of unroll factor? (need extra code to detect
this)
-- Increases code size
899
10-11-2023

Agenda for Today & Next Few Lectures Approaches to (Instruction-Level) Concurrency
 Single-cycle Microarchitectures  Pipelining
 Out-of-order execution
 Multi-cycle and Microprogrammed Microarchitectures  Dataflow (at the ISA level)
 Pipelining  SIMD Processing (Vector and array processors, GPUs)
 VLIW
 Issues in Pipelining: Control & Data Dependence Handling,  Decoupled Access Execute
State Maintenance and Recovery, …  Systolic Arrays

 Out-of-Order Execution
 Static Instruction Scheduling
 Issues in OoO Execution: Load-Store Handling, …

 Alternative Approaches to Instruction Level Parallelism


901 902

Isolating Programs from One Another Recap of Last Lecture


 Remember matlab vs. gcc?  GPUs
Programming Model vs. Execution Model Separation
We will get back to this again


 GPUs: SPMD programming on SIMD/SIMT hardware
 SIMT Advantages vs. Traditional SIMD
 In the meantime, if you are curious, take a look at:  Warps, Fine-grained Multithreading of Warps
SIMT Memory Access
 Subramanian et al., “MISE: Providing Performance 

Branch Divergence Problem in SIMT


Predictability and Improving Fairness in Shared Main Memory 

Dynamic Warp Formation/Merging


Systems,” HPCA 2013. 

 Moscibroda and Mutlu, “Memory Performance Attacks: Denial


 VLIW
of Memory Service in Multi-Core Systems,” USENIX Security
Philosophy: RISC and VLIW
2007.

 VLIW vs. SIMD vs. Superscalar


 Tradeoffs and Advantages

 DAE (Decoupled Access/Execute)


 Dynamic and Static Scheduling
903 904
10-11-2023

Systolic Arrays: Motivation


 Goal: design an accelerator that has
 Simple, regular design (keep # unique parts small and regular)
Systolic Arrays  High concurrency  high performance
 Balanced computation and I/O (memory) bandwidth

 Idea: Replace a single processing element (PE) with a regular


array of PEs and carefully orchestrate flow of data between
the PEs
 such that they collectively transform a piece of input data before
outputting it to memory

 Benefit: Maximizes computation done on a single piece of


data element brought from memory

905 906

Systolic Arrays Why Systolic Architectures?


 Idea: Data flows from the computer memory in a rhythmic
fashion, passing through many processing elements before it
returns to memory

 Similar to an assembly line of processing elements


Memory: heart
PEs: cells  Different people work on the same car
 Many cars are assembled simultaneously
Memory pulses
data through  Can be two-dimensional
cells

 Why? Special purpose accelerators/architectures need


 Simple, regular design (keep # unique parts small and regular)
 High concurrency  high performance
 H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982.  Balanced computation and I/O (memory) bandwidth
907 908
10-11-2023

Systolic Architectures Systolic Computation Example


 Basic principle: Replace a single PE with a regular array of  Convolution
PEs and carefully orchestrate flow of data between the PEs  Used in filtering, pattern matching, correlation, polynomial
 Balance computation and memory bandwidth evaluation, etc …
 Many image processing tasks

 Differences from pipelining:


 These are individual PEs
 Array structure can be non-linear and multi-dimensional
 PE connections can be multidirectional (and different speed)
 PEs can have local memory and execute kernels (rather than a
piece of the instruction)
909 910

Systolic Computation Example: Convolution Systolic Computation Example: Convolution


 y1 = w1x1 +
w2x2 + w3x3

 y2 = w1x2 +
w2x3 + w3x4

 y3 = w1x3 +
w2x4 + w3x5
 Worthwhile to implement adder and multiplier separately
to allow overlapping of add/mul executions
911 912
10-11-2023

Systolic Computation Example: Convolution Systolic Arrays: Pros and Cons


 One needs to carefully orchestrate when data elements are  Advantage:
input to the array  Specialized (computation needs to fit PE organization/functions)
 And when output is buffered  improved efficiency, simple design, high concurrency/
performance
 This gets more involved when  good to do more with less memory bandwidth requirement
 Array dimensionality increases
 PEs are less predictable in terms of latency  Downside:
 Specialized
 not generally applicable because computation needs to fit
the PE functions/organization

913 914

More Programmability Pipeline Parallelism


 Each PE in a systolic array
 Can store multiple “weights”
 Weights can be selected on the fly
 Eases implementation of, e.g., adaptive filtering

 Taken further
 Each PE can have its own data and instruction memory
 Data memory  to store partial/temporary results, constants
 Leads to stream processing, pipeline parallelism
 More generally, staged execution

915 916
10-11-2023

Stages of Pipelined Programs Pipelined File Compression Example


 Loop iterations are divided into code segments called stages
 Threads execute stages on different cores

A B C

loop {
Compute1 A

Compute2 B

Compute3 C
}

917 918

Systolic Array Example Systolic Array: The WARP Computer


 Advantages  HT Kung, CMU, 1984-1988
 Makes multiple uses of each data item  reduced need for
fetching/refetching  Linear array of 10 cells, each cell a 10 Mflop programmable
 High concurrency processor
 Regular design (both data and control flow)  Attached to a general purpose host machine
 HLL and optimizing compiler to program the systolic array
 Disadvantages
 Used extensively to accelerate vision and robotics tasks
 Not good at exploiting irregular parallelism
 Relatively special purpose  need software, programmer
support to be a general purpose model  Annaratone et al., “Warp Architecture and
Implementation,” ISCA 1986.
 Annaratone et al., “The Warp Computer: Architecture,
Implementation, and Performance,” IEEE TC 1987.

919 920
10-11-2023

The WARP Computer The WARP Cell

921 922

Systolic Arrays vs. SIMD Agenda Status


 Food for thought…  Single-cycle Microarchitectures

 Multi-cycle and Microprogrammed Microarchitectures

 Pipelining

 Issues in Pipelining: Control & Data Dependence Handling,


State Maintenance and Recovery, …

 Out-of-Order Execution

 Issues in OoO Execution: Load-Store Handling, …

 Alternative Approaches to Instruction Level Parallelism


923 924
10-11-2023

Approaches to (Instruction-Level) Concurrency Some More Recommended Readings


 Pipelining  Fisher, “Very Long Instruction Word architectures and the ELI-
 Out-of-order execution 512,” ISCA 1983.
 Smith, “Decoupled Access/Execute Compute Architectures,”
 Dataflow (at the ISA level)
ISCA 1982, ACM TOCS 1984.
 SIMD Processing (Vector and array processors, GPUs)  H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982.
 VLIW
 Decoupled Access Execute  Huck et al., “Introducing the IA-64 Architecture,” IEEE Micro
 Systolic Arrays 2000.

 Static Instruction Scheduling  Rau and Fisher, “Instruction-level parallel processing: history,
overview, and perspective,” Journal of Supercomputing, 1993.
 Faraboschi et al., “Instruction Scheduling for Instruction Level
Parallel Processors,” Proc. IEEE, Nov. 2001.

925 926

Agenda
 Static Scheduling
 Key Questions and Fundamentals
Static Instruction Scheduling
(with a Slight Focus on VLIW)  Enabler of Better Static Scheduling: Block Enlargement
 Predicated Execution
 Loop Unrolling
 Trace
 Superblock
 Hyperblock
 Block-structured ISA

928
10-11-2023

Key Questions How Do We Enable Straight-Line Code?


Q1. How do we find independent instructions to fetch/execute?  Get rid of control flow
 Predicated Execution
Q2. How do we enable more compiler optimizations?  Loop Unrolling
e.g., common subexpression elimination, constant  …
propagation, dead code elimination, redundancy elimination, …
 Optimize frequently executed control flow paths
Q3. How do we increase the instruction fetch rate?  Trace
i.e., have the ability to fetch more instructions per cycle  Superblock
 Hyperblock
 Block-structured ISA
 …
A: Enabling the compiler to optimize across a larger number of
instructions that will be executed straight line (without branches
getting in the way) eases all of the above
929 930

Review: Predication (Predicated Execution) Review: Loop Unrolling


 Idea: Compiler converts control dependence into data
dependence  branch is eliminated
 Each instruction has a predicate bit set based on the predicate computation
 Only instructions with TRUE predicates are committed (others turned into NOPs)

(normal branch code) (predicated code)


A
T N A
if (cond) {
b = 0; C B B
} C  Idea: Replicate loop body multiple times within an iteration
else { D D + Reduces loop maintenance overhead
b = 1; A  Induction variable increment or loop condition test
p1 = (cond)
} branch p1, TARGET
A
p1 = (cond) + Enlarges basic block (and analysis scope)
B
mov b, 1 B
 Enables code optimization and scheduling opportunities
jmp JOIN (!p1) mov b, 1 -- What if iteration count not a multiple of unroll factor? (need extra code to detect
C
TARGET: C this)
mov b, 0
(p1) mov b, 0
D D -- Increases code size
add x, b, 1 add x, b, 1 931 932
10-11-2023

Some Terminology: Basic vs. Atomic Block VLIW: Finding Independent Operations
 Basic block: A sequence (block) of instructions with a single  Within a basic block, there is limited instruction-level
control flow entry point and a single control flow exit point parallelism (if the basic block is small)
 A basic block executes uninterrupted (if no  To find multiple instructions to be executed in parallel, the
exceptions/interrupts) compiler needs to consider multiple basic blocks

 Atomic block: A block of instructions where either all  Problem: Moving an instruction above a branch is unsafe
instructions complete or none complete because instruction is not guaranteed to be executed
 In most modern ISAs, the atomic unit of execution is at the
granularity of an instruction
 Idea: Enlarge blocks at compile time by finding the
 A basic block can be considered atomic (if there are no
exceptions/interrupts and side effects observable in the middle frequently-executed paths
of execution)  Trace scheduling
 One can reorder instructions freely within an atomic block,  Superblock scheduling
subject only to true data dependences  Hyperblock scheduling
933 934

Safety and Legality in Code Motion Code Movement Constraints


 Two characteristics of speculative code motion:  Downward
 Safety: whether or not spurious exceptions may occur  When moving an operation from a BB to one of its dest BB’s,
 Legality: whether or not result will be always correct  all the other dest basic blocks should still be able to use the result
of the operation
 Four possible types of code motion:
 the other source BB’s of the dest BB should not be disturbed

 Upward
r1 = ... r1 = r2 & r3 r4 = r1 ... r1 = r2 & r3
 When moving an operation from a BB to its source BB’s
(a) safe and legal (b) illegal
 register values required by the other dest BB’s must not be
destroyed
 the movement must not cause new exceptions

r1 = ... r1 = load A r4 = r1 ... r1 = load A


(c) unsafe (d) unsafe and illegal
935 936
10-11-2023

Trace Scheduling Trace Scheduling (II)


 Trace: A frequently executed path in the control-flow graph  There may be conditional branches from the middle of the
(has multiple side entrances and multiple side exits) trace (side exits) and transitions from other traces into the
middle of the trace (side entrances).
 Idea: Find independent operations within a trace to pack
into VLIW instructions.  These control-flow transitions are ignored during trace
 Traces determined via profiling scheduling.
 Compiler adds fix-up code for correctness (if a side entrance
or side exit of a trace is exercised at runtime, corresponding  After scheduling, fix-up/bookkeeping code is inserted to
fix-up code is executed) ensure the correct execution of off-trace code.

 Fisher, “Trace scheduling: A technique for global microcode


compaction,” IEEE TC 1981.

937 938

Trace Scheduling Idea Trace Scheduling (III)

Instr 1 Instr 2
Instr 2 Instr 3
Instr 3 Instr 4
Instr 4 Instr 1
Instr 5 Instr 5

What bookeeping is required when Instr 1


is moved below the side entrance in the trace?

939 940
10-11-2023

Trace Scheduling (IV) Trace Scheduling (V)

Instr 3
Instr 1 Instr 2 Instr 1 Instr 1
Instr 4
Instr 2 Instr 3 Instr 2 Instr 5
Instr 3 Instr 4 Instr 3 Instr 2
Instr 4 Instr 1 Instr 4 Instr 3
Instr 5 Instr 5 Instr 5 Instr 4

What bookeeping is required when Instr 5


moves above the side entrance in the trace?

941 942

Trace Scheduling (VI) Trace Scheduling Fixup Code Issues


 Sometimes need to copy instructions more than once to
ensure correctness on all paths (see C below)
A D A’ B’ C’ Y
Instr 5 B X B
Instr 1 Instr 1 Original C Scheduled
Instr 2 Instr 5 trace trace E
Instr 3 Instr 2 D Y A C’’’
Instr 4 Instr 3
Instr 5 Instr 4
E C E’’ D’’ B’’ X

B X
Correctness C
D Y
943 944
10-11-2023

Trace Scheduling Overview Data Precedence Graph


 Trace Selection
 select seed block (the highest frequency basic block) i1 i2 i5 i6 i7 i10 i11 i12
 extend trace (along the highest frequency edges)
2 2 2 2 2 2
forward (successor of the last block of the trace) 2 2
i3 i8 i13
backward (predecessor of the first block of the trace)
 don’t cross loop back edge 2 2 2
 bound max_trace_length heuristically i4 i9 i14

 Trace Scheduling
 build data precedence graph for a whole trace 4 4

 perform list scheduling and allocate registers i15


 add compensation code to maintain semantic correctness 2
 Speculative Code Motion (upward)
 move an instruction above a branch if safe i16

945 946

List Scheduling Instruction Prioritization Heuristics


 Assign priority to each instruction  Number of descendants in precedence graph
 Initialize ready list that holds all ready instructions  Maximum latency from root node of precedence graph
 Ready = data ready and can be scheduled  Length of operation latency
 Choose one ready instruction I from ready list with the  Ranking of paths based on importance
highest priority  Combination of above
 Possibly using tie-breaking heuristics
 Insert I into schedule
 Making sure resource constraints are satisfied
 Add those instructions whose precedence constraints are
now satisfied into the ready list

947 948
10-11-2023

VLIW List Scheduling Trace Scheduling Example (I)


 Assign Priorities
 Compute Data Ready List (DRL) - all operations whose predecessors
fdiv f1, f2, f3 B1 fdiv f1, f2, f3
have been scheduled. fadd f4, f1, f5 9 stalls
fadd f4, f1, f5
 Select from DRL in priority order while checking resource constraints beq r1, $0 beq r1, $0
 Add newly ready operations to DRL and repeat for next instruction 990 10 r2 and f2
not live
B2 B3 out
ld r2, 0(r3) ld r2, 4(r3) ld r2, 0(r3)
5
1 stall B3
1 990 10
4-wide VLIW Data Ready List add r2, r2, 4 B4 add r2, r2, 4
2 3 3 3 4 beq r2, $0 beq r2, $0
2 3 4 5 6 1 {1}
800 200
f2 not
2 2 3 6 3 4 5 {2,3,4,5,6} live out
B5 fsub f2, f2, f6
fsub f2, f3, f7
B6 fsub f2, f2, f6
7 8 9
9 2 7 8 {2,7,8,9} st.d f2, 0(r8) st.d f2, 0(r8)
200
1 stall B6
1 1 2 800
12 10 11 {10,11,12}
10 11 12 B7
add r3, r3, 4 add r3, r3, 4
1 13 {13} add r8, r8, 4 add r8, r8, 4
13
949 950

Trace Scheduling Example (II) Trace Scheduling Example (III)

fdiv f1, f2, f3


fdiv f1, f2, f3 fdiv f1, f2, f3 beq r1, $0
beq r1, $0 beq r1, $0
fadd f4, f1, f5
fadd f4, f1, f5 ld r2, 0(r3)
ld r2, 0(r3) ld r2, 0(r3) fsub f2, f2, f6
fsub f2, f2, f6 fsub f2, f2, f6 add r2, r2, 4 Split
add r2, r2, 4 add r2, r2, 4 Split beq r2, $0 comp. code
0 stall beq r2, $0 beq r2, $0 comp. code
0 stall fadd f4, f1, f5
fadd f4, f1, f5 st.d f2, 0(r8)
st.d f2, 0(r8) st.d f2, 0(r8)
1 stall
add r3, r3, 4
add r3, r3, 4 add r3, r3, 4 add r8, r8, 4 B3 B6
add r8, r8, 4 B3 add r8, r8, 4 B3 fadd f4, f1, f5
fadd f4, f1, f5 fadd f4, f1, f5 add r3, r3, 4
B6 B6 add r8, r8, 4

Join comp. code


951 952
10-11-2023

Trace Scheduling Example (IV) Trace Scheduling Example (V)

fdiv f1, f2, f3 fdiv f1, f2, f3


beq r1, $0 beq r1, $0
B3
fadd f4, f1, f5
ld r2, 0(r3) ld r2, 0(r3) fadd f4, f1, f5 B3
fsub f2, f2, f6 Split fsub f2, f2, f6
add r2, r2, 4 ld r2, 4(r3)
comp. code add r2, r2, 4
beq r2, $0 add r2, r2, 4 beq r2, $0 add r2, r2, 4
beq r2, $0 beq r2, $0
fadd f4, f1, f5 fsub f2, f2, f6
st.d f2, 0(r8) st.d f2, 0(r8)
add r3, r3, 4 st.d f2, 0(r8) fadd f4, f1, f5 fsub f2, f2, f6
add r8, r8, 4 add r3, r3, 4 st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4
Copied fadd f4, f1, f5 add r3, r3, 4
add r8, r8, 4 B6 add r8, r8, 4
split fsub f2, f3, f7 B6
fadd f4, f1, f5 instructions add r3, r3, 4
add r3, r3, 4 add r8, r8, 4
add r8, r8, 4

Join comp. code

953 954

Trace Scheduling Tradeoffs Superblock Scheduling


 Advantages  Trace: multiple entry, multiple exit block
+ Enables the finding of more independent instructions  fewer  Superblock: single-entry, multiple exit block
NOPs in a VLIW instruction  A trace with side entrances are eliminated
 Infrequent paths do not interfere with the frequent path
 Disadvantages + More optimization/scheduling opportunity than traces
-- Profile dependent + Eliminates “difficult” bookkeeping due to side entrances
-- What if dynamic path deviates from trace?
-- Code bloat and additional fix-up code executed
-- Due to side entrances and side exits
-- Infrequent paths interfere with the frequent path
-- Effectiveness depends on the bias of branches
-- Unbiased branches  smaller traces  less opportunity for
finding independent instructions

955 956
Hwu+, “The Superblock: An Effective Technique for VLIW and superscalar compilation,” J of SC 1991.
10-11-2023

Superblock Example Superblock Scheduling Shortcomings


Could you have done this with a trace? -- Still profile-dependent
opA: mul r1,r2,3 opA: mul r1,r2,3
1 1 -- No single frequently executed path if there is an unbiased
99 opB: add r2,r2,1 99 opB: add r2,r2,1 branch
1 opC’: mul r3,r2,3 -- Reduces the size of superblocks
opC: mul r3,r2,3 opC: mul r3,r2,3
Original Code Code After Superblock Formation
(using Tail Duplication) -- Code bloat and additional fix-up code executed
opA: mul r1,r2,3
1
-- Due to side exits
99 opB: add r2,r2,1
opC’: mul r3,r2,3
opC: mov r3,r1
Code After Common
Subexpression Elimination
957 958

Hyperblock Scheduling Hyperblock Formation (I)


Hyperblock formation 10
 Idea: Use predication support to eliminate unbiased 

branches and increase the size of superblocks 1. Block selection


BB1
2. Tail duplication
 Hyperblock: A single-entry, multiple-exit block with internal 90 80 20
3. If-conversion
control flow eliminated using predication (if-conversion) BB2 BB3

 Block selection 80 20
 Advantages  Select subset of BBs for inclusion in HB BB4
Difficult problem 10
+ Reduces the effect of unbiased branches on scheduling block 

Weighted cost/benefit function


size 
BB5 90
 Height overhead
 Resource overhead 10
Dependency overhead
 Disadvantages  BB6
 Branch elimination benefit
-- Requires predicated execution support  Weighted by frequency
10
-- All disadvantages of predicated execution
 Mahlke et al., “Effective Compiler Support for Predicated Execution Using the
Hyperblock,” MICRO 1992.
959 960
10-11-2023

Hyperblock Formation (II) Hyperblock Formation (III)


Tail duplication same as with Superblock formation
10
If-convert (predicate) intra-hyperblock branches
10 10
BB1 10
BB1 BB1
80 20 80 20 80 20 BB1
BB2 BB3
BB2 BB3 BB2 BB3 p1,p2 = CMPP
80 20
80 20 80 20 BB2 if p1
BB4
BB4 BB4
10 BB3 if p2
10 10
BB5 90 BB4
90 BB5 90 BB5
10 BB6 BB5
10 10
BB6 81 10
BB6 BB6’ BB6 BB6’ 9
90 81 81
9 9 BB6’
10 9 9
1 1
1

961 962

Aside: Test of Time Can We Do Better?


 Mahlke et al., “Effective Compiler Support for Predicated  Hyperblock still has disadvantages
Execution Using the Hyperblock,” MICRO 1992.  Profile dependent (Optimizes for a single path)
 Requires fix-up code
 MICRO Test of Time Award  And, it requires predication support
 http://www.cs.cmu.edu/~yixinluo/new_home/2014-Micro-
ToT.html  Can we do even better?

 Solution: Single-entry, single-exit enlarged blocks


 Block-structured ISA: atomic enlarged blocks

963 964
10-11-2023

Block Structured ISA Block Structured ISA (II)


 Blocks (> instructions) are atomic (all-or-none) operations  Advantages
 Either all of the block is committed or none of it + Large atomic blocks
 Aggressive compiler optimizations (e.g. reordering) can be enabled
 Compiler enlarges blocks by combining basic blocks with within atomic blocks (no side entries or exits)
their control flow successors  Larger units can be fetched from I-cache  wide fetch
 Branches within the enlarged block converted to “fault” + Can dynamically predict which optimized atomic block is
operations  if the fault operation evaluates to true, the block executed using a “branch predictor”
is discarded and the target of fault is fetched  can optimize multiple “hot” paths
+ No compensation (fix-up) code

 Disadvantages
-- “Fault operations” can lead wasted work (atomicity)
-- Code bloat (multiple copies of the same basic block exists in
the binary and possibly in I-cache)
-- Need to predict which enlarged block comes next
Melvin and Patt, “Enhancing Instruction Scheduling with a Block-Structured ISA,” IJPP 1995. 965 966

Block Structured ISA (III) Superblock vs. BS-ISA


 Hao et al., “Increasing the instruction fetch rate via block-  Superblock
structured instruction set architectures,” MICRO 1996.  Single-entry, multiple exit code block
 Not atomic
 Compiler inserts fix-up code on superblock side exit
 Only one path optimized (hardware has no choice to pick
dynamically)

 BS-ISA blocks
 Single-entry, single exit
 Atomic
 Need to roll back to the beginning of the block on fault
 Multiple paths optimized (hardware has a choice to pick)
967 968
10-11-2023

Superblock vs. BS-ISA Summary and Questions


 Superblock  Trace, superblock, hyperblock, block-structured ISA
+ No ISA support needed
-- Optimizes for only 1 frequently executed path  How many entries, how many exits does each of them have?
-- Not good if dynamic path deviates from profiled path  missed
 What are the corresponding benefits and downsides?
opportunity to optimize another path

 Block Structured ISA  What are the common benefits?


+ Enables optimization of multiple paths and their dynamic selection.  Enable and enlarge the scope of code optimizations
+ Dynamic prediction to choose the next enlarged block. Can  Reduce fetch breaks; increase fetch rate
dynamically adapt to changes in frequently executed paths at run-
time
+ Atomicity can enable more aggressive code optimization  What are the common downsides?
-- Code bloat becomes severe as more blocks are combined  Code bloat (code size increase)
-- Requires “next enlarged block” prediction, ISA+HW support  Wasted work if control flow deviates from enlarged block’s path
-- More wasted work on “fault” due to atomicity requirement
969 970

18-447
Computer Architecture IA-64: A “Complicated” VLIW ISA
Lecture 17: Memory Hierarchy and Caches

Recommended reading:
Huck et al., “Introducing the IA-64 Architecture,” IEEE Micro 2000.

Prof. Onur Mutlu


Carnegie Mellon University
Spring 2015, 2/25/2015
10-11-2023

EPIC – Intel IA-64 Architecture IA-64 Instructions


 Gets rid of lock-step execution of instructions within a VLIW  IA-64 “Bundle” (~EPIC Instruction)
instruction  Total of 128 bits
 Idea: More ISA support for static scheduling and parallelization  Contains three IA-64 instructions
 Specify dependencies within and between VLIW instructions
 Template bits in each bundle specify dependencies within a
(explicitly parallel) bundle

+ No lock-step execution \
+ Static reordering of stores and loads + dynamic checking
-- Hardware needs to perform dependency checking (albeit aided by
software)  IA-64 Instruction
-- Other disadvantages of VLIW still exist  Fixed-length 41 bits long
 Contains three 7-bit register specifiers
 Huck et al., “Introducing the IA-64 Architecture,” IEEE Micro, Sep/Oct
 Contains a 6-bit field for specifying one of the 64 one-bit
2000. predicate registers
973 974

IA-64 Instruction Bundles and Groups Template Bits


 Groups of instructions can be  Specify two things
executed safely in parallel  Stop information: Boundary of independent instructions
 Marked by “stop bits”  Functional unit information: Where should each instruction be routed

 Bundles are for packaging


 Groups can span multiple bundles
 Alleviates recompilation need
somewhat

975 976
10-11-2023

Three Things That Hinder Static Scheduling Non-Faulting Loads and Exception Propagation in IA-64
 Dynamic events (static unknowns)  Idea: Support unsafe code motion
ld.s r1=[a]
inst 1 inst 1
 Branch direction inst 2 unsafe inst 2
 Load hit miss status …. code ….
motion br
 Memory address br

 Let’s see how IA-64 ISA has support to aid scheduling in …. ld r1=[a] …. chk.s r1 ld r1=[a]
the presence of statically-unknown load-store addresses use=r1 use=r1

 ld.s (speculative load) fetches speculatively from memory


i.e. any exception due to ld.s is suppressed
 If ld.s r1 did not cause an exception then chk.s r1 is a NOP, else a
branch is taken (to execute some compensation code)
977 978

Non-Faulting Loads and Exception Propagation in IA-64 Aggressive ST-LD Reordering in IA-64
 Idea: Support unsafe code motion  Idea: Reorder LD/STs in the presence of unknown address
 Load and its use
ld.s r1=[a]
inst 1 inst 1 ld.a r1=[x]
inst 1 inst 2 potential
inst 2 unsafe use=r1 inst 2 aliasing inst 1
…. code …. …. inst 2
br motion
br br st [?]
st[?] ….
…. st [?]
ld r1=[x] ….
…. ld r1=[a] …. chk.s use ld r1=[a]
use=r1 use=r1 use=r1 ld.c r1=[x]
use=r1
 Load data can be speculatively consumed (use) prior to check
 “speculation” status is propagated with speculated data  ld.a (advanced load) starts the monitoring of any store to the same
 Any instruction that uses a speculative result also becomes speculative address as the advanced load
itself (i.e. suppressed exceptions)  If no aliasing has occurred since ld.a, ld.c is a NOP
 chk.s checks the entire dataflow sequence for exceptions  If aliasing has occurred, ld.c re-loads from memory
979 980
10-11-2023

Aggressive ST-LD Reordering in IA-64 What We Covered So Far in 447


 Idea: Reorder LD/STs in the presence of unknown address  ISA  Single-cycle Microarchitectures
 Load and its use
 Multi-cycle and Microprogrammed Microarchitectures
inst 1 potential ld.a r1=[x]
inst 2 inst 1  Pipelining
aliasing
…. inst 2
st [?] use=r1  Issues in Pipelining: Control & Data Dependence Handling,
st[?] State Maintenance and Recovery, …
…. ….
ld r1=[x] st [?]
use=r1 ….  Out-of-Order Execution
chk.a X ld r1=[a]
…. use=r1  Issues in OoO Execution: Load-Store Handling, …

 Alternative Approaches to Instruction Level Parallelism


981 982

Approaches to (Instruction-Level) Concurrency Agenda for the Rest of 447


 Pipelining  The memory hierarchy
 Out-of-order execution  Caches, caches, more caches (high locality, high bandwidth)
 Dataflow (at the ISA level)  Virtualizing the memory hierarchy
 SIMD Processing (Vector and array processors, GPUs)  Main memory: DRAM
 VLIW  Main memory control, scheduling
 Decoupled Access Execute  Memory latency tolerance techniques
 Systolic Arrays  Non-volatile memory

 Static Instruction Scheduling  Multiprocessors


 Coherence and consistency
 Interconnection networks
 Multi-core issues
983 984
10-11-2023

Readings for Today and Next Lecture Memory (Programmer’s View)


 Memory Hierarchy and Caches

 Cache chapters from P&H: 5.1-5.3


 Memory/cache chapters from Hamacher+: 8.1-8.7
 An early cache paper by Maurice Wilkes
 Wilkes, “Slave Memories and Dynamic Storage Allocation,”
IEEE Trans. On Electronic Computers, 1965.

985 986

Abstraction: Virtual vs. Physical Memory (Physical) Memory System


 Programmer sees virtual memory  You need a larger level of storage to manage a small
 Can assume the memory is “infinite” amount of physical memory automatically
 Reality: Physical memory size is much smaller than what  Physical memory has a backing store: disk
the programmer assumes
 The system (system software + hardware, cooperatively)  We will first start with the physical memory system
maps virtual memory addresses are to physical memory
 The system automatically manages the physical memory
 For now, ignore the virtualphysical indirection
space transparently to the programmer
As you have been doing in labs
+ Programmer does not need to know the physical size of memory
nor manage it  A small physical memory can appear as a huge
 We will get back to it when the needs of virtual memory
one to the programmer  Life is easier for the programmer
start complicating the design of physical memory…
-- More complex system software and architecture

A classic example of the programmer/(micro)architect tradeoff


987 988
10-11-2023

Idealism

Instruction
Supply
Pipeline
(Instruction
Data
Supply
The Memory Hierarchy
execution)

- Zero-cycle latency - No pipeline stalls - Zero-cycle latency

- Infinite capacity -Perfect data flow - Infinite capacity


(reg/memory dependencies)
- Zero cost - Infinite bandwidth
- Zero-cycle interconnect
- Perfect control flow (operand communication) - Zero cost
- Enough functional units

- Zero latency compute


989

Memory in a Modern System Ideal Memory


 Zero access time (latency)
 Infinite capacity
L2 CACHE 1
L2 CACHE 0

Zero cost
SHARED L3 CACHE


DRAM INTERFACE

CORE 0 CORE 1
DRAM BANKS

 Infinite bandwidth (to support multiple accesses in parallel)

DRAM MEMORY
CONTROLLER
L2 CACHE 2

L2 CACHE 3

CORE 2 CORE 3

991 992
10-11-2023

The Problem Memory Technology: DRAM


 Ideal memory’s requirements oppose each other  Dynamic random access memory
 Capacitor charge state indicates stored value
 Bigger is slower  Whether the capacitor is charged or discharged indicates
 Bigger  Takes longer to determine the location storage of 1 or 0
 1 capacitor
 1 access transistor
 Faster is more expensive
 Memory technology: SRAM vs. DRAM vs. Disk vs. Tape row enable
 Capacitor leaks through the RC path
 DRAM cell loses charge over time
Higher bandwidth is more expensive

_bitline

 DRAM cell needs to be refreshed
 Need more banks, more ports, higher frequency, or faster
technology

993 994

Memory Technology: SRAM Memory Bank Organization and Operation


 Static random access memory  Read access sequence:

 Two cross coupled inverters store a single bit 1. Decode row address
& drive word-lines
 Feedback path enables the stored value to persist in the “cell”
 4 transistors for storage 2. Selected bits drive
 2 transistors for access bit-lines
• Entire row read

3. Amplify row data

row select 4. Decode column


address & select subset
of row
_bitline
bitline

• Send to output

5. Precharge bit-lines
• For next access

995 996
10-11-2023

SRAM (Static Random Access Memory) DRAM (Dynamic Random Access Memory)
row enable Bits stored as charges on node
Read Sequence
row select capacitance (non-restorative)
1. address decode
- bit cell loses charge when read

_bitline
2. drive row select
- bit cell loses charge over time
3. selected bit-cells drive bitlines

_bitline
bitline

(entire row is read together)


Read Sequence
4. differential sensing and column select
1~3 same as SRAM
(data is ready)
4. a “flip-flopping” sense amp
RAS amplifies and regenerates the
bit-cell array
5. precharge all bitlines bitline, data bit is mux’ed out
(for next read or write) n 2n
2n row x 2m -col 5. precharge all bitlines
bit-cell array
2n (nm to minimize
n+m n 2n row x 2m -col Access latency dominated by steps 2 and 3 overall latency) Destructive reads
Cycling time dominated by steps 2, 3 and 5 Charge loss over time
(nm to minimize m
step 2 proportional to 2m 2m
overall latency) -
Refresh: A DRAM controller must
sense amp and mux
- step 3 and 5 proportional to 2n 1 periodically read each row within
m 2m diff pairs
A DRAM die comprises the allowed refresh time (10s of
sense amp and mux
1 CAS of multiple such arrays ms) such that charge is restored
997 998

DRAM vs. SRAM The Problem


 DRAM  Bigger is slower
 Slower access (capacitor)  SRAM, 512 Bytes, sub-nanosec
 Higher density (1T 1C cell)  SRAM, KByte~MByte, ~nanosec
 Lower cost  DRAM, Gigabyte, ~50 nanosec
 Requires refresh (power, performance, circuitry)  Hard Disk, Terabyte, ~10 millisec
 Manufacturing requires putting capacitor and logic together
 Faster is more expensive (dollars and chip area)
 SRAM, < 10$ per Megabyte
 SRAM
 DRAM, < 1$ per Megabyte
 Faster access (no capacitor)
 Hard Disk < 1$ per Gigabyte
 Lower density (6T cell)
 These sample values scale with time
 Higher cost
 No need for refresh  Other technologies have their place as well
 Manufacturing compatible with logic process (no capacitor)  Flash memory, PC-RAM, MRAM, RRAM (not mature yet)
999 1000
10-11-2023

Why Memory Hierarchy? The Memory Hierarchy


 We want both fast and large
move what you use here fast
 But we cannot achieve both with a single level of memory small

 Idea: Have multiple levels of storage (progressively bigger With good locality of
and slower as the levels are farther from the processor) reference, memory

cheaper per byte


and ensure most of the data the processor needs is kept in appears as fast as
the fast(er) level(s)
and as large as

faster per byte


backup
everything big but slow
here
1001 1002

Memory Hierarchy Locality


 Fundamental tradeoff  One’s recent past is a very good predictor of his/her near
 Fast memory: small future.
 Large memory: slow
 Idea: Memory hierarchy  Temporal Locality: If you just did something, it is very
likely that you will do the same thing again soon
Hard Disk
 since you are here today, there is a good chance you will be
Main
here again and again regularly
CPU Cache Memory
RF (DRAM)
 Spatial Locality: If you did something, it is very likely you
will do something similar/related (in space)
 Latency, cost, size,  every time I find you in this room, you are probably sitting
close to the same people
bandwidth

1003 1004
10-11-2023

Memory Locality Caching Basics: Exploit Temporal Locality


 A “typical” program has a lot of locality in memory  Idea: Store recently accessed data in automatically
references managed fast memory (called cache)
 typical programs are composed of “loops”  Anticipation: the data will be accessed again soon

 Temporal: A program tends to reference the same memory  Temporal locality principle
location many times and all within a small window of time  Recently accessed data will be again accessed in the near
future
 Spatial: A program tends to reference a cluster of memory  This is what Maurice Wilkes had in mind:
locations at a time  Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE
Trans. On Electronic Computers, 1965.
 most notable examples:
 “The use is discussed of a fast core memory of, say 32000 words
 1. instruction memory references
as a slave to a slower core memory of, say, one million words in
 2. array/data structure references such a way that in practical cases the effective access time is
nearer that of the fast memory than that of the slow memory.”

1005 1006

Caching Basics: Exploit Spatial Locality The Bookshelf Analogy


 Idea: Store addresses adjacent to the recently accessed  Book in your hand
one in automatically managed fast memory  Desk
 Logically divide memory into equal size blocks  Bookshelf
 Fetch to cache the accessed block in its entirety  Boxes at home
 Anticipation: nearby data will be accessed soon  Boxes in storage

 Spatial locality principle  Recently-used books tend to stay on desk


 Nearby data in memory will be accessed in the near future
 Comp Arch books, books for classes you are currently taking
 E.g., sequential instruction access, array traversal
 Until the desk gets full
 This is what IBM 360/85 implemented
 Adjacent books in the shelf needed around the same time
 16 Kbyte cache with 64 byte blocks
 Liptay, “Structural aspects of the System/360 Model 85 II: the  If I have organized/categorized my books well in the shelf
cache,” IBM Systems Journal, 1968.

1007 1008
10-11-2023

Caching in a Pipelined Design A Note on Manual vs. Automatic Management


 The cache needs to be tightly integrated into the pipeline  Manual: Programmer manages data movement across levels
 Ideally, access in 1-cycle so that dependent operations do not -- too painful for programmers on substantial programs
stall “core” vs “drum” memory in the 50’s

 High frequency pipeline  Cannot make the cache large still done in some embedded processors (on-chip scratch pad

 But, we want a large cache AND a pipelined design SRAM in lieu of a cache)
 Idea: Cache hierarchy
 Automatic: Hardware manages data movement across levels,
transparently to the programmer
Main ++ programmer’s life is easier
Level 2 Memory the average programmer doesn’t need to know about it
CPU Level1 Cache (DRAM)
RF Cache  You don’t need to know how big the cache is and how it works to
write a “correct” program! (What if you want a “fast” program?)

1009 1010

Automatic Management in Memory Hierarchy A Modern Memory Hierarchy


Register File
 Wilkes, “Slave Memories and Dynamic Storage Allocation,” 32 words, sub-nsec
IEEE Trans. On Electronic Computers, 1965. manual/compiler
Memory register spilling
L1 cache
Abstraction ~32 KB, ~nsec

L2 cache
512 KB ~ 1MB, many nsec Automatic
HW cache
L3 cache, management
.....
 “By a slave memory I mean one which automatically
accumulates to itself words that come from a slower main Main memory (DRAM),
GB, ~100 nsec
memory, and keeps them available for subsequent use automatic
without it being necessary for the penalty of main memory demand
Swap Disk
access to be incurred again.” 100 GB, ~10 msec paging
1011 1012
10-11-2023

Hierarchical Latency Analysis Hierarchy Design Considerations


 For a given memory hierarchy level i it has a technology-intrinsic  Recursive latency equation
access time of ti, The perceived access time Ti is longer than ti Ti = ti + mi ·Ti+1
 Except for the outer-most hierarchy, when looking for a given
 The goal: achieve desired T1 within allowed cost
address there is
 a chance (hit-rate hi) you “hit” and access time is ti
 Ti  ti is desirable
 a chance (miss-rate mi) you “miss” and access time ti +Ti+1

 hi + mi = 1  Keep mi low
 Thus  increasing capacity Ci lowers mi, but beware of increasing ti
lower mi by smarter management (replacement::anticipate what you
Ti = hi·ti + mi·(ti + Ti+1) 
don’t need, prefetching::anticipate what you will need)
Ti = ti + mi ·Ti+1
 Keep Ti+1 low
hi and mi are defined to be the hit-rate  faster lower hierarchies, but beware of increasing cost
and miss-rate of just the references that missed at Li-1  introduce intermediate hierarchies as a compromise
1013 1014

Intel Pentium 4 Example


 90nm P4, 3.6 GHz
 L1 D-cache if m1=0.1, m2=0.1
T1=7.6, T2=36
 C1 = 16K Cache Basics and Operation
 t1 = 4 cyc int / 9 cycle fp if m1=0.01, m2=0.01
 L2 D-cache T1=4.2, T2=19.8
 C2 =1024 KB if m1=0.05, m2=0.01
 t2 = 18 cyc int / 18 cyc fp T1=5.00, T2=19.8
 Main memory if m1=0.01, m2=0.50
 t3 = ~ 50ns or 180 cyc T1=5.08, T2=108
 Notice
 best case latency is not 1
 worst case access latencies are into 500+ cycles
10-11-2023

Cache Caching Basics


 Generically, any structure that “memoizes” frequently used  Block (line): Unit of storage in the cache
results to avoid repeating the long-latency operations  Memory is logically divided into cache blocks that map to
required to reproduce the results from scratch, e.g. a web locations in the cache
cache
 When data referenced
 HIT: If in cache, use cached data instead of accessing memory
 Most commonly in the on-die context: an automatically-  MISS: If not in cache, bring block into cache
managed memory hierarchy based on SRAM  Maybe have to kick something else out to do it
 memoize in SRAM the most frequently accessed DRAM
memory locations to avoid repeatedly paying for the DRAM  Some important cache design decisions
access latency  Placement: where and how to place/find a block in cache?
 Replacement: what data to remove to make room in cache?
 Granularity of management: large, small, uniform blocks?
 Write policy: what do we do about writes?
 Instructions/data: Do we treat them separately?
1017 1018

Cache Abstraction and Metrics


18-447
Address
Tag Store Data Store Computer Architecture
(is the address
in the cache? Lecture 18: Caches, Caches, Caches
+ bookkeeping)

Hit/miss? Data
Prof. Onur Mutlu
 Cache hit rate = (# hits) / (# hits + # misses) = (# hits) / (# accesses) Carnegie Mellon University
 Average memory access time (AMAT) Spring 2015, 2/27/2015
= ( hit-rate * hit-latency ) + ( miss-rate * miss-latency )
 Aside: Can reducing AMAT reduce performance?
1019
10-11-2023

Agenda for the Rest of 447 Readings for Today and Next Lecture
 The memory hierarchy  Memory Hierarchy and Caches
 Caches, caches, more caches (high locality, high bandwidth)
 Virtualizing the memory hierarchy Required
 Main memory: DRAM  Cache chapters from P&H: 5.1-5.3

 Main memory control, scheduling  Memory/cache chapters from Hamacher+: 8.1-8.7

 Memory latency tolerance techniques


 Non-volatile memory Required + Review:
 Wilkes, “Slave Memories and Dynamic Storage Allocation,”
IEEE Trans. On Electronic Computers, 1965.
 Multiprocessors
 Qureshi et al., “A Case for MLP-Aware Cache Replacement,“
 Coherence and consistency ISCA 2006.
 Interconnection networks
 Multi-core issues
1021 1022

Review: Caching Basics Review: Caching Basics


 Caches are structures that exploit locality of reference in  Block (line): Unit of storage in the cache
memory  Memory is logically divided into cache blocks that map to
 Temporal locality locations in the cache
 Spatial locality  When data referenced
 HIT: If in cache, use cached data instead of accessing memory
 They can be constructed in many ways  MISS: If not in cache, bring block into cache
 Can exploit either temporal or spatial locality or both  Maybe have to kick something else out to do it

 Some important cache design decisions


 Placement: where and how to place/find a block in cache?
 Replacement: what data to remove to make room in cache?
 Granularity of management: large, small, uniform blocks?
 Write policy: what do we do about writes?
 Instructions/data: Do we treat them separately?
1023 1024
10-11-2023

Cache Abstraction and Metrics A Basic Hardware Cache Design


 We will start with a basic hardware cache design
Address
Tag Store Data Store
 Then, we will examine a multitude of ideas to make it
(is the address (stores better
in the cache? memory
+ bookkeeping) blocks)

Hit/miss? Data

 Cache hit rate = (# hits) / (# hits + # misses) = (# hits) / (# accesses)


 Average memory access time (AMAT)
= ( hit-rate * hit-latency ) + ( miss-rate * miss-latency )
 Aside: Can reducing AMAT reduce performance?
1025 1026

Blocks and Addressing the Cache Direct-Mapped Cache: Placement and Access
 Memory is logically divided into fixed-size blocks  Assume byte-addressable memory:
256 bytes, 8-byte blocks  32 blocks
 Each block maps to a location in the cache, determined by  Assume cache: 64 bytes, 8 blocks
the index bits in the address tag index byte in block  Direct-mapped: A block can go to only one location
 used to index into the tag and data stores 2b 3 bits 3 bits tag index byte in block

8-bit address 2b 3 bits 3 bits Tag store Data store


Address
 Cache access:
1) index into the tag and data stores with index bits in address
2) check valid bit in tag store V tag

3) compare tag bits in address with the stored tag in tag store
byte in block
=? MUX

 If a block is in the cache (cache hit), the stored tag should be Hit? Data
valid and match the tag of the block  Addresses with same index contend for the same location
 Cause conflict misses
1027 1028
10-11-2023

Direct-Mapped Caches Set Associativity


 Direct-mapped cache: Two blocks in memory that map to  Addresses 0 and 8 always conflict in direct mapped cache
the same index in the cache cannot be present in the cache  Instead of having one column of 8, have 2 columns of 4 blocks
at the same time
Tag store Data store
 One index  one entry
SET

 Can lead to 0% hit rate if more than one block accessed in V tag V tag

an interleaved manner map to the same index


 Assume addresses A and B have the same index bits but =? =? MUX
different tag bits
Logic byte in block
 A, B, A, B, A, B, A, B, …  conflict in the cache index MUX

 All accesses are conflict misses Address Hit?


tag index byte in block
Key idea: Associative memory within the set
3b 2 bits 3 bits
+ Accommodates conflicts better (fewer conflict misses)
-- More complex, slower access, larger tag store
1029 1030

Higher Associativity Full Associativity


 4-way Tag store  Fully associative cache
 A block can be placed in any cache location

=? =? =? =? Tag store

Logic Hit? =? =? =? =? =? =? =? =?
Data store Logic

Hit?

MUX
Data store
byte in block
MUX
MUX
byte in block
MUX
+ Likelihood of conflict misses even lower
-- More tag comparators and wider data mux; larger tags
1031 1032
10-11-2023

Associativity (and Tradeoffs) Issues in Set-Associative Caches


 Degree of associativity: How many blocks can map to the  Think of each block in a set having a “priority”
same index (or set)?  Indicating how important it is to keep the block in the cache
 Key issue: How do you determine/adjust block priorities?
 Higher associativity
 There are three key decisions in a set:
++ Higher hit rate
 Insertion, promotion, eviction (replacement)
-- Slower cache access time (hit latency and data access latency)
-- More expensive hardware (morehitcomparators)
rate  Insertion: What happens to priorities on a cache fill?
 Where to insert the incoming block, whether or not to insert the block
 Diminishing returns from higher  Promotion: What happens to priorities on a cache hit?
associativity  Whether and how to change block priority
 Eviction/replacement: What happens to priorities on a cache
miss?
 Which block to evict and how to adjust priorities
associativity
1033 1034

Eviction/Replacement Policy Implementing LRU


 Which block in the set to replace on a cache miss?  Idea: Evict the least recently accessed block
 Any invalid block first  Problem: Need to keep track of access ordering of blocks
 If all are valid, consult the replacement policy
Random

 Question: 2-way set associative cache:
 FIFO
 What do you need to implement LRU perfectly?
 Least recently used (how to implement?)
 Not most recently used
 Least frequently used?  Question: 4-way set associative cache:
 Least costly to re-fetch?  What do you need to implement LRU perfectly?
 Why would memory accesses have different cost?  How many different orderings possible for the 4 blocks in the
 Hybrid replacement policies set?
 Optimal replacement policy?  How many bits needed to encode the LRU order of a block?
 What is the logic needed to determine the LRU victim?

1035 1036
10-11-2023

Approximations of LRU Hierarchical LRU (not MRU)


 Most modern processors do not implement “true LRU” (also  Divide a set into multiple groups
called “perfect LRU”) in highly-associative caches  Keep track of only the MRU group
 Keep track of only the MRU block in each group
 Why?
 True LRU is complex  On replacement, select victim as:
 LRU is an approximation to predict locality anyway (i.e., not  A not-MRU block in one of the not-MRU groups (randomly pick
the best possible cache management policy) one of such blocks/groups)

 Examples:
 Not MRU (not most recently used)
 Hierarchical LRU: divide the 4-way set into 2-way “groups”,
track the MRU group and the MRU way in each group
 Victim-NextVictim Replacement: Only keep track of the victim
and the next victim
1037 1038

Hierarchical LRU (not MRU) Example Hierarchical LRU (not MRU) Example

1039 1040
10-11-2023

Hierarchical LRU (not MRU): Questions Victim/Next-Victim Policy


 16-way cache  Only 2 blocks’ status tracked in each set:
 2 8-way groups  victim (V), next victim (NV)
 all other blocks denoted as (O) – Ordinary block
 What is an access pattern that performs worse than true
LRU?  On a cache miss
 Replace V
 What is an access pattern that performs better than true  Demote NV to V
LRU?  Randomly pick an O block as NV

 On a cache hit to V
 Demote NV to V
 Randomly pick an O block as NV
 Turn V to O
1041 1042

Victim/Next-Victim Policy (II) Victim/Next-Victim Example


 On a cache hit to NV
 Randomly pick an O block as NV
 Turn NV to O

 On a cache hit to O
 Do nothing

1043 1044
10-11-2023

Cache Replacement Policy: LRU or Random What Is the Optimal Replacement Policy?
 LRU vs. Random: Which one is better?  Belady’s OPT
 Example: 4-way cache, cyclic references to A, B, C, D, E  Replace the block that is going to be referenced furthest in the
 0% hit rate with LRU policy future by the program
 Set thrashing: When the “program working set” in a set is  Belady, “A study of replacement algorithms for a virtual-
larger than set associativity storage computer,” IBM Systems Journal, 1966.
 Random replacement policy is better when thrashing occurs  How do we implement this? Simulate?
 In practice:
 Depends on workload  Is this optimal for minimizing miss rate?
 Average hit rate of LRU and Random are similar  Is this optimal for minimizing execution time?
 No. Cache miss latency/cost varies from block to block!
 Best of both Worlds: Hybrid of LRU and Random  Two reasons: Remote vs. local caches and miss overlapping
 How to choose between the two? Set sampling  Qureshi et al. “A Case for MLP-Aware Cache Replacement,“
 See Qureshi et al., “A Case for MLP-Aware Cache Replacement,“ ISCA 2006.
ISCA 2006.
1045 1046

Aside: Cache versus Page Replacement What’s In A Tag Store Entry?


 Physical memory (DRAM) is a cache for disk  Valid bit
 Usually managed by system software via the virtual memory  Tag
subsystem  Replacement policy bits

 Page replacement is similar to cache replacement  Dirty bit?


 Page table is the “tag store” for physical memory data store  Write back vs. write through caches

 What is the difference?


 Required speed of access to cache vs. physical memory
 Number of blocks in a cache vs. physical memory
 “Tolerable” amount of time to find a replacement candidate
(disk versus memory access latency)
 Role of hardware versus software
1047 1048
10-11-2023

Handling Writes (I) Handling Writes (II)


 When do we write the modified data in a cache to the next level?  Do we allocate a cache block on a write miss?
 Write through: At the time the write happens  Allocate on write miss: Yes
 Write back: When the block is evicted  No-allocate on write miss: No

 Write-back  Allocate on write miss


+ Can consolidate multiple writes to the same block before eviction
+ Can consolidate writes instead of writing each of them
 Potentially saves bandwidth between cache levels + saves energy
individually to next level
-- Need a bit in the tag store indicating the block is “dirty/modified”
+ Simpler because write misses can be treated the same way as
read misses
 Write-through
-- Requires (?) transfer of the whole cache block
+ Simpler
+ All levels are up to date. Consistency: Simpler cache coherence because
no need to check lower-level caches  No-allocate
-- More bandwidth intensive; no coalescing of writes + Conserves cache space if locality of writes is low (potentially
better cache hit rate)
1049 1050

Handling Writes (III) Sectored Caches


 What if the processor writes to an entire block over a small  Idea: Divide a block into subblocks (or sectors)
amount of time?  Have separate valid and dirty bits for each sector
 When is this useful? (Think writes…)
 Is there any need to bring the block into the cache from
memory in the first place? ++ No need to transfer the entire cache block into the cache
(A write simply validates and updates a subblock)
 Ditto for a portion of the block, i.e., subblock ++ More freedom in transferring subblocks into the cache (a
 E.g., 4 bytes out of 64 bytes cache block does not need to be in the cache fully)
(How many subblocks do you transfer on a read?)

-- More complex design


-- May not exploit spatial locality fully when used for reads
v d subblock v d subblock v d subblock tag
1051 1052
10-11-2023

Instruction vs. Data Caches Multi-level Caching in a Pipelined Design


 Separate or Unified?  First-level caches (instruction and data)
 Decisions very much affected by cycle time
 Unified:  Small, lower associativity
+ Dynamic sharing of cache space: no overprovisioning that  Tag store and data store accessed in parallel
might happen with static partitioning (i.e., split I and D  Second-level caches
caches)  Decisions need to balance hit rate and access latency
-- Instructions and data can thrash each other (i.e., no  Usually large and highly associative; latency not as important
guaranteed space for either)
 Tag store and data store accessed serially
-- I and D are accessed in different places in the pipeline. Where
do we place the unified cache for fast access?
 Serial vs. Parallel access of levels
 Serial: Second level cache accessed only if first-level misses
 First level caches are almost always split
 Second level does not see the same accesses as the first
 Mainly for the last reason above
 First level acts as a filter (filters some temporal and spatial locality)
 Second and higher levels are almost always unified  Management policies are therefore different
1053 1054

Cache Parameters vs. Miss/Hit Rate


 Cache size

Cache Performance  Block size

 Associativity

 Replacement policy
 Insertion/Placement policy

1056
10-11-2023

Cache Size Block Size


 Cache size: total data (not including tag) capacity  Block size is the data that is associated with an address tag
 bigger can exploit temporal locality better  not necessarily the unit of transfer between hierarchies
Sub-blocking: A block divided into multiple pieces (each with V bit)
 not ALWAYS better 

 Can improve “write” performance


 Too large a cache adversely affects hit and miss latency
smaller is faster => bigger is slower
Too small blocks
 hit rate

 access time may degrade critical path
hit rate  don’t exploit spatial locality well
 Too small a cache  have larger tag overhead
 doesn’t exploit temporal locality well
 useful data replaced often “working set”
 Too large blocks
size
 too few total # of blocks  less
 Working set: the whole set of data temporal locality exploitation
the executing application references  waste of cache space and bandwidth/energy
block
cache size size
 Within a time interval if spatial locality is not high
1057 1058

Large Blocks: Critical-Word and Subblocking Associativity


 Large cache blocks can take a long time to fill into the cache  How many blocks can map to the same index (or set)?
 fill cache line critical word first
 restart cache access before complete fill  Larger associativity
 lower miss rate, less variation among programs
 Large cache blocks can waste bus bandwidth  diminishing returns, higher hit latency
 divide a block into subblocks hit rate
 associate separate valid bits for each subblock  Smaller associativity
 When is this useful?  lower cost
 lower hit latency
 Especially important for L1 caches

v d subblock v d subblock v d subblock tag


 Power of 2 associativity required? associativity

1059 1060
10-11-2023

Classification of Cache Misses How to Reduce Each Miss Type


 Compulsory miss  Compulsory
 first reference to an address (block) always results in a miss  Caching cannot help
 subsequent references should hit unless the cache block is  Prefetching
displaced for the reasons below  Conflict
 More associativity
 Capacity miss  Other ways to get more associativity without making the
 cache is too small to hold everything needed cache associative
 defined as the misses that would occur even in a fully-  Victim cache
associative cache (with optimal replacement) of the same  Hashing
capacity  Software hints?
 Conflict miss  Capacity
 defined as any miss that is neither a compulsory nor a capacity  Utilize cache space better: keep blocks that will be referenced
miss  Software management: divide working set such that each
“phase” fits in cache
1061 1062

Improving Cache “Performance” Improving Basic Cache Performance


 Reducing miss rate
 Remember
 More associativity
 Average memory access time (AMAT)
= ( hit-rate * hit-latency ) + ( miss-rate * miss-latency )
 Alternatives/enhancements to associativity
 Victim caches, hashing, pseudo-associativity, skewed associativity
 Better replacement/insertion policies
 Reducing miss rate
 Software approaches
 Caveat: reducing miss rate can reduce performance if more
costly-to-refetch blocks are evicted  Reducing miss latency/cost
 Multi-level caches
 Reducing miss latency/cost  Critical word first
 Subblocking/sectoring
 Better replacement/insertion policies
 Reducing hit latency/cost
 Non-blocking caches (multiple cache misses in parallel)
 Multiple accesses per cycle
 Software approaches
1063 1064
10-11-2023

Agenda for the Rest of 447


 The memory hierarchy
18-447  Caches, caches, more caches
Computer Architecture  Virtualizing the memory hierarchy
Main memory: DRAM
Lecture 19: High-Performance Caches 

 Main memory control, scheduling


 Memory latency tolerance techniques
 Non-volatile memory

Prof. Onur Mutlu  Multiprocessors


Carnegie Mellon University  Coherence and consistency
Spring 2015, 3/2/2015  Interconnection networks
 Multi-core issues
1066

Readings for Today and Next Lecture How to Improve Cache Performance
 Memory Hierarchy and Caches  Three fundamental goals

Required  Reducing miss rate


 Cache chapters from P&H: 5.1-5.3  Caveat: reducing miss rate can reduce performance if more
 Memory/cache chapters from Hamacher+: 8.1-8.7
costly-to-refetch blocks are evicted

Required + Review:  Reducing miss latency or miss cost


 Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE
Trans. On Electronic Computers, 1965.
 Qureshi et al., “A Case for MLP-Aware Cache Replacement,“  Reducing hit latency or hit cost
ISCA 2006.

1067 1068
10-11-2023

Improving Basic Cache Performance Cheap Ways of Reducing Conflict Misses


 Reducing miss rate
 Instead of building highly-associative caches:
 More associativity
 Alternatives/enhancements to associativity
 Victim caches, hashing, pseudo-associativity, skewed associativity
 Victim Caches
 Better replacement/insertion policies  Hashed/randomized Index Functions
 Software approaches  Pseudo Associativity
 Reducing miss latency/cost  Skewed Associative Caches
 Multi-level caches  …
 Critical word first
 Subblocking/sectoring
 Better replacement/insertion policies
 Non-blocking caches (multiple cache misses in parallel)
 Multiple accesses per cycle
 Software approaches
1069 1070

Victim Cache: Reducing Conflict Misses Hashing and Pseudo-Associativity


 Hashing: Use better “randomizing” index functions
Victim
cache
+ can reduce conflict misses
Direct
Mapped Next Level  by distributing the accessed memory blocks more evenly to sets
Cache
Cache  Example of conflicting accesses: strided access pattern where
stride value equals number of sets in cache
-- More complex to implement: can lengthen critical path

 Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small


Fully-Associative Cache and Prefetch Buffers,” ISCA 1990.  Pseudo-associativity (Poor Man’s associative cache)
 Idea: Use a small fully associative buffer (victim cache) to  Serial lookup: On a miss, use a different index function and
store evicted blocks access cache again
+ Can avoid ping ponging of cache blocks mapped to the same set (if two  Given a direct-mapped array with K cache blocks
cache blocks continuously accessed in nearby time conflict with each  Implement K/N sets
other)
 Given address Addr, sequentially look up: {0,Addr[lg(K/N)-1: 0]},
-- Increases miss latency if accessed serially with L2; adds complexity {1,Addr[lg(K/N)-1: 0]}, … , {N-1,Addr[lg(K/N)-1: 0]}
1071 1072
10-11-2023

Skewed Associative Caches Skewed Associative Caches (I)


 Idea: Reduce conflict misses by using different index  Basic 2-way associative cache structure
functions for each cache way
Way 0 Way 1
Same index function
 Seznec, “A Case for Two-Way Skewed-Associative Caches,” for each way
ISCA 1993.

=? =?

Tag Index Byte in Block

1073 1074

Skewed Associative Caches (II) Skewed Associative Caches (III)


 Skewed associative caches  Idea: Reduce conflict misses by using different index
 Each bank has a different index function functions for each cache way
same index
redistributed to same index
Way 0 different sets same set Way 1  Benefit: indices are more randomized (memory blocks are
better distributed across sets)
f0  Less likely two blocks have same index
 Reduced conflict misses

 Cost: additional latency of hash function

=? tag index byte in block =?  Seznec, “A Case for Two-Way Skewed-Associative Caches,” ISCA 1993.

1075 1076
10-11-2023

Software Approaches for Higher Hit Rate Restructuring Data Access Patterns (I)
 Restructuring data access patterns  Idea: Restructure data layout or data access patterns
 Restructuring data layout  Example: If column-major
 x[i+1,j] follows x[i,j] in memory
 Loop interchange  x[i,j+1] is far away from x[i,j]
 Data structure separation/merging
Poor code Better code
 Blocking
for i = 1, rows for j = 1, columns
 … for j = 1, columns for i = 1, rows
sum = sum + x[i,j] sum = sum + x[i,j]

 This is called loop interchange


 Other optimizations can also increase hit rate
 Loop fusion, array merging, …
 What if multiple arrays? Unknown array size at compile time?
1077 1078

Restructuring Data Access Patterns (II) Restructuring Data Layout (I)


 Blocking  Pointer based traversal
struct Node { (e.g., of a linked list)
 Divide loops operating on arrays into computation chunks so
struct Node* node;
that each chunk can hold its data in the cache int key;  Assume a huge linked
 Avoids cache conflicts between different chunks of char [256] name; list (1M nodes) and
computation char [256] school; unique keys
Essentially: Divide the working set so that each piece fits in }

 Why does the code on
the cache the left have poor cache
while (node) {
if (nodekey == input-key) { hit rate?
// access other fields of node  “Other fields” occupy
}
 But, there are still self-conflicts in a block most of the cache line
node = nodenext;
even though rarely
1. there can be conflicts among different arrays }
accessed!
2. array sizes may be unknown at compile/programming time

1079 1080
10-11-2023

Restructuring Data Layout (II) Improving Basic Cache Performance


 Reducing miss rate
struct Node {  Idea: separate frequently-
struct Node* node; used fields of a data  More associativity
int key;
structure and pack them  Alternatives/enhancements to associativity
struct Node-data* node-data; Victim caches, hashing, pseudo-associativity, skewed associativity
} into a separate data 

structure  Better replacement/insertion policies


struct Node-data {  Software approaches
char [256] name;
char [256] school;  Who should do this?  Reducing miss latency/cost
}  Programmer  Multi-level caches
 Compiler  Critical word first
while (node) {
if (nodekey == input-key) {  Profiling vs. dynamic  Subblocking/sectoring
// access nodenode-data  Hardware?  Better replacement/insertion policies
}  Who can determine what  Non-blocking caches (multiple cache misses in parallel)
node = nodenext; is frequently used?
}  Multiple accesses per cycle
 Software approaches
1081 1082

Miss Latency/Cost Memory Level Parallelism (MLP)


 What is miss latency or miss cost affected by?
 Where does the miss get serviced from? isolated miss parallel miss
 Local vs. remote memory B
A
 What level of cache in the hierarchy? C
 Row hit versus row miss time
 Queueing delays in the memory controller and the interconnect
 …
 How much does the miss stall the processor?
 Memory Level Parallelism (MLP) means generating and
 Is it overlapped with other latencies?
servicing multiple memory accesses in parallel [Glew’98]
 Is the data immediately needed?  Several techniques to improve MLP (e.g., out-of-order execution)
 …
 MLP varies. Some misses are isolated and some parallel
How does this affect cache replacement?

1083 1084
10-11-2023

Traditional Cache Replacement Policies An Example


 Traditional cache replacement policies try to reduce miss
count

P4 P3 P2 P1 P1 P2 P3 P4 S1 S2 S3
 Implicit assumption: Reducing miss count reduces memory-
related stall time
Misses to blocks P1, P2, P3, P4 can be parallel
Misses to blocks S1, S2, and S3 are isolated
 Misses with varying cost/MLP breaks this assumption!
Two replacement algorithms:
1. Minimizes miss count (Belady’s OPT)
 Eliminating an isolated miss helps performance more than 2. Reduces isolated miss (MLP-Aware)
eliminating a parallel miss
 Eliminating a higher-latency miss could help performance For a fully associative cache containing 4 blocks

more than eliminating a lower-latency miss

1085 1086

Fewest Misses = Best Performance MLP-Aware Cache Replacement


 How do we incorporate MLP into replacement decisions?
P4 P3
S1Cache
P2
S2 S3 P1
P4 P3
S1 P2
S2 P1
S3 P4P4P3S1P2
P4S2P1
P3S3P4
P2 P3
S1 P2P4S2P3 P2 S3

S2 S3
 Qureshi et al., “A Case for MLP-Aware Cache Replacement,”
P4 P3 P2 P1 P1 P2 P3 P4 S1
ISCA 2006.
 Required reading for this week
Hit/Miss H H H M H H H H M M M
Misses=4
Time stall
Stalls=4
Belady’s OPT replacement

Hit/Miss H M M M H M M M H H H
Time Saved
stall Misses=6
cycles
Stalls=2
MLP-Aware replacement

1087 1088
10-11-2023

Handling Multiple Outstanding Accesses


 Question: If the processor can generate multiple cache
accesses, can the later accesses be handled while a
Enabling Multiple Outstanding Misses previous miss is outstanding?

 Goal: Enable cache access when there is a pending miss

 Goal: Enable multiple misses in parallel


 Memory-level parallelism (MLP)

 Solution: Non-blocking or lockup-free caches


 Kroft, “Lockup-Free Instruction Fetch/Prefetch Cache
Organization," ISCA 1981.

1090

Handling Multiple Outstanding Accesses Miss Status Handling Register


 Idea: Keep track of the status/data of misses that are being  Also called “miss buffer”
handled in Miss Status Handling Registers (MSHRs)  Keeps track of
 Outstanding cache misses
 A cache access checks MSHRs to see if a miss to the same  Pending load/store accesses that refer to the missing cache
block is already pending. block
If pending, a new request is not generated

 Fields of a single MSHR entry
 If pending and the needed data available, data forwarded to later
load
 Valid bit
 Cache block address (to match incoming accesses)
 Requires buffering of outstanding miss requests  Control/status bits (prefetch, issued to memory, which
subblocks have arrived, etc)
 Data for each subblock
 For each pending load/store
 Valid, type, data size, byte in block, destination register or store
buffer entry address
1091 1092
10-11-2023

Miss Status Handling Register Entry MSHR Operation


 On a cache miss:
 Search MSHRs for a pending access to the same block
 Found: Allocate a load/store entry in the same MSHR entry
 Not found: Allocate a new MSHR
 No free entry: stall

 When a subblock returns from the next level in memory


 Check which loads/stores waiting for it
 Forward data to the load/store unit
 Deallocate load/store entry in the MSHR entry
 Write subblock in cache or MSHR
 If last subblock, dellaocate MSHR (after writing the block in
cache)

1093 1094

Non-Blocking Cache Implementation


 When to access the MSHRs?
 In parallel with the cache?
 After cache access is complete? Enabling High Bandwidth Memories
 MSHRs need not be on the critical path of hit requests
 Which one below is the common case?
 Cache miss, MSHR hit
 Cache hit

1095
10-11-2023

Multiple Instructions per Cycle Handling Multiple Accesses per Cycle (I)
 Can generate multiple cache/memory accesses per cycle  True multiporting
 How do we ensure the cache/memory can handle multiple  Each memory cell has multiple read or write ports

accesses in the same clock cycle? + Truly concurrent accesses (no conflicts on read accesses)
-- Expensive in terms of latency, power, area
 Solutions:  What about read and write to the same location at the same

time?
 true multi-porting
 Peripheral logic needs to handle this
 virtual multi-porting (time sharing a port)

 multiple cache copies

 banking (interleaving)

1097 1098

Peripheral Logic for True Multiporting Peripheral Logic for True Multiporting

1099 1100
10-11-2023

Handling Multiple Accesses per Cycle (II) Handling Multiple Accesses per Cycle (III)
 Virtual multiporting  Multiple cache copies
 Time-share a single port  Stores update both caches
 Each access needs to be (significantly) shorter than clock cycle  Loads proceed in parallel
 Used in Alpha 21264 Port 1
 Is this scalable?  Used in Alpha 21164 Load Port 1
Cache
Copy 1 Data
 Scalability?
 Store operations form a
Store
bottleneck
 Area proportional to “ports” Port 2
Cache
Port 2 Copy 2 Data

Load

1101 1102

Handling Multiple Accesses per Cycle (III) General Principle: Interleaving


 Banking (Interleaving)  Interleaving (banking)
 Bits in address determines which bank an address maps to  Problem: a single monolithic memory array takes long to
 Address space partitioned into separate banks access and does not enable multiple accesses in parallel
 Which bits to use for “bank address”?
+ No increase in data store area  Goal: Reduce the latency of memory array access and enable
-- Cannot satisfy multiple accesses Bank 0: multiple accesses in parallel
to the same bank Even
addresses
-- Crossbar interconnect in input/output  Idea: Divide the array into multiple banks that can be
accessed independently (in the same cycle or in consecutive
cycles)
 Bank conflicts
 Each bank is smaller than the entire memory storage
 Two accesses are to the same bank Bank 1:
 Accesses to different banks can be overlapped
Odd
 How can these be reduced? addresses
 Hardware? Software?  A Key Issue: How do you map data to different banks? (i.e.,
how do you interleave data across banks?)
1103 1104
10-11-2023

Further Readings on Caching and MLP


 Required: Qureshi et al., “A Case for MLP-Aware Cache
Replacement,” ISCA 2006.
Multi-Core Issues in Caching
 Glew, “MLP Yes! ILP No!,” ASPLOS Wild and Crazy Ideas
Session, 1998.

 Mutlu et al., “Runahead Execution: An Effective Alternative


to Large Instruction Windows,” IEEE Micro 2003.

1105

Caches in Multi-Core Systems Private vs. Shared Caches


 Cache efficiency becomes even more important in a multi-  Private cache: Cache belongs to one core (a shared block can be in
core/multi-threaded system multiple caches)
 Memory bandwidth is at premium  Shared cache: Cache is shared by multiple cores
 Cache space is a limited resource

 How do we design the caches in a multi-core system? CORE 0 CORE 1 CORE 2 CORE 3 CORE 0 CORE 1 CORE 2 CORE 3

 Many decisions L2 L2 L2 L2
L2
 Shared vs. private caches CACHE CACHE CACHE CACHE
CACHE

 How to maximize performance of the entire system?


 How to provide QoS to different threads in a shared cache? DRAM MEMORY CONTROLLER DRAM MEMORY CONTROLLER
 Should cache management algorithms be aware of threads?
 How should space be allocated to threads in a shared cache?
1107 1108
10-11-2023

Resource Sharing Concept and Advantages Resource Sharing Disadvantages


 Idea: Instead of dedicating a hardware resource to a  Resource sharing results in contention for resources
hardware context, allow multiple contexts to use it  When the resource is not idle, another thread cannot use it
 Example resources: functional units, pipeline, caches, buses,  If space is occupied by one thread, another thread needs to re-
memory occupy it
 Why?
- Sometimes reduces each or some thread’s performance
+ Resource sharing improves utilization/efficiency  throughput - Thread performance can be worse than when it is run alone
 When a resource is left idle by one thread, another thread can - Eliminates performance isolation  inconsistent performance
use it; no need to replicate shared data across runs
+ Reduces communication latency - Thread performance depends on co-executing threads
 For example, shared data kept in the same cache in - Uncontrolled (free-for-all) sharing degrades QoS
multithreaded processors - Causes unfairness, starvation
+ Compatible with the shared memory model
Need to efficiently and fairly utilize shared resources
1109 1110

Private vs. Shared Caches Shared Caches Between Cores


 Private cache: Cache belongs to one core (a shared block can be in  Advantages:
multiple caches)  High effective capacity
 Shared cache: Cache is shared by multiple cores  Dynamic partitioning of available cache space
 No fragmentation due to static partitioning
 Easier to maintain coherence (a cache block is in a single location)
 Shared data and locks do not ping pong between caches
CORE 0 CORE 1 CORE 2 CORE 3 CORE 0 CORE 1 CORE 2 CORE 3

 Disadvantages
 Slower access
L2 L2 L2 L2
CACHE CACHE CACHE CACHE L2  Cores incur conflict misses due to other cores’ accesses
CACHE
 Misses due to inter-core interference
 Some cores can destroy the hit rate of other cores
DRAM MEMORY CONTROLLER
 Guaranteeing a minimum level of service (or fairness) to each core is harder
DRAM MEMORY CONTROLLER
(how much space, how much bandwidth?)

1111 1112
10-11-2023

Shared Caches: How to Share? Example: Utility Based Shared Cache Partitioning
 Free-for-all sharing  Goal: Maximize system throughput
 Placement/replacement policies are the same as a single core  Observation: Not all threads/applications benefit equally from
system (usually LRU or pseudo-LRU) caching  simple LRU replacement not good for system
throughput
 Not thread/application aware
 Idea: Allocate more cache space to applications that obtain the
 An incoming block evicts a block regardless of which threads
most benefit from more space
the blocks belong to

 The high-level idea can be applied to other shared resources as


 Problems well.
 Inefficient utilization of cache: LRU is not the best policy
 A cache-unfriendly application can destroy the performance of
 Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-
a cache friendly application
Overhead, High-Performance, Runtime Mechanism to Partition
 Not all applications benefit equally from the same amount of Shared Caches,” MICRO 2006.
cache: free-for-all might prioritize those that do not benefit
 Suh et al., “A New Memory Monitoring Scheme for Memory-
 Reduced performance, reduced fairness Aware Scheduling and Partitioning,” HPCA 2002.
1113 1114

The Multi-Core System: A Shared Resource View Need for QoS and Shared Resource Mgmt.
 Why is unpredictable performance (or lack of QoS) bad?

 Makes programmer’s life difficult


 An optimized program can get low performance (and
performance varies widely depending on co-runners)

 Causes discomfort to user


 An important program can starve
Shared
Storage  Examples from shared software resources

 Makes system management difficult


 How do we enforce a Service Level Agreement when
hardware resources are sharing is uncontrollable?
1115 1116
10-11-2023

Resource Sharing vs. Partitioning Shared Hardware Resources


 Memory subsystem (in both multithreaded and multi-core
 Sharing improves throughput
systems)
 Better utilization of space
 Non-private caches
 Interconnects
 Partitioning provides performance isolation (predictable  Memory controllers, buses, banks
performance)
 Dedicated space
 I/O subsystem (in both multithreaded and multi-core
systems)
 Can we get the benefits of both?  I/O, DMA controllers
 Ethernet controllers
 Idea: Design shared resources such that they are efficiently
utilized, controllable and partitionable
 Processor (in multithreaded systems)
 No wasted resource + QoS mechanisms for threads
 Pipeline resources
 L1 caches
1117 1118

Agenda for the Rest of 447


 The memory hierarchy
18-447  Caches, caches, more caches
Virtualizing the memory hierarchy: Virtual Memory
Computer Architecture 

 Main memory: DRAM


Lecture 20: Virtual Memory  Main memory control, scheduling
 Memory latency tolerance techniques
 Non-volatile memory

Prof. Onur Mutlu  Multiprocessors


Carnegie Mellon University  Coherence and consistency
Spring 2015, 3/4/2015  Interconnection networks
 Multi-core issues
1120
10-11-2023

Readings Memory (Programmer’s View)


 Section 5.4 in P&H
 Optional: Section 8.8 in Hamacher et al.
 Your 213 textbook for brush-up

1121 1122

Ideal Memory Abstraction: Virtual vs. Physical Memory


 Zero access time (latency)  Programmer sees virtual memory
 Infinite capacity  Can assume the memory is “infinite”
 Zero cost  Reality: Physical memory size is much smaller than what
the programmer assumes
 Infinite bandwidth (to support multiple accesses in parallel)
 The system (system software + hardware, cooperatively)
maps virtual memory addresses are to physical memory
 The system automatically manages the physical memory
space transparently to the programmer

+ Programmer does not need to know the physical size of memory


nor manage it  A small physical memory can appear as a huge
one to the programmer  Life is easier for the programmer
-- More complex system software and architecture

A classic example of the programmer/(micro)architect tradeoff


1123 1124
10-11-2023

Benefits of Automatic Management of Memory A System with Physical Memory Only


 Programmer does not deal with physical addresses  Examples:
 Each process has its own mapping from virtualphysical  most Cray machines
addresses Memory
 early PCs
 nearly all embedded systems Physical
0:
1:
 Enables Addresses
 Code and data to be located anywhere in physical memory CPU
(relocation)
 Isolation/separation of code and data of different processes in
physical processes
(protection and isolation) CPU’s load or store addresses used
 Code and data sharing between multiple processes directly to access memory N-1:
(sharing)

1125 1126

The Problem Difficulties of Direct Physical Addressing


 Physical memory is of limited size (cost)  Programmer needs to manage physical memory space
 What if you need more?  Inconvenient & hard
 Should the programmer be concerned about the size of  Harder when you have multiple processes
code/data blocks fitting physical memory?
 Should the programmer manage data movement from disk to  Difficult to support code and data relocation
physical memory?
 Should the programmer ensure two processes do not use the
same physical memory?  Difficult to support multiple processes
 Protection and isolation between multiple processes
 Sharing of physical memory space
 Also, ISA can have an address space greater than the
physical memory size
 E.g., a 64-bit address space with byte addressability  Difficult to support data/code sharing across processes
 What if you do not have enough physical memory?

1127 1128
10-11-2023

Virtual Memory Basic Mechanism


 Idea: Give the programmer the illusion of a large address  Indirection (in addressing)
space while having a small physical memory
 So that the programmer does not worry about managing  Address generated by each instruction in a program is a
physical memory “virtual address”
 i.e., it is not the physical address used to address main
 Programmer can assume he/she has “infinite” amount of memory
physical memory  called “linear address” in x86

 Hardware and software cooperatively and automatically  An “address translation” mechanism maps this address to a
manage the physical memory space to provide the illusion “physical address”
 Illusion is maintained for each independent process  called “real address” in x86
 Address translation mechanism can be implemented in
hardware and software together

1129 1130

A System with Virtual Memory (Page based) Virtual Pages, Physical Frames
Memory
 Virtual address space divided into pages
 Physical address space divided into frames
0:
Page Table 1:
Virtual Physical
Addresses 0: Addresses  A virtual page is mapped to
1:  A physical frame, if the page is in physical memory
CPU  A location in disk, otherwise

 If an accessed virtual page is not in memory, but on disk


P-1:
N-1:  Virtual memory system brings the page into a physical frame
and adjusts the mapping  this is called demand paging
Disk

 Address Translation: The hardware converts virtual addresses into  Page table is the table that stores the mapping of virtual
physical addresses via an OS-managed lookup table (page table) pages to physical frames
1131 1132
10-11-2023

Physical Memory as a Cache Supporting Virtual Memory


 In other words…  Virtual memory requires both HW+SW support
 Page Table is in memory
Can be cached in special hardware structures called Translation
 Physical memory is a cache for pages stored on disk 

Lookaside Buffers (TLBs)


 In fact, it is a fully associative cache in modern systems (a
virtual page can be mapped to any physical frame)
 The hardware component is called the MMU (memory
management unit)
 Similar caching issues exist as we have covered earlier:
 Includes Page Table Base Register(s), TLBs, page walkers
 Placement: where and how to place/find a page in cache?
 Replacement: what page to remove to make room in cache?
 It is the job of the software to leverage the MMU to
 Granularity of management: large, small, uniform pages?
 Populate page tables, decide what to replace in physical memory
 Write policy: what do we do about writes? Write back?
 Change the Page Table Register on context switch (to use the
running thread’s page table)
 Handle page faults and ensure correct mapping
1133 1134

Some System Software Jobs for VM Page Fault (“A Miss in Physical Memory”)
 Keeping track of which physical frames are free  If a page is not in physical memory but disk
 Page table entry indicates virtual page not in memory
 Allocating free physical frames to virtual pages  Access to such a page triggers a page fault exception
 OS trap handler invoked to move data from disk into memory
 Other processes can continue executing
 Page replacement policy
 OS has full control over placement
 When no physical frame is free, what should be swapped out?
Before fault After fault
Memory
Memory
 Sharing pages between processes Page Table
Virtual Page Table
Physical
Addresses Addresses Virtual Physical
Addresses Addresses
 Copy-on-write optimization CPU
CPU

 Page-flip optimization
Disk
Disk
1135
10-11-2023

Servicing a Page Fault Page Table is Per Process


 Each process has its own virtual address space
 (1) Processor signals controller (1) Initiate Block Read
 Full address space for each program
 Read block of length P starting Processor  Simplifies memory allocation, sharing, linking and loading.
at disk address X and store Reg (3) Read
starting at memory address Y Done 0
Virtual 0 Physical Address
Cache Address Space (DRAM)
Address VP 1 PP 2
 (2) Read occurs Space for VP 2 Translation
...
 Direct Memory Access (DMA) Memory-I/O bus
Process 1: N-1
(e.g., read/only
 Under control of I/O controller (2) DMA PP 7 library code)
Transfer I/O Virtual 0
controller Address VP 1
Memory PP 10
 (3) Controller signals completion Space for
VP 2
...
 Interrupt processor Process 2: N-1 M-1

OS resumes suspended process Disk Disk


1137 1138

Address Translation Address Translation (II)


 How to obtain the physical address from a virtual address?

 Page size specified by the ISA


 VAX: 512 bytes
 Today: 4KB, 8KB, 2GB, … (small and large pages mixed
together)
 Trade-offs? (remember cache lectures)

 Page Table contains an entry for each virtual page


 Called Page Table Entry (PTE)
 What is in a PTE?

1139 1140
10-11-2023

Address Translation (III) Address Translation (IV)


 Parameters  Separate (set of) page table(s) per process
 P= 2p = page size (bytes).  VPN forms index into page table (points to a page table entry)
 Page Table Entry (PTE) provides information about page
 N = 2n = Virtual-address limit
 M = 2m = Physical-address limit virtual address 0
page table
n–1 p p–1
n–1 p p–1 0 base register
virtual address (per process) virtual page number (VPN) page offset
virtual page number page offset
valid access physical frame number (PFN)

VPN acts as
address translation
table index

m–1 p p–1 0 if valid=0


physical frame number page offset physical address then page m–1 p p–1 0
not in memory physical frame number (PFN) page offset
(page fault)
Page offset bits don’t change as a result of translation physical address
1141 1142

Address Translation: Page Hit Address Translation: Page Fault

1143 1144
10-11-2023

What Is in a Page Table Entry (PTE)? Remember: Cache versus Page Replacement
 Page table is the “tag store” for the physical memory data store  Physical memory (DRAM) is a cache for disk
 A mapping table between virtual memory and physical memory  Usually managed by system software via the virtual memory
 PTE is the “tag store entry” for a virtual page in memory subsystem
 Need a valid bit  to indicate validity/presence in physical memory
 Need tag bits (PFN)  to support translation  Page replacement is similar to cache replacement
 Need bits to support replacement  Page table is the “tag store” for physical memory data store
 Need a dirty bit to support “write back caching”
 Need protection bits to enable access control and protection  What is the difference?
 Required speed of access to cache vs. physical memory
 Number of blocks in a cache vs. physical memory
 “Tolerable” amount of time to find a replacement candidate
(disk versus memory access latency)
 Role of hardware versus software
1145 1146

Page Replacement Algorithms CLOCK Page Replacement Algorithm


 If physical memory is full (i.e., list of free physical pages is  Keep a circular list of physical frames in memory
empty), which physical frame to replace on a page fault?  Keep a pointer (hand) to the last-examined frame in the list
 When a page is accessed, set the R bit in the PTE
 Is True LRU feasible?  When a frame needs to be replaced, replace the first frame
 4GB memory, 4KB pages, how many possibilities of ordering? that has the reference (R) bit not set, traversing the
circular list starting from the pointer (hand) clockwise
 Modern systems use approximations of LRU  During traversal, clear the R bits of examined frames
 E.g., the CLOCK algorithm  Set the hand pointer to the next frame in the list
 And, more sophisticated algorithms to take into account
“frequency” of use
 E.g., the ARC algorithm
 Megiddo and Modha, “ARC: A Self-Tuning, Low Overhead
Replacement Cache,” FAST 2003.
1147 1148
10-11-2023

Aside: Page Size Trade Offs


 What is the granularity of management of physical memory?
 Large vs. small pages
 Tradeoffs have analogies to large vs. small cache blocks Access Protection/Control
via Virtual Memory
 Many different tradeoffs with advantages and disadvantages
 Size of the Page Table (tag store)
 Reach of the Translation Lookaside Buffer (we will see this later)
 Transfer size from disk to memory (waste of bandwidth?)
 Waste of space within a page (internal fragmentation)
 Waste of space within the entire physical memory (external
fragmentation)
 Granularity of access protection
 …
1149

Page-Level Access Control (Protection) Two Functions of Virtual Memory


 Not every process is allowed to access every page
 E.g., may need supervisor level privilege to access system
pages

 Idea: Store access control information on a page basis in


the process’s page table

 Enforce access control at the same time as translation

 Virtual memory system serves two functions today


Address translation (for illusion of large physical memory)
Access control (protection)

1151 1152
10-11-2023

VM as a Tool for Memory Access Protection Access Control Logic


 Extend Page Table Entries (PTEs) with permission bits
 Check bits on each access and during a page fault
 If violated, generate exception (Access Protection exception)

Page Tables Memory


Read? Write? Physical Addr PP 0
VP 0: Yes No PP 6
PP 2
Process i: VP 1: Yes Yes PP 4
VP 2: No No XXXXXXX PP 4
• • •
• • • PP 6
• • •
Read? Write? Physical Addr PP 8
VP 0: Yes Yes PP 6
PP 10
Process j: VP 1: Yes No PP 9
PP 12
VP 2: No No XXXXXXX
• • • •
• • • •
• • • •
1153 1154

Privilege Levels in x86 Page Level Protection in x86

1155 1156
10-11-2023

Three Major Issues


 How large is the page table and how do we store and
access it?

Some Issues in Virtual Memory  How can we speed up translation & access control check?

 When do we do the translation in relation to cache access?

 There are many other issues we will not cover in detail


 What happens on a context switch?
 How can you handle multiple page sizes?
 …

1158

Virtual Memory Issue I Issue: Page Table Size


 How large is the page table? 64-bit

VPN PO
 Where do we store it?
 In hardware? 52-bit 12-bit
 In physical memory? (Where is the PTBR?)
 In virtual memory? (Where is the PTBR?) page concat PA
table 28-bit 40-bit
 How can we store it efficiently without requiring physical
memory that can store all page tables?
 Suppose 64-bit VA and 40-bit PA, how large is the page table?
 Idea: multi-level page tables
252 entries x ~4 bytes  16x1015 Bytes
 Only the first-level page table has to be in physical memory
and that is for just one process!
 Remaining levels are in virtual memory (but get cached in
physical memory when accessed) and the process many not be using the entire
VM space!
1159 1160
10-11-2023

Solution: Multi-Level Page Tables Page Table Access


Example from x86 architecture  How do we access the Page Table?

 Page Table Base Register (CR3 in x86)


 Page Table Limit Register

 If VPN is out of the bounds (exceeds PTLR) then the


process did not allocate the virtual page  access control
exception

 Page Table Base Register is part of a process’s context


 Just like PC, status registers, general purpose registers
 Needs to be loaded when the process is context-switched in

1161 1162

More on x86 Page Tables (I): Small Pages More on x86 Page Tables (II): Large Pages

1163 1164
10-11-2023

x86 Page Table Entries x86 PTE (4KB page)

1165 1166

x86 Page Directory Entry (PDE) Four-level Paging in x86

1167 1168
10-11-2023

Four-level Paging and Extended Physical Address Space in x86 Virtual Memory Issue II
 How fast is the address translation?
 How can we make it fast?

 Idea: Use a hardware structure that caches PTEs 


Translation lookaside buffer

 What should be done on a TLB miss?


 What TLB entry to replace?
 Who handles the TLB miss? HW vs. SW?

 What should be done on a page fault?


 What virtual page to replace from physical memory?
 Who handles the page fault? HW vs. SW?
1169 1170

Speeding up Translation with a TLB Handling TLB Misses


 Essentially a cache of recent address translations  The TLB is small; it cannot hold all PTEs
 Avoids going to the page table on every reference  Some translations will inevitably miss in the TLB
 Must access memory to find the appropriate PTE
 Index = lower bits of VPN  Called walking the page directory/table
(virtual page #)  Large performance penalty
 Tag = unused bits of VPN +
process ID  Who handles TLB misses? Hardware or software?
 Data = a page-table entry
 Status = valid, dirty

The usual cache design choices


(placement, replacement policy,
multi-level, etc.) apply here too.
1171
10-11-2023

Handling TLB Misses (II) Handling TLB Misses (III)


 Approach #1. Hardware-Managed (e.g., x86)  Hardware-Managed TLB
 The hardware does the page walk  Pro: No exception on TLB miss. Instruction just stalls
 The hardware fetches the PTE and inserts it into the TLB  Pro: Independent instructions may continue
 If the TLB is full, the entry replaces another entry  Pro: No extra instructions/data brought into caches.
 Done transparently to system software  Con: Page directory/table organization is etched into the
system: OS has little flexibility in deciding these
 Approach #2. Software-Managed (e.g., MIPS)
 The hardware raises an exception  Software-Managed TLB
 The operating system does the page walk  Pro: The OS can define page table oganization
 The operating system fetches the PTE  Pro: More sophisticated TLB replacement policies are possible
 The operating system inserts/evicts entries in the TLB  Con: Need to generate an exception  performance overhead
due to pipeline flush, exception handler execution, extra
instructions brought to caches

Virtual Memory Issue III


 When do we do the address translation?
 Before or after accessing the L1 cache?
Virtual Memory and Cache Interaction

1175
10-11-2023

Address Translation and Caching Homonyms and Synonyms


 When do we do the address translation?  Homonym: Same VA can map to two different PAs
 Before or after accessing the L1 cache?  Why?
 VA is in different processes
 In other words, is the cache virtually addressed or
physically addressed?  Synonym: Different VAs can map to the same PA
 Virtual versus physical cache  Why?
 Different pages can share the same physical frame within or
across processes
 What are the issues with a virtually addressed cache?
 Reasons: shared libraries, shared data, copy-on-write pages
within the same process, …
 Synonym problem:
 Two different virtual addresses can map to the same physical  Do homonyms and synonyms create problems when we
address  same physical address can be present in multiple have a cache?
locations in the cache  can lead to inconsistency in data  Is the cache virtually or physically addressed?
1177 1178

Cache-VM Interaction Physical Cache

CPU CPU CPU

VA
TLB cache
PA
VA
cache tlb
PA

VA
cache tlb
PA

lower lower lower


hier. hier. hier.

physical cache virtual (L1) cache virtual-physical cache1179 1180


10-11-2023

Virtual Cache Virtual-Physical Cache

1181 1182

Virtually-Indexed Physically-Tagged Virtually-Indexed Physically-Tagged


 If C≤(page_size  associativity), the cache index bits come only  If C>(page_size  associativity), the cache index bits include VPN
from page offset (same in VA and PA)  Synonyms can cause problems
 If both cache and TLB are on chip  The same physical address can exist in two locations
 index both arrays concurrently using VA bits  Solutions?
 check cache tag (physical) against TLB output at the end
VPN Page Offset
Index BiB

VPN Page Offset a


TLB physical
Index BiB cache

TLB physical
cache PPN = tag data

PPN = tag data TLB hit? cache hit? 1184

TLB hit? cache hit? 1183


10-11-2023

Some Solutions to the Synonym Problem An Exercise


 Limit cache size to (page size times associativity)  Problem 5 from
 get index from page offset  Past midterm exam Problem 5, Spring 2009
 http://www.ece.cmu.edu/~ece740/f11/lib/exe/fetch.php?medi
 On a write to a block, search all possible indices that can a=wiki:midterm:midterm_s09.pdf
contain the same physical block, and update/invalidate
 Used in Alpha 21264, MIPS R10K

 Restrict page placement in OS


 make sure index(VA) = index(PA)
 Called page coloring
 Used in many SPARC processors

1185 1186

An Exercise (I) An Exercise (II)

1187 1188
10-11-2023

An Exercise (Concluded)

1189 1190

Solutions to the Exercise


 http://www.ece.cmu.edu/~ece740/f11/lib/exe/fetch.php?m
edia=wiki:midterm:midterm_s09_solution.pdf
We did not cover the following slides.
They are for your benefit.  And, more exercises are in past exams and in your
homeworks…

1192
10-11-2023

Review: Solutions to the Synonym Problem Some Questions to Ponder


 Limit cache size to (page size times associativity)  At what cache level should we worry about the synonym
 get index from page offset and homonym problems?

 On a write to a block, search all possible indices that can  What levels of the memory hierarchy does the system
contain the same physical block, and update/invalidate software’s page mapping algorithms influence?
 Used in Alpha 21264, MIPS R10K
 What are the potential benefits and downsides of page
 Restrict page placement in OS coloring?
 make sure index(VA) = index(PA)
 Called page coloring
 Used in many SPARC processors

1193 1194

Fast Forward: Virtual Memory – DRAM Interaction


 Operating System influences where an address maps to in
DRAM
Virtual Page number (52 bits) Page offset (12 bits) VA Protection and Translation
Physical Frame number (19 bits) Page offset (12 bits) PA
without Virtual Memory
Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) PA

 Operating system can control which bank/channel/rank a


virtual page is mapped to.

 It can perform page coloring to minimize bank conflicts


 Or to minimize inter-application interference

1195
10-11-2023

Aside: Protection w/o Virtual Memory Very Quick Overview: Base and Bound
 Question: Do we need virtual memory for protection? In a multi-tasking system
Each process is given a non-overlapping, contiguous physical memory region, everything
belonging to a process must fit in that region
 Answer: No When a process is swapped in, OS sets base to the start of the process’s memory region
and bound to the end of the region
HW translation and protection check (on each memory reference)
 Other ways of providing memory protection
PA = EA + base, provided (PA < bound), else violations
 Base and bound registers
 Each process sees a private and uniform address space (0 .. max)
 Segmentation

Base active process’s


 None of these are as elegant as page-based access control region
Bound
 They run into complexities as we need more protection
capabilites privileged control another process’s
registers region Bound can also be
physical mem. formulated as a range

1197 1198

Very Quick Overview: Base and Bound (II) Segmented Address Space
segment == a base and bound pair
 Limitations of the base and bound scheme 

 segmented addressing gives each process multiple segments


 large contiguous space is hard to come by after the system
 initially, separate code and data segments
runs for a while---free space may be fragmented
- 2 sets of base-and-bound reg’s for inst and data fetch
 how do two processes share some memory regions but not
- allowed sharing code segments
others?
 became more and more elaborate: code, data, stack, etc.

SEG # EA

segment tables
must be 1. PA
privileged data segment +,<
structures and 2. base &
table
private/unique to & okay?
each process bound
1199 1200
10-11-2023

Segmented Address Translation Segmentation as a Way to Extend Address Space


 EA: segment number (SN) and a segment offset (SO)  How to extend an old ISA to support larger addresses for
 SN may be specified explicitly or implied (code vs. data) new applications while remaining compatible with old
 segment size limited by the range of SO
applications?
 segments can have different sizes, not all SOs are meaningful
 Segment translation and protection table
 maps SN to corresponding base and bound SN SO small EA
 separate mapping for each process
 must be a privileged structure
large
“large” base EA
SN SO
segment
table
PA,
base bound +,<
okay?
1201 1202

Issues with Segmentation Page-based Address Space


 Segmented addressing creates fragmentation problems:  In a Paged Memory System:
 a system may have plenty of unallocated memory locations  PA space is divided into fixed size “segments” (e.g., 4kbyte),
 they are useless if they do not form a contiguous region of a more commonly known as “page frames”
sufficient size  VA is interpreted as page number and page offset

 Page-based virtual memory solves these issues


 By ensuring the address space is divided into fixed size Page No. Page Offset
“pages”
page tables
 And virtual address space of each process is contiguous must be 1.
 The key is the use of indirection to give each process the privileged data
structures and 2. page PA
illusion of a contiguous address space Frame no
+
private/unique to table
each process &
okay?

1203 1204
10-11-2023

Upcoming Seminar on Flash Memory (March 25)


 March 25, Wednesday, CIC Panther Hollow Room, 4-5pm
18-447  Yixin Luo, PhD Student, CMU
Computer Architecture  Data Retention in MLC NAND Flash Memory:
Characterization, Optimization and Recovery
Lecture 21: Main Memory
 Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu,
"Data Retention in MLC NAND Flash Memory:
Characterization, Optimization and Recovery"
Proceedings of the 21st International Symposium on High-
Prof. Onur Mutlu Performance Computer Architecture (HPCA), Bay Area, CA,
Carnegie Mellon University February 2015.
[Slides (pptx) (pdf)]
Spring 2015, 3/23/2015 Best paper session.

1206

Computer Architecture Seminars Where We Are in Lecture Schedule


 Seminars relevant to many topics covered in 447  The memory hierarchy
 Caching  Caches, caches, more caches
 DRAM  Virtualizing the memory hierarchy: Virtual Memory
 Multi-core systems  Main memory: DRAM
 …  Main memory control, scheduling
 Memory latency tolerance techniques
 List of past and upcoming seminars are here:  Non-volatile memory
 https://www.ece.cmu.edu/~calcm/doku.php?id=seminars:sem
inars
 You can subscribe to receive Computer Architecture related  Multiprocessors
event announcements here:  Coherence and consistency
 https://sos.ece.cmu.edu/mailman/listinfo/calcm-list  Interconnection networks
 Multi-core issues
1207 1208
10-11-2023

Required Reading (for the Next Few Lectures)


 Onur Mutlu, Justin Meza, and Lavanya Subramanian,
"The Main Memory System: Challenges and
Opportunities"
Main Memory Invited Article in Communications of the Korean Institute of
Information Scientists and Engineers (KIISE), 2015.

1210

Required Readings on DRAM


 DRAM Organization and Operation Basics
 Sections 1 and 2 of: Lee et al., “Tiered-Latency DRAM: A Low
Latency and Low Cost DRAM Architecture,” HPCA 2013. Why Is Memory So Important?
http://users.ece.cmu.edu/~omutlu/pub/tldram_hpca13.pdf
(Especially Today)
 Sections 1 and 2 of Kim et al., “A Case for Subarray-Level
Parallelism (SALP) in DRAM,” ISCA 2012.
http://users.ece.cmu.edu/~omutlu/pub/salp-dram_isca12.pdf

 DRAM Refresh Basics


 Sections 1 and 2 of Liu et al., “RAIDR: Retention-Aware
Intelligent DRAM Refresh,” ISCA 2012.
http://users.ece.cmu.edu/~omutlu/pub/raidr-dram-
refresh_isca12.pdf
1211
10-11-2023

The Main Memory System Memory System: A Shared Resource View

Processor Main Memory Storage (SSD/HDD)


and caches

 Main memory is a critical component of all computing


Storage
systems: server, mobile, embedded, desktop, sensor

 Main memory system must scale (in size, technology,


efficiency, cost, and management algorithms) to maintain
performance growth and technology scaling benefits
1213 1214

State of the Main Memory System Major Trends Affecting Main Memory (I)
 Recent technology, architecture, and application trends  Need for main memory capacity, bandwidth, QoS increasing
 lead to new requirements
 exacerbate old requirements

 DRAM and memory controllers, as we know them today,


are (will be) unlikely to satisfy all requirements  Main memory energy/power is a key system design concern

 Some emerging non-volatile memory technologies (e.g.,


PCM) enable new opportunities: memory+storage merging
 DRAM technology scaling is ending
 We need to rethink/reinvent the main memory system
 to fix DRAM issues and enable emerging technologies
 to satisfy all requirements
1215 1216
10-11-2023

Demand for Memory Capacity Example: The Memory Capacity Gap


 More cores  More concurrency  Larger working set Core count doubling ~ every 2 years
DRAM DIMM capacity doubling ~ every 3 years

AMD Barcelona: 4 cores IBM Power7: 8 cores Intel SCC: 48 cores

 Modern applications are (increasingly) data-intensive

 Many applications/virtual machines (will) share main memory


 Cloud computing/servers: Consolidation to improve efficiency
 GP-GPUs: Many threads from multiple parallel applications The picture can't be display ed.

 Mobile: Interactive + non-interactive consolidation  Memory capacity per core expected to drop by 30% every two years
 …  Trends worse for memory bandwidth per core!
1217 1218

Major Trends Affecting Main Memory (II) Major Trends Affecting Main Memory (III)
 Need for main memory capacity, bandwidth, QoS increasing  Need for main memory capacity, bandwidth, QoS increasing
 Multi-core: increasing number of cores/agents
 Data-intensive applications: increasing demand/hunger for data
 Consolidation: Cloud computing, GPUs, mobile, heterogeneity
 Main memory energy/power is a key system design concern
 IBM servers: ~50% energy spent in off-chip memory hierarchy
 Main memory energy/power is a key system design concern [Lefurgy, IEEE Computer 2003]
 DRAM consumes power when idle and needs periodic refresh

 DRAM technology scaling is ending


 DRAM technology scaling is ending

1219 1220
10-11-2023

Major Trends Affecting Main Memory (IV) The DRAM Scaling Problem
 Need for main memory capacity, bandwidth, QoS increasing  DRAM stores charge in a capacitor (charge-based memory)
 Capacitor must be large enough for reliable sensing
 Access transistor should be large enough for low leakage and high
retention time
 Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009]

 Main memory energy/power is a key system design concern

 DRAM technology scaling is ending


 ITRS projects DRAM will not scale easily below X nm
 Scaling has provided many benefits:
 higher capacity, higher density, lower cost, lower energy  DRAM capacity, cost, and energy/power hard to scale
1221 1222

Evidence of the DRAM Scaling Problem Most DRAM Modules Are At Risk
A company B company C company
Row of Cells Wordline
Row Row
Victim
Row Opened
Aggressor Closed
Row VHIGH
LOW
(37/43) (45/54) (28/32)
Row Row
Victim
Row Up to Up to Up to
7 6 5
Repeatedly opening and closing a row enough times within a
refresh interval induces disturbance errors in adjacent rows in errors errors errors
most real DRAM chips you can buy today
Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of
Kim+, “Flipping Bits in Memory Without Accessing
1223 Them: An Experimental Study of DRAM Disturbance Errors,” ISCA 2014. 1224
DRAM Disturbance Errors,” ISCA 2014.
10-11-2023

x86 CPU DRAM Module x86 CPU DRAM Module

loop: loop:
mov (X), %eax mov (X), %eax
mov (Y), %ebx X mov (Y), %ebx X
clflush (X) clflush (X)
clflush (Y) clflush (Y)
mfence Y mfence Y
jmp loop jmp loop

x86 CPU DRAM Module x86 CPU DRAM Module

loop: loop:
mov (X), %eax mov (X), %eax
mov (Y), %ebx X mov (Y), %ebx X
clflush (X) clflush (X)
clflush (Y) clflush (Y)
mfence Y mfence Y
jmp loop jmp loop
10-11-2023

Observed Errors in Real Systems Errors vs. Vintage


Access-
CPU Architecture Errors
Rate
Intel Haswell (2013) 22.9K 12.3M/sec First
Appearance
Intel Ivy Bridge (2012) 20.7K 11.7M/sec

• AIntel
realSandy Bridge (2011)
reliability 16.1K
& security issue 11.6M/sec
• InAMD
a more controlled
Piledriver 59
(2012) environment, 6.1M/sec
we can
induce as many as ten million disturbance
errors
All modules from 2012–2013 are vulnerable
Kim+, “Flipping Bits in Memory Without Accessing
1229 Them: An Experimental Study of 1230
DRAM Disturbance Errors,” ISCA 2014.

Security Implications (I) Security Implications (II)


 “Rowhammer” is a problem with some recent DRAM devices in
which repeatedly accessing a row of memory can cause bit flips
in adjacent rows.
 We tested a selection of laptops and found that a subset of them
exhibited the problem.
 We built two working privilege escalation exploits that use this
http://users.ece.cmu.edu/~omutlu/pub/dra effect.
m-row-hammer_isca14.pdf
 One exploit uses rowhammer-induced bit flips to gain kernel
privileges on x86-64 Linux when run as an unprivileged userland
process.
 When run on a machine vulnerable to the rowhammer problem,
the process was able to induce bit flips in page table entries
http://googleprojectzero.blogspot.com/2015
/03/exploiting-dram-rowhammer-bug-to-
(PTEs).
gain.html  It was able to use this to gain write access to its own page table,
and hence gain read-write access to all of physical memory.
1231 1232
10-11-2023

Recap: The DRAM Scaling Problem An Orthogonal Issue: Memory Interference

Core Core
Main
Memory
Core Core

Cores’ interfere with each other when accessing shared main memory
Uncontrolled interference leads to many problems (QoS, performance)
1233 1234

Major Trends Affecting Main Memory


 Need for main memory capacity, bandwidth, QoS increasing

How Can We Fix the Memory Problem &


Design (Memory) Systems of the Future?
 Main memory energy/power is a key system design concern

 DRAM technology scaling is ending

1235
10-11-2023

Look Backward to Look Forward


 We first need to understand the principles of:
 Memory and DRAM
Memory controllers

 Techniques for reducing and tolerating memory latency


Main Memory
 Potential memory technologies that can compete with DRAM

 This is what we will cover in the next few lectures

1237

Main Memory in the System The Memory Chip/System Abstraction


L2 CACHE 1
L2 CACHE 0
SHARED L3 CACHE

DRAM INTERFACE

CORE 0 CORE 1
DRAM BANKS

DRAM MEMORY
CONTROLLER
L2 CACHE 2

L2 CACHE 3

CORE 2 CORE 3

1239 1240
10-11-2023

Review: Memory Bank Organization Review: SRAM (Static Random Access Memory)
 Read access sequence: Read Sequence
row select 1. address decode
1. Decode row address 2. drive row select
& drive word-lines
3. selected bit-cells drive bitlines

_bitline
bitline
2. Selected bits drive (entire row is read together)
bit-lines 4. diff. sensing and col. select
• Entire row read
(data is ready)
5. precharge all bitlines
3. Amplify row data
(for next read or write)
bit-cell array
4. Decode column
address & select subset n+m n 2n
2n row x 2m -col Access latency dominated by steps 2 and 3
of row
• Send to output Cycling time dominated by steps 2, 3 and 5
(nm to minimize
overall latency) - step 2 proportional to 2m
5. Precharge bit-lines - step 3 and 5 proportional to 2n
• For next access m 2m diff pairs
sense amp and mux
1
1241 1242

Review: DRAM (Dynamic Random Access Memory) Review: DRAM vs. SRAM
row enable
Bits stored as charges on node  DRAM
capacitance (non-restorative)
Slower access (capacitor)
_bitline


- bit cell loses charge when read

- bit cell loses charge over time


 Higher density (1T 1C cell)
Read Sequence  Lower cost
1~3 same as SRAM  Requires refresh (power, performance, circuitry)
RAS 4. a “flip-flopping” sense amp  Manufacturing requires putting capacitor and logic together
bit-cell array amplifies and regenerates the
n 2n bitline, data bit is mux’ed out
2n row x 2m -col
5. precharge all bitlines  SRAM
(nm to minimize
overall latency)
 Faster access (no capacitor)
Refresh: A DRAM controller must  Lower density (6T cell)
m 2m periodically read all rows within the
allowed refresh time (10s of ms)  Higher cost
sense amp and mux
1 such that charge is restored in cells  No need for refresh
A DRAM die comprises  Manufacturing compatible with logic process (no capacitor)
CAS of multiple such arrays
1243 1244
10-11-2023

Some Fundamental Concepts (I) Some Fundamental Concepts (II)


 Physical address space  Interleaving (banking)
 Maximum size of main memory: total number of uniquely  Problem: a single monolithic memory array takes long to
identifiable locations access and does not enable multiple accesses in parallel

 Physical addressability  Goal: Reduce the latency of memory array access and enable
 Minimum size of data in memory can be addressed multiple accesses in parallel
 Byte-addressable, word-addressable, 64-bit-addressable
 Microarchitectural addressability depends on the abstraction  Idea: Divide the array into multiple banks that can be
level of the implementation accessed independently (in the same cycle or in consecutive
cycles)
 Alignment  Each bank is smaller than the entire memory storage
 Does the hardware support unaligned access transparently to  Accesses to different banks can be overlapped
software?
 A Key Issue: How do you map data to different banks? (i.e.,
 Interleaving how do you interleave data across banks?)
1245 1246

Interleaving Interleaving Options

1247 1248
10-11-2023

Some Questions/Concepts The Bank Abstraction


 Remember CRAY-1 with 16 banks
 11 cycle bank latency
 Consecutive words in memory in consecutive banks (word
interleaving)
 1 access can be started (and finished) per cycle

 Can banks be operated fully in parallel?


 Multiple accesses started per cycle?

 What is the cost of this?


 We have seen it earlier

 Modern superscalar processors have L1 data caches with


multiple, fully-independent banks; DRAM banks share buses
1249 1250

Rank

The DRAM Subsystem

1251
10-11-2023

DRAM Subsystem Organization Page Mode DRAM


 A DRAM bank is a 2D array of cells: rows x columns
 Channel  A “DRAM row” is also called a “DRAM page”
 DIMM  “Sense amplifiers” also called “row buffer”
 Rank
 Chip  Each address is a <row,column> pair
 Bank  Access to a “closed row”
 Row/Column  Activate command opens row (placed into row buffer)
 Cell  Read/write command reads/writes column in the row buffer
 Precharge command closes the row and prepares the bank for
next access
 Access to an “open row”
 No need for an activate command

1253 1254

The DRAM Bank Structure DRAM Bank Operation


Access Address:
(Row 0, Column 0) Columns
(Row 0, Column 1)
(Row 0, Column 85)

Row decoder
(Row 1, Column 0)

Rows
Row address 0
1

Row
Empty
Row 01 Row Buffer CONFLICT
HIT !

Column address 0
1
85 Column mux

Data

1255 1256
10-11-2023

The DRAM Chip 128M x 8-bit DRAM Chip


 Consists of multiple banks (8 is a common number today)
 Banks share command/address/data buses
 The chip itself has a narrow interface (4-16 bits per read)

 Changing the number of banks, size of the interface (pins),


whether or not command/address/data buses are shared
has significant impact on DRAM system cost

1257 1258

DRAM Rank and Module A 64-bit Wide DIMM (One Rank)


 Rank: Multiple chips operated together to form a wide
interface
 All chips comprising a rank are controlled at the same time
 Respond to a single command
 Share address and command buses, but provide different data

 A DRAM module consists of one or more ranks


 E.g., DIMM (dual inline memory module)
 This is what you plug into your motherboard

 If we have chips with 8-bit interface, to read 8 bytes in a


single access, use 8 chips in a DIMM

1259 1260
10-11-2023

A 64-bit Wide DIMM (One Rank) Multiple DIMMs


 Advantages:  Advantages:
 Acts like a high-  Enables even
capacity DRAM chip
higher capacity
with a wide
interface
 Flexibility: memory  Disadvantages:
controller does not
need to deal with
 Interconnect
individual chips complexity and
energy
consumption
 Disadvantages:
can be high
 Granularity:
Accesses cannot be  Scalability is
smaller than the limited by this
interface width

1261 1262

DRAM Channels Generalized Memory Structure

 2 Independent Channels: 2 Memory Controllers (Above)


 2 Dependent/Lockstep Channels: 1 Memory Controller with
wide interface (Not Shown above)
1263 1264
10-11-2023

Generalized Memory Structure

The DRAM Subsystem


The Top Down View

1265

DRAM Subsystem Organization The DRAM subsystem

 Channel “Channel” DIMM (Dual in-line memory module)


 DIMM
 Rank
 Chip
 Bank
 Row/Column
 Cell Processor

Memory channel Memory channel

1267
10-11-2023

Breaking down a DIMM Breaking down a DIMM

DIMM (Dual in-line memory module) DIMM (Dual in-line memory module)

Side view Side view

Front of DIMM Back of DIMM Front of DIMM Back of DIMM

Rank 0: collection of 8 chips Rank 1

Rank Breaking down a Rank

...

Chip 0

Chip 1

Chip 7
Rank 0 (Front) Rank 1 (Back)
Rank 0

<56:63>
<8:15>
<0:7>
<0:63> <0:63> <0:63>

Data <0:63>

Addr/Cmd CS <0:1> Data <0:63>

Memory channel
10-11-2023

Breaking down a Chip Breaking down a Bank


2kB
1B (column)

row 16k-1
Chip 0

...
Bank 0
Bank 0
<0:7> row 0
<0:7>

<0:7>
<0:7>

<0:7>
... Row-buffer
1B 1B 1B
...

<0:7>

<0:7>
DRAM Subsystem Organization Example: Transferring a cache block

 Channel Physical memory space

 DIMM 0xFFFF…F
Channel 0
 Rank
 Chip

...
 Bank
DIMM 0
 Row/Column
 Cell 0x40
Rank 0
64B
cache block

0x00

1275
10-11-2023

Example: Transferring a cache block Example: Transferring a cache block

Physical memory space Physical memory space


Chip 0 Chip 1 Chip 7 Chip 0 Chip 1 Chip 7
Rank 0 Rank 0
0xFFFF…F 0xFFFF…F

... Row 0 ...


Col 0
...

...
<56:63>

<56:63>
<8:15>

<8:15>
<0:7>

<0:7>
0x40 0x40

64B Data <0:63> 64B Data <0:63>


cache block cache block

0x00 0x00

Example: Transferring a cache block Example: Transferring a cache block


Physical memory space Physical memory space
Chip 0 Chip 1 Chip 7 Chip 0 Chip 1 Chip 7
Rank 0 Rank 0
0xFFFF…F 0xFFFF…F

Row 0 ... Row 0 ...


Col 0 Col 1
...

...
<56:63>

<56:63>
<8:15>

<8:15>
<0:7>

<0:7>
0x40 0x40

64B 64B
Data <0:63> Data <0:63>
cache block cache block
8B 8B
0x00 8B 0x00
10-11-2023

Example: Transferring a cache block Example: Transferring a cache block

Physical memory space Physical memory space


Chip 0 Chip 1 Chip 7 Chip 0 Chip 1 Chip 7
Rank 0 Rank 0
0xFFFF…F 0xFFFF…F

Row 0 ... Row 0 ...


Col 1 Col 1
...

...
<56:63>

<56:63>
<8:15>

<8:15>
<0:7>

<0:7>
0x40 0x40

8B
64B Data <0:63> 8B
64B Data <0:63>
cache block cache block
8B 8B
0x00 8B 0x00
A 64B cache block takes 8 I/O cycles to transfer.

During the process, 8 columns are read sequentially.

Latency Components: Basic DRAM Operation Multiple Banks (Interleaving) and Channels
 CPU → controller transfer time  Multiple banks
 Controller latency  Enable concurrent DRAM accesses
 Queuing & scheduling delay at the controller  Bits in address determine which bank an address resides in
 Access converted to basic commands  Multiple independent channels serve the same purpose
 Controller → DRAM transfer time  But they are even better because they have separate data buses
 DRAM bank latency  Increased bus bandwidth
 Simple CAS (column address strobe) if row is “open” OR
 RAS (row address strobe) + CAS if array precharged OR  Enabling more concurrency requires reducing
 PRE + RAS + CAS (worst case)  Bank conflicts
 DRAM → Controller transfer time  Channel conflicts
 Bus latency (BL)  How to select/randomize bank/channel indices in address?
 Controller to CPU transfer time  Lower order bits have more entropy
 Randomizing hash functions (XOR of different address bits)
1283 1284
10-11-2023

How Multiple Banks/Channels Help Multiple Channels


 Advantages
 Increased bandwidth
 Multiple concurrent accesses (if independent channels)

 Disadvantages
 Higher cost than a single channel
 More board wires
 More pins (if on-chip memory controller)

1285 1286

Address Mapping (Single Channel) Bank Mapping Randomization


 Single-channel system with 8-byte memory bus  DRAM controller can randomize the address mapping to
 2GB memory, 8 banks, 16K rows & 2K columns per bank banks so that bank conflicts are less likely

 Row interleaving
3 bits Column (11 bits) Byte in bus (3 bits)
 Consecutive rows of memory in consecutive banks
Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits)

XOR
 Accesses to consecutive cache blocks serviced in a pipelined manner

 Cache block interleaving Bank index


(3 bits)
 Consecutive cache block addresses in consecutive banks
 64 byte cache blocks
Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits)
8 bits 3 bits

 Accesses to consecutive cache blocks can be serviced in parallel


1287 1288
10-11-2023

Address Mapping (Multiple Channels) Interaction with VirtualPhysical Mapping


C Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits)
 Operating System influences where an address maps to in
Row (14 bits) C Bank (3 bits) Column (11 bits) Byte in bus (3 bits) DRAM
Row (14 bits) Bank (3 bits) C Column (11 bits) Byte in bus (3 bits) Virtual Page number (52 bits) Page offset (12 bits) VA

Row (14 bits) Bank (3 bits) Column (11 bits) C Byte in bus (3 bits)
Physical Frame number (19 bits) Page offset (12 bits) PA
 Where are consecutive cache blocks? Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) PA

C Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits)
8 bits 3 bits  Operating system can influence which bank/channel/rank a
Row (14 bits) C High Column Bank (3 bits) Low Col. Byte in bus (3 bits) virtual page is mapped to.
It can perform page coloring to
8 bits 3 bits

Row (14 bits) High Column C Bank (3 bits) Low Col. Byte in bus (3 bits)
8 bits 3 bits  Minimize bank conflicts
Row (14 bits) High Column Bank (3 bits) C Low Col. Byte in bus (3 bits)  Minimize inter-application interference [Muralidhara+ MICRO’11]
8 bits 3 bits

Row (14 bits) High Column Bank (3 bits) Low Col. C Byte in bus (3 bits)
8 bits 3 bits

1289 1290

More on Reducing Bank Conflicts


 Read Sections 1 through 4 of:
 Kim et al., “A Case for Exploiting Subarray-Level Parallelism in 18-447
DRAM,” ISCA 2012.
Computer Architecture
Lecture 22: Memory Controllers

Prof. Onur Mutlu


Carnegie Mellon University
Spring 2015, 3/25/2015

1291
10-11-2023

Flash Memory (SSD) Controllers Where We Are in Lecture Schedule


 Similar to DRAM memory controllers, except:  The memory hierarchy
 They are flash memory specific  Caches, caches, more caches
 They do much more: error correction, garbage collection,
 Virtualizing the memory hierarchy: Virtual Memory
page remapping, …
 Main memory: DRAM
 Main memory control, scheduling
 Memory latency tolerance techniques
 Non-volatile memory

 Multiprocessors
 Coherence and consistency
 Interconnection networks
 Multi-core issues
Cai+, “Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory 1293 1294
Lifetime”, ICCD 2012.

Required Reading (for the Next Few Lectures) Required Readings on DRAM
 Onur Mutlu, Justin Meza, and Lavanya Subramanian,  DRAM Organization and Operation Basics
"The Main Memory System: Challenges and  Sections 1 and 2 of: Lee et al., “Tiered-Latency DRAM: A Low
Opportunities" Latency and Low Cost DRAM Architecture,” HPCA 2013.
Invited Article in Communications of the Korean Institute of http://users.ece.cmu.edu/~omutlu/pub/tldram_hpca13.pdf
Information Scientists and Engineers (KIISE), 2015.
 Sections 1 and 2 of Kim et al., “A Case for Subarray-Level
http://users.ece.cmu.edu/~omutlu/pub/main-memory- Parallelism (SALP) in DRAM,” ISCA 2012.
system_kiise15.pdf http://users.ece.cmu.edu/~omutlu/pub/salp-dram_isca12.pdf

 DRAM Refresh Basics


 Sections 1 and 2 of Liu et al., “RAIDR: Retention-Aware
Intelligent DRAM Refresh,” ISCA 2012.
http://users.ece.cmu.edu/~omutlu/pub/raidr-dram-
refresh_isca12.pdf
1295 1296
10-11-2023

DRAM versus Other Types of Memories


 Long latency memories have similar characteristics that
need to be controlled.
Memory Controllers
 The following discussion will use DRAM as an example, but
many scheduling and control issues are similar in the
design of controllers for other types of memories
 Flash memory
 Other emerging memory technologies
 Phase Change Memory
 Spin-Transfer Torque Magnetic Memory
 These other technologies can place other demands on the
controller

1298

DRAM Types DRAM Types (II)


 DRAM has different types with different interfaces optimized
for different purposes
 Commodity: DDR, DDR2, DDR3, DDR4, …
 Low power (for mobile): LPDDR1, …, LPDDR5, …
 High bandwidth (for graphics): GDDR2, …, GDDR5, …
 Low latency: eDRAM, RLDRAM, …
 3D stacked: WIO, HBM, HMC, …
 …
 Underlying microarchitecture is fundamentally the same
 A flexible memory controller can support various DRAM types
 This complicates the memory controller
 Difficult to support all types (and upgrades)
Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator,” IEEE Comp Arch Letters 2015.

1299 1300
10-11-2023

DRAM Controller: Functions DRAM Controller: Where to Place


 Ensure correct operation of DRAM (refresh and timing)  In chipset
+ More flexibility to plug different DRAM types into the system
 Service DRAM requests while obeying timing constraints of + Less power density in the CPU chip
DRAM chips
 Constraints: resource conflicts (bank, bus, channel), minimum  On CPU chip
write-to-read delays + Reduced latency for main memory access
 Translate requests to DRAM command sequences + Higher bandwidth between cores and controller
 More information can be communicated (e.g. request’s
 Buffer and schedule requests to for high performance + QoS importance in the processing core)
 Reordering, row-buffer, bank, rank, bus management

 Manage power consumption and thermals in DRAM


 Turn on/off DRAM chips, manage power modes
1301 1302

A Modern DRAM Controller (I) A Modern DRAM Controller (II)

1303 1304
10-11-2023

DRAM Scheduling Policies (I) DRAM Scheduling Policies (II)


 FCFS (first come first served)  A scheduling policy is a request prioritization order
 Oldest request first
 Prioritization can be based on
 FR-FCFS (first ready, first come first served)  Request age
1. Row-hit first  Row buffer hit/miss status
2. Oldest first  Request type (prefetch, read, write)
Goal: Maximize row buffer hit rate  maximize DRAM throughput  Requestor type (load miss or store miss)
 Request criticality
 Actually, scheduling is done at the command level  Oldest miss in the core?
 Column commands (read/write) prioritized over row commands  How many instructions in core are dependent on it?
(activate/precharge)  Will it stall the processor?
 Within each group, older commands prioritized over younger ones  Interference caused to other cores
 …
1305 1306

Row Buffer Management Policies Open vs. Closed Row Policies


 Open row
 Keep the row open after an access Policy First access Next access Commands
+ Next access might need the same row  row hit needed for next
-- Next access might need a different row  row conflict, wasted energy access
Open row Row 0 Row 0 (row hit) Read
Open row Row 0 Row 1 (row Precharge +
 Closed row conflict) Activate Row 1 +
 Close the row after an access (if no other requests already in the request Read
buffer need the same row) Closed row Row 0 Row 0 – access in Read
+ Next access might need a different row  avoid a row conflict request buffer
-- Next access might need the same row  extra activate latency (row hit)
Closed row Row 0 Row 0 – access not Activate Row 0 +
in request buffer Read + Precharge
 Adaptive policies (row closed)
 Predict whether or not the next access to the bank will be to Closed row Row 0 Row 1 (row closed) Activate Row 1 +
the same row Read + Precharge

1307 1308
10-11-2023

Review: A Modern DRAM Controller

Memory Interference and Scheduling


in Multi-Core Systems

1310

Review: DRAM Bank Operation Scheduling Policy for Single-Core Systems


Access Address:  A row-conflict memory access takes significantly longer than a
(Row 0, Column 0) Columns
(Row 0, Column 1) row-hit access
(Row 0, Column 85)  Current controllers take advantage of the row buffer
Row decoder

(Row 1, Column 0)
Rows

Row address 0
1
 FR-FCFS (first ready, first come first served) scheduling policy
1. Row-hit first
2. Oldest first

Row
Empty
Row 01 Row Buffer CONFLICT
HIT ! Goal 1: Maximize row buffer hit rate  maximize DRAM throughput
Goal 2: Prioritize older requests  ensure forward progress
Column address 0
1
85 Column mux

Data  Is this a good policy in a multi-core system?

1311 1312
10-11-2023

Trend: Many Cores on Chip Many Cores on Chip


 Simpler and lower power than a single large core  What we want:
 Large scale parallelism on chip  N times the system performance with N times the cores

 What do we get today?

Intel Core i7 IBM Cell BE IBM POWER7


AMD Barcelona 8 cores 8+1 cores 8 cores
4 cores

Nvidia Fermi Intel SCC Tilera TILE Gx


Sun Niagara II 448 “cores” 48 cores, networked 100 cores, networked
8 cores

1313 1314

(Un)expected Slowdowns in Multi-Core Uncontrolled Interference: An Example


High priority

CORE
stream1 random2
CORE Multi-Core
Chip

L2 L2
Low priority CACHE CACHE
unfairness
INTERCONNECT
Shared DRAM
DRAM MEMORY CONTROLLER Memory System

DRAM DRAM DRAM DRAM


(Core 0) (Core 1) Bank 0 Bank 1 Bank 2 Bank 3
Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service
in multi-core systems,” USENIX Security 2007.
1315 1316
10-11-2023

A Memory Performance Hog What Does the Memory Hog Do?


// initialize large arrays A, B // initialize large arrays A, B

Row decoder
for (j=0; j<N; j++) { for (j=0; j<N; j++) {
index = j*linesize; streaming index = rand(); random
A[index] = B[index]; A[index] = B[index]; T0: Row 0
… … T0:
T1: Row 05
} }
T1:
T0:Row
Row111
0
T1:
T0:Row
Row16
0
STREAM RANDOM Memory Request Buffer Row 0 Row Buffer
- Sequential memory access - Random memory access
- Very high row buffer locality (96% hit rate) - Very low row buffer locality (3% hit rate)
- Memory intensive - Similarly memory intensive Row size: 8KB, cache blockColumn mux
size: 64B
T0: STREAM
128
T1: (8KB/64B)
RANDOM requests of T0 serviced
Data before T1

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007. Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

1317 1318

Effect of the Memory Performance Hog Problems due to Uncontrolled Interference


3 Main memory is the only shared resource
High priority
2.82X slowdown
2.5

Slowdown
Memory
Lowperformance
priority hog
2
Slowdown

1.5 1.18X slowdown

1
Cores make
very slow
0.5 progress

0  Unfair slowdown of different threads


STREAM Virtual
gcc PC  Low system performance
Results on Intel Pentium D running Windows XP  Vulnerability to denial of service
(Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux)
 Priority inversion: unable to enforce priorities/SLAs
Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

1319 1320
10-11-2023

Problems due to Uncontrolled Interference Recap: Inter-Thread Interference in Memory


 Memory controllers, pins, and memory banks are shared

 Pin bandwidth is not increasing as fast as number of cores


 Bandwidth per core reducing

 Different threads executing on different cores interfere with


each other in the main memory system

 Unfair slowdown of different threads


 Threads delay each other by causing resource contention:
 Low system performance
 Bank, bus, row-buffer conflicts  reduced DRAM throughput
 Vulnerability to denial of service
 Threads can also destroy each other’s DRAM bank
 Priority inversion: unable to enforce priorities/SLAs
parallelism
 Poor performance predictability (no performance isolation)
 Otherwise parallel requests can become serialized
Uncontrollable, unpredictable system 1321 1322

Effects of Inter-Thread Interference in DRAM Problem: QoS-Unaware Memory Control


 Queueing/contention delays  Existing DRAM controllers are unaware of inter-thread
 Bank conflict, bus conflict, channel conflict, … interference in DRAM system

 Additional delays due to DRAM constraints  They simply aim to maximize DRAM throughput
 Called “protocol overhead”  Thread-unaware and thread-unfair
 Examples  No intent to service each thread’s requests in parallel
 Row conflicts  FR-FCFS policy: 1) row-hit first, 2) oldest first
 Unfairly prioritizes threads with high row-buffer locality
 Read-to-write and write-to-read delays
 Unfairly prioritizes threads that are memory intensive (many outstanding
memory accesses)
 Loss of intra-thread parallelism
 A thread’s concurrent requests are serviced serially instead of
in parallel

1323 1324
10-11-2023

Solution: QoS-Aware Memory Request Scheduling

Resolves memory contention


by scheduling requests
Core Core Memory Memory Stall-Time Fair Memory Scheduling
Controller
Core Core

 How to schedule requests to provide


 High system performance
 High fairness to applications
Onur Mutlu and Thomas Moscibroda,
 Configurability to system software "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors"
40th International Symposium on Microarchitecture (MICRO),
pages 146-158, Chicago, IL, December 2007. Slides (ppt)
 Memory controller needs to be aware of threads

1325 STFM Micro 2007 Talk

The Problem: Unfairness How Do We Solve the Problem?


 Stall-time fair memory scheduling [Mutlu+ MICRO’07]

 Goal: Threads sharing main memory should experience


similar slowdowns compared to when they are run alone 
fair scheduling
 Also improves overall system performance by ensuring cores
make “proportional” progress

 Unfair slowdown of different threads  Idea: Memory controller estimates each thread’s slowdown
due to interference and schedules requests in a way to
 Low system performance
balance the slowdowns
 Vulnerability to denial of service
 Priority inversion: unable to enforce priorities/SLAs
 Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for
 Poor performance predictability (no performance isolation) Chip Multiprocessors,” MICRO 2007.
Uncontrollable, unpredictable system 1327 1328
10-11-2023

Stall-Time Fairness in Shared DRAM Systems STFM Scheduling Algorithm [MICRO’07]


 For each thread, the DRAM controller
 A DRAM system is fair if it equalizes the slowdown of equal-priority threads  Tracks STshared
relative to when each thread is run alone on the same system  Estimates STalone

 DRAM-related stall-time: The time a thread spends waiting for DRAM memory  Each cycle, the DRAM controller
 STshared: DRAM-related stall-time when the thread runs with other threads  Computes Slowdown = STshared/STalone for threads with legal requests
 STalone: DRAM-related stall-time when the thread runs alone  Computes unfairness = MAX Slowdown / MIN Slowdown

 Memory-slowdown = STshared/STalone
 Relative increase in stall-time  If unfairness < a
 Use DRAM throughput oriented scheduling policy
 Stall-Time Fair Memory scheduler (STFM) aims to equalize  If unfairness ≥ a
Memory-slowdown for interfering threads, without sacrificing performance  Use fairness-oriented scheduling policy
 Considers inherent DRAM performance of each thread  (1) requests from thread with MAX Slowdown first
 Aims to allow proportional progress of threads  (2) row-hit first , (3) oldest-first

1329 1330

How Does STFM Prevent Unfairness? STFM Pros and Cons


 Upsides:
 First algorithm for fair multi-core memory scheduling
T0: Row 0
 Provides a mechanism to estimate memory slowdown of a
T1: Row 5 thread
T0: Row 0  Good at providing fairness
T1: Row 111  Being fair can improve performance
T0: Row 0
T0: Row 16
T1: 0
 Downsides:
Row
Row 111
16
0 Row Buffer  Does not handle all types of interference
T0 Slowdown 1.10
1.00
1.04
1.07
1.03
 (Somewhat) complex to implement
T1 Slowdown 1.14
1.03
1.06
1.08
1.11
1.00
 Slowdown estimations can be incorrect
Unfairness 1.06
1.04
1.03
1.00
Data
a 1.05

1331 1332
10-11-2023

Another Problem due to Interference

 Processors try to tolerate the latency of DRAM requests by


generating multiple outstanding requests
Parallelism-Aware Batch Scheduling  Memory-Level Parallelism (MLP)
 Out-of-order execution, non-blocking caches, runahead execution

 Effective only if the DRAM controller actually services the


multiple requests in parallel in DRAM banks
Onur Mutlu and Thomas Moscibroda,
"Parallelism-Aware Batch Scheduling: Enhancing both  Multiple threads share the DRAM controller
Performance and Fairness of Shared DRAM Systems”
35th International Symposium on Computer Architecture (ISCA),  DRAM controllers are not aware of a thread’s MLP
pages 63-74, Beijing, China, June 2008. Slides (ppt)  Can service each thread’s outstanding requests serially, not in parallel

1334

Bank Parallelism of a Thread Bank Parallelism Interference in DRAM


2 DRAM Requests Bank 0 Bank 1 Baseline Scheduler: Bank 0 Bank 1
Single Thread: 2 DRAM Requests

Thread A : Compute Stall Compute A : Compute Stall Stall Compute


Bank 0 Bank 0
Bank 1 Bank 1
Thread A: Bank 0, Row 1 2 DRAM Requests Thread A: Bank 0, Row 1
Thread A: Bank 1, Row 1 B: Compute Stall Stall Compute Thread B: Bank 1, Row 99
Bank 1 Thread B: Bank 0, Row 99
Bank 0
Thread A: Bank 1, Row 1
Bank access latencies of the two requests overlapped
Thread stalls for ~ONE bank access latency
Bank access latencies of each thread serialized
Each thread stalls for ~TWO bank access latencies

1335 1336
10-11-2023

Parallelism-Aware Scheduler Parallelism-Aware Batch Scheduling (PAR-BS)


Baseline Scheduler: Bank 0 Bank 1
2 DRAM Requests
 Principle 1: Parallelism-awareness
A : Compute Stall Stall Compute
 Schedule requests from a thread (to T1 T1
Bank 0
Bank 1 different banks) back to back T2 T0
2 DRAM Requests Thread A: Bank 0, Row 1  Preserves each thread’s bank parallelism
T2 T2
B: Compute Stall Stall Compute Thread B: Bank 1, Row 99  But, this can cause starvation…
Bank 1 T3 T2 Batch
Bank 0 Thread B: Bank 0, Row 99

Parallelism-aware Scheduler: Thread A: Bank 1, Row 1  Principle 2: Request Batching T0 T3

2 DRAM Requests  Group a fixed number of oldest requests T2 T1

A : Compute Stall Compute from each thread into a “batch” T1 T0


Saved Cycles
Bank 0 Average stall-time:  Service the batch before all other requests
Bank 1
2 DRAM Requests
~1.5 bank access  Form a new batch when the current one is done
Bank 0 Bank 1
B: Compute Stall Stall Compute
latencies  Eliminates starvation, provides fairness
Bank 0  Allows parallelism-awareness within a batch
Bank 1
Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling,” ISCA 2008.
1337 1338

PAR-BS Components Request Batching

 Request batching  Each memory request has a bit (marked) associated with it

 Batch formation:
 Mark up to Marking-Cap oldest requests per bank for each thread
 Marked requests constitute the batch
 Within-batch scheduling  Form a new batch when no marked requests are left
 Parallelism aware
 Marked requests are prioritized over unmarked ones
 No reordering of requests across batches: no starvation, high fairness

 How to prioritize requests within a batch?

1339 1340
10-11-2023

Within-Batch Scheduling How to Rank Threads within a Batch


 Can use any DRAM scheduling policy  Ranking scheme affects system throughput and fairness
 FR-FCFS (row-hit first, then oldest-first) exploits row-buffer locality
 But, we also want to preserve intra-thread bank parallelism  Maximize system throughput
 Service each thread’s requests back to back  Minimize average stall-time of threads within the batch
 Minimize unfairness (Equalize the slowdown of threads)
HOW?  Service threads with inherently low stall-time early in the batch
 Scheduler computes a ranking of threads when the batch is  Insight: delaying memory non-intensive threads results in high
formed slowdown
 Higher-ranked threads are prioritized over lower-ranked ones
 Improves the likelihood that requests from a thread are serviced in  Shortest stall-time first (shortest job first) ranking
parallel by different banks  Provides optimal system throughput [Smith, 1956]*
 Different threads prioritized in the same order across ALL banks  Controller estimates each thread’s stall-time within the batch
 Ranks threads with shorter stall-time higher
* W.E. Smith, “Various optimizers for single stage production,” Naval Research Logistics Quarterly, 1956.
1341 1342

Shortest Stall-Time First Ranking Example Within-Batch Scheduling Order


 Maximum number of marked requests to any bank (max-bank-load) Baseline Scheduling T3 7 PAR-BS Scheduling T3 7
 Rank thread with lower max-bank-load higher (~ low stall-time) Order (Arrival order) T3 6 Order T3 6
 Total number of marked requests (total-load) T3 T2 T3 T3 5 T3 T3 T3 T3 5

Time

Time
 Breaks ties: rank thread with lower total-load higher T1 T0 T2 T0 4 T3 T2 T2 T3 4
T2 T2 T1 T2 3 T2 T2 T2 T3 3
T3 T3 T1 T0 T3 2 T1 T1 T1 T2 2
max-bank-load total-load T1 T3 T2 T3 1 T1 T0 T0 T0 1
T3
T3 T2 T3 T3 T0 1 3
Bank 0 Bank 1 Bank 2 Bank 3 Bank 0 Bank 1 Bank 2 Bank 3
T1 T0 T2 T0 T1 2 4
T2 T2 T1 T2 T2 2 6
T3 T1 T0 T3 Ranking: T0 > T1 > T2 > T3
T3 5 9
T1 T3 T2 T3
T0 T1 T2 T3 T0 T1 T2 T3
Bank 0 Bank 1 Bank 2 Bank 3 Ranking: Stall times 4 4 5 7 Stall times 1 2 4 7
T0 > T1 > T2 > T3
AVG: 5 bank access latencies AVG: 3.5 bank access latencies

1343 1344
10-11-2023

Putting It Together: PAR-BS Scheduling Policy Hardware Cost


 PAR-BS Scheduling Policy  <1.5KB storage cost for
(1) Marked requests first Batching  8-core system with 128-entry memory request buffer
(2) Row-hit requests first
Parallelism-aware
(3) Higher-rank thread first (shortest stall-time first) within-batch
 No complex operations (e.g., divisions)
(4) Oldest first scheduling

 Three properties:  Not on the critical path


 Exploits row-buffer locality and intra-thread bank parallelism  Scheduler makes a decision only every DRAM cycle
 Work-conserving: does not waste bandwidth when it can be used
 Services unmarked requests to banks without marked requests
 Marking-Cap is important
 Too small cap: destroys row-buffer locality
 Too large cap: penalizes memory non-intensive threads

 Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling,” ISCA 2008.


1345 1346

Unfairness on 4-, 8-, 16-core Systems System Performance


Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007]
1.4
5
1.3
FR-FCFS
4.5 FCFS 1.2
1.1

Normalized Hmean Speedup


NFQ
Unfairness (lower is better)

4
STFM 1
PAR-BS 0.9
3.5
0.8

3 0.7 FR-FCFS
0.6 FCFS
2.5 NFQ
0.5
STFM
0.4
2 PAR-BS
0.3

1.5 0.2
0.1
1 0
4-core 8-core 16-core 4-core 8-core 16-core

1347 1348
10-11-2023

PAR-BS Pros and Cons


 Upsides:
 First scheduler to address bank parallelism destruction across
multiple threads TCM:
Simple mechanism (vs. STFM)
Thread Cluster Memory Scheduling

 Batching provides fairness


 Ranking enables parallelism awareness

 Downsides:
 Does not always prioritize the latency-sensitive applications Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter,
"Thread Cluster Memory Scheduling:
Exploiting Differences in Memory Access Behavior"
43rd International Symposium on Microarchitecture (MICRO),
pages 65-76, Atlanta, GA, December 2010. Slides (pptx) (pdf)

1349 TCM Micro 2010 Talk

Throughput vs. Fairness Throughput vs. Fairness


24 cores, 4 memory controllers, 96 workloads
17
Throughput biased approach Fairness biased approach
Prioritize less memory-intensive threads Take turns accessing memory
Maximum Slowdown

15
Better fairness

13
System throughput bias
FCFS
11 Good for throughput Does not starve
9 FRFCF thread A
7 S
5 Fairness bias STFM less memory thread B higher thread C thread A thread B
3 intensive priority
thread C
1 not prioritized 
7 7.5 8 8.5 9 9.5 10
Weighted Speedup starvation  unfairness reduced throughput
Better system throughput
No previous memory scheduling algorithm provides Single policy for all threads is insufficient
both the best fairness and system throughput
1352
10-11-2023

Achieving the Best of Both Worlds Thread Cluster Memory Scheduling [Kim+ MICRO’10]
higher For Throughput 1. Group threads into two
priority clusters
Prioritize memory-non-intensive threads 2. Prioritize non-intensive higher
thread
priority
thread cluster
Non-intensive
thread
Different policies forcluster
3.Memory-non-intensive each
thread For Fairness cluster
Throughput
Unfairness caused by memory-intensive
thread
thread
thread
thread
being prioritized over each other thread
Prioritized higher
thread
• Shuffle thread ranking thread
thread
thread
priority

thread
Memory-intensive threads have Threads in the system
thread
different vulnerability to interference Memory-intensive
• Shuffle asymmetrically
Intensive cluster Fairness
1353 1354

Clustering Threads TCM: Quantum-Based Operation


Step1 Sort threads by MPKI (misses per Previous quantum Current quantum
kiloinstruction) (~1M cycles) (~1M cycles)
higher
MPKI
thread
thread
thread
thread
thread

Time
thread

Non-intensive Intensive
cluster αT cluster Shuffle interval
During quantum: (~1K cycles)
• Monitor thread behavior
T
α < 10% 1. Memory intensity Beginning of quantum:
T = Total memory bandwidth usage ClusterThreshold 2. Bank-level parallelism • Perform clustering
3. Row-buffer locality • Compute niceness of
Step2 Memory bandwidth usage αT divides clusters intensive threads

1355 1356
10-11-2023

TCM: Scheduling Algorithm TCM: Throughput and Fairness


24 cores, 4 memory controllers, 96 workloads
1. Highest-rank: Requests from higher ranked threads 16
prioritized FRFCFS
14

Better fairness
Maximum Slowdown
 Non-Intensive cluster > Intensive cluster ATLAS
12
 Non-Intensive cluster: lower intensity  higher rank
STFM
 Intensive cluster: rank shuffling 10
PAR-BS
8
TCM
6

2. Row-hit: Row-buffer hit requests are prioritized 4


7.5 8 8.5 9 9.5 10
Weighted Speedup

3. Oldest: Older requests are prioritized Better system throughput


TCM, a heterogeneous scheduling policy,
1357
provides best fairness and system throughput 1358

TCM: Fairness-Throughput Tradeoff TCM Pros and Cons


 Upsides:
When configuration parameter is varied…
12
 Provides both high fairness and high performance
FRFCFS  Caters to the needs for different types of threads (latency vs.
Better fairness
Maximum Slowdown

10 bandwidth sensitive)
STFM ATLAS  (Relatively) simple
8
PAR-BS
6 TCM
 Downsides:
4  Scalability to large buffer sizes?
2 Adjusting  Robustness of clustering and shuffling algorithms?
12 13 14 ClusterThreshold
15 16
Weighted Speedup
Better system throughput

TCM allows robust fairness-throughput tradeoff


1359 1360
10-11-2023

Where We Are in Lecture Schedule


 The memory hierarchy
18-447  Caches, caches, more caches
Computer Architecture  Virtualizing the memory hierarchy: Virtual Memory
Main memory: DRAM
Lecture 23: Memory Management 

 Main memory control, scheduling


 Memory latency tolerance techniques
 Non-volatile memory

Prof. Onur Mutlu  Multiprocessors


Carnegie Mellon University  Coherence and consistency
Spring 2015, 3/27/2015  Interconnection networks
 Multi-core issues (e.g., heterogeneous multi-core)
1362

Required Reading (for the Next Few Lectures) Required Readings on DRAM
 Onur Mutlu, Justin Meza, and Lavanya Subramanian,  DRAM Organization and Operation Basics
"The Main Memory System: Challenges and  Sections 1 and 2 of: Lee et al., “Tiered-Latency DRAM: A Low
Opportunities" Latency and Low Cost DRAM Architecture,” HPCA 2013.
Invited Article in Communications of the Korean Institute of http://users.ece.cmu.edu/~omutlu/pub/tldram_hpca13.pdf
Information Scientists and Engineers (KIISE), 2015.
 Sections 1 and 2 of Kim et al., “A Case for Subarray-Level
http://users.ece.cmu.edu/~omutlu/pub/main-memory- Parallelism (SALP) in DRAM,” ISCA 2012.
system_kiise15.pdf http://users.ece.cmu.edu/~omutlu/pub/salp-dram_isca12.pdf

 DRAM Refresh Basics


 Sections 1 and 2 of Liu et al., “RAIDR: Retention-Aware
Intelligent DRAM Refresh,” ISCA 2012.
http://users.ece.cmu.edu/~omutlu/pub/raidr-dram-
refresh_isca12.pdf
1363 1364
10-11-2023

Review: A Modern DRAM Controller

Memory Interference and Scheduling


in Multi-Core Systems

1366

(Un)expected Slowdowns in Multi-Core Memory Scheduling Techniques


High priority
 We covered
 FCFS
 FR-FCFS
 STFM (Stall-Time Fair Memory Access Scheduling)
Low priority  PAR-BS (Parallelism-Aware Batch Scheduling)
 ATLAS
 TCM (Thread Cluster Memory Scheduling)

 There are many more …

 See your required reading (Section 7):


(Core 0) (Core 1)
 Mutlu et al., “The Main Memory System: Challenges and
Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service
in multi-core systems,” USENIX Security 2007.
Opportunities,” KIISE 2015.
1367 1368
10-11-2023

Fundamental Interference Control Techniques


 Goal: to reduce/control inter-thread memory interference

Other Ways of
Handling Memory Interference 1. Prioritization or request scheduling

2. Data mapping to banks/channels/ranks

3. Core/source throttling

4. Application/thread scheduling

1370

Observation: Modern Systems Have Multiple Channels Data Mapping in Current Systems
Core Core
Page
Red Memory Channel 0 Memory Red Memory Channel 0 Memory
App Controller App Controller

Core Core

Blue Memory Channel 1 Memory Blue Memory Channel 1 Memory


App Controller App Controller

A new degree of freedom Causes interference between applications’ requests


Mapping data across multiple channels
Muralidhara et al., “Memory Channel Partitioning,” MICRO’11. 1371 Muralidhara et al., “Memory Channel Partitioning,” MICRO’11. 1372
10-11-2023

Partitioning Channels Between Applications Overview: Memory Channel Partitioning (MCP)


Core  Goal
Page  Eliminate harmful interference between applications
Red Memory Channel 0 Memory
App Controller

Core
 Basic Idea
 Map the data of badly-interfering applications to different
Blue Memory Channel 1 Memory channels
App Controller

 Key Principles
 Separate low and high memory-intensity applications
Eliminates interference between applications’ requests  Separate low and high row-buffer locality applications

Muralidhara et al., “Memory Channel Partitioning,” MICRO’11. 1373 Muralidhara et al., “Memory Channel Partitioning,” MICRO’11. 1374

Key Insight 1: Separate by Memory Intensity Key Insight 2: Separate by Row-Buffer Locality
High memory-intensity applications interfere with low HighRequest Buffer
row-buffer
State
locality
Channelapplications
0 Request Buffer
interfere
State
with low
Channel 0
memory-intensity applications in shared memory channels Bank 0
row-buffer locality applications in shared memory channels
R1 Bank 0
Time Units Time Units Bank 1 R0 R0 Bank 1
Channel 0 Channel 0 R0 R3 R2 R0
5 4 3 2 1 5 4 3 2 1
Core Core Bank 0
Bank 0 Red Bank 0 Bank 0
Red R4 R1 R4
App Bank 1 App Bank 1 Bank 1 R3 R2 Bank 1
Time Time
Channel 1 Channel 1
Core Core Saved Cycles Bank 0
units Service Order units Service Order
Bank 0 Blue Channel 0 Channel 0
Blue 6 5 4 3 2 1 6 5 4 3 2 1
App Bank 1 App Saved Cycles Bank 1 R1 Bank 0 Bank 0

Channel 1 Channel 1 R3 R2 R0 R0 Bank 1 R0 R0 Bank 1


Conventional Page Mapping Channel Partitioning
R4 Bank 0 R1 R4 Bank 0

andBank
high1 row-buffer
Saved Bank 1
Map data of low and high memory-intensity applications Map data of low Cycles locality
R3
applications
R2

to different channels toChannel 1


different
Conventional Page Mapping channels
Channel Partitioning
Channel 1

1375 1376
10-11-2023

Memory Channel Partitioning (MCP) Mechanism Interval Based Operation


Current Interval Next Interval
Hardware

1. Profile applications time


2. Classify applications into groups
3. Partition channels between application groups 1. Profile applications 5. Enforce channel preferences

4. Assign a preferred channel to each application


2. Classify applications into groups
5. Allocate application pages to preferred channel
3. Partition channels between groups
4. Assign preferred channel to applications
System
Software

Muralidhara et al., “Memory Channel Partitioning,” MICRO’11. 1377 1378

Observations Integrated Memory Partitioning and Scheduling (IMPS)

 Applications with very low memory-intensity rarely  Always prioritize very low memory-intensity
access memory applications in the memory scheduler
 Dedicating channels to them results in precious
memory bandwidth waste

 They have the most potential to keep their cores busy  Use memory channel partitioning to mitigate
 We would really like to prioritize them interference between other applications

 They interfere minimally with other applications


 Prioritizing them does not hurt others

1379 Muralidhara et al., “Memory Channel Partitioning,” MICRO’11. 1380


10-11-2023

Hardware Cost Performance of Channel Partitioning


 Memory Channel Partitioning (MCP) Averaged over 240 workloads
1.15
 Only profiling counters in hardware
11%
5%
 No modifications to memory scheduling logic

System Performance
1.1 FRFCFS
 1.5 KB storage cost for a 24-core, 4-channel system
7%
1%

Normalized
1.05 ATLAS

 Integrated Memory Partitioning and Scheduling (IMPS) TCM


1
 A single bit per request MCP
 Scheduler prioritizes based on this single bit 0.95
IMPS
0.9
Better system performance than the best previous scheduler
at lower hardware cost
Muralidhara et al., “Memory Channel Partitioning,” MICRO’11. 1381 1382

Combining Multiple Interference Control Techniques Fundamental Interference Control Techniques


 Combined interference control techniques can mitigate  Goal: to reduce/control inter-thread memory interference
interference much more than a single technique alone can
do
1. Prioritization or request scheduling
 The key challenge is:
 Deciding what technique to apply when
2. Data mapping to banks/channels/ranks
 Partitioning work appropriately between software and
hardware
3. Core/source throttling

4. Application/thread scheduling

1383 1384
10-11-2023

Source Throttling: A Fairness Substrate Fairness via Source Throttling (FST) [ASPLOS’10]
 Key idea: Manage inter-thread interference at the cores
(sources), not at the shared resources Interval 1 Interval 2 Interval 3
Time




Dynamically estimate unfairness in the memory system



Slowdown
 Feed back this information into a controller Estimation
FST
 Throttle cores’ memory access rates accordingly
Unfairness Estimate
 Whom to throttle and by how much depends on performance Runtime
target (throughput, fairness, per-thread QoS, etc) App-slowest Dynamic
Unfairness App-interfering
E.g., if unfairness > system-software-specified target then Request Throttling
 Evaluation
throttle down core causing unfairness &
throttle up core that was unfairly treated 1- Estimating system unfairness if (Unfairness Estimate >Target)
2- Find app. with the highest {
slowdown (App-slowest) 1-Throttle down App-interfering
 Ebrahimi et al., “Fairness via Source Throttling,” ASPLOS’10, TOCS’12. 3- Find app. causing most (limit injection rate and parallelism)
interference for App-slowest 2-Throttle up App-slowest
1385 (App-interfering) } 1386

Core (Source) Throttling Fundamental Interference Control Techniques


 Idea: Estimate the slowdown due to (DRAM) interference  Goal: to reduce/control interference
and throttle down threads that slow down others
 Ebrahimi et al., “Fairness via Source Throttling: A Configurable
and High-Performance Fairness Substrate for Multi-Core 1. Prioritization or request scheduling
Memory Systems,” ASPLOS 2010.

 Advantages 2. Data mapping to banks/channels/ranks


+ Core/request throttling is easy to implement: no need to change the
memory scheduling algorithm
3. Core/source throttling
+ Can be a general way of handling shared resource contention
+ Can reduce overall load/contention in the memory system
4. Application/thread scheduling
 Disadvantages Idea: Pick threads that do not badly interfere with each
- Requires interference/slowdown estimations  difficult to estimate other to be scheduled together on cores sharing the memory
- Thresholds can become difficult to optimize  throughput loss system
1387 1388
10-11-2023

Interference-Aware Thread Scheduling Virtualized Cluster


 An example from scheduling in clusters (data centers)
 Clusters can be running virtual machines VM VM VM VM

App App App App

How
Distributed to dynamically
Resource Management
schedule VMs onto
Host (DRM) policies
hosts? Host

Core0 Core1 Core0 Core1

LLC LLC

DRAM DRAM
1389 1390

Conventional DRM Policies Microarchitecture-level Interference

BasedVMon operating-system-level metrics


VM Host

e.g., CPU
App utilization,
App memory App
capacityApp  VMs within a host compete for: VM VM
demand  Shared cache capacity App App
 Shared memory bandwidth
Memory Capacity Host Host Core0 Core1

VM LLC
CPU
App
DRAM

Core0 Core1 Core0 Core1

LLC LLC Can operating-system-level metrics capture the


microarchitecture-level resource interference?
DRAM DRAM
1391 1392
10-11-2023

Microarchitecture Unawareness Impact on Performance


Operating-system-level Microarchitecture-level metrics 0.6

V metrics IPC 0.4


LLC Hit Ratio Memory Bandwidth (Harmonic
M CPU
Memory Capacity Mean) 0.2
Utilization 2% 2267 MB/s
0.0

App 92% 369 MB 98% 1 MB/s Conventional DRM with Microarchitecture Awareness

93% Host
348 MB Host Host Host
Memory Capacity Memory Capacity
VM VM VM VM VM VM VM VM
VM VM
CPU App App App App CPU App App SWAP App App
App App
Core0 Core1 Core0 Core1 Core0 Core1 Core0 Core1
STREAM LLC LLC STREAM LLC LLC
App gromacs App gromacs
DRAM DRAM DRAM DRAM
1393 1394

Impact on Performance A-DRM: Architecture-aware DRM


0.6
 Goal: Take into account microarchitecture-level
IPC 0.4
49%
shared resource interference
(Harmonic
Mean) 0.2  Shared cache capacity
0.0  Shared memory bandwidth
We need microarchitecture-
Conventional DRM with Microarchitecture Awareness

level interference
Host awareness inHost  Key Idea:
Memory Capacity
DRM! VM VM  Monitor and detect microarchitecture-level shared
VM VM
VM resource interference
CPU App App App App
App  Balance microarchitecture-level resource usage
Core0 Core1 Core0 Core1 across cluster to minimize memory interference
STREAM LLC LLC while maximizing system performance
App gromacs
DRAM DRAM
1395 1396
10-11-2023

A-DRM: Architecture-aware DRM More on Architecture-Aware DRM


 Optional Reading

Hosts Controller
 Wang et al., “A-DRM: Architecture-aware Distributed
A-DRM: Global Architecture
–aware Resource Manager
Resource Management of Virtualized Clusters,” VEE 2015.
OS+Hypervisor
 http://users.ece.cmu.edu/~omutlu/pub/architecture-aware-
Profiling Engine
VM VM distributed-resource-management_vee15.pdf
Ap Ap Architecture-aware
•••
p p Interference Detector

Architecture-aware
Distributed Resource
CPU/Memory Architectural Management (Policy)
Capacity Resources
Resource
Profiler Migration Engine

1397 1398

Interference-Aware Thread Scheduling Summary: Fundamental Interference Control Techniques

 Advantages  Goal: to reduce/control interference


+ Can eliminate/minimize interference by scheduling “symbiotic
applications” together (as opposed to just managing the
interference)
1. Prioritization or request scheduling
+ Less intrusive to hardware (no need to modify the hardware
resources)
2. Data mapping to banks/channels/ranks
 Disadvantages and Limitations
-- High overhead to migrate threads between cores and 3. Core/source throttling
machines
-- Does not work (well) if all threads are similar and they 4. Application/thread scheduling
interfere

Best is to combine all. How would you do that?


1399 1400
10-11-2023

Multithreaded (Parallel) Applications


 Threads in a multi-threaded application can be inter-
dependent
Handling Memory Interference  As opposed to threads from different applications
In Multithreaded Applications
 Such threads can synchronize with each other
 Locks, barriers, pipeline stages, condition variables,
semaphores, …

 Some threads can be on the critical path of execution due


to synchronization; some threads are not

 Even within a thread, some “code segments” may be on


the critical path of execution; some are not
1402

Critical Sections Barriers

 Enforce mutually exclusive access to shared data  Synchronization point


 Only one thread can be executing it at a time  Threads have to wait until all threads reach the barrier
 Contended critical sections make threads wait  threads  Last thread arriving to the barrier is on the critical path
causing serialization can be on the critical path
Each thread:
Each thread: loop1 {
loop { Compute
Compute N }
lock(A) barrier
Update shared data loop2 {
unlock(A) C Compute
} }

1403 1404
10-11-2023

Stages of Pipelined Programs Handling Interference in Parallel Applications


 Loop iterations are statically divided into code segments called stages
 Threads in a multithreaded application are inter-dependent
 Threads execute stages on different cores
 Thread executing the slowest stage is on the critical path  Some threads can be on the critical path of execution due
to synchronization; some threads are not
 How do we schedule requests of inter-dependent threads
A B C
to maximize multithreaded application performance?
loop {
Compute1 A
 Idea: Estimate limiter threads likely to be on the critical path and
B
prioritize their requests; shuffle priorities of non-limiter threads
Compute2
to reduce memory interference among them [Ebrahimi+, MICRO’11]
Compute3 C
}  Hardware/software cooperative limiter thread estimation:
 Thread executing the most contended critical section
 Thread executing the slowest pipeline stage
 Thread that is falling behind the most in reaching a barrier
1405 1406

Prioritizing Requests from Limiter Threads More on Parallel Application Memory Scheduling
 Optional reading
Non-Critical Section Critical Section 1 Barrier
Waiting for Sync
or Lock
Critical Section 2 Critical Path  Ebrahimi et al., “Parallel Application Memory Scheduling,”
Barrier MICRO 2011.
Thread A
 http://users.ece.cmu.edu/~omutlu/pub/parallel-memory-
Thread B
scheduling_micro11.pdf
Thread C
Thread D
Time
Limiter Thread Identification Barrier
Thread A Most Contended
Thread B Critical Section: 1
Saved
Thread C Cycles Limiter Thread: C
A
B
D
Thread D
Time

1407 1408
10-11-2023

DRAM Power Management


 DRAM chips have power modes
 Idea: When not accessing a chip power it down
More on DRAM Management and
DRAM Controllers  Power states
 Active (highest power)
 All banks idle
 Power-down
 Self-refresh (lowest power)

 State transitions incur latency during which the chip cannot


be accessed

1410

DRAM Refresh
 DRAM capacitor charge leaks over time

DRAM Refresh  The memory controller needs to refresh each row


periodically to restore charge
 Read and close each row every N ms
 Typical N = 64 ms

 Downsides of refresh
-- Energy consumption: Each refresh consumes energy
-- Performance degradation: DRAM rank/bank unavailable while
refreshed
-- QoS/predictability impact: (Long) pause times during refresh
-- Refresh rate limits DRAM capacity scaling
1412
10-11-2023

DRAM Refresh: Performance Distributed Refresh


 Implications of refresh on performance
-- DRAM bank unavailable while refreshed
-- Long pause times: If we refresh all rows in burst, every 64ms
the DRAM will be unavailable until refresh ends

 Burst refresh: All rows refreshed immediately after one


another

 Distributed refresh eliminates long pause times


 Distributed refresh: Each row refreshed at a different time,
at regular intervals  How else can we reduce the effect of refresh on
performance/QoS?
 Does distributed refresh reduce refresh impact on energy?
 Can we reduce the number of refreshes?
1413 1414

Refresh Today: Auto Refresh Refresh Overhead: Performance


Columns

BANK 0
Rows

BANK 1 BANK 2 BANK 3

46%
Row Buffer

DRAM Bus
8%
A batch of rows are DRAM CONTROLLER
periodically refreshed
via the auto-refresh command
1415 Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. 1416
10-11-2023

Refresh Overhead: Energy Problem with Conventional Refresh


 Today: Every row is refreshed at the same rate

47%

15%
 Observation: Most rows can be refreshed much less often
without losing data [Kim+, EDL’09]
 Problem: No support in DRAM for different refresh rates per row

Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. 1417 1418

Retention Time of DRAM Rows Reducing DRAM Refresh Operations


 Observation: Only very few rows need to be refreshed at the  Idea: Identify the retention time of different rows and
worst-case rate refresh each row at the frequency it needs to be refreshed

 (Cost-conscious) Idea: Bin the rows according to their


minimum retention times and refresh rows in each bin at
the refresh rate specified for the bin
 e.g., a bin for 64-128ms, another for 128-256ms, …

 Observation: Only very few rows need to be refreshed very


frequently [64-128ms]  Have only a few bins  Low HW
overhead to achieve large reductions in refresh operations

 Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.

 Can we exploit this to reduce refresh operations at low cost?


1419 1420
10-11-2023

RAIDR: Mechanism 1. Profiling


1. Profiling: Profile the retention time of all DRAM rows
 can be done at DRAM design time or dynamically

2. Binning: Store rows into bins by retention time


 use Bloom Filters for efficient and scalable storage
1.25KB storage in controller for 32GB DRAM memory

3. Refreshing: Memory controller refreshes rows in different


bins at different rates
 probe Bloom Filters to determine refresh rate of a row
1421 1422

2. Binning Bloom Filter


 How to efficiently and scalably store rows into retention  [Bloom, CACM 1970]
time bins?  Probabilistic data structure that compactly represents set
 Use Hardware Bloom Filters [Bloom, CACM 1970] membership (presence or absence of element in a set)

 Non-approximate set membership: Use 1 bit per element to


indicate absence/presence of each element from an element
space of N elements
 Approximate set membership: use a much smaller number of
bits and indicate each element’s presence/absence with a
subset of those bits
 Some elements map to the bits other elements also map to

 Operations: 1) insert, 2) test, 3) remove all elements


Bloom, “Space/Time Trade-offs in Hash Coding with Allowable Errors”, CACM 1970. 1423 Bloom, “Space/Time Trade-offs in Hash Coding with Allowable Errors”, CACM 1970. 1424
10-11-2023

Bloom Filter Operation Example Bloom Filter Operation Example

Bloom, “Space/Time Trade-offs in Hash Coding with Allowable Errors”, CACM 1970. 1425 1426

Bloom Filter Operation Example Bloom Filter Operation Example

1427 1428
10-11-2023

Bloom Filter Operation Example Bloom Filters

1429 1430

Bloom Filters: Pros and Cons Benefits of Bloom Filters as Refresh Rate Bins
 Advantages  False positives: a row may be declared present in the
+ Enables storage-efficient representation of set membership Bloom filter even if it was never inserted
+ Insertion and testing for set membership (presence) are fast  Not a problem: Refresh some rows more frequently than
needed
+ No false negatives: If Bloom Filter says an element is not
present in the set, the element must not have been inserted
+ Enables tradeoffs between time & storage efficiency & false  No false negatives: rows are never refreshed less
positive rate (via sizing and hashing) frequently than needed (no correctness problems)

 Disadvantages  Scalable: a Bloom filter never overflows (unlike a fixed-size


-- False positives: An element may be deemed to be present in table)
the set by the Bloom Filter but it may never have been inserted
Not the right data structure when you cannot tolerate false  Efficient: No need to store info on a per-row basis; simple
positives hardware  1.25 KB for 2 filters for 32 GB DRAM system
Bloom, “Space/Time Trade-offs in Hash Coding with Allowable Errors”, CACM 1970. 1431 1432
10-11-2023

Use of Bloom Filters in Hardware 3. Refreshing (RAIDR Refresh Controller)


 Useful when you can tolerate false positives in set
membership tests

 See the following recent examples for clear descriptions of


how Bloom Filters are used
 Liu et al., “RAIDR: Retention-Aware Intelligent DRAM
Refresh,” ISCA 2012.
 Seshadri et al., “The Evicted-Address Filter: A Unified
Mechanism to Address Both Cache Pollution and Thrashing,”
PACT 2012.

1433 1434

3. Refreshing (RAIDR Refresh Controller) RAIDR: Baseline Design

Refresh control is in DRAM in today’s auto-refresh systems


RAIDR can be implemented in either the controller or DRAM
Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.

1435 1436
10-11-2023

RAIDR in Memory Controller: Option 1 RAIDR in DRAM Chip: Option 2

Overhead of RAIDR in DRAM controller: Overhead of RAIDR in DRAM chip:


1.25 KB Bloom Filters, 3 counters, additional commands Per-chip overhead: 20B Bloom Filters, 1 counter (4 Gbit chip)
issued for per-row refresh (all accounted for in evaluations) Total overhead: 1.25KB Bloom Filters, 64 counters (32 GB DRAM)
1437 1438

RAIDR: Results and Takeaways DRAM Refresh: More Questions


 System: 32GB DRAM, 8-core; SPEC, TPC-C, TPC-H workloads  What else can you do to reduce the impact of refresh?
 RAIDR hardware cost: 1.25 kB (2 Bloom filters)
 Refresh reduction: 74.6%  What else can you do if you know the retention times of
 Dynamic DRAM energy reduction: 16% rows?
 Idle DRAM power reduction: 20%
 Performance improvement: 9%  How can you accurately measure the retention time of
DRAM rows?
 Benefits increase as DRAM scales in density
 Recommended reading:
 Liu et al., “An Experimental Study of Data Retention Behavior
in Modern DRAM Devices: Implications for Retention Time
Profiling Mechanisms,” ISCA 2013.

1439 1440
10-11-2023

More Readings on DRAM Refresh


 Liu et al., “An Experimental Study of Data Retention 18-447
Behavior in Modern DRAM Devices: Implications for
Retention Time Profiling Mechanisms,” ISCA 2013. Computer Architecture
http://users.ece.cmu.edu/~omutlu/pub/dram-retention-time-

characterization_isca13.pdf
Lecture 24: Simulation and
Memory Latency Tolerance
 Chang+, “Improving DRAM Performance by Parallelizing
Refreshes with Accesses,” HPCA 2014.
 http://users.ece.cmu.edu/~omutlu/pub/dram-access-refresh-
parallelization_hpca14.pdf Prof. Onur Mutlu
Carnegie Mellon University
Spring 2015, 3/30/2015

1441

Dreaming and Reality


 An architect is in part a dreamer, a creator

Simulation: The Field of Dreams  Simulation is a key tool of the architect

 Simulation enables
 The exploration of many dreams
 A reality check of the dreams
 Deciding which dream is better

 Simulation also enables


 The ability to fool yourself with false dreams

1444
10-11-2023

Why High-Level Simulation? Different Goals in Simulation


 Problem: RTL simulation is intractable for design space  Explore the design space quickly and see what you want to
potentially implement in a next-generation platform
exploration  too time consuming to design and evaluate 

 propose as the next big idea to advance the state of the art
 Especially over a large number of workloads
 the goal is mainly to see relative effects of design decisions
 Especially if you want to predict the performance of a good
chunk of a workload on a particular design  Match the behavior of an existing system so that you can
 Especially if you want to consider many design choices  debug and verify it at cycle-level accuracy
 Cache size, associativity, block size, algorithms  propose small tweaks to the design that can make a difference in
 Memory control and scheduling algorithms performance or energy
 In-order vs. out-of-order execution  the goal is very high accuracy
 Reservation station sizes, ld/st queue size, register file size, …
 Other goals in-between:
 …
 Refine the explored design space without going into a full
detailed, cycle-accurate design
 Goal: Explore design choices quickly to see their impact on
 Gain confidence in your design decisions made by higher-level
the workloads we are designing the platform for
design space exploration
1445 1446

Tradeoffs in Simulation Trading Off Speed, Flexibility, Accuracy


 Three metrics to evaluate a simulator  Speed & flexibility affect:
 Speed  How quickly you can make design tradeoffs
 Flexibility
 Accuracy  Accuracy affects:
 How good your design tradeoffs may end up being
 Speed: How fast the simulator runs (xIPS, xCPS)  How fast you can build your simulator (simulator design time)
 Flexibility: How quickly one can modify the simulator to
evaluate different algorithms and design choices?  Flexibility also affects:
 Accuracy: How accurate the performance (energy) numbers  How much human effort you need to spend modifying the
the simulator generates are vs. a real design (Simulation simulator
error)
 You can trade off between the three to achieve design
 The relative importance of these metrics varies depending exploration and decision goals
on where you are in the design process
1447 1448
10-11-2023

High-Level Simulation Simulation as Progressive Refinement


 Key Idea: Raise the abstraction level of modeling to give up  High-level models (Abstract, C)
some accuracy to enable speed & flexibility (and quick  …
simulator design)  Medium-level models (Less abstract)
 …
 Advantage
 Low-level models (RTL with eveything modeled)
+ Can still make the right tradeoffs, and can do it quickly
 …
+ All you need is modeling the key high-level factors, you can
omit corner case conditions  Real design
+ All you need is to get the “relative trends” accurately, not
exact performance numbers  As you refine (go down the above list)
 Abstraction level reduces
 Disadvantage  Accuracy (hopefully) increases (not necessarily, if not careful)
-- Opens up the possibility of potentially wrong decisions  Speed and flexibility reduce
-- How do you ensure you get the “relative trends” accurately?  You can loop back and fix higher-level models
1449 1450

This Course Optional Reading on DRAM Simulation


 A good architect is comfortable at all levels of refinement  Kim et al., “Ramulator: A Fast and Extensible DRAM
 Including the extremes Simulator,” IEEE Computer Architecture Letters 2015.

 This course, as a result, gives you a flavor of both:  https://github.com/CMU-SAFARI/ramulator


 High-level, abstract simulation (Labs 6, 7, 8)
 Low-level, RTL simulation (Labs 2, 3, 4, 5)  http://users.ece.cmu.edu/~omutlu/pub/ramulator_dram_si
mulator-ieee-cal15.pdf

1451 1452
10-11-2023

Where We Are in Lecture Schedule Upcoming Seminar on DRAM (April 3)


 The memory hierarchy  April 3, Friday, 11am-noon, GHC 8201
 Caches, caches, more caches  Prof. Moinuddin Qureshi, Georgia Tech
 Virtualizing the memory hierarchy: Virtual Memory  Lead author of “MLP-Aware Cache Replacement”
 Main memory: DRAM  Architecting 3D Memory Systems
 Main memory control, scheduling  Die stacked 3D DRAM technology can provide low-energy high-
bandwidth memory module by vertically integrating several dies
 Memory latency tolerance techniques within the same chip. (…) In this talk, I will discuss how memory
 Non-volatile memory systems can efficiently architect 3D DRAM either as a cache or as
main memory. First, I will show that some of the basic design
decisions typically made for conventional caches (such as
 Multiprocessors serialization of tag and data access, large associativity, and update
of replacement state) are detrimental to the performance of DRAM
 Coherence and consistency caches, as they exacerbate hit latency. (…) Finally, I will present a
 Interconnection networks memory organization that allows 3D DRAM to be a part of the OS-
 Multi-core issues (e.g., heterogeneous multi-core) visible memory address space, and yet relieves the OS from data
migration duties. (…)”
1453 1454

Required Reading Required Readings on DRAM


 Onur Mutlu, Justin Meza, and Lavanya Subramanian,  DRAM Organization and Operation Basics
"The Main Memory System: Challenges and  Sections 1 and 2 of: Lee et al., “Tiered-Latency DRAM: A Low
Opportunities" Latency and Low Cost DRAM Architecture,” HPCA 2013.
Invited Article in Communications of the Korean Institute of http://users.ece.cmu.edu/~omutlu/pub/tldram_hpca13.pdf
Information Scientists and Engineers (KIISE), 2015.
 Sections 1 and 2 of Kim et al., “A Case for Subarray-Level
http://users.ece.cmu.edu/~omutlu/pub/main-memory- Parallelism (SALP) in DRAM,” ISCA 2012.
system_kiise15.pdf http://users.ece.cmu.edu/~omutlu/pub/salp-dram_isca12.pdf

 DRAM Refresh Basics


 Sections 1 and 2 of Liu et al., “RAIDR: Retention-Aware
Intelligent DRAM Refresh,” ISCA 2012.
http://users.ece.cmu.edu/~omutlu/pub/raidr-dram-
refresh_isca12.pdf
1455 1456
10-11-2023

Readings on Bloom Filters


 Section 3.1 of
 Seshadri et al., “The Evicted-Address Filter: A Unified
Mechanism to Address Both Cache Pollution and Thrashing,”
PACT 2012.
Difficulty of DRAM Control
 Section 3.3 of
 Liu et al., “RAIDR: Retention-Aware Intelligent DRAM
Refresh,” ISCA 2012.

1457

Why are DRAM Controllers Difficult to Design? Many DRAM Timing Constraints
 Need to obey DRAM timing constraints for correctness
 There are many (50+) timing constraints in DRAM
 tWTR: Minimum number of cycles to wait before issuing a read
command after a write command is issued
 tRC: Minimum number of cycles between the issuing of two
consecutive activate commands to the same bank
 …
 Need to keep track of many resources to prevent conflicts
 Channels, banks, ranks, data bus, address bus, row buffers
 Need to handle DRAM refresh  From Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing
 Need to manage power consumption Write-Caused Interference in Memory Systems,” HPS Technical Report,
April 2010.
 Need to optimize performance & QoS (in the presence of constraints)
 Reordering is not simple
 Fairness and QoS needs complicates the scheduling problem
1459 1460
10-11-2023

More on DRAM Operation DRAM Controller Design Is Becoming More Difficult


 Kim et al., “A Case for Exploiting Subarray-Level Parallelism
(SALP) in DRAM,” ISCA 2012. CPU CPU CPU CPU
GPU
 Lee et al., “Tiered-Latency DRAM: A Low Latency and Low
Cost DRAM Architecture,” HPCA 2013. HWA HWA
Shared Cache

DRAM and Hybrid Memory Controllers

DRAM and Hybrid Memories

 Heterogeneous agents: CPUs, GPUs, and HWAs


 Main memory interference between CPUs, GPUs, HWAs
 Many timing constraints for various memory types
 Many goals at the same time: performance, fairness, QoS,
energy efficiency, …
1461 1462

Reality and Dream Self-Optimizing DRAM Controllers


 Reality: It difficult to optimize all these different constraints  Problem: DRAM controllers difficult to design  It is difficult for
while maximizing performance, QoS, energy-efficiency, … human designers to design a policy that can adapt itself very well
to different workloads and different system conditions

 Dream: Wouldn’t it be nice if the DRAM controller


 Idea: Design a memory controller that adapts its scheduling
automatically found a good scheduling policy on its own?
policy decisions to workload behavior and system conditions
using machine learning.

 Observation: Reinforcement learning maps nicely to memory


control.

 Design: Memory controller is a reinforcement learning agent that


dynamically and continuously learns and employs the best
scheduling policy.

1463 1464
Ipek+, “Self Optimizing Memory Controllers: A Reinforcement Learning Approach,” ISCA 2008.
10-11-2023

Self-Optimizing DRAM Controllers Self-Optimizing DRAM Controllers


 Engin Ipek, Onur Mutlu, José F. Martínez, and Rich  Dynamically adapt the memory scheduling policy via
Caruana, interaction with the system at runtime
"Self Optimizing Memory Controllers: A  Associate system states and actions (commands) with long term
Reinforcement Learning Approach" reward values: each action at a given state leads to a learned reward
Proceedings of the 35th International Symposium on  Schedule command with highest estimated long-term reward value in
Computer Architecture (ISCA), pages 39-50, Beijing, each state
China, June 2008.  Continuously update reward values for <state, action> pairs based on
Goal: Learn to choose actions to maximize r0 + r1 + 2r2 + … ( 0   < 1)
feedback from system

1465 1466

Self-Optimizing DRAM Controllers States, Actions, Rewards


 Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana,
"Self Optimizing Memory Controllers: A Reinforcement Learning ❖ Reward function ❖ State attributes ❖ Actions
Approach"
Proceedings of the 35th International Symposium on Computer Architecture • +1 for scheduling • Number of reads, • Activate
(ISCA), pages 39-50, Beijing, China, June 2008. Read and Write writes, and load
commands misses in
• Write

• 0 at all other
transaction queue • Read - load miss
times • Number of pending • Read - store miss
writes and ROB
Goal is to maximize
heads waiting for • Precharge - pending
data bus
utilization
referenced row • Precharge - preemptive
• Request’s relative • NOP
ROB order

1467 1468
10-11-2023

Performance Results Self Optimizing DRAM Controllers


 Advantages
+ Adapts the scheduling policy dynamically to changing workload
behavior and to maximize a long-term target
+ Reduces the designer’s burden in finding a good scheduling
policy. Designer specifies:
1) What system variables might be useful
2) What target to optimize, but not how to optimize it

 Disadvantages and Limitations


-- Black box: designer much less likely to implement what she
cannot easily reason about
-- How to specify different reward functions that can achieve
different objectives? (e.g., fairness, QoS)
-- Hardware complexity?
1469 1470

Readings on Memory Latency Tolerance


 Required
 Mutlu et al., “Runahead Execution: An Alternative to Very
Large Instruction Windows for Out-of-order Processors,” HPCA
Memory Latency Tolerance 2003.
 Srinath et al., “Feedback directed prefetching”, HPCA 2007.

 Optional
 Mutlu et al., “Efficient Runahead Execution: Power-Efficient
Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top Picks
2006.
 Mutlu et al., “Address-Value Delta (AVD) Prediction,” MICRO
2005.
 Armstrong et al., “Wrong Path Events,” MICRO 2004.

1472
10-11-2023

Remember: Latency Tolerance Stalls due to Long-Latency Instructions


 An out-of-order execution processor tolerates latency of  When a long-latency instruction is not complete,
multi-cycle operations by executing independent it blocks instruction retirement.
instructions concurrently  Because we need to maintain precise exceptions
 It does so by buffering instructions in reservation stations and
reorder buffer
 Incoming instructions fill the instruction window (reorder
 Instruction window: Hardware resources needed to buffer all buffer, reservation stations).
decoded but not yet retired/committed instructions

 Once the window is full, processor cannot place new


 What if an instruction takes 500 cycles?
instructions into the window.
 How large of an instruction window do we need to continue
 This is called a full-window stall.
decoding?
 How many cycles of latency can OoO tolerate?
 A full-window stall prevents the processor from making
progress in the execution of the program.
1473 1474

Full-window Stall Example Cache Misses Responsible for Many Stalls


8-entry instruction window: 100
95 Non-stall (compute) time
Oldest LOAD R1  mem[R5] L2 Miss! Takes 100s of cycles. 90
85 Full-window stall time

Normalized Execution Time


BEQ R1, R0, target 80
75
ADD R2  R2, 8 70
65
LOAD R3  mem[R2] 60
Independent of the L2 miss, 55
MUL R4  R4, R3 executed out of program order, 50
ADD R4  R4, R5 but cannot be retired. 45
40
STOR mem[R2]  R4 35
30
ADD R2  R2, 64 25 L2 Misses
Younger instructions cannot be executed 20
15
LOAD R3  mem[R2] because there is no space in the instruction window. 10
5
The processor stalls until the L2 Miss is serviced. 0
128-entry window
 Long-latency cache misses are responsible for 512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model
most full-window stalls.
1475 1476
10-11-2023

The Memory Latency Problem How Do We Tolerate Stalls Due to Memory?


 Problem: Memory latency is long  Two major approaches
 Reduce/eliminate stalls
 And, it is not easy to reduce it…  Tolerate the effect of a stall when it happens
 We will look at methods for reducing DRAM latency in a later
lecture  Four fundamental techniques to achieve these
 Lee et al. “Tiered-Latency DRAM,” HPCA 2013.  Caching
 Lee et al., “Adaptive-Latency DRAM,” HPCA 2014.  Prefetching
 Multithreading
 And, even if we reduce memory latency, it is still long  Out-of-order execution
 Remember the fundamental capacity-latency tradeoff
 Many techniques have been developed to make these four
 Contention for memory increases latencies
fundamental techniques more effective in tolerating
memory latency

1477 1478

Memory Latency Tolerance Techniques


 Caching [initially by Wilkes, 1965]
 Widely used, simple, effective, but inefficient, passive
 Not all applications/phases exhibit temporal or spatial locality

 Prefetching [initially in IBM 360/91, 1967]


Runahead Execution
 Works well for regular memory access patterns
 Prefetching irregular access patterns is difficult, inaccurate, and hardware-
intensive

 Multithreading [initially in CDC 6600, 1964]


 Works well if there are multiple threads
 Improving single thread performance using multithreading hardware is an
ongoing research effort

 Out-of-order execution [initially by Tomasulo, 1967]


 Tolerates irregular cache misses that cannot be prefetched
 Requires extensive hardware resources for tolerating long latencies
 Runahead execution alleviates this problem (as we will see today)

1479
10-11-2023

Small Windows: Full-window Stalls Impact of Long-Latency Cache Misses


8-entry instruction window: 100
95 Non-stall (compute) time
Oldest LOAD R1  mem[R5] L2 Miss! Takes 100s of cycles. 90
85 Full-window stall time

Normalized Execution Time


BEQ R1, R0, target 80
75
ADD R2  R2, 8 70
65
LOAD R3  mem[R2] 60
Independent of the L2 miss, 55
MUL R4  R4, R3 executed out of program order, 50
ADD R4  R4, R5 but cannot be retired. 45
40
STOR mem[R2]  R4 35
30
ADD R2  R2, 64 25 L2 Misses
Younger instructions cannot be executed 20
15
LOAD R3  mem[R2] because there is no space in the instruction window. 10
5
The processor stalls until the L2 Miss is serviced. 0
128-entry window
 Long-latency cache misses are responsible for most 512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher
full-window stalls. Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model

1481 1482

Impact of Long-Latency Cache Misses The Problem


100  Out-of-order execution requires large instruction windows
95 Non-stall (compute) time
90 to tolerate today’s main memory latencies.
85 Full-window stall time
Normalized Execution Time

80
75
70
65
 As main memory latency increases, instruction window size
60 should also increase to fully tolerate the memory latency.
55
50
45
40
35  Building a large instruction window is a challenging task
30
25 L2 Misses if we would like to achieve
20
15  Low power/energy consumption (tag matching logic, ld/st
10
5
buffers)
0  Short cycle time (access, wakeup/select latencies)
128-entry window 2048-entry window
 Low design and verification complexity
500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model

1483 1484
10-11-2023

Efficient Scaling of Instruction Window Size Memory Level Parallelism (MLP)


 One of the major research issues in out of order execution  Idea: Find and service multiple cache misses in parallel so
that the processor stalls only once for all misses
 How to achieve the benefits of a large window with a small
one (or in a simpler way)?
isolated miss parallel miss
B
 How do we efficiently tolerate memory latency with the A
C
machinery of out-of-order execution (and a small time
instruction window)?

 Enables latency tolerance: overlaps latency of different misses

 How to generate multiple misses?


 Out-of-order execution, multithreading, prefetching, runahead
1485 1486

Runahead Example
Runahead Execution (I) Perfect Caches:
Load 1 Hit Load 2 Hit
 A technique to obtain the memory-level parallelism benefits
Compute Compute
of a large instruction window

 When the oldest instruction is a long-latency cache miss: Small Window:

 Checkpoint architectural state and enter runahead mode Load 1 Miss Load 2 Miss

 In runahead mode: Compute Stall Compute Stall


 Speculatively pre-execute instructions Miss 1 Miss 2

 The purpose of pre-execution is to generate prefetches


 L2-miss dependent instructions are marked INV and dropped Runahead:
 Runahead mode ends when the original miss returns Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Hit

 Checkpoint is restored and normal execution resumes


Compute Runahead Compute
Saved Cycles
Miss 1
 Mutlu et al., “Runahead Execution: An Alternative to Very Large
Miss 2
Instruction Windows for Out-of-order Processors,” HPCA 2003.
1487
10-11-2023

Benefits of Runahead Execution Runahead Execution Mechanism


Instead of stalling during an L2 cache miss:  Entry into runahead mode
 Checkpoint architectural register state

 Pre-executed loads and stores independent of L2-miss


instructions generate very accurate data prefetches:
 For both regular and irregular access patterns  Instruction processing in runahead mode

 Instructions on the predicted program path are prefetched


into the instruction/trace cache and L2.
 Exit from runahead mode
 Hardware prefetcher and branch predictor tables are trained  Restore architectural register state from checkpoint
using future access information.

Instruction Processing in Runahead Mode L2-Miss Dependent Instructions


Load 1 Miss Load 1 Miss

Compute Runahead Compute Runahead


Miss 1 Miss 1

Runahead mode processing is the same as


normal instruction processing, EXCEPT:  Two types of results produced: INV and VALID

 It is purely speculative: Architectural (software-visible)  INV = Dependent on an L2 miss


register/memory state is NOT updated in runahead mode.
 INV results are marked using INV bits in the register file and
 L2-miss dependent instructions are identified and treated store buffer.
specially.
 They are quickly removed from the instruction window.  INV values are not used for prefetching/branch resolution.
 Their results are not trusted.
10-11-2023

Removal of Instructions from Window Store/Load Handling in Runahead Mode


Load 1 Miss Load 1 Miss

Compute Runahead Compute Runahead


Miss 1 Miss 1

 Oldest instruction is examined for pseudo-retirement  A pseudo-retired store writes its data and INV status to a
 An INV instruction is removed from window immediately.
dedicated memory, called runahead cache.
 A VALID instruction is removed when it completes execution.
 Purpose: Data communication through memory in runahead mode.
 Pseudo-retired instructions free their allocated resources.
 This allows the processing of later instructions.  A dependent load reads its data from the runahead cache.

 Pseudo-retired stores communicate their data to  Does not need to be always correct  Size of runahead cache is
dependent loads. very small.

Branch Handling in Runahead Mode A Runahead Processor Diagram


Load 1 Miss Mutlu+, “Runahead Execution,”
HPCA 2003.

Compute Runahead
Miss 1

 INV branches cannot be resolved.


 A mispredicted INV branch causes the processor to stay on the wrong
program path until the end of runahead execution.

 VALID branches are resolved and initiate recovery if mispredicted.

1496
10-11-2023

Runahead Execution Pros and Cons Performance of Runahead Execution


 Advantages: 1.3
12% No prefetcher, no runahead
+ Very accurate prefetches for data/instructions (all cache levels) 1.2 Only prefetcher (baseline)
+ Follows the program path 1.1 Only runahead
+ Simple to implement, most of the hardware is already built in Prefetcher + runahead
1.0

Micro-operations Per Cycle


+ Versus other pre-execution based prefetching mechanisms (as we will see): 0.9 22% 12%
+ Uses the same thread context as main thread, no waste of context 15%
0.8
+ No need to construct a pre-execution thread 35% 22%
0.7

 Disadvantages/Limitations: 0.6
16% 52%
0.5
-- Extra executed instructions
0.4
-- Limited by branch prediction accuracy
0.3 13%
-- Cannot prefetch dependent cache misses
0.2
-- Effectiveness limited by available “memory-level parallelism” (MLP)
-- Prefetch distance limited by memory latency 0.1
0.0
 Implemented in IBM POWER6, Sun “Rock” S95 FP00 INT00 WEB MM PROD SERV WS AVG

1497 1498

Runahead Execution vs. Large Windows Runahead vs. A (Real) Large Window
1.5  When is one beneficial, when is the other?
entry window (baseline) -128
1.4
entry window with Runahead -128  Pros and cons of each
1.3 entry window-256
1.2 entry window-384
entry window-512
1.1
Which can tolerate FP operation latencies better?
Micro-operations Per Cycle


1.0

0.9  Which leads to less wasted execution?


0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
S95 FP00 INT00 WEB MM PROD SERV WS AVG

1499 1500
10-11-2023

Runahead on In-order vs. Out-of-order Effect of Runahead in Sun ROCK


1.3
15% 10% in-order baseline  Shailender Chaudhry talk, Aug 2008.
1.2
in-order + runahead
1.1 out-of-order baseline
1.0 out-of-order + runahead
14% 12%
Micro-operations Per Cycle

0.9
20% 22%
0.8 17% 13%
0.7
39% 20%
73% 23%
0.6
28% 15% 50% 47%
0.5

0.4
73% 16%
0.3

0.2

0.1

0.0
S95 FP00 INT00 WEB MM PROD SERV WS AVG

1501 1502

Where We Are in Lecture Schedule


18-447  The memory hierarchy
 Caches, caches, more caches
Computer Architecture  Virtualizing the memory hierarchy: Virtual Memory
Lecture 25: Memory Latency Tolerance II:  Main memory: DRAM
Prefetching  Main memory control, scheduling
 Memory latency tolerance techniques
 Non-volatile memory

Prof. Onur Mutlu  Multiprocessors


Carnegie Mellon University  Coherence and consistency
Spring 2015, 4/1/2015  Interconnection networks
 Multi-core issues (e.g., heterogeneous multi-core)
1504
10-11-2023

Upcoming Seminar on DRAM (April 3) Required Reading


 April 3, Friday, 11am-noon, GHC 8201  Onur Mutlu, Justin Meza, and Lavanya Subramanian,
 Prof. Moinuddin Qureshi, Georgia Tech "The Main Memory System: Challenges and
 Lead author of “MLP-Aware Cache Replacement” Opportunities"
Invited Article in Communications of the Korean Institute of
 Architecting 3D Memory Systems
Information Scientists and Engineers (KIISE), 2015.
 Die stacked 3D DRAM technology can provide low-energy high-
bandwidth memory module by vertically integrating several dies
within the same chip. (…) In this talk, I will discuss how memory http://users.ece.cmu.edu/~omutlu/pub/main-memory-
systems can efficiently architect 3D DRAM either as a cache or as
system_kiise15.pdf
main memory. First, I will show that some of the basic design
decisions typically made for conventional caches (such as
serialization of tag and data access, large associativity, and update
of replacement state) are detrimental to the performance of DRAM
caches, as they exacerbate hit latency. (…) Finally, I will present a
memory organization that allows 3D DRAM to be a part of the OS-
visible memory address space, and yet relieves the OS from data
migration duties. (…)”
1505 1506

Cache Misses Responsible for Many Stalls


100
95 Non-stall (compute) time
90
85 Full-window stall time

Normalized Execution Time


80
Tolerating Memory Latency 75
70
65
60
55
50
45
40
35
30
25 L2 Misses
20
15
10
5
0
128-entry window
512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model

1508
10-11-2023

Memory Latency Tolerance Techniques Runahead Execution (I)


 Caching [initially by Wilkes, 1965]  A technique to obtain the memory-level parallelism benefits
 Widely used, simple, effective, but inefficient, passive of a large instruction window
 Not all applications/phases exhibit temporal or spatial locality

 Prefetching [initially in IBM 360/91, 1967]  When the oldest instruction is a long-latency cache miss:
 Works well for regular memory access patterns  Checkpoint architectural state and enter runahead mode
 Prefetching irregular access patterns is difficult, inaccurate, and hardware-
intensive  In runahead mode:
 Multithreading [initially in CDC 6600, 1964]
 Speculatively pre-execute instructions
 Works well if there are multiple threads  The purpose of pre-execution is to generate prefetches
Improving single thread performance using multithreading hardware is an

ongoing research effort  L2-miss dependent instructions are marked INV and dropped
 Runahead mode ends when the original miss returns
 Out-of-order execution [initially by Tomasulo, 1967]
 Tolerates irregular cache misses that cannot be prefetched  Checkpoint is restored and normal execution resumes
 Requires extensive hardware resources for tolerating long latencies
 Runahead execution alleviates this problem (as we will see today)  Mutlu et al., “Runahead Execution: An Alternative to Very Large
Instruction Windows for Out-of-order Processors,” HPCA 2003.
1509 1510

Runahead Example
Perfect Caches:
Load 1 Hit Load 2 Hit

Compute Compute

Small Window: Runahead Enhancements


Load 1 Miss Load 2 Miss

Compute Stall Compute Stall


Miss 1 Miss 2

Runahead:
Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Hit

Compute Runahead Compute


Saved Cycles
Miss 1

Miss 2
10-11-2023

Readings Limitations of the Baseline Runahead Mechanism


 Required  Energy Inefficiency
 Mutlu et al., “Runahead Execution”, HPCA 2003, Top Picks 2003.  A large number of instructions are speculatively executed
 Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06]
 Recommended
 Ineffectiveness for pointer-intensive applications
 Mutlu et al., “Efficient Runahead Execution: Power-Efficient  Runahead cannot parallelize dependent L2 cache misses
Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top Picks  Address-Value Delta (AVD) Prediction [MICRO’05]
2006.
 Irresolvable branch mispredictions in runahead mode
 Mutlu et al., “Address-Value Delta (AVD) Prediction,” MICRO  Cannot recover from a mispredicted L2-miss dependent branch
2005.  Wrong Path Events [MICRO’04]

 Armstrong et al., “Wrong Path Events,” MICRO 2004.

1513

The Problem: Dependent Cache Misses Parallelizing Dependent Cache Misses


Runahead: Load 2 is dependent on Load 1  Idea: Enable the parallelization of dependent L2 cache
Cannot Compute Its Address! misses in runahead mode with a low-cost mechanism

Load 1 Miss Load 2 INV Load 1 Hit Load 2 Miss

 How: Predict the values of L2-miss address (pointer)


Compute Runahead
loads
Miss 1 Miss 2
 Address load: loads an address into its destination register,
 Runahead execution cannot parallelize dependent misses which is later used to calculate the address of another load
 wasted opportunity to improve performance  as opposed to data load
 wasted energy (useless pre-execution)
 Read:
 Runahead performance would improve by 25% if this  Mutlu et al., “Address-Value Delta (AVD) Prediction,” MICRO
limitation were ideally overcome 2005.
10-11-2023

Parallelizing Dependent Cache Misses AVD Prediction [MICRO’05]


Cannot Compute Its Address!  Address-value delta (AVD) of a load instruction defined as:
AVD = Effective Address of Load – Data Value of Load
Load 1 Miss Load 2 INV Load 1 Hit Load 2 Miss

Compute Runahead  For some address loads, AVD is stable


Miss 1 Miss 2  An AVD predictor keeps track of the AVDs of address loads
 When a load is an L2 miss in runahead mode, AVD
Value Predicted Can Compute Its Address predictor is consulted

Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Hit Saved Speculative
Instructions  If the predictor returns a stable (confident) AVD for that
load, the value of the load is predicted
Compute Runahead
Miss 1
Saved Cycles Predicted Value = Effective Address – Predicted AVD
Miss 2

Why Do Stable AVDs Occur? Traversal Address Loads


 Regularity in the way data structures are Regularly-allocated linked list: A traversal address load loads the
 allocated in memory AND pointer to next node:
A
 traversed node = nodenext

 Two types of loads can have stable AVDs A+k


AVD = Effective Addr – Data Value

 Traversal address loads Effective Addr Data Value AVD


 Produce addresses consumed by address loads A+2k A A+k -k
 Leaf address loads
A+k A+2k -k
 Produce addresses consumed by data loads ... A+3k
A+2k A+3k -k

Striding Stable AVD


data value
10-11-2023

Leaf Address Loads Identifying Address Loads in Hardware


Sorted dictionary in parser: Dictionary looked up for an input word.  Insight:
Nodes point to strings (words)
String and node allocated consecutively
A leaf address load loads the pointer to  If the AVD is too large, the value that is loaded is likely not an
the string of each node: address
lookup (node, input) { // ...
A+k ptr_str = nodestring;
m = check_match(ptr_str, input);  Only keep track of loads that satisfy:
// …
B+k A C+k } -MaxAVD ≤ AVD ≤ +MaxAVD
node
string AVD = Effective Addr – Data Value
B C
 This identification mechanism eliminates many loads from
D+k E+k F+k G+k Effective Addr Data Value AVD
consideration for prediction
A+k A k
 No need to value- predict the loads that will not generate
D E F G C+k C k addresses
F+k F k  Enables the predictor to be small
No stride! Stable AVD

AVD Prediction 1522

Performance of AVD Prediction


1.0 runahead
Normalized Execution Time and Executed Instructions

0.9
14.3%
15.5%
0.8

0.7
Prefetching
0.6

0.5

0.4 Execution Time


0.3 Executed Instructions
0.2

0.1

0.0
er

er
dd
th

G
rt

st

cf

f
p

no

ol

vp
ts
so

AV
m

m
al

et

rs

tw
ea

ro
he
bi

r im

pa
vo
tre
pe
10-11-2023

Outline of Prefetching Lecture(s) Prefetching


 Why prefetch? Why could/does it work?  Idea: Fetch the data before it is needed (i.e. pre-fetch) by
 The four questions the program
 What (to prefetch), when, where, how
 Software prefetching  Why?
 Hardware prefetching algorithms  Memory latency is high. If we can prefetch accurately and
early enough we can reduce/eliminate that latency.
 Execution-based prefetching
 Can eliminate compulsory cache misses
 Prefetching performance
 Can it eliminate all cache misses? Capacity, conflict?
 Coverage, accuracy, timeliness
 Bandwidth consumption, cache pollution
 Involves predicting which address will be needed in the
 Prefetcher throttling
future
 Issues in multi-core (if we get to it)
 Works if programs have predictable miss address patterns

1525 1526

Prefetching and Correctness Basics


 Does a misprediction in prefetching affect correctness?  In modern systems, prefetching is usually done in cache
block granularity
 No, prefetched data at a “mispredicted” address is simply
not used  Prefetching is a technique that can reduce both
 There is no need for state recovery  Miss rate
 In contrast to branch misprediction or value misprediction  Miss latency

 Prefetching can be done by


 hardware
 compiler
 programmer

1527 1528
10-11-2023

How a HW Prefetcher Fits in the Memory System Prefetching: The Four Questions
 What
 What addresses to prefetch

 When
 When to initiate a prefetch request

 Where
 Where to place the prefetched data

 How
 Software, hardware, execution-based, cooperative

1529 1530

Challenges in Prefetching: What Challenges in Prefetching: When


 What addresses to prefetch  When to initiate a prefetch request
 Prefetching useless data wastes resources  Prefetching too early
 Memory bandwidth  Prefetched data might not be used before it is evicted from
 Cache or prefetch buffer space storage
 Energy consumption  Prefetching too late
 These could all be utilized by demand requests or more accurate  Might not hide the whole memory latency
prefetch requests
 Accurate prediction of addresses to prefetch is important  When a data item is prefetched affects the timeliness of the
 Prefetch accuracy = used prefetches / sent prefetches prefetcher
 How do we know what to prefetch  Prefetcher can be made more timely by
 Predict based on past access patterns  Making it more aggressive: try to stay far ahead of the
 Use the compiler’s knowledge of data structures processor’s access stream (hardware)
 Moving the prefetch instructions earlier in the code (software)
 Prefetching algorithm determines what to prefetch
1531 1532
10-11-2023

Challenges in Prefetching: Where (I) Challenges in Prefetching: Where (II)


 Where to place the prefetched data
 Which level of cache to prefetch into?
 In cache
 Memory to L2, memory to L1. Advantages/disadvantages?
+ Simple design, no need for separate buffers
 L2 to L1? (a separate prefetcher between levels)
-- Can evict useful demand data  cache pollution
 In a separate prefetch buffer
+ Demand data protected from prefetches  no cache pollution  Where to place the prefetched data in the cache?
-- More complex memory system design  Do we treat prefetched blocks the same as demand-fetched
- Where to place the prefetch buffer blocks?
- When to access the prefetch buffer (parallel vs. serial with cache)  Prefetched blocks are not known to be needed
- When to move the data from the prefetch buffer to cache  With LRU, a demand block is placed into the MRU position
- How to size the prefetch buffer
- Keeping the prefetch buffer coherent  Do we skew the replacement policy such that it favors the
demand-fetched blocks?
 Many modern systems place prefetched data into the cache
 E.g., place all prefetches into the LRU position in a way?
 Intel Pentium 4, Core2’s, AMD systems, IBM POWER4,5,6, …
1533 1534

Challenges in Prefetching: Where (III) Challenges in Prefetching: How


 Where to place the hardware prefetcher in the memory  Software prefetching
hierarchy?  ISA provides prefetch instructions
 In other words, what access patterns does the prefetcher see?  Programmer or compiler inserts prefetch instructions (effort)
 L1 hits and misses  Usually works well only for “regular access patterns”
 L1 misses only
 L2 misses only  Hardware prefetching
 Hardware monitors processor accesses
 Seeing a more complete access pattern:  Memorizes or finds patterns/strides
+ Potentially better accuracy and coverage in prefetching  Generates prefetch addresses automatically
-- Prefetcher needs to examine more requests (bandwidth
intensive, more ports into the prefetcher?)  Execution-based prefetchers
 A “thread” is executed to prefetch data for the main program
 Can be generated by either software/programmer or hardware
1535 1536
10-11-2023

Software Prefetching (I) X86 PREFETCH Instruction


 Idea: Compiler/programmer places prefetch instructions into
appropriate places in code

 Mowry et al., “Design and Evaluation of a Compiler Algorithm for


Prefetching,” ASPLOS 1992.

 Prefetch instructions prefetch data into caches microarchitecture


dependent
 Compiler or programmer can insert such instructions into the specification
program

different instructions
for different cache
levels

1537 1538

Software Prefetching (II) Software Prefetching (III)


for (i=0; i<N; i++) { while (p) { while (p) {  Where should a compiler insert prefetches?
__prefetch(a[i+8]); __prefetch(pnext); __prefetch(pnextnextnext);
__prefetch(b[i+8]); work(pdata); work(pdata);
sum += a[i]*b[i]; p = pnext; p = pnext;  Prefetch for every load access?
} } }  Too bandwidth intensive (both memory and execution bandwidth)
Which one is better?
 Can work for very regular array-based access patterns. Issues:
 Profile the code and determine loads that are likely to miss
-- Prefetch instructions take up processing/execution bandwidth
 What if profile input set is not representative?
 How early to prefetch? Determining this is difficult
-- Prefetch distance depends on hardware implementation (memory latency,
cache size, time between loop iterations)  portability?  How far ahead before the miss should the prefetch be inserted?
-- Going too far back in code reduces accuracy (branches in between)  Profile and determine probability of use for various prefetch
 Need “special” prefetch instructions in ISA? distances from the miss
 Alpha load into register 31 treated as prefetch (r31==0)  What if profile input set is not representative?
 PowerPC dcbt (data cache block touch) instruction  Usually need to insert a prefetch far in advance to cover 100s of cycles
of main memory latency  reduced accuracy
-- Not easy to do for pointer-based data structures
1539 1540
10-11-2023

Hardware Prefetching (I) Next-Line Prefetchers


 Idea: Specialized hardware observes load/store access  Simplest form of hardware prefetching: always prefetch next
patterns and prefetches data based on past access behavior N cache lines after a demand access (or a demand miss)
 Next-line prefetcher (or next sequential prefetcher)
 Tradeoffs:  Tradeoffs:
+ Can be tuned to system implementation + Simple to implement. No need for sophisticated pattern detection
+ Works well for sequential/streaming access patterns (instructions?)
+ Does not waste instruction execution bandwidth
-- Can waste bandwidth with irregular patterns
-- More hardware complexity to detect patterns
-- And, even regular patterns:
- Software can be more efficient in some cases
- What is the prefetch accuracy if access stride = 2 and N = 1?
- What if the program is traversing memory from higher to lower
addresses?
- Also prefetch “previous” N cache lines?

1541 1542

Stride Prefetchers Instruction Based Stride Prefetching


 Two kinds Load Inst. Last Address Last Confidence
Load
 Instruction program counter (PC) based PC (tag) Referenced Stride
Inst
 Cache block address based ……. ……. ……
PC

 Instruction based:  What is the problem with this?


 Baer and Chen, “An effective on-chip preloading scheme to  How far can the prefetcher get ahead of the demand access stream?
reduce data access penalty,” SC 1991.  Initiating the prefetch when the load is fetched the next time can be

 Idea: too late


 Record the distance between the memory addresses referenced by  Load will access the data cache soon after it is fetched!
a load instruction (i.e. stride of the load) as well as the last address  Solutions:
referenced by the load  Use lookahead PC to index the prefetcher table (decouple frontend of
 Next time the same load instruction is fetched, the processor from backend)
prefetch last address + stride  Prefetch ahead (last address + N*stride)
 Generate multiple prefetches

1543 1544
10-11-2023

Cache-Block Address Based Stride Prefetching Stream Buffers (Jouppi, ISCA 1990)
Address tag Stride Control/Confidence  Each stream buffer holds one stream of
Block sequentially prefetched cache lines FIFO
address
……. ……
On a load miss check the head of all

FIFO

Memory interface
stream buffers for an address match
 if hit, pop the entry from FIFO, update the cache
with data
 Can detect  if not, allocate a new stream buffer to the new DCache
miss address (may have to recycle a stream
 A, A+N, A+2N, A+3N, … buffer following LRU policy)

 Stream buffers are a special case of cache block address


based stride prefetching where N = 1  Stream buffer FIFOs are continuously FIFO
topped-off with subsequent cache lines
whenever there is room and the bus is not
busy FIFO
Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of
a Small Fully-Associative Cache and Prefetch Buffers,” ISCA 1990.
1545 1546

Stream Buffer Design Stream Buffer Design

1547 1548
10-11-2023

Prefetcher Performance (I) Prefetcher Performance (II)


 Accuracy (used prefetches / sent prefetches)  Prefetcher aggressiveness affects all performance metrics
 Coverage (prefetched misses / all misses)  Aggressiveness dependent on prefetcher type
 Timeliness (on-time prefetches / used prefetches)  For most hardware prefetchers:
 Prefetch distance: how far ahead of the demand stream
 Bandwidth consumption  Prefetch degree: how many prefetches per demand access
 Memory bandwidth consumed with prefetcher / without
prefetcher
Access Stream
 Good news: Can utilize idle bus bandwidth (if available)
Prefetch Degree
XX+1

 Cache pollution Predicted


Predicted
Stream
Stream

 Extra demand misses due to prefetch placement in cache 123


Pmax P Pmax Pmax Pmax
 More difficult to quantify but affects performance Very Conservative
Middle of
Prefetch
the
Very
Road
Aggressive
Distance

1549 1550

Prefetcher Performance (III) Prefetcher Performance (IV)


 How do these metrics interact?
 Very Aggressive Prefetcher (large prefetch distance & degree) 400%

 Well ahead of the load access stream 350%

 Hides memory access latency better 300%

Percentage IPC change over No Pref etching


 More speculative 250%

+ Higher coverage, better timeliness 200%

-- Likely lower accuracy, higher bandwidth and pollution 150%

 Very Conservative Prefetcher (small prefetch distance & degree) 100%

 Closer to the load access stream 50%

 Might not hide memory access latency completely 0%


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
 Reduces potential for cache pollution and bandwidth contention -50%

+ Likely higher accuracy, lower bandwidth, less polluting -100%

-- Likely lower coverage and less timely Pref etcher Accuracy

1551 1552
10-11-2023

Prefetcher Performance (V) Feedback-Directed Prefetcher Throttling (I)


 Idea:
 Dynamically monitor prefetcher performance metrics
5.0
No Prefetching  Throttle the prefetcher aggressiveness up/down based on past
4.0 Very Conservative
performance
Instructions per Cycle

Middle-of-the-Road
3.0 Very Aggressive  Change the location prefetches are inserted in cache based on
48% past performance
2.0
 29%
1.0 High Accuracy Med Accuracy Low Accuracy
0.0

Not-Late Late Not-Poll Polluting Not-Poll Decrease


x

n
ke
er

el
2

ck
p

ise
cf

im
a
p

id
ar
vp
rt e

pl

re

ea
ga
ip

es
m
m

lg

gr
rs

ra
ua

sw
ap

w
bz

ce
am
vo

ga

gm
m
pa

xt

up
eq

fa

si

w
Polluting Increase Late Decrease Not-Late
 Srinath et al., “Feedback Directed Prefetching: Improving the
Performance and Bandwidth-Efficiency of Hardware Prefetchers“, Decrease Increase No Change
HPCA 2007.
1553 1554

Feedback-Directed Prefetcher Throttling (II) How to Prefetch More Irregular Access Patterns?
 Regular patterns: Stride, stream prefetchers do well
 More irregular access patterns
 Indirect array accesses
 Linked data structures
 Multiple regular strides (1,2,3,1,2,3,1,2,3,…)
11% 13%  Random patterns?
 Generalized prefetcher for all patterns?

 Srinath et al., “Feedback Directed Prefetching: Improving the  Correlation based prefetchers
Performance and Bandwidth-Efficiency of Hardware Prefetchers“,
HPCA 2007.  Content-directed prefetchers
 Srinath et al., “Feedback Directed Prefetching: Improving the  Precomputation or execution-based prefetchers
Performance and Bandwidth-Efficiency of Hardware Prefetchers“,
HPCA 2007.
1555 1556
10-11-2023

Where We Are in Lecture Schedule


18-447  The memory hierarchy
 Caches, caches, more caches
Computer Architecture  Virtualizing the memory hierarchy: Virtual Memory
Lecture 26: Prefetching &  Main memory: DRAM
Emerging Memory Technologies  Main memory control, scheduling
 Memory latency tolerance techniques
 Non-volatile memory

Prof. Onur Mutlu  Multiprocessors


Carnegie Mellon University  Coherence and consistency
Spring 2015, 4/3/2015  Interconnection networks
 Multi-core issues (e.g., heterogeneous multi-core)
1558

Required Reading
 Onur Mutlu, Justin Meza, and Lavanya Subramanian,
"The Main Memory System: Challenges and
Opportunities"
Invited Article in Communications of the Korean Institute of Prefetching
Information Scientists and Engineers (KIISE), 2015.

http://users.ece.cmu.edu/~omutlu/pub/main-memory-
system_kiise15.pdf

1559
10-11-2023

Review: Outline of Prefetching Lecture(s) Review: How to Prefetch More Irregular Access Patterns?

 Why prefetch? Why could/does it work?  Regular patterns: Stride, stream prefetchers do well
 The four questions  More irregular access patterns
 What (to prefetch), when, where, how  Indirect array accesses
 Software prefetching  Linked data structures
 Hardware prefetching algorithms  Multiple regular strides (1,2,3,1,2,3,1,2,3,…)
 Execution-based prefetching  Random patterns?
Generalized prefetcher for all patterns?
 Prefetching performance 

 Coverage, accuracy, timeliness


 Bandwidth consumption, cache pollution  Correlation based prefetchers
 Prefetcher throttling  Content-directed prefetchers
 Issues in multi-core (if we get to it)  Precomputation or execution-based prefetchers

1561 1562

Address Correlation Based Prefetching (I) Address Correlation Based Prefetching (II)
 Consider the following history of cache block addresses Cache Block Addr Prefetch Confidence …. Prefetch Confidence
A, B, C, D, C, E, A, C, F, F, E, A, A, B, C, D, E, A, B, C, D, C Cache (tag) Candidate 1 …. Candidate N
 After referencing a particular address (say A or E), are Block
Addr ……. ……. …… .… ……. ……
some addresses more likely to be referenced next
….

.2  Idea: Record the likely-next addresses (B, C, D) after seeing an address A


.6 1.0
A B C  Next time A is accessed, prefetch B, C, D
A is said to be correlated with B, C, D
.67
.2 Markov 

.6  Prefetch up to N next addresses to increase coverage


Model Prefetch accuracy can be improved by using multiple addresses as key for
.2 
the next address: (A, B)  (C)
D .33 E .5 F .5 (A,B) correlated with C

1.0  Joseph and Grunwald, “Prefetching using Markov Predictors,” ISCA 1997.
 Also called “Markov prefetchers”
1563 1564
10-11-2023

Address Correlation Based Prefetching (III) Content Directed Prefetching (I)


 Advantages:  A specialized prefetcher for pointer values
 Can cover arbitrary access patterns  Idea: Identify pointers among all values in a fetched cache
 Linked data structures block and issue prefetch requests for them.
 Streaming patterns (though not so efficiently!)  Cooksey et al., “A stateless, content-directed data prefetching
mechanism,” ASPLOS 2002.
 Disadvantages:
 Correlation table needs to be very large for high coverage + No need to memorize/record past addresses!
 Recording every miss address and its subsequent miss addresses + Can eliminate compulsory misses (never-seen pointers)
is infeasible
-- Indiscriminately prefetches all pointers in a cache block
 Can have low timeliness: Lookahead is limited since a prefetch
for the next access/miss is initiated right after previous
 Can consume a lot of memory bandwidth  How to identify pointer addresses:
 Especially when Markov model probabilities (correlations) are low  Compare address sized values within cache block with cache
 Cannot reduce compulsory misses block’s address  if most-significant few bits match, pointer
1565 1566

Content Directed Prefetching (II) Execution-based Prefetchers (I)


 Idea: Pre-execute a piece of the (pruned) program solely
X800 22220 for prefetching data
x40373551 x80011100
Only need to distill pieces that lead to cache misses
[31:20]

x80011100 
[31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20]

= = = = = = = =
 Speculative thread: Pre-executed program piece can
Virtual Address Predictor be considered a “thread”
Generate Prefetch

 Speculative thread can be executed


 On a separate processor/core
X80022220  On a separate hardware thread context (think fine-grained
multithreading)
L2 DRAM
… …  On the same thread context in idle cycles (during cache misses)

1567 1568
10-11-2023

Execution-based Prefetchers (II) Thread-Based Pre-Execution


 How to construct the speculative thread:  Dubois and Song, “Assisted
 Software based pruning and “spawn” instructions Execution,” USC Tech
Report 1998.
 Hardware based pruning and “spawn” instructions
 Use the original program (no construction), but
 Execute it faster without stalling and correctness constraints
 Chappell et al.,
“Simultaneous Subordinate
Microthreading (SSMT),”
 Speculative thread ISCA 1999.
 Needs to discover misses before the main program
 Avoid waiting/stalling and/or compute less  Zilles and Sohi, “Execution-
 To get ahead, uses based Prediction Using
 Perform only address generation computation, branch prediction, Speculative Slices”, ISCA
value prediction (to predict “unknown” values) 2001.
 Purely speculative so there is no need for recovery of main
program if the speculative thread is incorrect
1569 1570

Thread-Based Pre-Execution Issues Thread-Based Pre-Execution Issues


 Where to execute the precomputation thread?  What, when, where, how
1. Separate core (least contention with main thread)  Luk, “Tolerating Memory Latency through Software-Controlled
2. Separate thread context on the same core (more contention) Pre-Execution in Simultaneous Multithreading Processors,”
3. Same core, same context ISCA 2001.
 When the main thread is stalled  Many issues in software-based pre-execution discussed
 When to spawn the precomputation thread?
1. Insert spawn instructions well before the “problem” load
 How far ahead?

 Too early: prefetch might not be needed


 Too late: prefetch might not be timely
2. When the main thread is stalled
 When to terminate the precomputation thread?
1. With pre-inserted CANCEL instructions
2. Based on effectiveness/contention feedback (recall throttling)
1571 1572
10-11-2023

An Example Example ISA Extensions

1573 1574

Results on a Multithreaded Processor Problem Instructions


 Zilles and Sohi, “Execution-based Prediction Using Speculative Slices”, ISCA
2001.
 Zilles and Sohi, ”Understanding the backward slices of performance degrading
instructions,” ISCA 2000.

1575 1576
10-11-2023

Fork Point for Prefetching Thread Pre-execution Thread Construction

1577 1578

Review: Runahead Execution Review: Runahead Execution (Mutlu et al., HPCA 2003)

 A simple pre-execution method for prefetching purposes Small Window:


Load 1 Miss Load 2 Miss
 When the oldest instruction is a long-latency cache miss:
Compute Stall Compute Stall
 Checkpoint architectural state and enter runahead mode
Miss 1 Miss 2
 In runahead mode:
 Speculatively pre-execute instructions
 The purpose of pre-execution is to generate prefetches Runahead:
 L2-miss dependent instructions are marked INV and dropped Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Hit

 Runahead mode ends when the original miss returns


Compute Runahead Compute
 Checkpoint is restored and normal execution resumes Miss 1
Saved Cycles

Miss 2

 Mutlu et al., “Runahead Execution: An Alternative to Very Large


Instruction Windows for Out-of-order Processors,” HPCA 2003.
1579 1580
10-11-2023

Runahead as an Execution-based Prefetcher Taking Advantage of Pure Speculation


 Idea of an Execution-Based Prefetcher: Pre-execute a piece  Runahead mode is purely speculative
of the (pruned) program solely for prefetching data
 The goal is to find and generate cache misses that would
 Idea of Runahead: Pre-execute the main program solely for otherwise stall execution later on
prefetching data
 How do we achieve this goal most efficiently and with the
 Advantages and disadvantages of runahead vs. other highest benefit?
execution-based prefetchers?
 Idea: Find and execute only those instructions that will lead
 Can you make runahead even better by pruning the to cache misses (that cannot already be captured by the
program portion executed in runahead mode? instruction window)

 How?
1581 1582

Where We Are in Lecture Schedule


 The memory hierarchy
 Caches, caches, more caches
 Virtualizing the memory hierarchy: Virtual Memory Emerging Memory Technologies
 Main memory: DRAM
 Main memory control, scheduling
 Memory latency tolerance techniques
 Non-volatile memory

 Multiprocessors
 Coherence and consistency
 Interconnection networks
 Multi-core issues (e.g., heterogeneous multi-core)
1583
10-11-2023

The Main Memory System Major Trends Affecting Main Memory (I)
 Need for main memory capacity, bandwidth, QoS increasing

Processor Main Memory Storage (SSD/HDD)


and caches  Main memory energy/power is a key system design concern

 Main memory is a critical component of all computing


systems: server, mobile, embedded, desktop, sensor

 DRAM technology scaling is ending


 Main memory system must scale (in size, technology,
efficiency, cost, and management algorithms) to maintain
performance growth and technology scaling benefits
1585 1586

The DRAM Scaling Problem An Example of The Scaling Problem


 DRAM stores charge in a capacitor (charge-based memory)
 Capacitor must be large enough for reliable sensing
 Access transistor should be large enough for low leakage and high Row of Cells Wordline
retention time
 Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009] Row Row
Victim
Row
HammeredOpened
Closed
Row VHIGH
LOW
Row Row
Victim
Row

Repeatedly opening and closing a row induces


 DRAM capacity, cost, and energy/power hard to scale disturbance errors in adjacent rows in most real
1587
DRAM chips [Kim+ ISCA 2014] 158
8
10-11-2023

Most DRAM Modules Are At Risk Errors vs. Vintage


A company B company C company

First
(37/43) (45/54) (28/32) Appearance

Up to Up to Up to
7 6 5
errors errors errors
Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of 158
All modules from 2012–2013 are vulnerable
159
DRAM Disturbance Errors,” ISCA 2014.
9 0

Security Implications Security Implications

http://users.ece.cmu.edu/~omutlu/pub/dra
m-row-hammer_isca14.pdf

http://googleprojectzero.blogspot.com/201
5/03/exploiting-dram-rowhammer-bug-to-
gain.html

1591 1592
10-11-2023

How Do We Solve The Problem?


 Tolerate it: Make DRAM and controllers more intelligent
 New interfaces, functions, architectures: system-DRAM codesign
How Can We Fix the Memory Problem &
Design (Memory) Systems of the Future?  Eliminate or minimize it: Replace or (more likely) augment
DRAM with a different technology
 New technologies and system-wide rethinking of memory &
storage

 Embrace it: Design heterogeneous-reliability memories that


map error-tolerant data to less reliable portions
 New usage and execution models


Solutions (to memory scaling) require

software/hardware/device cooperation 1594

Trends: Problems with DRAM as Main Memory Solutions to the DRAM Scaling Problem
 Need for main memory capacity increasing  Two potential solutions
 DRAM capacity hard to scale  Tolerate DRAM (by taking a fresh look at it)
 Enable emerging memory technologies to eliminate/minimize
DRAM

 Main memory energy/power is a key system design concern


 Do both
 DRAM consumes high power due to leakage and refresh
 Hybrid memory systems

 DRAM technology scaling is ending


 DRAM capacity, cost, and energy/power hard to scale

1595 1596
10-11-2023

Solution 1: Tolerate DRAM Solution 1: Tolerate DRAM


 Liu+, “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.

 Overcome DRAM shortcomings with 


Kim+, “A Case for Exploiting Subarray-Level Parallelism in DRAM,” ISCA 2012.
Lee+, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013.
 System-DRAM co-design  Liu+, “An Experimental Study of Data Retention Behavior in Modern DRAM Devices,” ISCA 2013.
 Seshadri+, “RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” MICRO 2013.
 Novel DRAM architectures, interface, functions  Pekhimenko+, “Linearly Compressed Pages: A Main Memory Compression Framework,” MICRO 2013.

 Better waste management (efficient utilization)  Chang+, “Improving DRAM Performance by Parallelizing Refreshes with Accesses,” HPCA 2014.
Khan+, “The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental

Study,” SIGMETRICS 2014.
 Luo+, “Characterizing Application Memory Error Vulnerability to Optimize Data Center Cost,” DSN 2014.
Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,”
Key issues to tackle

 ISCA 2014.
Lee+, “Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” HPCA 2015.
 Reduce energy 

 Qureshi+, “AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems,” DSN 2015.
 Enable reliability at low cost  Meza+, “Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New
Trends from the Field,” DSN 2015.
 Improve bandwidth and latency  Kim+, “Ramulator: A Fast and Extensible DRAM Simulator,” IEEE CAL 2015.

 Reduce waste  Avoid DRAM:


 Enable computation close to data 


Seshadri+, “The Evicted-Address Filter: A Unified Mechanism to Address Both Cache Pollution and Thrashing,” PACT 2012.
Pekhimenko+, “Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches,” PACT 2012.
 Seshadri+, “The Dirty-Block Index,” ISCA 2014.
 Pekhimenko+, “Exploiting Compressed Block Size as an Indicator of Future Reuse,” HPCA 2015.

1597 1598

Solution 2: Emerging Memory Technologies Hybrid Memory Systems


 Some emerging resistive memory technologies seem more
scalable than DRAM (and they are non-volatile)
 Example: Phase Change Memory CPU
DRA PCM
 Expected to scale to 9nm (2022 [ITRS]) MCtrl Ctrl
DRAM Phase Change Memory (or Tech. X)
 Expected to be denser than DRAM: can store multiple bits/cell
Fast, durable
 But, emerging technologies have shortcomings as well Small, Large, non-volatile, low-cost
 Can they be enabled to replace/augment/surpass DRAM? leaky, volatile, Slow, wears out, high active energy
high-cost
 Lee+, “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA’09, CACM’10, Micro’10.
 Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters 2012.
 Yoon, Meza+, “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012.
Hardware/software manage data allocation and movement
 Kultursay+, “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.
 Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED to achieve the best of multiple technologies
2013.
 Lu+, “Loose Ordering Consistency for Persistent Memory,” ICCD 2014.
Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012.
 Zhao+, “FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems,” MICRO 2014.
 Yoon, Meza+, “Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Yoon, Meza et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD
Memories,” ACM TACO 2014. 2012 Best Paper Award.
1599
10-11-2023

Requirements from an Ideal Memory System The Promise of Emerging Technologies


 Likely need to replace/augment DRAM with a technology that is
 Traditional
 Technology scalable
 Higher capacity  And at least similarly efficient, high performance, and fault-tolerant
 Continuous low cost  or can be architected to be so
 High system performance (higher bandwidth, low latency)

 New  Some emerging resistive memory technologies appear promising


 Phase Change Memory (PCM)?
 Technology scalability: lower cost, higher capacity, lower energy
 Spin Torque Transfer Magnetic Memory (STT-MRAM)?
 Energy (and power) efficiency
 Memristors?
 QoS support and configurability (for consolidation)  And, maybe there are other ones
 Can they be enabled to replace/augment/surpass DRAM?
Emerging, resistive memory technologies (NVM) can help
1601 1602

Charge vs. Resistive Memories Limits of Charge Memory


 Difficult charge placement and control
 Charge Memory (e.g., DRAM, Flash)  Flash: floating gate charge
 Write data by capturing charge Q  DRAM: capacitor charge, transistor leakage
 Read data by detecting voltage V
 Reliable sensing becomes difficult as charge storage unit
size reduces
 Resistive Memory (e.g., PCM, STT-MRAM, memristors)
 Write data by pulsing current dQ/dt
 Read data by detecting resistance R

1603 1604
10-11-2023

Emerging Resistive Memory Technologies What is Phase Change Memory?


 PCM  Phase change material (chalcogenide glass) exists in two states:
 Inject current to change material phase  Amorphous: Low optical reflexivity and high electrical resistivity
 Crystalline: High optical reflexivity and low electrical resistivity
 Resistance determined by phase

 STT-MRAM
 Inject current to change magnet polarity
 Resistance determined by polarity

 Memristors/RRAM/ReRAM
 Inject current to change atomic structure
 Resistance determined by atom distance
PCM is resistive memory: High resistance (0), Low resistance (1)
PCM cell can be switched between states reliably and quickly
1605 1606

How Does PCM Work? Opportunity: PCM Advantages


 Write: change phase via current injection  Scales better than DRAM, Flash
SET: sustained current to heat cell above Tcryst

 Requires current pulses, which scale linearly with feature size
 RESET: cell heated above Tmelt and quenched
 Expected to scale to 9nm (2022 [ITRS])
 Read: detect phase via material resistance
 Amorphous vs. crystalline  Prototyped at 20nm (Raoux+, IBM JRD 2008)

Large Small
 Can be denser than DRAM
Current Current  Can store multiple bits per cell due to large resistance range
 Prototypes with 2 bits/cell in ISSCC’08, 4 bits/cell by 2012
Memory
Element
 Non-volatile
SET (cryst) Access RESET (amorph)
 Retain data for >10 years at 85C
Low resistance Device High resistance
103-104 W 106-107 W  No refresh needed, low idle power
Photo Courtesy: Bipin Rajendran, IBM Slide Courtesy: Moinuddin Qureshi, IBM 1607 1608
10-11-2023

Phase Change Memory Properties

 Surveyed prototypes from 2003-2008 (ITRS, IEDM, VLSI,


ISSCC)
 Derived PCM parameters for F=90nm

 Lee, Ipek, Mutlu, Burger, “Architecting Phase Change


Memory as a Scalable DRAM Alternative,” ISCA 2009.

1609 1610

Phase Change Memory Properties: Latency Phase Change Memory Properties


 Latency comparable to, but slower than DRAM  Dynamic Energy
 40 uA Rd, 150 uA Wr
 2-43x DRAM, 1x NAND Flash

 Endurance
 Writes induce phase change at 650C
 Contacts degrade from thermal expansion/contraction
 Read Latency  108 writes per cell
 50ns: 4x DRAM, 10-3x NAND Flash
 10-8x DRAM, 103x NAND Flash
 Write Latency
 150ns: 12x DRAM  Cell Size
 Write Bandwidth  9-12F2 using BJT, single-level cells
 5-10 MB/s: 0.1x DRAM, 1x NAND Flash  1.5x DRAM, 2-3x NAND (will scale with feature size)
1611 1612
10-11-2023

Phase Change Memory: Pros and Cons PCM-based Main Memory: Some Questions
 Pros over DRAM  Where to place PCM in the memory hierarchy?
 Better technology scaling  Hybrid OS controlled PCM-DRAM
 Non volatility  Hybrid OS controlled PCM and hardware-controlled DRAM
 Low idle power (no refresh)  Pure PCM main memory
 Cons
 Higher latencies: ~4-15x DRAM (especially write)  How to mitigate shortcomings of PCM?
 Higher active energy: ~2-50x DRAM (especially write)
 Lower endurance (a cell dies after ~108 writes)
 How to take advantage of (byte-addressable and fast) non-
 Reliability issues (resistance drift)
volatile main memory?
 Challenges in enabling PCM as DRAM replacement/helper:
 Mitigate PCM shortcomings
 Find the right way to place PCM in the system
 Ensure secure and fault-tolerant PCM operation
1613 1614

PCM-based Main Memory (I) Hybrid Memory Systems: Challenges


 How should PCM-based (main) memory be organized?  Partitioning
 Should DRAM be a cache or main memory, or configurable?
 What fraction? How many controllers?

 Data allocation/movement (energy, performance, lifetime)


 Who manages allocation/movement?
 What are good control algorithms?
 How do we prevent degradation of service due to wearout?

 Hybrid PCM+DRAM [Qureshi+ ISCA’09, Dhiman+ DAC’09, Meza+  Design of cache hierarchy, memory controllers, OS
IEEE CAL’12]:  Mitigate PCM shortcomings, exploit PCM advantages
How to partition/migrate data between PCM and DRAM

 Design of PCM/DRAM chips and modules
 Rethink the design of PCM/DRAM with new requirements

1615 1616
10-11-2023

PCM-based Main Memory (II) Aside: STT-RAM Basics


 How should PCM-based (main) memory be organized?  Magnetic Tunnel Junction (MTJ) Logical 0
 Reference layer: Fixed Reference Layer
 Free layer: Parallel or anti-parallel Barrier

 Cell Free Layer

 Access transistor, bit/sense lines Logical 1


Reference Layer
 Read and Write Barrier
 Read: Apply a small voltage across Free Layer
bitline and senseline; read the current.
 Write: Push large current through MTJ.
 Pure PCM main memory [Lee et al., ISCA’09, Top Picks’10]: Direction of current determines new Word Line
orientation of the free layer. MTJ
 How to redesign entire hierarchy (and cores) to overcome
Access
PCM shortcomings Transistor
 Kultursay et al., “Evaluating STT-RAM as an
Sense Line
Energy-Efficient Main Memory Alternative,” ISPASS Bit Line
2013

1617

Aside: STT MRAM: Pros and Cons An Initial Study: Replace DRAM with PCM
 Pros over DRAM  Lee, Ipek, Mutlu, Burger, “Architecting Phase Change
 Better technology scaling Memory as a Scalable DRAM Alternative,” ISCA 2009.
 Non volatility  Surveyed prototypes from 2003-2008 (e.g. IEDM, VLSI, ISSCC)
 Low idle power (no refresh)  Derived “average” PCM parameters for F=90nm

 Cons
 Higher write latency
 Higher write energy
 Reliability?

 Another level of freedom


 Can trade off non-volatility for lower write latency/energy (by
reducing the size of the MTJ)

1619 1620
10-11-2023

Results: Naïve Replacement of DRAM with PCM Architecting PCM to Mitigate Shortcomings
 Replace DRAM with PCM in a 4-core, 4MB L2 system  Idea 1: Use multiple narrow row buffers in each PCM chip
 PCM organized the same as DRAM: row buffers, banks, peripherals  Reduces array reads/writes  better endurance, latency, energy
 1.6x delay, 2.2x energy, 500-hour average lifetime
 Idea 2: Write into array at
cache block or word
granularity DRAM PCM
 Reduces unnecessary wear

 Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a


Scalable DRAM Alternative,” ISCA 2009.
1621 1622

Results: Architected PCM as Main Memory Hybrid Memory Systems


 1.2x delay, 1.0x energy, 5.6-year average lifetime
 Scaling improves energy, endurance, density
CPU
DRA PCM
DRAM MCtrl Ctrl Phase Change Memory (or Tech. X)
Fast, durable
Small, Large, non-volatile, low-cost
leaky, volatile, Slow, wears out, high active energy
high-cost

Hardware/software manage data allocation and movement


to achieve the best of multiple technologies
 Caveat 1: Worst-case lifetime is much shorter (no guarantees) Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012.
 Caveat 2: Intensive applications see large performance and energy hits Yoon, Meza et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD
2012 Best Paper Award.
 Caveat 3: Optimistic PCM parameters?
1623
10-11-2023

One Option: DRAM as a Cache for PCM DRAM vs. PCM: An Observation
 PCM is main memory; DRAM caches memory rows/blocks  Row buffers are the same in DRAM and PCM
 Benefits: Reduced latency on DRAM cache hit; write filtering  Row buffer hit latency same in DRAM and PCM
 Memory controller hardware manages the DRAM cache  Row buffer miss latency small in DRAM, large in PCM
 Benefit: Eliminates system software overhead

CPU
 Three issues: DRA PCM
 What data should be placed in DRAM versus kept in PCM? Row buffer MCtrl Ctrl
DRAM Cache PCM Main Memory
 What is the granularity of data movement?
Ban Ban Ban Ban
 How to design a huge (DRAM) cache at low cost? k k k k
N ns row hit N ns row hit
 Two solutions: Fast row miss Slow row miss

 Locality-aware data placement [Yoon+ , ICCD 2012]  Accessing the row buffer in PCM is fast
 Cheap tag stores and dynamic granularity [Meza+, IEEE CAL 2012]  What incurs high latency is the PCM array access  avoid this
1625 1626

Row-Locality-Aware Data Placement Row-Locality-Aware Data Placement: Results


 Idea: Cache in DRAM only those rows that FREQ FREQ-Dyn RBLA RBLA-Dyn
 Frequently cause row buffer conflicts  because row-conflict latency 1.4
is smaller in DRAM

Normalized Weighted Speedup


 Are reused many times  to reduce cache pollution and bandwidth 1.2
waste 10% 14%
117%
 Simplified rule of thumb: 0.8
 Streaming accesses: Better to place in PCM
0.6
 Other accesses (with some reuse): Better to place in DRAM
0.4

0.2

0
 Yoon et al., “Row Buffer Locality-Aware Data Placement in Hybrid Server
Memory Cloud and fairness
energy-efficiency Avgalso
Memories,” ICCD 2012 Best Paper Award. Workload
improve correspondingly
1627 1628
10-11-2023

Hybrid vs. All-PCM/DRAM Other Opportunities with Emerging Technologies


 Merging of memory and storage
16GB PCM RBLA-Dyn 16GB DRAM  e.g., a single interface to manage all data
2
2 1.2
1.8
1.8  New applications
Normalized Weighted Speedup

1.6

Normalized Max. Slowdown


1 e.g., ultra-fast checkpoint and restore
1.4 1.6 29% 

1.2 1.4
0.8
1 1.2 31%  More robust system design
0.8 1 0.6  e.g., reducing data loss
0.6
0.8
0.4 0.4
0.2
0.6 31% better performance than all PCM,  Processing tightly-coupled with memory
0 0.4 within 29% of all DRAM performance
0.2  e.g., enabling efficient search and filtering
0.2Weighted Speedup Max. Slowdown Perf. per Watt
0 Normalized Metric
0
1629 1630

Coordinated Memory and Storage with NVM (I) Coordinated Memory and Storage with NVM (II)
 The traditional two-level storage model is a bottleneck with NVM
 Goal: Unify memory and storage management in a single unit to
Volatile data in memory  a load/store interface
eliminate wasted work to locate, transfer, and translate data

 Persistent data in storage  a file system interface


 Improves both energy and performance
 Problem: Operating system (OS) and file system (FS) code to locate, translate,
 Simplifies programming model as well
buffer data become performance and energy bottlenecks with fast NVM stores

Two-Level Store Unified Memory/Storage


Load/Store fopen, fread, fwrite, …

Operating
Virtual memory system Persistent Memory
and file system Manager
Processor
Processor and caches
Address and caches Load/Store Feedback
translation

Persistent (e.g., Phase-Change)


Storage
Memory(SSD/HDD)
Main Memory
Persistent (e.g., Phase-Change) Memory

1631 Meza+, “A Case for Efficient Hardware-Software Cooperative Management of 1632


Storage and Memory,” WEED 2013.
10-11-2023

The Persistent Memory Manager (PMM) The Persistent Memory Manager (PMM)
 Exposes a load/store interface to access persistent data
 Applications can directly access persistent memory  no conversion, Persistent objects
translation, location overhead for persistent data

 Manages data placement, location, persistence, security


 To get the best of multiple forms of storage

 Manages metadata storage and retrieval


 This can lead to overheads that need to be managed

 Exposes hooks and interfaces for system software


 To enable better data placement and management decisions

 Meza+, “A Case for Efficient Hardware-Software Cooperative Management of


Storage and Memory,” WEED 2013. PMM uses access and hint information to allocate, locate, migrate
1633 and access data in the heterogeneous array of devices 1634

Performance Benefits of a Single-Level Store Energy Benefits of a Single-Level Store

~24X ~16X

~5X ~5X

Results for PostMark 1635 Results for PostMark 1636


10-11-2023

Enabling and Exploiting NVM: Issues Three Principles for (Memory) Scaling
 Many issues and ideas from
technology layer to algorithms layer  Better cooperation between devices and the system
Problems  Expose more information about devices to upper layers
Algorithms More flexible interfaces
 Enabling NVM and hybrid memory 
Programs User
 How to tolerate errors?
 How to enable secure operation?  Better-than-worst-case design
 How to tolerate performance and Runtime System  Do not optimize for the worst case
(VM, OS, MM)
power shortcomings?  Worst case should not determine the common case
ISA
 How to minimize cost?
Microarchitecture
Logic
 Heterogeneity in design (specialization, asymmetry)
 Exploiting emerging tecnologies  Enables a more efficient design (No one size fits all)
Devices
 How to exploit non-volatility?
 How to minimize energy consumption?
 These principles are related and sometimes coupled
 How to exploit NVM on chip?
1637 1638

Security Challenges of Emerging Technologies Securing Emerging Memory Technologies


1. Limited endurance  Wearout attacks 1. Limited endurance  Wearout attacks
Better architecting of memory chips to absorb writes
Hybrid memory system management
Online wearout attack detection

2. Non-volatility  Data persists in memory after powerdown 2. Non-volatility  Data persists in memory after powerdown
 Easy retrieval of privileged or private information  Easy retrieval of privileged or private information
Efficient encryption/decryption of whole main memory
Hybrid memory system management

3. Multiple bits per cell  Information leakage (via side channel) 3. Multiple bits per cell  Information leakage (via side channel)
System design to hide side channel information
1639 1640
10-11-2023

Summary of Emerging Memory Technologies


 Key trends affecting main memory
 End of DRAM scaling (cost, capacity, efficiency) 18-447
 Need for high capacity
 Need for energy efficiency Computer Architecture
 Emerging NVM technologies can help
Lecture 27: Multiprocessors
 PCM or STT-MRAM more scalable than DRAM and non-volatile
 But, they have shortcomings: latency, active energy, endurance

 We need to enable promising NVM technologies by Prof. Onur Mutlu


overcoming their shortcomings
Carnegie Mellon University
Spring 2015, 4/6/2015
 Many exciting opportunities to reinvent main memory at all
layers of computing stack
1641

Where We Are in Lecture Schedule


 The memory hierarchy
 Caches, caches, more caches
 Virtualizing the memory hierarchy: Virtual Memory Multiprocessors and
 Main memory: DRAM Issues in Multiprocessing
 Main memory control, scheduling
 Memory latency tolerance techniques
 Non-volatile memory

 Multiprocessors
 Coherence and consistency
 Interconnection networks
 Multi-core issues (e.g., heterogeneous multi-core)
1643
10-11-2023

Readings: Multiprocessing Memory Consistency


 Required  Required
 Amdahl, “Validity of the single processor approach to achieving large  Lamport, “How to Make a Multiprocessor Computer That Correctly
scale computing capabilities,” AFIPS 1967. Executes Multiprocess Programs,” IEEE Transactions on Computers,
1979

 Recommended
 Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE,
1966
 Hill, Jouppi, Sohi, “Multiprocessors and Multicomputers,” pp. 551-
560 in Readings in Computer Architecture.
 Hill, Jouppi, Sohi, “Dataflow and Multithreading,” pp. 309-314 in
Readings in Computer Architecture.

1645 1646

Readings: Cache Coherence Remember: Flynn’s Taxonomy of Computers


 Required  Mike Flynn, “Very High-Speed Computing Systems,” Proc.
 Culler and Singh, Parallel Computer Architecture of IEEE, 1966
 Chapter 5.1 (pp 269 – 283), Chapter 5.3 (pp 291 – 305)
 P&H, Computer Organization and Design  SISD: Single instruction operates on single data element
Chapter 5.8 (pp 534 – 538 in 4th and 4th revised eds.)

 SIMD: Single instruction operates on multiple data elements
 Array processor
 Recommended:  Vector processor
 Papamarcos and Patel, “A low-overhead coherence solution
 MISD: Multiple instructions operate on single data element
for multiprocessors with private cache memories,” ISCA 1984.
 Closest form: systolic array processor, streaming processor
 MIMD: Multiple instructions operate on multiple data
elements (multiple instruction streams)
 Multiprocessor
 Multithreaded processor
1647 1648
10-11-2023

Why Parallel Computers? Types of Parallelism and How to Exploit


 Parallelism: Doing multiple things at a time Them
 Instruction Level Parallelism
 Things: instructions, operations, tasks  Different instructions within a stream can be executed in parallel
 Pipelining, out-of-order execution, speculative execution, VLIW
 Main (or Original) Goal  Dataflow

 Improve performance (Execution time or task throughput)


 Execution time of a program governed by Amdahl’s Law  Data Parallelism
 Different pieces of data can be operated on in parallel
 Other Goals  SIMD: Vector processing, array processing
Systolic arrays, streaming processors
 Reduce power consumption 

 (4N units at freq F/4) consume less power than (N units at freq F)
 Why?  Task Level Parallelism
 Improve cost efficiency and scalability, reduce complexity  Different “tasks/threads” can be executed in parallel
 Harder to design a single unit that performs as well as N simpler units  Multithreading
 Improve dependability: Redundant execution in space  Multiprocessing (multi-core)
1649 1650

Task-Level Parallelism: Creating Tasks


 Partition a single problem into multiple related tasks
(threads)
 Explicitly: Parallel programming Multiprocessing Fundamentals
 Easy when tasks are natural in the problem
 Web/database queries
 Difficult when natural task boundaries are unclear

 Transparently/implicitly: Thread level speculation


 Partition a single thread speculatively

 Run many independent tasks (processes) together


 Easy when there are many processes
 Batch simulations, different users, cloud computing workloads
 Does not improve the performance of a single task
1651 1652
10-11-2023

Multiprocessor Types Main Design Issues in Tightly-Coupled MP


 Loosely coupled multiprocessors  Shared memory synchronization
 No shared global memory address space  How to handle locks, atomic operations
 Multicomputer network
 Network-based multiprocessors  Cache coherence
 Usually programmed via message passing  How to ensure correct operation in the presence of private
 Explicit calls (send, receive) for communication caches

 Tightly coupled multiprocessors  Memory consistency: Ordering of memory operations


 Shared global memory address space  What should the programmer expect the hardware to provide?
 Traditional multiprocessing: symmetric multiprocessing (SMP)
 Existing multi-core processors, multithreaded processors  Shared resource management
 Programming model similar to uniprocessors (i.e., multitasking
uniprocessor) except
 Operations on shared data require synchronization  Communication: Interconnects
1653 1654

Main Programming Issues in Tightly-Coupled MP Aside: Hardware-based Multithreading


 Load imbalance  Coarse grained
 How to partition a single task into multiple tasks  Quantum based
 Event based (switch-on-event multithreading), e.g., switch on L3 miss

 Synchronization
 How to synchronize (efficiently) between tasks  Fine grained
 How to communicate between tasks  Cycle by cycle
Thornton, “CDC 6600: Design of a Computer,” 1970.
Locks, barriers, pipeline stages, condition variables,

semaphores, atomic operations, …  Burton Smith, “A pipelined, shared resource MIMD computer,” ICPP
1978.

 Ensuring correct operation while optimizing for performance


 Simultaneous
 Can dispatch instructions from multiple threads at the same time
 Good for improving execution unit utilization

1655 1656
10-11-2023

Parallel Speedup Example


 a4x4 + a3x3 + a2x2 + a1x + a0

Limits of Parallel Speedup  Assume each operation 1 cycle, no communication cost,


each op can be executed in a different processor

 How fast is this with a single processor?


 Assume no pipelining or concurrent execution of instructions

 How fast is this with 3 processors?

1657 1658

1659 1660
10-11-2023

Speedup with 3 Processors Revisiting the Single-Processor Algorithm

Horner, “A new method of solving numerical equations of all orders, by continuous


approximation,” Philosophical Transactions of the Royal Society, 1819.

1661 1662

Superlinear Speedup
 Can speedup be greater than P with P processing
elements?

 Unfair comparisons
Compare best parallel
algorithm to wimpy serial
algorithm  unfair

 Cache/memory effects
More processors 
more cache or memory 
fewer misses in cache/mem

1663 1664
10-11-2023

Utilization, Redundancy, Efficiency Utilization of a Multiprocessor


 Traditional metrics
 Assume all P processors are tied up for parallel computation

 Utilization: How much processing capability is used


 U = (# Operations in parallel version) / (processors x Time)

 Redundancy: how much extra work is done with parallel


processing
 R = (# of operations in parallel version) / (# operations in best
single processor algorithm version)

 Efficiency
 E = (Time with 1 processor) / (processors x Time with P processors)
 E = U/R
1665 1666

Amdahl’s Law and


Caveats of Parallelism

1667 1668
10-11-2023

Caveats of Parallelism (I) Amdahl’s Law

Amdahl, “Validity of the single processor approach to


achieving large scale computing capabilities,” AFIPS 1967.
1669 1670

Amdahl’s Law Implication 1 Amdahl’s Law Implication 2

1671 1672
10-11-2023

Caveats of Parallelism (II) Sequential Bottleneck


 Amdahl’s Law
 f: Parallelizable fraction of a program 200
190
 N: Number of processors 180
170
160
1 150
Speedup = 140
f 130
1-f + N
120
110 N=10
100
90 N=100

 Amdahl, “Validity of the single processor approach to achieving large scale 80 N=1000

computing capabilities,” AFIPS 1967. 70


60
 Maximum speedup limited by serial portion: Serial bottleneck 50
40
 Parallel portion is usually not perfectly parallel 30
20
Synchronization overhead (e.g., updates to shared data) 10

0

0.04
0.08
0.12
0.16

0.24
0.28
0.32
0.36

0.44
0.48
0.52
0.56

0.64
0.68
0.72
0.76

0.84
0.88
0.92
0.96
0

1
0.2

0.4

0.6

0.8
 Load imbalance overhead (imperfect parallelization)
f (parallel fraction)
 Resource sharing overhead (contention among N processors)
1673 1674

Why the Sequential Bottleneck? Another Example of Sequential Bottleneck


 Parallel machines have the
sequential bottleneck

 Main cause: Non-parallelizable


operations on data (e.g. non-
parallelizable loops)
for ( i = 0 ; i < N; i++)
A[i] = (A[i] + A[i-1]) / 2

 There are other causes as well:


 Single thread prepares data and
spawns parallel tasks (usually
sequential)

1675 1676
10-11-2023

Bottlenecks in Parallel Portion Bottlenecks in Parallel Portion: Another View


 Synchronization: Operations manipulating shared data  Threads in a multi-threaded application can be inter-
cannot be parallelized dependent
 Locks, mutual exclusion, barrier synchronization  As opposed to threads from different applications
 Communication: Tasks may need values from each other
- Causes thread serialization when shared data is contended
 Such threads can synchronize with each other
 Locks, barriers, pipeline stages, condition variables,
 Load Imbalance: Parallel tasks may have different lengths semaphores, …
 Due to imperfect parallelization or microarchitectural effects
- Reduces speedup in parallel portion
 Some threads can be on the critical path of execution due
to synchronization; some threads are not
 Resource Contention: Parallel tasks can share hardware
resources, delaying each other
 Replicating all resources (e.g., memory) expensive
 Even within a thread, some “code segments” may be on
- Additional latency not present when each task runs alone
the critical path of execution; some are not
1677 1678

Remember: Critical Sections Remember: Barriers

 Enforce mutually exclusive access to shared data  Synchronization point


 Only one thread can be executing it at a time  Threads have to wait until all threads reach the barrier
 Contended critical sections make threads wait  threads  Last thread arriving to the barrier is on the critical path
causing serialization can be on the critical path
Each thread:
Each thread: loop1 {
loop { Compute
Compute N }
lock(A) barrier
Update shared data loop2 {
unlock(A) C Compute
} }

1679 1680
10-11-2023

Remember: Stages of Pipelined Programs Difficulty in Parallel Programming


 Loop iterations are statically divided into code segments called stages  Little difficulty if parallelism is natural
 Threads execute stages on different cores  “Embarrassingly parallel” applications
 Thread executing the slowest stage is on the critical path  Multimedia, physical simulation, graphics
 Large web servers, databases?

A B C
 Difficulty is in
loop {  Getting parallel programs to work correctly
Compute1 A
 Optimizing performance in the presence of bottlenecks
Compute2 B

Compute3 C  Much of parallel computer architecture is about


}
 Designing machines that overcome the sequential and parallel
bottlenecks to achieve higher performance and efficiency
 Making programmer’s job easier in writing correct and high-
performance parallel programs
1681 1682

Where We Are in Lecture Schedule


18-447  The memory hierarchy
 Caches, caches, more caches
Computer Architecture  Virtualizing the memory hierarchy: Virtual Memory
Lecture 28: Memory Consistency and  Main memory: DRAM
Main memory control, scheduling
Cache Coherence 

 Memory latency tolerance techniques


 Non-volatile memory

Prof. Onur Mutlu  Multiprocessors


Carnegie Mellon University  Coherence and consistency
Spring 2015, 4/8/2015  Interconnection networks
 Multi-core issues (e.g., heterogeneous multi-core)
1684
10-11-2023

Readings: Memory Consistency


 Required
Lamport, “How to Make a Multiprocessor Computer That
Memory Ordering in

Correctly Executes Multiprocess Programs,” IEEE Transactions


on Computers, 1979
Multiprocessors
 Recommended
 Gharachorloo et al., “Memory Consistency and Event Ordering
in Scalable Shared-Memory Multiprocessors,” ISCA 1990.
 Charachorloo et al., “Two Techniques to Enhance the
Performance of Memory Consistency Models,” ICPP 1991.
 Ceze et al., “BulkSC: bulk enforcement of sequential
consistency,” ISCA 2007.

1685 1686

Memory Consistency vs. Cache Coherence Difficulties of Multiprocessing


 Much of parallel computer architecture is about
 Consistency is about ordering of all memory operations
from different processors (i.e., to different memory
Designing machines that overcome the sequential and parallel
locations) 

bottlenecks to achieve higher performance and efficiency


 Global ordering of accesses to all memory locations

 Making programmer’s job easier in writing correct and high-


 Coherence is about ordering of operations from different performance parallel programs
processors to the same memory location
 Local ordering of accesses to each cache block

1687 1688
10-11-2023

Ordering of Operations Memory Ordering in a Single Processor


 Operations: A, B, C, D  Specified by the von Neumann model
 In what order should the hardware execute (and report the  Sequential order
results of) these operations?  Hardware executes the load and store operations in the order
specified by the sequential program
 A contract between programmer and microarchitect
 Specified by the ISA  Out-of-order execution does not change the semantics
 Hardware retires (reports to software the results of) the load
 Preserving an “expected” (more accurately, “agreed upon”) and store operations in the order specified by the sequential
order simplifies programmer’s life program
 Ease of debugging; ease of state recovery, exception handling
 Advantages: 1) Architectural state is precise within an execution. 2)
 Preserving an “expected” order usually makes the hardware Architectural state is consistent across different runs of the program 
Easier to debug programs
designer’s life difficult
 Especially if the goal is to design a high performance processor: Recall load-  Disadvantage: Preserving order adds overhead, reduces
store queues in out of order execution and their complexity performance, increases complexity, reduces scalability
1689 1690

Memory Ordering in a Dataflow Processor Memory Ordering in a MIMD Processor


 A memory operation executes when its operands are ready  Each processor’s memory operations are in sequential order
with respect to the “thread” running on that processor
 Ordering specified only by data dependencies (assume each processor obeys the von Neumann model)

 Two operations can be executed and retired in any order if  Multiple processors execute memory operations
they have no dependency concurrently

 Advantage: Lots of parallelism  high performance  How does the memory see the order of operations from all
processors?
 Disadvantage: Order can change across runs of the same
 In other words, what is the ordering of operations across
program  Very hard to debug
different processors?

1691 1692
10-11-2023

Why Does This Even Matter? When Could Order Affect Correctness?
 Ease of debugging  When protecting shared data
 It is nice to have the same execution done at different times
to have the same order of execution  Repeatability

 Correctness
 Can we have incorrect execution if the order of memory
operations is different from the point of view of different
processors?

 Performance and overhead


 Enforcing a strict “sequential ordering” can make life harder
for the hardware designer in implementing performance
enhancement techniques (e.g., OoO execution, caches)

1693 1694

Protecting Shared Data Supporting Mutual Exclusion


 Threads are not allowed to update shared data concurrently  Programmer needs to make sure mutual exclusion
 For correctness purposes (synchronization) is correctly implemented
 We will assume this
 But, correct parallel programming is an important topic
 Accesses to shared data are encapsulated inside  Reading: Dijkstra, “Cooperating Sequential Processes,” 1965.
critical sections or protected via synchronization constructs  http://www.cs.utexas.edu/users/EWD/transcriptions/EWD01xx/EWD
(locks, semaphores, condition variables) 123.html
 See Dekker’s algorithm for mutual exclusion

 Only one thread can execute a critical section at


a given time  Programmer relies on hardware primitives to support correct
 Mutual exclusion principle synchronization
 If hardware primitives are not correct (or unpredictable),
 A multiprocessor should provide the correct execution of programmer’s life is tough
synchronization primitives to enable the programmer to  If hardware primitives are correct but not easy to reason about
protect shared data or use, programmer’s life is still tough
1695 1696
10-11-2023

Protecting Shared Data A Question


 Can the two processors be in the critical section at the
same time given that they both obey the von Neumann
model?
 Answer: yes

Assume P1 is in critical section.


Intuitively, it must have executed A,
which means F1 must be 1 (as A happens before B),
which means P2 should not enter the critical section. 1697 1698

Both Processors in Critical Section

1699 1700
10-11-2023

The Problem
 The two processors did NOT see the same order of
operations to memory

 The “happened before” relationship between multiple


updates to memory was inconsistent between the two
processors’ points of view
A appeared to happen X appeared to happen
before X before A  As a result, each processor thought the other was not in
the critical section

1701 1702

How Can We Solve The Problem? Sequential Consistency


 Idea: Sequential consistency  Lamport, “How to Make a Multiprocessor Computer That
Correctly Executes Multiprocess Programs,” IEEE Transactions on
Computers, 1979
 All processors see the same order of operations to memory
 i.e., all memory operations happen in an order (called the
 A multiprocessor system is sequentially consistent if:
global total order) that is consistent across all processors
 the result of any execution is the same as if the operations of all
the processors were executed in some sequential order
 Assumption: within this global order, each processor’s AND
operations appear in sequential order with respect to its  the operations of each individual processor appear in this
own operations. sequence in the order specified by its program

 This is a memory ordering model, or memory model


 Specified by the ISA

1703 1704
10-11-2023

Programmer’s Abstraction Sequentially Consistent Operation Orders


 Memory is a switch that services one load or store at a time  Potential correct global orders (all are correct):
from any processor
 All processors see the currently serviced load or store at the  ABXY
same time  AXBY
 Each processor’s operations are serviced in program order  AXYB
 XABY
P1 P2 P3 Pn
 XAYB
 XYAB

 Which order (interleaving) is observed depends on


MEMORY
implementation and dynamic latencies

1705 1706

Consequences of Sequential Consistency Issues with Sequential Consistency?


 Corollaries  Nice abstraction for programming, but two issues:
 Too conservative ordering requirements
1. Within the same execution, all processors see the same  Limits the aggressiveness of performance enhancement
global order of operations to memory techniques
 No correctness issue
 Satisfies the “happened before” intuition  Is the total global order requirement too strong?
 Do we need a global order across all operations and all
processors?
 How about a global order only across all stores?
2. Across different executions, different global orders can be  Total store order memory model; unique store order model
observed (each of which is sequentially consistent)  How about a enforcing a global order only at the boundaries
 Debugging is still difficult (as order changes across runs) of synchronization?
 Relaxed memory models
 Acquire-release consistency model
1707 1708
10-11-2023

Issues with Sequential Consistency? Weaker Memory Consistency


 Performance enhancement techniques that could make SC  The ordering of operations is important when the order
implementation difficult affects operations on shared data  i.e., when processors
need to synchronize to execute a “program region”
 Out-of-order execution
 Loads happen out-of-order with respect to each other and  Weak consistency
with respect to independent stores  makes it difficult for all  Idea: Programmer specifies regions in which memory
processors to see the same global order of all memory operations do not need to be ordered
operations  “Memory fence” instructions delineate those regions
 All memory operations before a fence must complete before
 Caching fence is executed
 A memory location is now present in multiple places  All memory operations after the fence must wait for the fence to
complete
 Prevents the effect of a store to be seen by other processors
 Fences complete in program order
 makes it difficult for all processors to see the same global
order of all memory operations  All synchronization operations act like a fence
1709 1710

Tradeoffs: Weaker Consistency Related Questions


 Advantage  Question 4 in
 No need to guarantee a very strict order of memory  http://www.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?medi
operations a=final.pdf
 Enables the hardware implementation of performance
enhancement techniques to be simpler
 Can be higher performance than stricter ordering

 Disadvantage
 More burden on the programmer or software (need to get the
“fences” correct)

 Another example of the programmer-microarchitect tradeoff

1711 1712
10-11-2023

Caching in Multiprocessors
 Caching not only complicates ordering of all operations…
 A memory location can be present in multiple caches
 Prevents the effect of a store or load to be seen by other Cache Coherence
processors  makes it difficult for all processors to see the
same global order of (all) memory operations

 … but it also complicates ordering of operations on a single


memory location
 A memory location can be present in multiple caches
 Makes it difficult for processors that have cached the same
location to have the correct value of that location (in the
presence of updates to that location)

1713 1714

Readings: Cache Coherence Shared Memory Model


 Required  Many parallel programs communicate through shared memory
 Culler and Singh, Parallel Computer Architecture  Proc 0 writes to an address, followed by Proc 1 reading
 Chapter 5.1 (pp 269 – 283), Chapter 5.3 (pp 291 – 305)  This implies communication between the two
 P&H, Computer Organization and Design
 Chapter 5.8 (pp 534 – 538 in 4th and 4th revised eds.)
 Papamarcos and Patel, “A low-overhead coherence solution for multiprocessors with
private cache memories,” ISCA 1984. Proc 0 Proc 1
Mem[A] = 1 …
 Recommended
Print Mem[A]
 Censier and Feautrier, “A new solution to coherence problems in multicache systems,”
IEEE Trans. Computers, 1978.
 Goodman, “Using cache memory to reduce processor-memory traffic,” ISCA 1983.
 Laudon and Lenoski, “The SGI Origin: a ccNUMA highly scalable server,” ISCA 1997.  Each read should receive the value last written by anyone
 Martin et al, “Token coherence: decoupling performance and correctness,” ISCA 2003.  This requires synchronization (what does last written mean?)
 Baer and Wang, “On the inclusion properties for multi-level cache hierarchies,” ISCA  What if Mem[A] is cached (at either end)?
1988.

1715 1716
10-11-2023

Cache Coherence The Cache Coherence Problem


 Basic question: If multiple processors cache the same
block, how do they ensure they all see a consistent state?
P1 P2 ld r2, x
P1 P2
1000

Interconnection Network
Interconnection Network

1000
x
1000
x
Main Memory
Main Memory

1717 1718

The Cache Coherence Problem The Cache Coherence Problem

P1 P2 ld r2, x P1 P2 ld r2, x

ld r2, x 1000 1000 ld r2, x 2000 1000


add r1, r2, r4
st x, r1

Interconnection Network Interconnection Network

1000 1000
x x
Main Memory Main Memory

1719 1720
10-11-2023

The Cache Coherence Problem Cache Coherence: Whose Responsibility?


 Software
 Can the programmer ensure coherence if caches are invisible to
P1 P2 ld r2, x software?
 What if the ISA provided a cache flush instruction?
ld r2, x 2000 1000 Should NOT  FLUSH-LOCAL A: Flushes/invalidates the cache block containing
add r1, r2, r4 load 1000 address A from a processor’s local cache.
st x, r1 ld r5, x  FLUSH-GLOBAL A: Flushes/invalidates the cache block containing
address A from all other processors’ caches.
Interconnection Network  FLUSH-CACHE X: Flushes/invalidates all blocks in cache X.

 Hardware
1000
x  Simplifies software’s job
Main Memory  One idea: Invalidate all other copies of block A when a processor writes
to it

1721 1722

A Very Simple Coherence Scheme (VI) (Non-)Solutions to Cache Coherence


 Caches “snoop” (observe) each other’s write/read  No hardware based coherence
operations. If a processor writes to a block, all others  Keeping caches coherent is software’s responsibility
invalidate the block. + Makes microarchitect’s life easier
 A simple protocol: -- Makes average programmer’s life much harder
Write-through, no-  need to worry about hardware caches to maintain program
PrRd/-- PrWr / BusWr 
write-allocate correctness?
cache -- Overhead in ensuring coherence in software (e.g., page
Valid  Actions of the local protection and page-based software coherence)
BusWr processor on the
PrRd / BusRd cache block: PrRd,
 All caches are shared between all processors
PrWr,
Invalid  Actions that are
+ No need for coherence
broadcast on the -- Shared cache becomes the bandwidth bottleneck
PrWr / BusWr bus for the block: -- Very hard to design a scalable system with low-latency cache
BusRd, BusWr access this way
1723 1724
10-11-2023

Maintaining Coherence Hardware Cache Coherence


 Need to guarantee that all processors see a consistent  Basic idea:
value (i.e., consistent updates) for the same memory  A processor/cache broadcasts its write/update to a memory
location location to all other processors
 Another cache that has the location either updates or
 Writes to location A by P0 should be seen by P1 invalidates its local copy
(eventually), and all writes to A should appear in some
order

 Coherence needs to provide:


 Write propagation: guarantee that updates will propagate
 Write serialization: provide a consistent global order seen
by all processors

 Need a global point of serialization for this store ordering


1725 1726

Coherence: Update vs. Invalidate Coherence: Update vs. Invalidate (II)


 How can we safely update replicated data?  On a Write:
 Option 1 (Update protocol): push an update to all copies  Read block into cache as before

 Option 2 (Invalidate protocol): ensure there is only one Update Protocol:


copy (local), update it  Write to block, and simultaneously broadcast written
data and address to sharers
 On a Read:  (Other nodes update the data in their caches if block is
 If local copy is Invalid, put out request
present)
Invalidate Protocol:
 (If another node has a copy, it returns it, otherwise
memory does)  Write to block, and simultaneously broadcast invalidation
of address to sharers
 (Other nodes invalidate block in their caches if block is
present)

1727 1728
10-11-2023

Update vs. Invalidate Tradeoffs Two Cache Coherence Methods


 Which do we want?  How do we ensure that the proper caches are updated?
 Write frequency and sharing behavior are critical
 Update  Snoopy Bus [Goodman ISCA 1983, Papamarcos+ ISCA 1984]
+ If sharer set is constant and updates are infrequent, avoids  Bus-based, single point of serialization for all memory requests

the cost of invalidate-reacquire (broadcast update pattern)  Processors observe other processors’ actions
E.g.: P1 makes “read-exclusive” request for A on bus, P0 sees this
- If data is rewritten without intervening reads by other cores, 
and invalidates its own copy of A
updates were useless
- Write-through cache policy  bus becomes bottleneck
 Directory [Censier and Feautrier, IEEE ToC 1978]
 Invalidate  Single point of serialization per block, distributed among nodes
+ After invalidation broadcast, core has exclusive access rights  Processors make explicit requests for blocks
+ Only cores that keep reading after each write retain a copy  Directory tracks which caches have each block
- If write contention is high, leads to ping-ponging (rapid  Directory coordinates invalidation and updates
mutual invalidation-reacquire)  E.g.: P1 asks directory for exclusive copy, directory asks P0 to
invalidate, waits for ACK, then responds to P1
1729 1730

Directory Based Coherence


 Idea: A logically-central directory keeps track of where the
copies of each cache block reside. Caches consult this
Directory Based directory to ensure coherence.

Cache Coherence  An example mechanism:


 For each cache block in memory, store P+1 bits in directory
 One bit for each cache, indicating whether the block is in cache
 Exclusive bit: indicates that a cache has the only copy of the block
and can update it without notifying others
 On a read: set the cache’s bit and arrange the supply of data
 On a write: invalidate all caches that have the block and reset
their bits
 Have an “exclusive bit” associated with each block in each cache
(so that the cache can update the exclusive block silently)
1731 1732
10-11-2023

Directory Based Coherence Example (I) Directory Based Coherence Example (I)

1733 1734

A Note on 740 Next Semester


 If you like 447, 740 is the next course in sequence
18-447  Tentative Time: Lect. MW 7:30-9:20pm, Rect. T 7:30pm
Content:
Computer Architecture 

 Lectures: More advanced, with a different perspective


Lecture 29: Cache Coherence  Recitations: Delving deeper into papers, advanced topics
 Readings: Many fundamental and research readings; will do
many reviews
 Project: More open ended research project. Proposal 
milestones  final poster and presentation
Prof. Onur Mutlu  Exams: lighter and fewer
Carnegie Mellon University  Homeworks: None
Spring 2015, 4/10/2015

1736
10-11-2023

Where We Are in Lecture Schedule


 The memory hierarchy
 Caches, caches, more caches
 Virtualizing the memory hierarchy: Virtual Memory Cache Coherence
 Main memory: DRAM
 Main memory control, scheduling
 Memory latency tolerance techniques
 Non-volatile memory

 Multiprocessors
 Coherence and consistency
 Interconnection networks
 Multi-core issues (e.g., heterogeneous multi-core)
1737 1738

Readings: Cache Coherence Review: Two Cache Coherence Methods


 Required  How do we ensure that the proper caches are updated?
 Culler and Singh, Parallel Computer Architecture
 Chapter 5.1 (pp 269 – 283), Chapter 5.3 (pp 291 – 305)
 P&H, Computer Organization and Design
 Snoopy Bus [Goodman ISCA 1983, Papamarcos+ ISCA 1984]
 Chapter 5.8 (pp 534 – 538 in 4th and 4th revised eds.)  Bus-based, single point of serialization for all memory requests

 Papamarcos and Patel, “A low-overhead coherence solution for multiprocessors with  Processors observe other processors’ actions
private cache memories,” ISCA 1984.  E.g.: P1 makes “read-exclusive” request for A on bus, P0 sees this
and invalidates its own copy of A
 Recommended
 Censier and Feautrier, “A new solution to coherence problems in multicache systems,”  Directory [Censier and Feautrier, IEEE ToC 1978]
IEEE Trans. Computers, 1978.
 Single point of serialization per block, distributed among nodes
 Goodman, “Using cache memory to reduce processor-memory traffic,” ISCA 1983.
 Laudon and Lenoski, “The SGI Origin: a ccNUMA highly scalable server,” ISCA 1997.  Processors make explicit requests for blocks
 Martin et al, “Token coherence: decoupling performance and correctness,” ISCA 2003.  Directory tracks which caches have each block
 Baer and Wang, “On the inclusion properties for multi-level cache hierarchies,” ISCA  Directory coordinates invalidation and updates
1988.
 E.g.: P1 asks directory for exclusive copy, directory asks P0 to
invalidate, waits for ACK, then responds to P1
1739 1740
10-11-2023

Review: Directory Based Coherence


 Idea: A logically-central directory keeps track of where the
copies of each cache block reside. Caches consult this
Directory Based directory to ensure coherence.

Cache Coherence  An example mechanism:


 For each cache block in memory, store P+1 bits in directory
 One bit for each cache, indicating whether the block is in cache
 Exclusive bit: indicates that a cache has the only copy of the block
and can update it without notifying others
 On a read: set the cache’s bit and arrange the supply of data
 On a write: invalidate all caches that have the block and reset
their bits
 Have an “exclusive bit” associated with each block in each cache
(so that the cache can update the exclusive block silently)
1741 1742

Directory Based Coherence Example (I) Directory Based Coherence Example (I)

1743 1744
10-11-2023

Snoopy Cache Coherence


 Idea:
 All caches “snoop” all other caches’ read/write requests and
Snoopy Cache Coherence keep the cache block coherent
 Each cache block has “coherence metadata” associated with it
in the tag store of each cache

 Easy to implement if all caches share a common bus


 Each cache broadcasts its read/write operations on the bus
 Good for small-scale multiprocessors
 What if you would like to have a 1000-node multiprocessor?

1745 1746

A Simple Snoopy Cache Coherence Protocol


 Caches “snoop” (observe) each other’s write/read
operations
 A simple protocol (VI protocol):

PrRd/-- PrWr / BusWr  Write-through, no-


write-allocate
cache
Valid  Actions of the local
BusWr processor on the
PrRd / BusRd cache block: PrRd,
PrWr,
Invalid  Actions that are
broadcast on the
PrWr / BusWr bus for the block:
BusRd, BusWr
1747 1748
10-11-2023

Extending the Protocol A More Sophisticated Protocol: MSI


 What if you want write-back caches?  Extend metadata per block to encode three states:
 We want a “modified” state  M(odified): cache line is the only cached copy and is dirty
 S(hared): cache line is potentially one of several cached
copies
 I(nvalid): cache line is not present in this cache

 Read miss makes a Read request on bus, transitions to S


 Write miss makes a ReadEx request, transitions to M state
 When a processor snoops ReadEx from another writer, it
must invalidate its own copy (if any)
 SM upgrade can be made without re-reading data from
memory (via Invalidations)
1749 1750

MSI State Machine The Problem with MSI


 A block is in no cache to begin with
 Problem: On a read, the block immediately goes to
BusRd/Flush M “Shared” state although it may be the only copy to be
PrWr/BusRdX
cached (i.e., no other processor will cache it)
PrWr/BusRdX PrRd/--
PrWr/--
BusRdX/Flush

PrRd/BusRd  Why is this a problem?


 Suppose the cache that read the block wants to write to it at
S I some point
PrRd/--  It needs to broadcast “invalidate” even though it has the only
BusRd/--
cached copy!
BusRdX/--
 If the cache knew it had the only cached copy in the system,
it could have written to the block without notifying any other
cache  saves unnecessary broadcasts of invalidations
ObservedEvent/Action [Culler/Singh96]
1751 1752
10-11-2023

The Solution: MESI


 Idea: Add another state indicating that this is the only
cached copy and it is clean.
 Exclusive state

 Block is placed into the exclusive state if, during BusRd, no


other cache had it
 Wired-OR “shared” signal on bus can determine this:
snooping caches assert the signal if they also have a copy

 Silent transition ExclusiveModified is possible on write!

 MESI is also called the Illinois protocol


 Papamarcos and Patel, “A low-overhead coherence solution for
multiprocessors with private cache memories,” ISCA 1984.
1753 1754

MESI State Machine MESI State Machine


M

PrWr/--
BusRd/Flush PrWr/BusRdX

BusRd/ $ Transfer PrWr/BusRdX

S PrRd (S’)/BusRd

PrRd (S)/BusRd

BusRdX/Flush (all incoming)


I

[Culler/Singh96]
1755 1756
10-11-2023

MESI State Machine from Lab 8 MESI State Machine from Lab 8

A transition from a single-owner state (Exclusive or Modified) to Shared is called a


downgrade, because the transition takes away the owner's right to modify the data

A transition from Shared to a single-owner state (Exclusive or Modified) is called an


upgrade, because the transition grants the ability to the owner (the cache which contains
the respective block) to write to the block.
1757 1758

Intel Pentium Pro Snoopy Invalidation Tradeoffs


 Should a downgrade from M go to S or I?
 S: if data is likely to be reused (before it is written to by another
processor)
 I: if data is likely to be not reused (before it is written to by another)
 Cache-to-cache transfer
 On a BusRd, should data come from another cache or memory?
 Another cache
 May be faster, if memory is slow or highly contended
 Memory
 Simpler: no need to wait to see if another cache has the data first
 Less contention at the other caches
 Requires writeback on M downgrade
 Writeback on Modified->Shared: necessary?
 One possibility: Owner (O) state (MOESI protocol)

 One cache owns the latest data (memory is not updated)


 Memory writeback happens when all caches evict copies

Slide credit: Yale Patt 1759 1760


10-11-2023

The Problem with MESI Improving on MESI


 Observation: Shared state requires the data to be clean
 i.e., all caches that have the block have the up-to-date copy  Idea 1: Do not transition from MS on a BusRd. Invalidate
and so does the memory the copy and supply the modified block to the requesting
 Problem: Need to write the block to memory when BusRd processor directly without updating memory
happens when the block is in Modified state
 Idea 2: Transition from MS, but designate one cache as
 Why is this a problem? the owner (O), who will write the block back when it is
 Memory can be updated unnecessarily  some other evicted
processor may want to write to the block again  Now “Shared” means “Shared and potentially dirty”
 This is a version of the MOESI protocol

1761 1762

Tradeoffs in Sophisticated Cache Coherence Protocols Revisiting Two Cache Coherence Methods
 The protocol can be optimized with more states and  How do we ensure that the proper caches are updated?
prediction mechanisms to
+ Reduce unnecessary invalidates and transfers of blocks  Snoopy Bus [Goodman ISCA 1983, Papamarcos+ ISCA 1984]
 Bus-based, single point of serialization for all memory requests

Processors observe other processors’ actions


 However, more states and optimizations 
 E.g.: P1 makes “read-exclusive” request for A on bus, P0 sees this
-- Are more difficult to design and verify (lead to more cases to and invalidates its own copy of A
take care of, race conditions)
-- Provide diminishing returns  Directory [Censier and Feautrier, IEEE ToC 1978]
 Single point of serialization per block, distributed among nodes

 Processors make explicit requests for blocks


 Directory tracks which caches have each block
 Directory coordinates invalidation and updates
 E.g.: P1 asks directory for exclusive copy, directory asks P0 to
invalidate, waits for ACK, then responds to P1
1763 1764
10-11-2023

Snoopy Cache vs. Directory Coherence


 Snoopy Cache
+ Miss latency (critical path) is short: request  bus transaction to mem.
+ Global serialization is easy: bus provides this already (arbitration)
+ Simple: can adapt bus-based uniprocessors easily
Revisiting Directory-Based
- Relies on broadcast messages to be seen by all caches (in same order):
 single point of serialization (bus): not scalable
Cache Coherence
 need a virtual bus (or a totally-ordered interconnect)

 Directory
- Adds indirection to miss latency (critical path): request  dir.  mem.
- Requires extra storage space to track sharer sets
 Can be approximate (false positives are OK for correctness)
- Protocols and race conditions are more complex (for high-performance)
+ Does not require broadcast to all caches
+ Exactly as scalable as interconnect and directory storage
(much more scalable than bus)
1765 1766

Remember: Directory Based Coherence Remember: Directory Based Coherence


 Idea: A logically-central directory keeps track of where the Example
copies of each cache block reside. Caches consult this
directory to ensure coherence.

 An example mechanism:
 For each cache block in memory, store P+1 bits in directory
 One bit for each cache, indicating whether the block is in cache
 Exclusive bit: indicates that the cache that has the only copy of
the block and can update it without notifying others
 On a read: set the cache’s bit and arrange the supply of data
 On a write: invalidate all caches that have the block and reset
their bits
 Have an “exclusive bit” associated with each block in each
cache
1767 1768
10-11-2023

Directory-Based Protocols Directory: Data Structures


 Required when scaling past the capacity of a single bus 0x00 Shared: {P0, P1, P2}
 Distributed, but: 0x04 ---
0x08 Exclusive: P2
 Coherence still requires single point of serialization (for write 0x0C ---
serialization) … ---
 Serialization location can be different for every block (striped
across nodes)
 Required to support invalidation and cache block requests
 Key operation to support is set inclusion test
 We can reason about the protocol for a single block: one
 False positives are OK: want to know which caches may contain
server (directory node), many clients (private caches)
a copy of a block, and spurious invalidations are ignored
 False positive rate determines performance
 Directory receives Read and ReadEx requests, and sends  Most accurate (and expensive): full bit-vector
Invl requests: invalidation is explicit (as opposed to snoopy
buses)  Compressed representation, linked list, Bloom filters are all
possible
1769 1770

Directory: Basic Operations MESI Directory Transaction: Read


 Follow semantics of snoop-based system
 but with explicit request, reply messages

P0 acquires an address for reading:


 Directory: 1. Read
 Receives Read, ReadEx, Upgrade requests from nodes
 Sends Inval/Downgrade messages to sharers if needed
P0 Home
 Forwards request to memory if needed
 Replies to requestor and updates sharing state
2. DatEx (DatShr)

 Protocol design is flexible


 Exact forwarding paths depend on implementation P1
 For example, do cache-to-cache transfer?

Culler/Singh Fig. 8.16


1771 1772
10-11-2023

RdEx with Former Owner Contention Resolution (for Write)

1a. RdEx 1b. RdEx


1. RdEx 4. Invl 3. RdEx

2. Invl  P0 Home P1 
P0 Home 5a. Rev

3a. Rev 2a. DatEx 2b. NACK

Owner 
5b. DatEx
3b. DatEx

1773 1774

Issues with Contention Resolution Scaling the Directory: Some Questions


 Need to escape race conditions by:  How large is the directory?
 NACKing requests to busy (pending invalidate) entries
 Original requestor retries
 OR, queuing requests and granting in sequence  How can we reduce the access latency to the directory?
 (Or some combination thereof)

 Fairness
 How can we scale the system to thousands of nodes?
 Which requestor should be preferred in a conflict?
 Interconnect delivery order, and distance, both matter

 Ping-ponging is a higher-level issue  Can we get the best of snooping and directory protocols?
 Heterogeneity
 With solutions like combining trees (for locks/barriers) and
better shared-data-structure design  E.g., token coherence [Martin+, ISCA 2003]

1775 1776
10-11-2023

Motivation: Three Desirable Attributes

Low-latency cache-to-cache misses


Advancing Coherence

No bus-like interconnect Bandwidth efficient

Dictated by workload and technology trends

1777 slide 1778

Workload Trends Workload Trends


1 Low-latency cache-to-cache misses
 Commercial workloads
 Many cache-to-cache misses P P P M
 Clusters of small multiprocessors
2
• Goals:
– Direct cache-to-cache misses 1
(2 hops, not 3 hops) Directory
– Moderate scalability Protocol P P P M No bus-like interconnect Bandwidth efficient

3 2
Workload trends  snooping protocols
slide 1779 slide 1780
10-11-2023

Workload Trends Snooping Protocols Technology Trends


 High-speed point-to-point links
Low-latency cache-to-cache misses  No (multi-drop) busses
(Yes: direct
request/response)
• Increasing design integration
– “Glueless” multiprocessors
– Improve cost & latency

• Desire: low-latency interconnect


No bus-like interconnect Bandwidth efficient
– Avoid “virtual bus” ordering
(No: requires a “virtual bus”) (No: broadcast always)
– Enabled by directory protocols

Technology trends  unordered interconnects


slide 1781 slide 1782

Technology Trends Technology Trends  Directory Protocols

Low-latency cache-to-cache misses Low-latency cache-to-cache misses


(No: indirection
through directory)

No bus-like interconnect Bandwidth efficient No bus-like interconnect Bandwidth efficient


(Yes: no ordering required) (Yes: avoids broadcast)

slide 1783 slide 1784


10-11-2023

Goal: All Three Attributes Token Coherence: Key Insight


 Goal of invalidation-based coherence
Low-latency cache-to-cache misses  Invariant: many readers -or- single writer
 Enforced by globally coordinated actions

Key insight
Step#1  Enforce this invariant directly using tokens
 Fixed number of tokens per block
 One token to read, all tokens to write
Step#2

 Guarantees safety in all cases


No bus-like interconnect Bandwidth efficient  Global invariant enforced with only local rules
 Independent of races, request ordering, etc.

slide 1785 slide 1786

A Case for 18-447


Computer Architecture
Asymmetry Everywhere
Lecture 30: In-memory Processing

Onur Mutlu,
"Asymmetry Everywhere (with Automatic Resource Management)" Vivek Seshadri
CRA Workshop on Advancing Computer Architecture Research: Popular
Carnegie Mellon University
Parallel Programming, San Diego, CA, February 2010.
Position paper Spring 2015, 4/13/2015

1787
10-11-2023

Goals for This Lecture DRAM Module and Chip


 Understand DRAM technology
 How it is built?
 How it operates?
 What are the trade-offs?

 Can we use DRAM for more than just storage?


 In-DRAM copying
 In-DRAM bitwise operations

1789 1790

Goals of DRAM Design DRAM Chip


 Cost
 Latency
 Bandwidth
 Parallelism
 Power
 Energy
Bank
 Reliability

1791 1792
10-11-2023

DRAM Cell – Capacitor Sense Amplifier

top

Empty State Fully Charged State


Logical “0” Logical “1” enable

Inverter
1 Small – Cannot drive circuits
2 Reading destroys the state
bottom
1793 1794

Sense Amplifier – Two Stable States Sense Amplifier Operation

VDD
T
VDD 0

en en
dis
en VT > V B

0 VDD

Logical “1” Logical “0” V0B


1795 1796
10-11-2023

Capacitor to Sense Amplifier DRAM Cell Operation

½V
½V
VDD
DD+δ

? Cell loses charge


0 VDD Cell regains charge
di
e
sn
en en

½V
0 DD
VDD 0
1797 1798

Amortizing Cost – DRAM Tile DRAM Subarray

Row Driver
Row Decoder
Tile Tile Tile
Row Driver

1799 1800
Row Decoder
Row Driver

Tile

DRAM Chip
Cell Array Cell Array Cell Array Cell Array

Array of Sense Amplifiers (8Kb) Array of Sense Amplifiers (8Kb) Array of Sense Amplifiers (8Kb) Array of Sense Amplifiers (8Kb)

Decoder

Row Decoder
Row Decoder
Row Decoder
Row
Shared internal bus
Cell Array Cell Array Cell Array Cell Array
DRAM Subarray

Cell Array Cell Array Cell Array Cell Array

Array of Sense Amplifiers Array of Sense Amplifiers Array of Sense Amplifiers Array of Sense Amplifiers

Decoder

Memory channel - 8bits


Row Decoder

Row Decoder
Row Decoder
Row
Cell Array Cell Array Cell Array Cell Array

Bank I/O Bank I/O Bank I/O Bank I/O


(64b) (64b) (64b) (64b)
Tile
Tile

(64b) (64b) (64b) (64b)


Bank I/O Bank I/O Bank I/O Bank I/O
Cell Array Cell Array Cell Array Cell Array

Row
Row
Array of Sense Amplifiers Array of Sense Amplifiers Array of Sense Amplifiers Array of Sense Amplifiers

Row Decoder
Decoder
Row Decoder
Decoder
Cell Array Cell Array Cell Array Cell Array

Cell Array Cell Array Cell Array Cell Array

Row
Row
Array of Sense Amplifiers (8Kb) Array of Sense Amplifiers (8Kb) Array of Sense Amplifiers (8Kb) Array of Sense Amplifiers (8Kb)

Row Decoder
Decoder
Row Decoder
Decoder
Cell Array Cell Array Cell Array Cell Array
Tile
Tile

Tile
Tile

1803
1801

Row Address

Row Decoder Row Decoder

Column Address
Address
DRAM Bank

Cell Array
Cell Array

Row Decoder Row Decoder


Bank I/O
DRAM Operation

Array of Sense Amplifiers

Data
Address
Cell Array
Cell Array
Cell Array
Cell Array

Array of Sense Amplifiers

Data
Bank I/O (64b)
Array of Sense Amplifiers (8Kb)

3 PRECHARGE
1 ACTIVATE Row

2 READ/WRITE Column
1802

1804
10-11-2023
10-11-2023

Goals for This Lecture Trade-offs in DRAM Design


 Understand DRAM technology
 How it is built?  Cost
 How it operates?  Latency
 What are the trade-offs?  Bandwidth
—Rows/Subarray
 Parallelism —Data width, Chips/DIMM
Can we use DRAM for more than just storage? Power
 
—Banks/Chip
 In-DRAM copying  Energy
In-DRAM bitwise operations

 Reliability

1805 1806

Goals for This Lecture


 Understand DRAM technology


How it is built?
How it operates? RowClone
 What are the trade-offs?

 Can we use DRAM for more than just storage? Fast and Energy-Efficient In-DRAM Bulk
 In-DRAM copying Data Copy and Initialization
In-DRAM bitwise operations

Vivek Seshadri
Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun,
G. Pekhimenko, Y. Luo, O. Mutlu,
P. B. Gibbons, M. A. Kozuch, T. C. Mowry

1807
10-11-2023

Memory Channel – Bottleneck Goal: Reduce Memory Bandwidth Demand

Limited Bandwidth

Memory

Memory
Core Core
Cache

Cache
M M
C
Channel C
Channel
Core Core

High Energy
Reduce unnecessary data movement

Bulk Data Copy and Initialization Bulk Data Copy and Initialization

Bulk Data src dst


Bulk Data src dst
Copy Copy

Bulk Data Bulk Data


val dst val dst
Initialization Initialization
10-11-2023

Bulk Copy and Initialization – Applications Shortcomings of Existing Approach

00000 High Energy


00000 (3600nJ to copy 4KB)
00000

Zero initialization Core


Forking dst

Cache
(e.g., security) Checkpointing M
C
Channel src
Core

Many more High latency


(1046ns to copy 4KB)
VM Cloning Page Migration
Deduplication Interference

Our Approach: In-DRAM Copy with Low Cost

X
High Energy

Core dst RowClone: In-DRAM Copy


?
Cache

M
C
Channel src
Core

X
High latency

X
Interference
1816
10-11-2023

Bulk Copy in DRAM – RowClone Fast Parallel Mode – Benefits

Bulk Data Copy (4KB across a module)

½V
VDD
DD +δ Latency 11X Energy 74X

1046ns to 90ns 3600nJ to 40nJ


Data gets
1
0
copied
No bandwidth consumption
½V0DD Very little changes to the DRAM chip

1817 1818

Fast Parallel Mode – Constraints End-to-end System Design


 Software interface
 Location constraint  memcpy and meminit instructions
 Source and destination in same subarray
 Size constraint  Managing cache coherence
 Entire row gets copied (no partial copy)  Use existing DMA support!

 Maximizing use of Fast Parallel Mode


Smart OS page allocation
1 Can still accelerate many existing primitives

(copy-on-write, bulk zeroing)

2 Alternate mechanism to copy data across banks


(pipelined serial mode – lower benefits than Fast Parallel)

1819 1820
10-11-2023

Applications Summary Results Summary


Zero Copy Write Read IPC Improvement Memory Energy Reduction
1 70%

Compared to Baseline
Fraction of Memory Traffic

60%
0.8
50%
0.6 40%

30%
0.4
20%
0.2
10%

0%
0
bootup compile forkbench mcached mysql shell bootup compile forkbench mcached mysql shell

1821 1822

Goals for This Lecture Triple Row Activation


 Understand DRAM technology
 How it is built? ½V DD+δ
VDD
How it operates?
A
Final State

 What are the trade-offs?

B AB + BC +
Can we use DRAM for more than just storage?

 In-DRAM copying
AC
 In-DRAM bitwise operations C
C(A + B)
di
en + ~C(AB)
s

½V0DD
1823 1824
10-11-2023

In-DRAM Bitwise AND/OR Throughput Results


Required Operation: Perform a bitwise AND of two rows A and
B and store the result in C Intel-AVX (one core)
Our Proposal (Aggressive) (one bank)

AND/OR Throughput (GB/s)


80
 R0 – reserved zero row, R1 – reserved one row
70 L1
 D1, D2, D3 – Designated rows for triple activation 60
50
1. RowClone A into D1 40
30
2. RowClone B into D2 20 L2 L3
3. RowClone R0 into D3 10 Memory
4. ACTIVATE D1,D2,D3 0

16MB
32MB
1MB
2MB
4MB
8MB
128KB
256KB
512KB
16KB
32KB
64KB
8KB
5. RowClone Result into C

Size of vectors involved in AND/OR


1825 1826

Bitmap Index Performance Evaluation


 Alternative to B-tree and its variants
Conservative (1 Bank)
 Efficient for performing range queries and joins Aggressive (1 Bank)
Conservative (4 Banks)

Performance Relative to
1.4
age < 1818 < age < 2525 < age < 60 age > 60 1.2
1.0

Baseline
0.8
Bitmap 2

Bitmap 3

Bitmap 4
Bitmap 1

0.6
0.4
0.2
0.0
3 9 20 45 98 118 128
Number of OR bins
1827 1828
10-11-2023

Goals for This Lecture


 Understand DRAM technology 18-447
 How it is built?
 How it operates? Computer Architecture
 What are the trade-offs? Lecture 31: Predictable Performance
 Can we use DRAM for more than just storage?
 In-DRAM copying
 In-DRAM bitwise operations

Lavanya Subramanian
Carnegie Mellon University
Spring 2015, 4/15/2015

1829

Shared Resource Interference High and Unpredictable


Application Slowdowns
6 6

Core Core Core Core 5 5

Slowdown

Slowdown
4 4

Core Core Core Core


Shared Main 3 3

Cache Memory 2 2
Core Core Core Core
1 1

0 0
Core Core Core Core leslie3d (core 0) gcc (core 1) leslie3d (core 0) mcf (core 1)

2. An application’s performance
1. High application slowdowns due to
depends on which application it is
shared resource interference
running with
1831 1832
10-11-2023

Need for Predictable Performance Tackling Different Parts of the


Shared Memory Hierarchy
 There is a need for predictable performance
 When multiple applications share resources Core Core Core Core
 Especially if some applications require performance
guarantees
Core Core Core Core
Shared Main
Example 1: In virtualized systems Cache Memory


Our Goal: Predictable performance
Different users’ jobs consolidated onto the same server
Core Core Core Core

 inNeed
theto presence of shared
meet performance resources
requirements of critical jobs
Core Core Core Core

 Example 2: In mobile systems


 Interactive applications run with non-interactive
applications
 Need to guarantee performance for interactive
applications
1833 1834

Predictability in the Presence of Memory Predictability in the Presence of Memory


Bandwidth Interference Bandwidth Interference (HPCA 2013)
1. Estimate Slowdown
Core Core Core Core
 Key Observations
 Implementation
Core Core Core Core
Shared Main  MISE Model: Putting it All Together
Cache Memory
 Evaluating the Model
Core Core Core Core

Core Core Core Core


2. Control Slowdown
 Providing Soft Slowdown Guarantees

1835 1836
10-11-2023

Predictability in the Presence of Memory Slowdown: Definition


Bandwidth Interference
1. Estimate Slowdown
 Key Observations
 Implementation
Performance Alone
 MISE Model: Putting it All Together
Slowdown 
 Evaluating the Model Performance Shared
2. Control Slowdown
 Providing Soft Slowdown Guarantees

1837 1838

Key Observation 1 Key Observation 2


Request Service Rate Alone (RSRAlone) of an application can be
estimated by giving the application highest priority at the
For a memory bound application,
memory controller
Performance  Memory request service rate
1
Normalized Performance

omnetpp Highest priority  Little interference


0.9
mcf Difficult (almost as if the application were run alone)
0.8
Requestastar
PerformancService
e Alone Rate Alone
0.7
Slowdown0.6
Slowdown Intel Core i7, 4 cores

0.5
Request Service
Performanc e SharedRate Shared

0.4
0.3
Easy
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Normalized Request Service Rate
1839 1840
10-11-2023

Key Observation 2
1. Run alone
Time units Service order
Request Buffer
3 2 1
State Main Main Memory Interference-induced Slowdown Estimation
Memory Memory
(MISE) model for memory bound applications
2. Run with another application
Request Service Rate Alone (RSRAlone)
Slowdown 
Time units Service order
Request Buffer
3 2 1
State Main Main Request Service Rate Shared (RSRShared)
Memory Memory

3. Run with another application: highest priority


Time units Service order
Request Buffer
3 2 1
State Main Main
Memory Memory
1841 1842

Key Observation 3 Key Observation 3


 Memory-bound application  Non-memory-bound application
Compute Phase Compute Phase

Memory Phase
Memory Phase Memory Interference-induced Slowdown Estimation
1a a
(MISE) model for non-memory bound applications
No Req Req Req
No
interference time interference RSRAlone
Slowdown  (1 - a )  a
time

With Req Req Req


With RSRShared
interference time interference
time
1a RSRAlone
a
Memory phase slowdown dominates overall slowdown RSRShared
Only memory fraction (a) slows down with interference

1843 1844
10-11-2023

Predictability in the Presence of Memory Interval Based Operation


Bandwidth Interference
1. Estimate Slowdown Interval Interval
 Key Observations
 Implementation time

 MISE Model: Putting it All Together

 Evaluating the Model  Measure RSRShared, a  Measure RSRShared, a


Estimate RSRAlone
2. Control Slowdown  Estimate RSRAlone 

 Providing Soft Slowdown Guarantees


Estimate Estimate
slowdown slowdown
1845 1846

Measuring RSRShared and α Estimating Request Service Rate Alone (RSRAlone)


 Request Service Rate Shared (RSRShared)  Divide each interval into shorter epochs
 Per-core counter to track number of requests serviced
 At the end of each interval, measure  At the beginning of each epoch
 Randomly pick an application as the highest priority
Goal: Estimate RSRAlone
application
Number of Requests Served How: Periodically give each application
RSRShared 
Interval Length 
highest
At the priority
end of an in each
interval, for accessing memory
application, estimate
 Memory Phase Fraction ( )
Count number of stall cycles at the core

a
 Compute fraction of cycles stalled for memory

Number of Requests During High Priority Epochs


RSRAlone 
Number of Cycles Application Given High Priority
10-11-2023

Inaccuracy in Estimating RSRAlone Accounting for Interference in


RSRAlone Estimation
 When an application has highest priority  Solution: Determine and remove interference cycles from
High Priority
Request
 Still experiences some
Buffer interference
Time units Service order ARSR calculation
State 3 2 1
Main Main
Memory Memory Number of Requests During High Priority Epochs
ARSR 
Number of Cycles Application Given High Priority - Interference Cycles
Request Buffer Time units Service order
State 3 2 1  A cycle is an interference cycle if
Main Main
Memory Memory  a request from the highest priority application is
waiting in the request buffer and
Request Buffer Time units Service order
 another application’s request was issued
State 3 2 1
Main Main previously
Memory Memory

Interference Cycles

Predictability in the Presence of Memory MISE Operation: Putting it All Together


Bandwidth Interference
1. Estimate Slowdown Interval Interval
 Key Observations
 Implementation time

 MISE Model: Putting it All Together

 Evaluating the Model  Measure RSRShared, a  Measure RSRShared, a


Estimate RSRAlone
2. Control Slowdown  Estimate RSRAlone 

 Providing Soft Slowdown Guarantees


Estimate Estimate
slowdown slowdown
1851 1852
10-11-2023

Predictability in the Presence of Memory Previous Work on


Bandwidth Interference Slowdown Estimation
1. Estimate Slowdown  Previous work on slowdown estimation
 Key Observations  STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO
‘07]
 Implementation  FST (Fairness via Source Throttling) [Ebrahimi et al., ASPLOS ‘10]
Per-thread Cycle Accounting [Du Bois et al., HiPEAC ‘13]
 MISE Model: Putting it All Together

 Evaluating the Model  Basic Idea:


Difficult
2. Control Slowdown Stall Time Alone
Slowdown 
 Providing Soft Slowdown Guarantees Stall Time Shared
Easy
Count number of cycles application receives
1853 interference 1854

Two Major Advantages of MISE Over STFM Methodology


 Advantage 1:  Configuration of our simulated system
 STFM estimates alone performance while an  4 cores
application is receiving interference  Difficult  1 channel, 8 banks/channel
 MISE estimates alone performance while giving an  DDR3 1066 DRAM
application the highest priority  Easier  512 KB private cache/core

 Advantage 2:
 Workloads
 STFM does not take into account compute phase for
 SPEC CPU2006
non-memory-bound applications
 300 multi programmed workloads
 MISE accounts for compute phase  Better accuracy

1855 1856
10-11-2023

Quantitative Comparison Comparison to STFM

SPEC CPU 2006 application 4 4 4


leslie3d 3 3 3

Slowdown
Slowdown
Slowdown
4 2 2 2

3.5 1 1 1
Slowdown

3 0
0 Average
50 100
0
0error of
50 MISE:
100
0
08.2% 50 100
Actual cactusADM GemsFDTD soplex
2.5
Average error of STFM: 29.4%
2 STFM 4 4 4

MISE 3
(across3
300 workloads)3

Slowdown

Slowdown

Slowdown
1.5
2 2 2
1
1 1 1
0 20 40 60 80 100
0
Million Cycles 0 50 100
0
0 50 100
0
0 50 100
wrf calculix povray
1857 1858

Predictability in the Presence of Memory Possible Use Cases


Bandwidth Interference Bounding application slowdowns [HPCA ’14]
1. Estimate Slowdown 

 Key Observations  VM migration and admission control schemes


 Implementation [VEE ’15]

 MISE Model: Putting it All Together


 Fair billing schemes in a commodity cloud
 Evaluating the Model

2. Control Slowdown
 Providing Soft Slowdown Guarantees

1859 1860
10-11-2023

Predictability in the Presence of Memory MISE-QoS: Providing


Bandwidth Interference “Soft” Slowdown Guarantees
1. Estimate Slowdown  Goal
 Key Observations 1. Ensure QoS-critical applications meet a prescribed slowdown
bound
 Implementation
2. Maximize system performance for other applications
 MISE Model: Putting it All Together
 Basic Idea
 Evaluating the Model
 Allocate just enough bandwidth to QoS-critical application

2. Control Slowdown  Assign remaining bandwidth to other applications

 Providing Soft Slowdown Guarantees

1861 1862

Methodology A Look at One Workload

Slowdown Bound = 10
 Each application (25 applications in total) Slowdown Bound = 3.33
Slowdown Bound = 2
3
considered the QoS-critical application
 Run with 12 sets of co-runners of different 2.5
memory intensities
Total of 300 multi programmed workloads MISE 2 is effective in AlwaysPrioritize
Slowdown

MISE-QoS-10/1
 Each workload run with 10 slowdown bound 1. 1.5
meeting the slowdown bound for the MISE-QoS-10/3
QoS-
values critical application MISE-QoS-10/5
1 MISE-QoS-10/7
 Baseline memory scheduling mechanism 2. improving performance of non-QoS-critical
MISE-QoS-10/9
Always prioritize QoS-critical application

[Iyer et al., SIGMETRICS 2007]


applications
0.5

 Other applications’ requests scheduled in FR-FCFS order 0


[Zuravleff and Robinson, US Patent 1997, Rixner+, ISCA 2000]
leslie3d hmmer lbm omnetpp
QoS-critical non-QoS-critical
1863 1864
10-11-2023

Effectiveness of MISE in Enforcing QoS Performance of


Non-QoS-Critical Applications
Across 3000 data points 1

Predicted Predicted 0.8 AlwaysPrioritize

Performance
Met Not Met
MISE-QoS-10/1

System
QoS Bound 0.6
MISE-QoS-10/3
78.8% 2.1%
Met
MISE-QoS-10/5
0.4
QoS Bound MISE-QoS-10/7
2.2% 16.9%
Not Met MISE-QoS-10/9
0.2

MISE-QoS meets the bound for 80.9% of workloads 0


MISE-QoS correctly predicts whether or not the bound
is met for 95.7% of workloads When slowdownwhen
bound is 10/3
AlwaysPrioritize meets the bound for 83% of workloads Higher performance bound is loose
MISE-QoS improves system performance by 10%
1865 1866

Summary: Predictability in the Presence of Taking Into Account


Memory Bandwidth Interference Shared Cache Interference
 Uncontrolled memory interference slows down
applications unpredictably Core Core Core Core

 Goal: Estimate and control slowdowns


 Key contribution Core Core Core Core
Shared Main
 MISE: An accurate slowdown estimation model Cache Memory
Core Core Core Core
 Average error of MISE: 8.2%
 Key Idea
Core Core Core Core
 Request Service Rate is a proxy for performance
 Leverage slowdown estimates to control slowdowns;
Many more applications exist

1867 1868
10-11-2023

Revisiting Request Service Rates Estimating Cache and Memory Slowdowns


Through Cache Access Rates

Memory Cache
Core Core Core Core Core Core Core Core
Access Rate Access Rate
Core Core Core Core
Shared Main
Core Core Core Core
Shared Main
Cache Memory Cache Memory
Core Core Core Core Core Core Core Core

Core Core Core Core


Request Core Core Core Core
Service
Rate
Request service and access rates tightly
coupled
1869 1870

The Application Slowdown Model Real System Studies:


Cache Access Rate vs. Slowdown

Cache 2.2
Core Core Core Core
Access Rate 2

Slowdown
1.8
Core Core Core Core
Shared Main
Cache Memory 1.6 astar
Core Core Core Core
1.4 lbm
Core Core Core Core 1.2 bzip2
1
Cache Access Rate Alone 1 1.2 1.4 1.6 1.8 2 2.2
Slowdown 
Cache Access Rate Shared Cache Access Rate Ratio
1871 1872
10-11-2023

Challenge Auxiliary Tag Store


How to estimate alone cache access rate?
Cache
Access Rate
Cache Core
Core Core Core Core
Access Rate Shared Main
Cache Memory
Core Core Core Core
Shared Main Core
Cache Memory Priority
Core Core Core Core Auxiliary
Priority Tag Store
Core Core Core Core Auxiliary Still in
Tag Store auxiliary tag
Auxiliary
Auxiliary tag store
Tag Store tracks such contention
store
misses
1873 1874

Revisiting Request Service Rate Alone Cache Access Rate Alone


 Revisiting alone memory request service rate

Alone Cache Access Rate of an Application 


# Requests During High Priority Epochs
Alone Request Service Rate of an Application  # High Priority Cycles - #Interference Cycles - #Cache Contention Cycles
# Requests During
Cycles serving High Priority
contention missesEpochs
are not
high priority
# High Priority Cycles cycles ce Cycles
- # Interferen Cache Contention Cycles: Cycles spent serving contention
Cache Contention Cycles misses
# Contention Misses x
Average Memory Service Time
From auxiliary tag Measured when given
store high priority
when given high
1875 priority 1876
10-11-2023

Application Slowdown Model (ASM) Previous Work on Slowdown Estimation

 Previous work on slowdown estimation


 STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO
Cache ‘07]
Core Core Core Core
Access Rate  FST (Fairness via Source Throttling) [Ebrahimi et al., ASPLOS ‘10]
Per-thread Cycle Accounting [Du Bois et al., HiPEAC ‘13]
Shared
Core Core Core Core 
Main
Cache Memory
Core Core Core Core  Basic Idea:
Difficult
Core Core Core Core Execution Time Alone
Slowdown 
Cache Access Rate Alone Execution Time Shared
Slowdown  Easy
Cache Access Rate Shared Count number of cycles application receives
1877 interference 1878

Model Accuracy Results Leveraging Slowdown Estimates


for Performance
How do we leverage Optimization
slowdown estimates from our model?
FST PTCA ASM
160
 To achieve high performance
Slowdown Estimation

140
120  Slowdown-aware cache allocation
Error (in %)

100  Slowdown-aware bandwidth allocation


80
60
40
 To achieve performance predictability?
20
0
xalanc…

GemsF…
perlbe…

omnet…
calculix

soplex

NPBua
tonto
namd

sjeng

NPBft
povray

lbm

libq
leslie3d

NPBis
milc
dealII

gobmk

sphinx3

mcf

NPBbt

Average

Select applications
Average error of ASM’s slowdown estimates: 10%

1879 1880
10-11-2023

Cache Capacity Partitioning Cache Capacity Partitioning

Cache Way Way Way Way


Access Rate 0 1 2 3
Core Core Set 0
Set 1
Shared Main Set 2 Main
Set 3
Cache Memory .. Memory
Core Core Set
N

Goal: Partition the shared cache among Previous way partitioning schemes optimize for miss
applications to mitigate contention count
Problem: Not aware of performance and slowdowns
1881 1882

ASM-Cache: Slowdown-aware Performance and Fairness Results


Cache Way Partitioning
 Key Requirement: Slowdown estimates for all possible way
partitions

 Extend ASM to estimate slowdown for all possible cache 15 0.8

(Lower is better)
way allocations

Performance
0.6
Fairness 10
0.4 NoPart
 Key Idea: Allocate each way to the application whose 5 UCP
slowdown reduces the most 0.2
ASM-Cache
0 0
4 8 16 4 8 16
Number of Cores Number of Cores

Significant fairness benefits across different


systems
1883 1884
10-11-2023

Memory Bandwidth Partitioning ASM-Mem: Slowdown-aware


Memory Bandwidth Partitioning
 Key Idea: Allocate high priority proportional to an
application’s slowdown
Cache
Access Rate
Core
Shared Main
Application Slowdown
Cache Memory 
Highi’sPriority
requestsFraction
given highest
i 
priority at the
i memory

Core controller for its fraction  Slowdown j


j

Goal: Partition the main memory


bandwidth among applications to mitigate
contention
1885 1886

ASM-Mem: Coordinated Resource


Fairness and Performance Results Allocation Schemes

Cache capacity-aware
20 0.8 bandwidth allocation
(Lower is better)

Performance

Core Core Core Core


15 0.6
Fairness

FRFCFS
10 0.4
TCM
5 0.2 PARBS
Core Core Core Core
Shared Main
0 0 ASM-Mem Cache Memory
Core Core Core Core
4 8 16 4 8 16
Number of Cores Number of Cores
Core Core Core Core

Significant fairness benefits across different 1. Employ ASM-Cache to partition cache capacity
systems 2. Drive ASM-Mem with slowdowns from ASM-
1887
Cache 1888
10-11-2023

Fairness and Performance Results Other Possible Applications


 VM migration and admission control schemes
16-core system [VEE ’15]
11 0.35
10 0.3
FRFCFS-NoPart  Fair billing schemes in a commodity cloud
(Lower is better)

Performance
9 0.25 FRFCFS+UCP
Fairness

8 0.2  Bounding application slowdowns


7 0.15 TCM+UCP
6 0.1
5 0.05
4 0
1 2 1 2
Number of Channels Number of Channels

Significant fairness benefits across different channel


counts 1889 1890

Summary: Predictability in the Presence of Future Work: Coordinated Resource


Shared Cache Interference Management for Predictable
Goal: Cache capacity and memory Performance
bandwidth allocation for
an application to meet a bound
 Key Ideas:
 Cache access rate is a proxy for performance
Challenges:
 Auxiliary tag stores and high priority can be combined to
estimate slowdowns • Large search space of potential cache
capacity and memory bandwidth allocations
 Key Result: Slowdown estimation error - ~10% • Multiple possible combinations of
cache/memory allocations for each
Some Applications:
application

 Slowdown-aware cache partitioning


 Slowdown-aware memory bandwidth partitioning
 Many more possible

1891 1892
10-11-2023

18-447 18-447
Computer Architecture Computer Architecture
Lecture 31: Predictable Performance Lecture 32: Heterogeneous Systems

Lavanya Subramanian Prof. Onur Mutlu


Carnegie Mellon University Carnegie Mellon University
Spring 2015, 4/15/2015 Spring 2014, 4/20/2015

Where We Are in Lecture Schedule Lab 8: Multi-Core Cache Coherence


 The memory hierarchy  Due May 3; Last submission accepted on May 10, 11:59pm
 Caches, caches, more caches  Cycle-level modeling of the MESI cache coherence protocol
 Virtualizing the memory hierarchy: Virtual Memory
 Main memory: DRAM  Since this is the last lab
 Main memory control, scheduling  An automatic extension of 7 days granted for everyone
 Memory latency tolerance techniques  No other late days accepted
 Non-volatile memory

 Multiprocessors
 Coherence and consistency
 In-memory computation and predictable performance
 Multi-core issues (e.g., heterogeneous multi-core)
 Interconnection networks
1895 1896
10-11-2023

We Have Another Course for Collaboration A Note on Testing Your Own Code
 740 is the next course in sequence  We provide the reference simulator to aid you
 Tentative Time: Lect. MW 7:30-9:20pm, (Rect. T 7:30pm)  Do not expect it to be given, and do not rely on it much
 Content:
 Lectures: More advanced, with a different perspective  In real life, there are no reference simulators
 Recitations: Delving deeper into papers, advanced topics
 Readings: Many fundamental and research readings; will do  The architect designs the reference simulator
many reviews
 The architect verifies it
 Project: More open ended research project. Proposal 
milestones  final poster and presentation  The architect tests it
 Done in groups of 1-3  The architect fixes it
 Focus of the course is the project (and papers)  The architect makes sure there are no bugs
 Exams: lighter and fewer  The architect ensures the simulator matches the specification
 Homeworks: None

1897 1898

Where We Are in Lecture Schedule Today


 The memory hierarchy  Heterogeneity (asymmetry) in system design
 Caches, caches, more caches
 Virtualizing the memory hierarchy: Virtual Memory  Evolution of multi-core systems
 Main memory: DRAM
 Main memory control, scheduling  Handling serial and parallel bottlenecks better
 Memory latency tolerance techniques
 Non-volatile memory  Heterogeneous multi-core systems

 Multiprocessors
 Coherence and consistency
 In-memory computation and predictable performance
 Multi-core issues (e.g., heterogeneous multi-core)
 Interconnection networks
1899 1900
10-11-2023

Heterogeneity (Asymmetry)  Specialization


 Heterogeneity and asymmetry have the same meaning
Contrast with homogeneity and symmetry
Heterogeneity (Asymmetry)

 Heterogeneity is a very general system design concept (and


life concept, as well)

 Idea: Instead of having multiple instances of the same


“resource” to be the same (i.e., homogeneous or symmetric),
design some instances to be different (i.e., heterogeneous or
asymmetric)

 Different instances can be optimized to be more efficient in


executing different types of workloads or satisfying different
requirements/goals
 Heterogeneity enables specialization/customization
1901 1902

Why Asymmetry in Design? (I) Why Asymmetry in Design? (II)


 Different workloads executing in a system can have different  Problem: Symmetric design is one-size-fits-all
behavior  It tries to fit a single-size design to all workloads and
 Different applications can have different behavior metrics
 Different execution phases of an application can have different behavior
 The same application executing at different times can have different
behavior (due to input set changes and dynamic events)  It is very difficult to come up with a single design
 E.g., locality, predictability of branches, instruction-level parallelism, data  that satisfies all workloads even for a single metric
dependencies, serial fraction, bottlenecks in parallel portion, interference  that satisfies all design metrics at the same time
characteristics, …

 This holds true for different system components, or


 Systems are designed to satisfy different metrics at the same resources
time
 Cores, caches, memory, controllers, interconnect, disks,
 There is almost never a single goal in design, depending on design point
servers, …
 E.g., Performance, energy efficiency, fairness, predictability, reliability,
availability, cost, memory capacity, latency, bandwidth, …  Algorithms, policies, …
1903 1904
10-11-2023

Asymmetry Enables Customization We Have Already Seen Examples Before (in 447)
 CRAY-1 design: scalar + vector pipelines
C C C C C2

C1  Modern processors: scalar instructions + SIMD extensions


C C C C C3
 Decoupled Access Execute: access + execute processors
C C C C C4 C4 C4 C4

C C C C C5 C5 C5 C5  Thread Cluster Memory Scheduling: different memory


scheduling policies for different thread clusters
Symmetric Asymmetric  RAIDR: Heterogeneous refresh rate
 Hybrid memory systems
 Symmetric: One size fits all  DRAM + Phase Change Memory
Energy and performance suboptimal for different “workload” behaviors

 Fast, Costly DRAM + Slow, Cheap DRAM
 Asymmetric: Enables customization and adaptation  Reliable, Costly DRAM + Unreliable, Cheap DRAM
 Processing requirements vary across workloads (applications and phases)
 Execute code on best-fit resources (minimal energy, adequate perf.)  …
1905 1906

An Example Asymmetric Design: CRAY-1 Remember: Hybrid Memory Systems


 CRAY-1
 Russell, “The CRAY-1 CPU
computer system,” DRA PCM
CACM 1978. DRAM MCtrl Ctrl Phase Change Memory (or Tech. X)
Fast, durable
 Scalar and vector modes Small, Large, non-volatile, low-cost
leaky, volatile, Slow, wears out, high active energy
 8 64-element vector high-cost
registers
 64 bits per element
Hardware/software manage data allocation and movement
 16 memory banks
to achieve the best of multiple technologies
 8 64-bit scalar registers
 8 24-bit address registers Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012.
Yoon, Meza et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD
2012 Best Paper Award.
1907
10-11-2023

Remember: Throughput vs. Fairness Remember: Achieving the Best of Both Worlds
higher For Throughput
Throughput biased approach Fairness biased approach
priority
Prioritize less memory-intensive threads Take turns accessing memory Prioritize memory-non-intensive threads
thread
thread
Good for throughput Does not starve thread

thread A
thread For Fairness
less memory thread B higher thread C thread A thread B Unfairness caused by memory-intensive
thread
intensive priority being prioritized over each other
thread C
not prioritized  thread
• Shuffle thread ranking
starvation  unfairness reduced throughput thread
Memory-intensive threads have
thread
different vulnerability to interference
Single policy for all threads is insufficient • Shuffle asymmetrically
1909 1910

Remember: Heterogeneous Retention Times in DRAM Aside: Examples from Life


 Heterogeneity is abundant in life
 both in nature and human-made components

 Humans are heterogeneous


 Cells are heterogeneous  specialized for different tasks
 Organs are heterogeneous
 Cars are heterogeneous
 Buildings are heterogeneous
 Rooms are heterogeneous
 …

1911 1912
10-11-2023

General-Purpose vs. Special-Purpose Asymmetry Advantages and Disadvantages


 Asymmetry is a way of enabling specialization  Advantages over Symmetric Design
+ Can enable optimization of multiple metrics
+ Can enable better adaptation to workload behavior
 It bridges the gap between purely general purpose and
purely special purpose + Can provide special-purpose benefits with general-purpose
usability/flexibility
 Purely general purpose: Single design for every workload or
metric
 Purely special purpose: Single design per workload or metric  Disadvantages over Symmetric Design
 Asymmetric: Multiple sub-designs optimized for sets of - Higher overhead and more complexity in design, verification
workloads/metrics and glued together - Higher overhead in management: scheduling onto asymmetric
components
 The goal of a good asymmetric design is to get the best of - Overhead in switching between multiple components can lead
both general purpose and special purpose to degradation

1913 1914

Yet Another Example Three Key Problems in Future Systems


 Modern processors integrate general purpose cores and  Memory system
GPUs  Applications are increasingly data intensive
 CPU-GPU systems  Data storage and movement limits performance & efficiency
 Heterogeneity in execution models

 Efficiency (performance and energy)  scalability


 Enables scalable systems  new applications
 Enables better user experience  new usage models

 Predictability and robustness


 Resource sharing and unreliable hardware causes QoS issues
 Predictable performance and QoS are first class constraints

1915 1916
10-11-2023

Many Cores on Chip


 Simpler and lower power than a single large core
Large scale parallelism on chip
Multi-Core Design 

Intel Core i7 IBM Cell BE IBM POWER7


AMD Barcelona 8 cores 8+1 cores 8 cores
4 cores

Nvidia Fermi Intel SCC Tilera TILE Gx


Sun Niagara II 448 “cores” 48 cores, networked 100 cores, networked
8 cores

1917 1918

With Many Cores on Chip Caveats of Parallelism


 What we want:  Amdahl’s Law
 N times the performance with N times the cores when we  f: Parallelizable fraction of a program
parallelize an application on N cores  N: Number of processors

1
 What we get: Speedup =
f
 Amdahl’s Law (serial bottleneck) 1-f + N
 Bottlenecks in the parallel portion
 Amdahl, “Validity of the single processor approach to achieving large scale
computing capabilities,” AFIPS 1967.
 Maximum speedup limited by serial portion: Serial bottleneck
 Parallel portion is usually not perfectly parallel
 Synchronization overhead (e.g., updates to shared data)
 Load imbalance overhead (imperfect parallelization)
 Resource sharing overhead (contention among N processors)
1919 1920
10-11-2023

The Problem: Serialized Code Sections Example from MySQL


 Many parallel programs cannot be parallelized completely
Asymmetric

Critical
 Causes of serialized code sections Section Access Open Tables Cache 8
 Sequential portions (Amdahl’s “serial part”) 7

 Critical sections Open database tables 6

Speedup
 Barriers 5
4
 Limiter stages in pipelined programs 3
2 Today
Perform the operations 1
 Serialized code sections …. 0
 Reduce performance Parallel 0 8 16 24 32

 Limit scalability Chip Area (cores)


 Waste energy

1921 1922

Demands in Different Code Sections “Large” vs. “Small” Cores


 What we want:
Large Small
Core Core
 In a serialized code section  one powerful “large” core
• Out-of-order • In-order
 In a parallel code section  many wimpy “small” cores • Wide fetch e.g. 4-wide • Narrow Fetch e.g. 2-wide
• Deeper pipeline • Shallow pipeline
• Aggressive branch
 These two conflict with each other: predictor (e.g. hybrid) • Simple branch predictor
 If you have a single powerful core, you cannot have many • Multiple functional units (e.g. Gshare)
cores • Trace cache • Few functional units
 A small core is much more energy and area efficient than a • Memory dependence
large core speculation

Large Cores are power inefficient:


1923
e.g., 2x performance for 4x area (power) 1924
10-11-2023

Large vs. Small Cores Meet Large: IBM POWER4


 Grochowski et al., “Best of both Latency and Throughput,”  Tendler et al., “POWER4 system microarchitecture,” IBM J
ICCD 2004. R&D, 2002.

 A symmetric multi-core chip…


 Two powerful cores

1925 1926

IBM POWER4 IBM POWER5


 2 cores, out-of-order execution  Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE
Micro 2004.
 100-entry instruction window in each core
 8-wide instruction fetch, issue, execute
 Large, local+global hybrid branch predictor
 1.5MB, 8-way L2 cache
 Aggressive stream based prefetching

1927 1928
10-11-2023

Meet Small: Sun Niagara (UltraSPARC T1) Niagara Core


 Kongetira et al., “Niagara: A 32-Way Multithreaded SPARC  4-way fine-grain multithreaded, 6-stage, dual-issue in-order
Processor,” IEEE Micro 2005.  Round robin thread selection (unless cache miss)
 Shared FP unit among cores

1929 1930

Remember the Demands Performance vs. Parallelism


 What we want:

 In a serialized code section  one powerful “large” core Assumptions:


1. Small cores takes an area budget of 1 and has
performance of 1
 In a parallel code section  many wimpy “small” cores

2. Large core takes an area budget of 4 and has


 These two conflict with each other: performance of 2
 If you have a single powerful core, you cannot have many
cores
 A small core is much more energy and area efficient than a
large core

 Can we get the best of both worlds?


1931 1932
10-11-2023

Tile-Large Approach Tile-Small Approach


Small Small Small Small
core core core core
Large Large
core core Small Small Small Small
core core core core

Small Small Small Small


core core core core
Large Large
core core Small Small Small Small
core core core core

“Tile-Large” “Tile-Small”

 Tile a few large cores  Tile many small cores


 IBM Power 5, AMD Barcelona, Intel Core2Quad, Intel Nehalem  Sun Niagara, Intel Larrabee, Tilera TILE (tile ultra-small)
+ High performance on single thread, serial code sections (2 units) + High throughput on the parallel part (16 units)
- Low throughput on parallel program portions (8 units)
- Low performance on the serial part, single thread (1 unit)
1933 1934

Can we get the best of both worlds?


 Tile Large
+ High performance on single thread, serial code sections (2
units) Asymmetric Multi-Core
- Low throughput on parallel program portions (8 units)

 Tile Small
+ High throughput on the parallel part (16 units)
- Low performance on the serial part, single thread (1 unit),
reduced single-thread performance compared to existing single
thread processors

 Idea: Have both large and small on the same chip 


Performance asymmetry

1935 1936
10-11-2023

Asymmetric Chip Multiprocessor (ACMP) Accelerating Serial Bottlenecks


Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Single thread  Large core
Large Large Large
core core Small Small Small Small core Small Small
core core core core core core

Small Small
Small Small Small Small Small Small Small Small
core core
Large Large core core core core core core core core Large
core core core Small Small
Small Small Small Small Small Small Small Small
core core
core core core core core core core core

Small Small Small Small


core core core core
“Tile-Large” “Tile-Small” ACMP
Small Small Small Small
core core core core

 Provide one large core and many small cores ACMP Approach
+ Accelerate serial part using the large core (2 units)
+ Execute parallel part on small cores and large core for high
throughput (12+2 units)
1937 1938

Performance vs. Parallelism ACMP Performance vs. Parallelism


Area-budget = 16 small cores

Small Small Small Small Small Small


Large Large core core core core Large core core

Assumptions: core core Small Small Small Small


core core core core
core Small Small
core core

1. Small cores takes an area budget of 1 and has Small Small Small Small Small Small Small Small
Large Large core core core core core core core core
performance of 1 core core Small Small Small Small Small Small Small Small
core core core core core core core core

“Tile-Large” “Tile-Small” ACMP


2. Large core takes an area budget of 4 and has
performance of 2 Large 4 0 1
Cores
Small 0 16 12
Cores
Serial 2 1 2
Performance
Parallel 2x4=8 1 x 16 = 16 1x2 + 1x12 = 14
Throughput
1939 1940 1940
10-11-2023

Amdahl’s Law Modified Caveats of Parallelism, Revisited


 Simplified Amdahl’s Law for an Asymmetric Multiprocessor  Amdahl’s Law
 Assumptions:  f: Parallelizable fraction of a program
 Serial portion executed on the large core  N: Number of processors
 Parallel portion executed on both small cores and large cores
1
 f: Parallelizable fraction of a program Speedup =
f
 L: Number of large processors 1-f + N
 S: Number of small processors
Amdahl, “Validity of the single processor approach to achieving large scale
 X: Speedup of a large processor over a small one 

computing capabilities,” AFIPS 1967.


 Maximum speedup limited by serial portion: Serial bottleneck
1  Parallel portion is usually not perfectly parallel
Speedup =
1-f f Synchronization overhead (e.g., updates to shared data)
X
+ S + X*L

 Load imbalance overhead (imperfect parallelization)


 Resource sharing overhead (contention among N processors)
1941 1942

Accelerating Parallel Bottlenecks


 Serialized or imbalanced execution in the parallel portion
can also benefit from a large core
Accelerated Critical Sections
 Examples:
 Critical sections that are contended
 Parallel stages that take longer than others to execute

 Idea: Dynamically identify these code portions that cause M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt,
"Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures"
serialization and execute them on a large core Proceedings of the 14th International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS), 2009

1943 1944
10-11-2023

Contention for Critical Sections Contention for Critical Sections


Critical Critical
12 iterations, 33% instructions inside the critical section 12 iterations, 33% instructions inside the critical section
Section Section
Parallel Parallel
Idle Idle

P=4 P=4

Accelerating critical sections


P=3 P=3 increases performance and scalability
Critical
Section
Accelerated
P=2 33% in critical section P=2 by 2x

P=1 P=1
0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11 12

1945 1946

Impact of Critical Sections on Scalability A Case for Asymmetry


 Contention for critical sections leads to serial execution  Execution time of sequential kernels, critical sections, and
(serialization) of threads in the parallel program portion limiter stages must be short
 Contention for critical sections increases with the number of
threads and limits scalability Asymmetric  It is difficult for the programmer to shorten these
serialized sections
 Insufficient domain-specific knowledge
8
 Variation in hardware platforms
7
6
 Limited resources
Speedup

4  Goal: A mechanism to shorten serial bottlenecks without


3 requiring programmer effort
2 Today
1  Idea: Accelerate serialized code sections by shipping them
0
0 8 16 24 32
MySQL (oltp-1)
to powerful cores in an asymmetric multi-core (ACMP)
Chip Area (cores)
1947 1948
10-11-2023

An Example: Accelerated Critical Sections Accelerated Critical Sections


 Idea: HW/SW ships critical sections to a large, powerful core in an
1. P2 encounters a critical section (CSCALL)
asymmetric multi-core architecture EnterCS()
2. P2 sends CSCALL Request to CSRB
PriorityQ.insert(…) 3. P1 executes Critical Section
 Benefit: LeaveCS() 4. P1 sends CSDONE signal
 Reduces serialization due to contended locks
 Reduces the performance impact of hard-to-parallelize sections
Core executing
 Programmer does not need to (heavily) optimize parallel code  fewer critical section
bugs, improved productivity
P1
P2 P3 P4
 Suleman et al., “Accelerating Critical Section Execution with Asymmetric
Multi-Core Architectures,” ASPLOS 2009, IEEE Micro Top Picks 2010.
 Suleman et al., “Data Marshaling for Multi-Core Architectures,” ISCA Critical Section
2010, IEEE Micro Top Picks 2011. Request Buffer
(CSRB) Onchip-
Interconnect
1949 1950

Accelerated Critical Sections (ACS) False Serialization


Small Core Small Core Large Core  ACS can serialize independent critical sections
A = compute()
PUSH A
A = compute()
CSCALL X, Target PC
 Selective Acceleration of Critical Sections (SEL)
… CSCALL Request
LOCK X Send X, TPC,
… Waiting in  Saturating counters to track false serialization
… STACK_PTR, CORE_ID … Critical Section
result = CS(A) … …
Request Buffer
(CSRB)
UNLOCK X … TPC: Acquire X To large core
… POP A
print result … result = CS(A)
PUSH result

Release X
A 2
3
4 CSCALL (A) Critical Section
CSRET X Request Buffer
B 5
4 CSCALL (A)
CSDONE Response (CSRB)
POP result CSCALL (B)
print result

 Suleman et al., “Accelerating Critical Section Execution with From small cores
Asymmetric Multi-Core Architectures,” ASPLOS 2009.
1951 1952
10-11-2023

ACS Performance Tradeoffs ACS Performance Tradeoffs


 Pluses  Fewer parallel threads vs. accelerated critical sections
+ Faster critical section execution  Accelerating critical sections offsets loss in throughput
+ Shared locks stay in one place: better lock locality  As the number of cores (threads) on chip increase:
 Fractional loss in parallel performance decreases
+ Shared data stays in large core’s (large) caches: better shared  Increased contention for critical sections
data locality, less ping-ponging makes acceleration more beneficial

 Minuses  Overhead of CSCALL/CSDONE vs. better lock locality


ACS avoids “ping-ponging” of locks among caches by keeping them at
- Large core dedicated for critical sections: reduced parallel 
the large core
throughput
- CSCALL and CSDONE control transfer overhead
 More cache misses for private data vs. fewer misses
- Thread-private data needs to be transferred to large core: worse for shared data
private data locality

1953 1954

Cache Misses for Private Data ACS Performance Tradeoffs


 Fewer parallel threads vs. accelerated critical sections
 Accelerating critical sections offsets loss in throughput
PriorityHeap.insert(NewSubProblems)  As the number of cores (threads) on chip increase:
 Fractional loss in parallel performance decreases
 Increased contention for critical sections
makes acceleration more beneficial

 Overhead of CSCALL/CSDONE vs. better lock locality


 ACS avoids “ping-ponging” of locks among caches by keeping them at
Private Data: Shared Data: the large core
NewSubProblems The priority heap

 More cache misses for private data vs. fewer misses


for shared data
 Cache misses reduce if shared data > private data
Puzzle Benchmark This problem can be solved

1955 1956
10-11-2023

ACS Comparison Points Accelerated Critical Sections: Methodology


Small Small Small Small Small Small Small Small
 Workloads: 12 critical section intensive applications
core core core core core core
Large Large core core
 Data mining kernels, sorting, database, web, networking
Small Small Small Small core Small Small core Small Small
core core core core core core core core

Small Small Small Small Small Small Small Small Small Small Small Small
 Multi-core x86 simulator
core core core core core core core core core core core core
 1 large and 28 small cores
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
 Aggressive stream prefetcher employed at each core

SCMP ACMP  Details:


ACS
 Large core: 2GHz, out-of-order, 128-entry ROB, 4-wide, 12-stage
 Conventional  Conventional  Large core executes  Small core: 2GHz, in-order, 2-wide, 5-stage
locking locking Amdahl’s serial part  Private 32 KB L1, private 256KB L2, 8MB shared L3
 Large core executes and critical sections  On-chip interconnect: Bi-directional ring, 5-cycle hop latency
Amdahl’s serial part

1957 1958

------ SCMP
ACS Performance Equal-Area Comparisons ------ ACMP
------ ACS
Chip Area = 32 small cores Equal-area comparison
SCMP = 32 small cores Number of threads = No. of cores
ACMP = 1 large and 28 small cores Number of threads = Best threads 3.5 3 5 7 3.5 14
3 2.5 6 3 12
4
2.5 2 5 2.5 10
269 180 185 2 3 4 2 8

Speedup over a small core


160 1.5
Speedup over SCMP

1.5 2 3 1.5 6
140 1
1 2 1 4
120 0.5 1
0.5 1 0.5 2
100
0 0 0 0 0 0
80 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32
60 (a) ep (b) is (c) pagemine (d) puzzle (e) qsort (f) tsp
40 Accelerating Sequential Kernels
6 10 8 12 3 12
20 Accelerating Critical Sections 5 10 2.5 10
0 8
6
4 8 2 8
6
-2
-1
up

n
t
e

e
e

p
or

ea
ch
in

lit

ts

jb
zl

3 4 6 1.5 6
tp
tp
ok
qs
z
m

ec
sq

hm
ca
ol

ol
pu

4
lo
ge

sp

eb

2 4 1 4
ip
pa

2
1 2 2 0.5 2
0 0 0 0 0 0
0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32
(g) sqlite (h) iplookup (i) oltp-1 (i) oltp-2 (k) specjbb (l) webcache
Coarse-grain locks Fine-grain locks
Chip Area (small cores)
1959 1960
10-11-2023

ACS Summary
 Critical sections reduce performance and limit scalability
18-447
Accelerate critical sections by executing them on a powerful

core Computer Architecture
Lecture 33: Interconnection Networks
 ACS reduces average execution time by:
 34% compared to an equal-area SCMP
 23% compared to an equal-area ACMP

 ACS improves scalability of 7 of the 12 workloads Prof. Onur Mutlu


Carnegie Mellon University
 Generalizing the idea: Accelerate all bottlenecks (“critical Spring 2015, 4/27/2015
paths”) by executing them on a powerful core
1961

Extra Credit Lab 8: Multi-Core Cache Coherence More on 740


 Completely extra credit (all get 5% for free; can get 5% more)  740 is the next course in sequence
 Last submission accepted on May 10, 11:59pm; no late submissions  Time: Lect. MW 7:30-9:20pm, Rect. T 7:30-9:20pm
 Cycle-level modeling of the MESI cache coherence protocol
 Content:
 Lectures: More advanced, with a different perspective
 Recitations: Delving deeper into papers, advanced topics
 Readings: Many fundamental and research readings; will do
many reviews
 Project: More open ended research project. Proposal 
milestones  final poster and presentation
 Done in groups of 1-3
 Focus of the course is the project and critical reviews of readings
 Exams: lighter and fewer
 Homeworks: None

1963 1964
10-11-2023

Where We Are in Lecture Schedule


 The memory hierarchy
 Caches, caches, more caches
 Virtualizing the memory hierarchy: Virtual Memory Interconnection Network Basics
 Main memory: DRAM
 Main memory control, scheduling
 Memory latency tolerance techniques
 Non-volatile memory

 Multiprocessors
 Coherence and consistency
 In-memory computation and predictable performance
 Multi-core issues (e.g., heterogeneous multi-core)
 Interconnection networks
1965 1966

Where Is Interconnect Used? Why Is It Important?


 To connect components  Affects the scalability of the system
 How large of a system can you build?
 Many examples  How easily can you add more processors?
 Processors and processors
 Processors and memories (banks)  Affects performance and energy efficiency
 Processors and caches (banks)  How fast can processors, caches, and memory communicate?
 Caches and caches  How long are the latencies to memory?
 I/O devices  How much energy is spent on communication?

Interconnection network

1967 1968
10-11-2023

Interconnection Network Basics Topology


 Topology  Bus (simplest)
 Specifies the way switches are wired  Point-to-point connections (ideal and most costly)
 Affects routing, reliability, throughput, latency, building ease  Crossbar (less costly)
 Ring
 Routing (algorithm)  Tree
 How does a message get from source to destination  Omega
 Static or adaptive  Hypercube
 Mesh
 Buffering and Flow Control  Torus
 What do we store within the network?
 Butterfly
 Entire packets, parts of packets, etc?
 How do we throttle during oversubscription?
 …
 Tightly coupled with routing strategy
1969 1970

Metrics to Evaluate Interconnect Topology Bus


 Cost All nodes connected to a single link
 Latency (in hops, in nanoseconds) + Simple + Cost effective for a small number of nodes
 Contention + Easy to implement coherence (snooping and serialization)
- Not scalable to large number of nodes (limited bandwidth,
 Many others exist you should think about electrical loading  reduced frequency)
 Energy - High contention  fast saturation
 Bandwidth 0 1 2 3 4 5 6 7
 Overall system performance

Memory Memory Memory Memory

cache cache cache cache

1971 Proc Proc Proc Proc 1972


10-11-2023

Point-to-Point Crossbar
Every node connected to every other 0  Every node connected to every other with a shared link for
with direct/isolated links each destination
7 1
 Enables concurrent transfers to non-conflicting destinations
+ Lowest contention  Could be cost-effective for small number of nodes
+ Potentially lowest latency 7

+ Ideal, if cost is no issue 6 2 + Low latency and high throughput 6

- Expensive 5

-- Highest cost - Not scalable  O(N2) cost 4

O(N) connections/ports - Difficult to arbitrate as N increases 3

per node 5 3
2

Used in core-to-cache-bank 1
O(N2) links
4 networks in 0
-- Not scalable
- IBM POWER5
-- How to lay out on chip? - Sun Niagara I/II
0 1 2 3 4 5 6 7

1973 1974

Another Crossbar Design Sun UltraSPARC T2 Core-to-Cache Crossbar


 High bandwidth
0 interface between 8
1 cores and 8 L2
banks & NCU
2

3
 4-stage pipeline:
4 req, arbitration,
5
selection,
transmission
6

7
 2-deep queue for
0 1 2 3 4 5 6 7
each src/dest pair
to hold data
transfer request
1975 1976
10-11-2023

Bufferless and Buffered Crossbars Can We Get Lower Cost than A Crossbar?
 Yet still have low contention compared to a bus?
0 NI Flow
+ Simpler
Control
arbitration/  Idea: Multistage networks
scheduling
1 NI Flow
Control

+ Efficient
2 NI Flow support for
Control
variable-size
packets
3 NI Flow
Control

- Requires
Bufferless
Buffered
Output
Arbiter
Output
Arbiter
Output
Arbiter
Output
Arbiter
N2 buffers
Crossbar

1977 1978

Multistage Logarithmic Networks Multistage Networks (Circuit Switched)


 Idea: Indirect networks with multiple layers of switches 0 0
between terminals/nodes 1 1
 Cost: O(NlogN), Latency: O(logN)
2 2
 Many variations (Omega, Butterfly, Benes, Banyan, …)
3 3
 Omega Network:
Omega Net w or k 4 4
000 000
001 001 5 5

010 010 6 6
011 011
7
100 100 7
101 101
 A multistage network has more restrictions on feasible 2-by-2 crossbar
110 110
concurrent Tx-Rx pairs vs a crossbar
111 111  But more scalable than crossbar in cost, e.g., O(N
conflict logN) for Butterfly
1979 1980
10-11-2023

Multistage Networks (Packet Switched) Aside: Circuit vs. Packet Switching


0 0
 Circuit switching sets up full path before transmission
 Establish route then send data
1 1
 Noone else can use those links while “circuit” is set
2 2 + faster arbitration
3 3 -- setting up and bringing down “path” takes time

4 4
 Packet switching routes per packet in each router
5 5
 Route each packet individually (possibly via different paths)
6 6  If link is free, any packet can use it

7 7 -- potentially slower --- must dynamically switch


+ no setup, bring down time
2-by-2 router
+ more flexible, does not underutilize links
 Packets “hop” from router to router, pending availability of
the next-required switch and buffer
1981 1982

Switching vs. Topology Another Example: Delta Network


 Circuit/packet switching choice independent of topology  Single path from source to
 It is a higher-level protocol on how a message gets sent to destination
a destination
 Each stage has different
routers
 However, some topologies are more amenable to circuit vs.
packet switching
 Proposed to replace costly
crossbars as processor-memory
interconnect

 Janak H. Patel ,“Processor-


Memory Interconnections for
Multiprocessors,” ISCA 1979. 8x8 Delta network

1983 1984
10-11-2023

Another Example: Omega Network Ring


 Single path from source to Each node connected to exactly two other nodes. Nodes form
destination a continuous pathway such that packets can reach any
node.
 All stages are the same
+ Cheap: O(N) cost
 Used in NYU - High latency: O(N)
Ultracomputer - Not easy to scale
- Bisection bandwidth remains constant RING
 Gottlieb et al. “The NYU
Ultracomputer - Designing Used in Intel Haswell, S S
S
an MIMD Shared Memory Intel Larrabee, IBM Cell, P P P
Parallel Computer,” IEEE
Trans. On Comp., 1983. many commercial systems today M M M

1985 1986

Unidirectional Ring Bidirectional Rings


Multi-directional pathways, or multiple rings

R R R R
+ Reduces latency
2x2 router
+ Improves scalability
0 1 N-2 N-1
2
- Slightly more complex injection policy (need to select which
 Single directional pathway ring to inject a packet into)
 Simple topology and implementation
 Reasonable performance if N and performance needs
(bandwidth & latency) still moderately low
 O(N) cost
 N/2 average hops; latency depends on utilization

1987 1988
10-11-2023

Hierarchical Rings More on Hierarchical Rings


 Rachata+, “Design and Evaluation of Hierarchical Rings
with Deflection Routing,” SBAC-PAD 2014.
 http://users.ece.cmu.edu/~omutlu/pub/hierarchical-rings-
with-deflection_sbacpad14.pdf

 Discusses the design and implementation of a mostly-


bufferless hierarchical ring

+ More scalable
+ Lower latency

- More complex

1989 1990

Mesh Torus
 Each node connected to 4 neighbors (N, E, S, W)  Mesh is not symmetric on edges: performance very
 O(N) cost sensitive to placement of task on edge vs. middle
 Average latency: O(sqrt(N))  Torus avoids this problem

 Easy to layout on-chip: regular and equal-length links + Higher path diversity (and bisection bandwidth) than mesh
 Path diversity: many ways to get from one node to another - Higher cost
- Harder to lay out on-chip
 Used in Tilera 100-core - Unequal link lengths
 And many on-chip network
prototypes

1991 1992
10-11-2023

Torus, continued Trees


 Weave nodes to make inter-node latencies ~constant Planar, hierarchical topology
Latency: O(logN)
Good for local traffic
+ Cheap: O(N) cost
H-Tree

S S S S S S S S + Easy to Layout
MP MP MP MP MP MP MP MP
- Root can become a bottleneck
Fat trees avoid this problem (CM-5)
Fat Tree

1993 1994

CM-5 Fat Tree Hypercube


 Fat tree based on 4x2 switches  “N-dimensional cube” or “N-cube”
 Randomized routing on the way up
 Combining, multicast, reduction operators supported in
hardware 0-D 1-D 2-D 3-D 4-D

 Thinking Machines Corp., “The Connection Machine CM-5  Latency: O(logN)


Technical Summary,” Jan. 1992.
 Radix: O(logN) 11 11
 #links: O(NlogN)
01 11
11
+ Low latency 01 01 00
11
10
01 11
- Hard to lay out in 2D/3D 01 10 10
00 01 01 11
10 10 10
00 10
00 00
01 11
00 00
00 10
1995 1996
10-11-2023

Caltech Cosmic Cube Interconnection Network Basics


 64-node message passing  Topology
machine  Specifies the way switches are wired
 Affects routing, reliability, throughput, latency, building ease
 Seitz, “The Cosmic Cube,”
CACM 1985.  Routing (algorithm)
 How does a message get from source to destination
 Static or adaptive

 Buffering and Flow Control


 What do we store within the network?
 Entire packets, parts of packets, etc?
 How do we throttle during oversubscription?
 Tightly coupled with routing strategy
1997 1998

Handling Contention Bufferless Deflection Routing


 Key idea: Packets are never buffered in the network. When
two packets contend for the same link, one is deflected.1

New traffic can be injected


whenever there is a free
output link.

 Two packets trying to use the same link at the same time
 What do you do?
 Buffer one
 Drop one
 Misroute one (deflection)
 Tradeoffs? Destination
1Baran, “On Distributed Communication Networks.” RAND Tech. Report., 1962 / IEEE Trans.Comm., 1964.
1999 2000
10-11-2023

Bufferless Deflection Routing Routing Algorithm


 Input buffers are eliminated: packets are buffered in  Three Types
pipeline latches and on network links  Deterministic: always chooses the same path for a
communicating source-destination pair
Input Buffers  Oblivious: chooses different paths, without considering
network state
North North  Adaptive: can choose different paths, adapting to the state of
the network
South South

East East  How to adapt


 Local/global feedback
West West
 Minimal or non-minimal paths
Local Local

Deflection Routing Logic

2001 2002

Deterministic Routing Deadlock


 All packets between the same (source, dest) pair take the  No forward progress
same path  Caused by circular dependencies on resources
 Each packet waits for a buffer occupied by another packet
 Dimension-order routing downstream
 First traverse dimension X, then traverse dimension Y
 E.g., XY routing (used in Cray T3D, and many on-chip
networks)

+ Simple
+ Deadlock freedom (no cycles in resource allocation)
- Could lead to high contention
- Does not exploit path diversity
2003 2004
10-11-2023

Handling Deadlock Turn Model to Avoid Deadlock


 Avoid cycles in routing  Idea
 Dimension order routing  Analyze directions in which packets can turn in the network
 Cannot build a circular dependency  Determine the cycles that such turns can form
 Restrict the “turns” each packet can take  Prohibit just enough turns to break possible cycles
 Glass and Ni, “The Turn Model for Adaptive Routing,” ISCA
1992.
 Avoid deadlock by adding more buffering (escape paths)

 Detect and break deadlock


 Preemption of buffers

2005 2006

Oblivious Routing: Valiant’s Algorithm Adaptive Routing


 An example of oblivious algorithm  Minimal adaptive
 Goal: Balance network load  Router uses network state (e.g., downstream buffer
 Idea: Randomly choose an intermediate destination, route occupancy) to pick which “productive” output port to send a
packet to
to it first, then route from there to destination
 Productive output port: port that gets the packet closer to its
 Between source-intermediate and intermediate-dest, can use
destination
dimension order routing
+ Aware of local congestion
- Minimality restricts achievable link utilization (load balance)
+ Randomizes/balances network load
- Non minimal (packet latency can increase)
 Non-minimal (fully) adaptive
 “Misroute” packets to non-productive output ports based on
 Optimizations: network state
 Do this on high load + Can achieve better network utilization and load balance
 Restrict the intermediate node to be close (in the same quadrant) - Need to guarantee livelock freedom
2007 2008
10-11-2023

On-Chip Networks

 Connect cores, caches, memory


PE PE PE controllers, etc
R R R
 Buses and crossbars are not scalable
PE PE PE
 Packet switched
R R R

PE PE PE
 2D mesh: Most commonly used
R R R topology
 Primarily serve cache misses and
memory requests
R Router
Processing Element
PE
(Cores, L2 Banks, Memory Controllers, etc)

2009

You might also like