Lecture 11. Memory Hierarchy
Lecture 11. Memory Hierarchy
Muhammad Abid
DCIS, PIEAS
Memory Subsystem
The performance of a computer system not only
depends on microprocessor but also its memory
subsystem. Why?
instructions and data are stored in memory.
A good architect try to improve performance of both
microprocessor and memory subsystem.
2
Why Memory Hierarchy?
We want both fast and large
Hard Disk
Main
CPU Cache Memory
RF (DRAM)
3
Why Memory Hierarchy?
We want both fast and large
Solution:
Have multiple levels of storage (progressively bigger
and slower as the levels are farther from the
processor)
Spatial and temporal locality in programs and
ensure most of the data the processor needs is kept in
the faster levels
4
The Memory Hierarchy
7
Caching Basics: Exploit
Temporal
Locality
Idea: Store recently accessed data in automatically
managed fast memory (called cache)
Anticipation: the data will be accessed again soon
8
Caching Basics: Exploit Spatial
Locality
Idea: Store addresses adjacent to the recently
accessed one in automatically managed fast
memory
Logically divide memory into equal size blocks
Fetch to cache the accessed block in its entirety
Anticipation: nearby data will be accessed soon
9
A Note on Manual vs. Automatic
Management
Manual: Programmer manages data movement
across levels
-- too painful for programmers on substantial programs
still done in some embedded processors (on-chip scratch
You don’t need to know how big the cache is and how it
works to write a “correct” program! (What if you want a
“fast” program?)
10
A Modern Memory Hierarchy
Register File
32 words, sub-nsec
manual/compiler
Memory register spilling
L1 cache
Abstraction ~32 KB, ~nsec
L2 cache
512 KB ~ 1MB, many nsec Automatic
HW cache
L3 cache, management
.....
12
Cache Basics and Operation
Cache
Most commonly in the on-die context: an
automatically-managed memory hierarchy based on
SRAM
stores in SRAM the most frequently accessed DRAM
memory locations to avoid repeatedly paying for the
DRAM access latency
14
Caching Basics
Block (line): Unit of storage in the cache
Memory is logically divided into cache blocks that map
to locations in the cache
When data referenced
HIT: If in cache, use cached data instead of accessing
memory
MISS: If not in cache, bring block into cache
Maybe have to kick something else out to do it
Address
Tag Store Data Store
Hit/miss? Data
V tag
=? MUX
Hit? Data
V tag
byte in block
=? MUX
Hit? Data
19
Direct-Mapped Caches Problem
Direct-mapped cache: Two blocks in memory that
map to the same index in the cache cannot be
present in the cache at the same time
One index one entry
20
Set Associativity
Block Addresses 0 and 8 always conflict in direct mapped
cache
Instead of having one column of 8, have 2 columns of 4
blocks Tag store Data store
SET
V tag V tag
=? =? MUX
=? =? =? =?
Logic Hit?
Data store
MUX
byte in block
MUX
Tag store
=? =? =? =? =? =? =? =?
Logic
Hit?
Data store
MUX
byte in block
MUX
23
Associativity (and Tradeoffs)
How many blocks can map to the same index (or
set)?
Higher associativity
++ Higher hit rate
-- Slower cache access time (hit latency and data access
latency)
hit rate
-- More expensive hardware (more comparators)
associativity
24
Which block to replace on a
cache
miss?
Which block in the set to replace on a cache miss?
Any invalid block first
If all are valid, consult the replacement policy
FIFO
Least recently used (how to implement?)
Random
25
Implementing LRU
Idea: Evict the least recently accessed block
Problem: Need to keep track of access ordering of
blocks, i.e. timestamps
26
Approximations of LRU
Most modern processors do not implement “true
LRU” in highly-associative caches
Why?
True LRU is complex
Pseudo-LRU:
associate a bit to each block; set it when block is
accessed
If all bits for a set are set then reset bits for that set
with the exception of most-recently set bit.
Victim: block whose bit is off; randomly choose if more
than one bit is off.
27
What’s In A Tag Store Entry?
Valid bit
Tag
Replacement policy bits
Dirty bit?
Write back vs. write through caches
28
Handling Writes (Stores)
When do we write the modified data in a cache to the
next level?
Write back: When the block is evicted
Write through: At the time the write happens
Write-back
+ Can consolidate multiple writes to the same block before
eviction
Potentially saves bandwidth between cache levels + saves
energy
-- Need a bit in the tag store indicating the block is
“modified”
Write-through
+ Simpler
+ All levels are up to date. Consistency: Simpler cache
coherence
29
Handling Writes (Stores)
Do we allocate a cache block on a write miss?
Allocate on write miss: Yes
No-allocate on write miss: No
Allocate on write miss
+ Can consolidate writes instead of writing each of them
individually to next level
+ Simpler because write misses can be treated the same
way as read misses
-- Requires transfer of the whole cache block
No-allocate
+ Conserves cache space if locality of writes is low
(potentially better cache hit rate)
30
Handling Writes (Stores)
write-through/write-back caches and allocate/no-
allocate on a write miss:
typically write-back caches allocate block on a write
miss
typically write-through caches don’t allocate block on a
write miss
Multicore processors can implement write-through
policy for L1 caches to keep L2 cache consistent.
However, L2 cache can implement write-back policy
to reduce bandwidth utilization.
31
Write Buffer & Write-through
Caches
In write-through caches processor waits for data to
be written to lower level, e.g. memory
Write Buffer:
Using write buffer, processor just needs to write data
there and can continue as soon as data is written.
writes to same block are merged
overlapping processor execution with memory
updating
32
Victim Buffer & Write-back
Caches
Dirty replaced blocks are written to victim buffer
rather than lower level e.g. memory
33
Miss Causes: 3 C’s
Compulsory or cold miss
First-ever reference to a given block of memory
occurs in all types of caches
Measure: number of misses in an infinite cache model
Capacity miss
working set exceeds cache capacity
useful blocks (with future references) replaced & later
retrieved from lower level, e.g. memory
Measure: additional misses in a fully-associative cache
34
Miss Causes: 3 C’s
Conflict miss
occurs in direct-mapped or set-associative cache
too many blocks mapped to an index and so a block is
replaced and later retrieved
Measure: misses that occurs bec of conflict
35
Reducing 3 C’s
Compulsory misses:
increase the block size
Capacity misses:
increase cache size
Conflict misses:
increase associativity
increase cache size
spreads out references to more indices
36
Hierarchical Latency Analysis
For a given memory hierarchy level i it has a technology-intrinsic
access time of ti, The perceived access time Ti is longer than ti
Except for the outer-most hierarchy, when looking for a given
address there is
a chance (hit-rate hi) you “hit” and access time is ti
a chance (miss-rate mi) you “miss” and access time ti +Ti+1
hi + mi = 1
Thus
Ti = hi·ti + mi·(ti + Ti+1)
Ti = ti + mi ·Ti+1
37
Hierarchy Design
Considerations
Recursive latency equation
Ti = ti + mi ·Ti+1
The goal: achieve desired T1 within allowed cost
Ti ti is desirable
Keep mi low
increasing capacity Ci lowers mi, but beware of increasing ti
lower mi by smarter management (replacement::anticipate what you
don’t need, prefetching::anticipate what you will need)
40
Reducing Memory Access Time
AMAT = access time + Miss rate * Miss penalty
41
Virtual Memory
Physical Addressing
CPU generates PA to access memory
no VA
Where used?
early PCs
digital signal processors, embedded microcontrollers
and Cray supercomputers still use it
43
Virtual addressing
CPU generates VA
Where used?
Most laptops, server, and modern PCs
44
Memory: Programmers’ View
45
Virtual Memory
Virtual memory is imaginary memory
it gives you illusion of memory that’s not physically
there
it provides the illusion of a large address space
This illusion is provided separately for each program
46
Using Physical Memory
Efficiently
key idea: only active portion of the program is
loaded into memory
achieved using demand paging
operating system copies a disk page into physical
memory only if an attempt is made to access it
The rest of the virtual address space is stored on disk
Virtual memory gets the most out of physical
memory
Keep only active areas of virtual address space in
fast memory
Transfer data back and forth as needed
47
Using Physical Memory
Efficiently
Prob: A page is on disk and not in memory, i.e. page
fault
Sol: An OS routine is called to load data from disk to
memory; Current process suspends execution; OS
has full control over placement
48
Using Physical Memory
Efficiently
Demand paging:
49
Using Physical Memory Simply
Key idea: Each process has its own virtual address
space defined by its page table
simplifies memory allocation
a virtual page can be mapped to any physical page
simplifies sharing code and data among processes
the OS can map virtual pages to same shared physical
page
simplifies loading the program anywhere in memory
just change mapping in the page table
physical memory need not be contiguous
simplifies allocating memory to applications
malloc() in c
50
Using Physical Memory Simply
51
Using Physical Memory Safely
Key idea: protection
Processes cannot interfere with each other
Because they operate in different address space
User processes can read but cannot modify
privileged information, e.g. I/O
Different sections of address space have different
permissions, e.g. read-only, read/write, execute
permission bits in the page table entries
52
Using Physical Memory Safely
53
Virtual Memory
Implementation
HW/SW Support for Virtual
Memory
Processor must
support Virtual address & Physical address
support two modes of execution
user & supervisor
protect privileged info
user process can only read user/supervisor mode bit,
page table pointer, TLB
support switching b/w user mode to supervisor mode
and vice versa
user process wants to perform I/O operation
user process requests to allocate memory, e..g malloc()
check protection bits while making a reference to
memory
55
HW/SW Support for Virtual
Memory
Operating System
creates separate page tables for processes so they
don’t interfere
all page tables reside in the OS address space and so
user process cannot update/modify them
assists processes to share data by making changes in
the page tables
assigns protection bits in the page tables, e.g. Read-
only
56
Paging
One of the techniques to implement virtual memory
virtual
4G
B
Program 1
physic
16M
al
B
virtual
4G
Program 2 B
57
Paging
1. Assigns virtual address space to each
process
Each “program” gets its own separate virtual address
space
Divides this address space into fixed-sized chunks
called virtual pages
58
Paging
3. Maps virtual pages to physical pages
they must be mapped to physical pages
59
Paging-4Qs
Where can a page be placed in main memory?
page miss penalty is very high lower miss rate fully-
associative placement
trade off b/w placement algorithm vs miss rate
How is a page found if it’s in main memory?
using page table indexed by virtual page number
60
Paging-4Qs
which page should be replaced if main memory is
full?
OS uses LRU replacement policy
it replaces page that was referenced/touched long
time ago
reference or use bit for each page in the page table
what happens when write is performed on a page?
write-back policy because access time of magnetic
disk is very high, i.e. millions of cycles
dirty bit for each page in the page table
61
Paging in Intel 80386
Intel 80386 (Mid 80s)
32-bit processor
4KB virtual/physical pages
Q: What is the size of a virtual address space?
A: 2^32 = 4GB
Q: How many virtual pages per virtual address
space?
A: 4GB/4KB = 1M
Q: What is the size of the physical address space?
A: assume 2^32 = 4GB
Q: How many physical pages in the physical address
space?
A: 4GB/4KB = 1M
62
Intel 80386: Virtual Pages
32-bit Virtual Address
31 12 11 0
…000000000 XXXXX
4GB
Virtual Page 1M-1
···
Virtual Page 2 8KB
Virtual Page 1 4KB
Virtual Page 0 0KB
63
Intel 80386: Virtual Pages
32-bit Virtual Address
31 12 11 0
…000000001 XXXXX
4GB
Virtual Page 1M-1
···
Virtual Page 2 8KB
Virtual Page 1 4KB
Virtual Page 0 0KB
64
Intel 80386: Virtual Pages
32-bit Virtual Address
31 12 11 0
…111111111 XXXXX
4GB
Virtual Page 1M-1
···
Virtual Page 2 8KB
Virtual Page 1 4KB
Virtual Page 0 0KB
65
Intel 80386: Translation
Assume: Virtual Page 7 is mapped to Physical Page
32
For an access to Virtual Page 7 …
31 12 11 0
Virtual Address: 0000000111 …011001
VPN Offset
Translated
31 12 11 0
Physical Address: 0000100000 …011001
PPN Offset
66
Intel 80386: Flat Page Table
VA=32-bit; Page size = 12-bit; PTE= 4Bytes
Page Table Size = 2^20 * 4B = 4MB
67
Intel 80386: Flat Page Table
uint32 PAGE_TABLE[1<<20];
PAGE_TABLE[7]=2;
VPN Offset
NULL
000000010 PTE7
···
31 12 11 0
NULL PTE1
NULL PTE0 000000010 XXX
PPN Offset
Physical Address
Intel 80386: Page Table
Two problems with page tables
entry (PTE)
uint32 PAGE_TABLE[1<<20];
PAGE_TABLE[65]=981;
PAGE_TABLE[3161]=1629;
PAGE_TABLE[9327]=524; ...
70
Problem #1: Page table is
too
large
Typically, the vast majority of PTEs are empty
PAGE_TABLE[0]=141;
...
PAGE_TABLE[532]=1190;
PAGE_TABLE[534]=NULL;
...
empty
PAGE_TABLE[1048401]=NULL;
PAGE_TABLE[1048402]=845;
...
PAGE_TABLE[1048575]=742; // 1048575=(1<<20)-1;
Q: Why? − A: Virtual address space is extremely
large
71
Problem #1: Page table is
too
large
Solution 01: “Unallocate” the empty PTEs to save
space
PAGE_TABLE[0]=141;
...
PAGE_TABLE[532]=1190;
PAGE_TABLE[534]=NULL;
...
PAGE_TABLE[1048401]=NULL;
PAGE_TABLE[1048402]=845;
...
PAGE_TABLE[1048575]=742; // 1048575=(1<<20)-1;
72
Problem #1: Page table is
too
large
To allow PTEs to be “unallocated” …
the page table must be restructured
Before restructuring: flat
uint32 PAGE_TABLE[1024*1024];
uint32 PAGE_TABLE[7]=2;
uint32 PAGE_TABLE[1023]=381;
Sol 02: After restructuring: hierarchical
uint32 *PAGE_DIRECTORY[1024];
PAGE_DIRECTORY[0]=malloc(sizeof(uint32)*1024);
PAGE_DIRECTORY[0][7]=2;
PAGE_DIRECTORY[0][1023]=381;
PAGE_DIRECTORY[1]=NULL; // 1024 PTEs unallocated
PAGE_DIRECTORY[2]=NULL; // 1024 PTEs unallocated
73
Problem #1: Page table is
too large
PAGE_DIRECTORY[0][7]=2;
PAGE_DIR PAGE_TABLE0
31 0 31 0
PDE1023 NULL NULL PTE1023
000000010
NULL PTE7
PDE1 NULL
PDE0 &PT
NULL0 NULL PTE0
VPN[19:0]=0000000000_0000000111
Directory index Table index
Intel 80386: Page Table
Two problems with page tables
76
Problem #2: Page table is stored in
memory
Solution: Translation Lookaside Buffer (TLB)
77
Intel 80386: Page Table
Two problems with page tables
Problem #1: Page table is too large
Page table has 1M entries
Each entry is 4B
Page table = 4MB (!!)
very expensive in the 80s
Solution: Hierarchical page table
Problem #2: Page table is in memory
Before every memory access, always fetch the PTE
from the slow memory? Large performance
penalty
Solution: Translation Lookaside Buffer
78
Translation Lookaside Buffer
(TLB)
Problem: Context Switch
Assume that Process X is running
Process X’s VPN 5 is mapped to PPN 100
The TLB caches this mapping
VPN 5 PPN 100
80
Handling TLB Misses
The TLB is small; it cannot hold all PTEs
Some translations will inevitably miss in the TLB
Must access memory to find the appropriate PTE
Called walking the page directory/table
Large performance penalty
81
Handling TLB Misses
Approach #1. Hardware-Managed (e.g., x86)
The hardware does the page walk
The hardware fetches the PTE and inserts it into the
TLB
If the TLB is full, the entry replaces another
entry
All of this is done transparently
82
Handling TLB Misses
Hardware-Managed TLB
Pro: No exceptions. Instruction just stalls
Pro: Independent instructions may continue
Pro: Small footprint (no extra instructions/data)
Con: Page directory/table organization is etched in
stone
Software-Managed TLB
Pro: The OS can design the page directory/table
Pro: More advanced TLB replacement policy
Con: Flushes pipeline
Con: Performance overhead
83
Page Fault
If a virtual page is not mapped to a physical page …
The virtual page does not have a valid PTE
84
Servicing a Page Fault
Processor communicates with
controller
Read block of length P starting
at disk address X and store
starting at memory address Y
Read occurs
Direct Memory Access (DMA)
Done by I/O controller
Controller signals completion
using Interrupt
OS resumes suspended
process
85
Why Mapped to Disk?
Why would a virtual page ever be mapped to disk?
Two possible reasons
1. Demand Paging
When a large file in disk needs to be read, not all of it
is loaded into memory at once
Instead, page-sized chunks are loaded on-demand
If most of the file is never actually read …
Saves time (remember, disk is extremely slow)
Saves memory space
Q: When can demand paging be bad?
Why Mapped to Disk? (cont’d)
2. Swapping
Assume that physical memory is exhausted
You are running many programs that require lots of
memory
What happens if you try to run another program?
Some physical pages are “swapped out” to disk
I.e., the data in some physical pages are migrated to
disk
This frees up those physical pages
As a result, their PTEs become invalid
When you access a physical page that has been
swapped out, only then is it brought back into physical
memory
This may cause another physical page to be swapped
out
If this “ping-ponging” occurs frequently, it is called
Virtual Memory and Cache
Interaction
Address Translation and
Caching
When do we do the address translation?
Before or after accessing the L1 cache?
Synonym problem:
Two different virtual addresses can map to the same
physical address same physical address can be
present in multiple locations in the cache can lead to
inconsistency in data 89
Homonyms and Synonyms
Homonym: Same VA can map to two different PAs
Why?
VA is in different processes
VA
TLB cache
PA
VA
cache tlb
PA
VA
cache tlb
PA
92
Virtual Cache
Problem:
homonym/
synonym
93
Virtual Cache
Why don’t we use virtual cache?
Page level protection
homonym
synonym
I/O devices
94
Virtual-Physical Cache
-No homonym
-synonym can
occur depending
on the cache
index size
95
Virtually-Indexed Physically-
Tagged
If C≤(page_size associativity), the cache index bits come only
from page offset (same in VA and PA)
If both cache and TLB are on chip
index both arrays concurrently using VA bits
check cache tag (physical) against TLB output at the end
TLB physical
cache
a
TLB physical
cache
98
Instruction vs. Data Caches
Unified:
+ Dynamic sharing of cache space: no overprovisioning
that might happen with static partitioning (i.e., split I
and D caches)
-- Instructions and data can thrash each other (i.e., no
guaranteed space for either)
-- I and D are accessed in different places in the pipeline.
Where do we place the unified cache for fast access?
99
Multi-level Caching in a
Pipelined
Design
First-level caches (instruction and data)
Decisions very much affected by cycle time
Small, lower associativity
Tag store and data store accessed in parallel
Second-level caches
Decisions need to balance hit rate and access latency
Usually large and highly associative; latency not as
important
Tag store and data store accessed serially
CPU cycles = Instruction Count (IC) * Average Cycles Per Instruction (CPI)
102
Miss Rate
Two ways to specify miss rate
Miss Rate = misses / total accesses
Misses/Inst = Miss Rate * Memory Accesses Per Inst
dependent on ISA, e.g. X86 vs MIPS
103
Effect of Cache on Program
Execution
Sol:
Miss Penalty = 200 cycles
CPI = 1 cycle
Miss Rate = 2%
Memory Accesses Per Inst = 1.5
Misses/1000 = 30
Prog Execution Time(Perfect cache) = ?
Prog Execution Time(with cache) = ?
Prog Execution Time(no cache) = ?
104
Effect of Cache on Program
Execution
Prog Execution Time(Perfect cache) = IC (CPI + Memory Accesses Per Inst * Miss Rate * Miss
Penalty ) cycle time
put Miss Rate = 0%
105
Choosing the Right Cache
Sol:
CPI = 1.6 cycle
cycle time = 0.35ns
memory Accesses Per Inst = 1.4
cycle time = 0.35*1.35ns for 2-way
miss penalty = 65ns
hit time = 1 cycle
miss rate = 2.1% (direct-mapped); miss rate = 1.9% (2-way)
average access time=?
performance=?
106
Choosing the Right Cache
Direct-mapped:
access time = hit time + miss rate * miss penalty
= 0.35 + 2.1% * 65 = 1.72ns
Prog. execution time =IC[(CPI*cycle time) +(miss rate*memory access per
inst*miss
penalty* cycle
time)]
= IC[(1.6*0.35)+(2.1%*1.4*65)]
= 2.47 * IC ns
2-way:
access time = 0.35*1.35 + 1.9% * 65 = 1.71ns
Prog. execution time = IC[(1.6*0.35*1.35)+(1.9%*1.4*65)]
= 2.49 * IC ns
107