0% found this document useful (0 votes)
10 views

Lecture 11. Memory Hierarchy

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 11. Memory Hierarchy

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 107

Computer Architecture

Memory Hierarchy and Caches

Muhammad Abid
DCIS, PIEAS
Memory Subsystem
 The performance of a computer system not only
depends on microprocessor but also its memory
subsystem. Why?
 instructions and data are stored in memory.
 A good architect try to improve performance of both
microprocessor and memory subsystem.

2
Why Memory Hierarchy?
 We want both fast and large

Hard Disk
Main
CPU Cache Memory
RF (DRAM)

3
Why Memory Hierarchy?
 We want both fast and large

 But we cannot achieve both with a single level of


memory

 Solution:
 Have multiple levels of storage (progressively bigger
and slower as the levels are farther from the
processor)
 Spatial and temporal locality in programs and
 ensure most of the data the processor needs is kept in
the faster levels

4
The Memory Hierarchy

move what you use here fast


small

With good locality of


reference, memory

cheaper per byte


appears as fast as
and as large as

faster per byte


backup
everything big but slow
here
5
Locality
 One’s recent past is a very good predictor of his/her
near future.

 Temporal Locality: If you just did something, it is


very likely that you will do the same thing again
soon
 since you are here today, there is a good chance you
will be here again and again regularly

 Spatial Locality: If you did something, it is very


likely you will do something similar/related (in
space)
 every time I find you in this room, you are probably
sitting close to the same people
6
Memory Locality
 A “typical” program has a lot of locality in memory
references
 typical programs are composed of “loops”

 Temporal: A program tends to reference the same


memory location many times and all within a small
window of time

 Spatial: A program tends to reference a cluster of


memory locations at a time
 most notable examples:
 1. instruction memory references
 2. array/data structure references

7
Caching Basics: Exploit
Temporal
 Locality
Idea: Store recently accessed data in automatically
managed fast memory (called cache)
 Anticipation: the data will be accessed again soon

 Temporal locality principle


 Recently accessed data will be again accessed in the
near future

8
Caching Basics: Exploit Spatial
Locality
Idea: Store addresses adjacent to the recently
accessed one in automatically managed fast
memory
 Logically divide memory into equal size blocks
 Fetch to cache the accessed block in its entirety
 Anticipation: nearby data will be accessed soon

 Spatial locality principle


 Nearby data in memory will be accessed in the near
future
 E.g., sequential instruction access, array traversal
 This is what IBM 360/85 implemented
 16 Kbyte cache with 64 byte blocks

9
A Note on Manual vs. Automatic
Management
 Manual: Programmer manages data movement

across levels
-- too painful for programmers on substantial programs
still done in some embedded processors (on-chip scratch

pad SRAM in lieu of a cache)

 Automatic: Hardware manages data movement


across levels, transparently to the programmer
++ programmer’s life is easier
simple heuristic: keep most recently used items in cache

the average programmer doesn’t need to know about it

 You don’t need to know how big the cache is and how it
works to write a “correct” program! (What if you want a
“fast” program?)
10
A Modern Memory Hierarchy
Register File
32 words, sub-nsec
manual/compiler
Memory register spilling
L1 cache
Abstraction ~32 KB, ~nsec

L2 cache
512 KB ~ 1MB, many nsec Automatic
HW cache
L3 cache, management
.....

Main memory (DRAM),


GB, ~100 nsec
automatic
Swap Disk
demand
100 GB, ~10 msec paging
11
Review of previous lecture
 we need memory hierarchy to make memory
subsystem appears fast and large.
 Progarm features: spatial and temporal locality.
 caches exploit spatial and temporal locality by
storing blocks of data. If no caches, spatial and
temporal locality is still there but prog execution is
not fast since processor needs to read data from
memory.
 Processor moves block of data b/w caches and
memory to make execution fast, assuming spatial
and temporal locality exist.

12
Cache Basics and Operation
Cache
 Most commonly in the on-die context: an
automatically-managed memory hierarchy based on
SRAM
 stores in SRAM the most frequently accessed DRAM
memory locations to avoid repeatedly paying for the
DRAM access latency

14
Caching Basics
 Block (line): Unit of storage in the cache
 Memory is logically divided into cache blocks that map
to locations in the cache
 When data referenced
 HIT: If in cache, use cached data instead of accessing
memory
 MISS: If not in cache, bring block into cache
 Maybe have to kick something else out to do it

 Some important cache design decisions


 Placement: where and how to place/find a block in
cache?
 Replacement: what data to remove to make room in
cache?
 Granularity of management: large, small, uniform
15
blocks?
Cache Abstraction and Metrics

Address
Tag Store Data Store

(is the address


in the cache?
+ bookkeeping)

Hit/miss? Data

 Cache hit rate = (# hits) / (# accesses) = (# hits) / (# hits + #


misses)
 Cache miss rate = (# misses ) / (# accesses) = (# misses ) / (#
hits + # misses)
 cache hit rate + cache miss rate = 1
16
Blocks and Addressing the
Cache
Memory is logically divided into cache blocks

 Each block maps to a location in the cache,


determined by the index bits in the address
tag index byte in block
 2ind = cache size / block size * associativity
2b 3 bits 3 bits
 used to index into the tag and data stores 8-bit address

 Cache access: index into the tag and data stores


with index bits in address, check valid bit in tag
store, compare tag bits in address with the stored
tag in tag store
 If a block is in the cache (cache hit), the tag store
should have the tag of the block stored in the index
of the block
17
Direct-Mapped Cache: Placement
 Assume byte-addressable memory:
256 bytes, 8-byte blocks  32 blocks
 Assume cache: 64 bytes, 8 blocks
 Direct-mapped: A block can go to only one
location
Tag store Data store

V tag

=? MUX

Hit? Data

 Addresses with same index contend for the same


location 18

Direct-Mapped Cache: Access
 byte address: 31, 192?

tag index byte in block


2b 3 bits 3 bits Tag store Data store
Address

V tag

byte in block
=? MUX

Hit? Data

19
Direct-Mapped Caches Problem
 Direct-mapped cache: Two blocks in memory that
map to the same index in the cache cannot be
present in the cache at the same time
 One index  one entry

 Can lead to 0% hit rate if more than one block


accessed in an interleaved manner map to the same
index
 Assume addresses A and B have the same index bits
but different tag bits
 A, B, A, B, A, B, A, B, …  conflict in the cache index
 All accesses are conflict misses

20
Set Associativity
 Block Addresses 0 and 8 always conflict in direct mapped
cache
 Instead of having one column of 8, have 2 columns of 4
blocks Tag store Data store
SET

V tag V tag

=? =? MUX

Logic byte in block


MUX
Hit?
Address
tag index byte in block
Associative memory within the set
3b 2 bits 3 bits
-- More complex, slower access, larger tag store
+ Accommodates conflicts better (fewer conflict misses)
21
Higher Associativity
 4-way Tag store

=? =? =? =?

Logic Hit?

Data store

MUX
byte in block
MUX

-- More tag comparators and wider data mux; larger


tags
+ Likelihood of conflict misses even lower
22
Full Associativity
 Fully associative cache
 A block can be placed in any cache location

Tag store

=? =? =? =? =? =? =? =?

Logic

Hit?

Data store

MUX
byte in block
MUX

23
Associativity (and Tradeoffs)
 How many blocks can map to the same index (or
set)?

 Higher associativity
++ Higher hit rate
-- Slower cache access time (hit latency and data access
latency)
hit rate
-- More expensive hardware (more comparators)

 Diminishing returns from higher


associativity

associativity
24
Which block to replace on a
cache
 miss?
Which block in the set to replace on a cache miss?
 Any invalid block first
 If all are valid, consult the replacement policy
 FIFO
 Least recently used (how to implement?)
 Random

25
Implementing LRU
 Idea: Evict the least recently accessed block
 Problem: Need to keep track of access ordering of
blocks, i.e. timestamps

 Question: 2-way set associative cache:


 What do you need to implement LRU?

 Question: 4-way set associative cache:


 How many different access orderings possible for the 4
blocks in the set?
 How many bits needed to encode the LRU order of a
block?
 What is the logic needed to determine the LRU victim?

26
Approximations of LRU
 Most modern processors do not implement “true
LRU” in highly-associative caches

 Why?
 True LRU is complex
 Pseudo-LRU:
 associate a bit to each block; set it when block is
accessed
 If all bits for a set are set then reset bits for that set
with the exception of most-recently set bit.
 Victim: block whose bit is off; randomly choose if more
than one bit is off.

27
What’s In A Tag Store Entry?
 Valid bit
 Tag
 Replacement policy bits

 Dirty bit?
 Write back vs. write through caches

28
Handling Writes (Stores)
 When do we write the modified data in a cache to the
next level?
 Write back: When the block is evicted
 Write through: At the time the write happens

 Write-back
+ Can consolidate multiple writes to the same block before
eviction
 Potentially saves bandwidth between cache levels + saves
energy
 -- Need a bit in the tag store indicating the block is
“modified”
 Write-through
+ Simpler
+ All levels are up to date. Consistency: Simpler cache
coherence
29
Handling Writes (Stores)
 Do we allocate a cache block on a write miss?
 Allocate on write miss: Yes
 No-allocate on write miss: No
 Allocate on write miss
+ Can consolidate writes instead of writing each of them
individually to next level
+ Simpler because write misses can be treated the same
way as read misses
-- Requires transfer of the whole cache block

 No-allocate
+ Conserves cache space if locality of writes is low
(potentially better cache hit rate)

30
Handling Writes (Stores)
 write-through/write-back caches and allocate/no-
allocate on a write miss:
 typically write-back caches allocate block on a write
miss
 typically write-through caches don’t allocate block on a
write miss
 Multicore processors can implement write-through
policy for L1 caches to keep L2 cache consistent.
However, L2 cache can implement write-back policy
to reduce bandwidth utilization.

31
Write Buffer & Write-through
Caches
In write-through caches processor waits for data to
be written to lower level, e.g. memory
 Write Buffer:
 Using write buffer, processor just needs to write data
there and can continue as soon as data is written.
 writes to same block are merged
 overlapping processor execution with memory
updating

32
Victim Buffer & Write-back
Caches
Dirty replaced blocks are written to victim buffer
rather than lower level e.g. memory

33
Miss Causes: 3 C’s
 Compulsory or cold miss
 First-ever reference to a given block of memory
 occurs in all types of caches
 Measure: number of misses in an infinite cache model
 Capacity miss
 working set exceeds cache capacity
 useful blocks (with future references) replaced & later
retrieved from lower level, e.g. memory
 Measure: additional misses in a fully-associative cache

34
Miss Causes: 3 C’s
 Conflict miss
 occurs in direct-mapped or set-associative cache
 too many blocks mapped to an index and so a block is
replaced and later retrieved
 Measure: misses that occurs bec of conflict

35
Reducing 3 C’s
 Compulsory misses:
 increase the block size
 Capacity misses:
 increase cache size
 Conflict misses:
 increase associativity
 increase cache size
 spreads out references to more indices

36
Hierarchical Latency Analysis
 For a given memory hierarchy level i it has a technology-intrinsic
access time of ti, The perceived access time Ti is longer than ti
 Except for the outer-most hierarchy, when looking for a given
address there is
 a chance (hit-rate hi) you “hit” and access time is ti
 a chance (miss-rate mi) you “miss” and access time ti +Ti+1
 hi + mi = 1
 Thus
Ti = hi·ti + mi·(ti + Ti+1)
Ti = ti + mi ·Ti+1

37
Hierarchy Design
Considerations
Recursive latency equation
Ti = ti + mi ·Ti+1
 The goal: achieve desired T1 within allowed cost
 Ti  ti is desirable

 Keep mi low
 increasing capacity Ci lowers mi, but beware of increasing ti
 lower mi by smarter management (replacement::anticipate what you
don’t need, prefetching::anticipate what you will need)

 Keep Ti+1 low


 faster lower hierarchies, but beware of increasing cost
 introduce intermediate hierarchies as a compromise 38
Intel Pentium 4 Example
 90nm P4, 3.6 GHz
 L1 cache if m1=0.1, m2=0.1
 C1 = 16K T1=7.6, T2=36
 t1 = 4 cyc if m1=0.01, m2=0.01
 L2 cache T1=4.2, T2=19.8
 C2 =1024 KB if m1=0.05, m2=0.01
 t2 = 18 cyc T1=5.00, T2=19.8
 Main memory if m1=0.01, m2=0.50
 t3 = ~ 50ns or 180 cyc T1=5.08, T2=108
Reducing Memory Access Time
 AMAT = access time + Miss rate * Miss penalty
 Reducing miss rate:
 larger block size
 larger cache size
 higher associativity
 smarter management (replacement::anticipate what you
don’t need, prefetching::anticipate what you will need)

40
Reducing Memory Access Time
 AMAT = access time + Miss rate * Miss penalty

 Reducing miss penalty:


 multilevel caches
 prioritizing reads
 early restart/ critical word first for read
 Reducing access time :
 avoiding VA to PA translation
 use simple/small cache i.e. direct-mapped
 access tag and data store in parallel for read/write

41
Virtual Memory
Physical Addressing
 CPU generates PA to access memory
 no VA
 Where used?
 early PCs
 digital signal processors, embedded microcontrollers
and Cray supercomputers still use it

43
Virtual addressing
 CPU generates VA
 Where used?
 Most laptops, server, and modern PCs

44
Memory: Programmers’ View

45
Virtual Memory
 Virtual memory is imaginary memory
 it gives you illusion of memory that’s not physically
there
 it provides the illusion of a large address space
 This illusion is provided separately for each program

 Why virtual memory?


 Using physical memory efficiently
 Using physical memory simply
 Using physical memory safely

46
Using Physical Memory
Efficiently
key idea: only active portion of the program is
loaded into memory
 achieved using demand paging
 operating system copies a disk page into physical
memory only if an attempt is made to access it
 The rest of the virtual address space is stored on disk
 Virtual memory gets the most out of physical
memory
 Keep only active areas of virtual address space in
fast memory
 Transfer data back and forth as needed

47
Using Physical Memory
Efficiently
Prob: A page is on disk and not in memory, i.e. page
fault
 Sol: An OS routine is called to load data from disk to
memory; Current process suspends execution; OS
has full control over placement

48
Using Physical Memory
Efficiently
Demand paging:

49
Using Physical Memory Simply
 Key idea: Each process has its own virtual address
space defined by its page table
 simplifies memory allocation
 a virtual page can be mapped to any physical page
 simplifies sharing code and data among processes
 the OS can map virtual pages to same shared physical
page
 simplifies loading the program anywhere in memory
 just change mapping in the page table
 physical memory need not be contiguous
 simplifies allocating memory to applications
 malloc() in c

50
Using Physical Memory Simply

51
Using Physical Memory Safely
 Key idea: protection
 Processes cannot interfere with each other
 Because they operate in different address space
 User processes can read but cannot modify
privileged information, e.g. I/O
 Different sections of address space have different
permissions, e.g. read-only, read/write, execute
 permission bits in the page table entries

52
Using Physical Memory Safely

53
Virtual Memory
Implementation
HW/SW Support for Virtual
Memory
Processor must
 support Virtual address & Physical address
 support two modes of execution
 user & supervisor
 protect privileged info
 user process can only read user/supervisor mode bit,
page table pointer, TLB
 support switching b/w user mode to supervisor mode
and vice versa
 user process wants to perform I/O operation
 user process requests to allocate memory, e..g malloc()
 check protection bits while making a reference to
memory

55
HW/SW Support for Virtual
Memory
Operating System
 creates separate page tables for processes so they
don’t interfere
 all page tables reside in the OS address space and so
user process cannot update/modify them
 assists processes to share data by making changes in
the page tables
 assigns protection bits in the page tables, e.g. Read-
only

56
Paging
 One of the techniques to implement virtual memory

virtual

4G
B
Program 1

physic

16M
al

B
virtual

4G
Program 2 B

57
Paging
1. Assigns virtual address space to each
process
 Each “program” gets its own separate virtual address
space
 Divides this address space into fixed-sized chunks
called virtual pages

2. Divides the physical address spaces into fixed-


sized chunks called Physical pages
 Also called a frame

 Size of virtual page == Size of physical page

58
Paging
3. Maps virtual pages to physical pages
 they must be mapped to physical pages

 This mapping is stored in page tables

59
Paging-4Qs
 Where can a page be placed in main memory?
 page miss penalty is very high lower miss rate  fully-
associative placement
 trade off b/w placement algorithm vs miss rate
 How is a page found if it’s in main memory?
 using page table indexed by virtual page number

60
Paging-4Qs
 which page should be replaced if main memory is
full?
 OS uses LRU replacement policy
 it replaces page that was referenced/touched long
time ago
 reference or use bit for each page in the page table
 what happens when write is performed on a page?
 write-back policy because access time of magnetic
disk is very high, i.e. millions of cycles
 dirty bit for each page in the page table

61
Paging in Intel 80386
 Intel 80386 (Mid 80s)
 32-bit processor
 4KB virtual/physical pages
 Q: What is the size of a virtual address space?
 A: 2^32 = 4GB
 Q: How many virtual pages per virtual address
space?
 A: 4GB/4KB = 1M
 Q: What is the size of the physical address space?
 A: assume 2^32 = 4GB
 Q: How many physical pages in the physical address
space?
 A: 4GB/4KB = 1M

62
Intel 80386: Virtual Pages
32-bit Virtual Address

31 12 11 0
…000000000 XXXXX
4GB
Virtual Page 1M-1

···
Virtual Page 2 8KB
Virtual Page 1 4KB
Virtual Page 0 0KB

63
Intel 80386: Virtual Pages
32-bit Virtual Address

31 12 11 0
…000000001 XXXXX
4GB
Virtual Page 1M-1

···
Virtual Page 2 8KB
Virtual Page 1 4KB
Virtual Page 0 0KB

64
Intel 80386: Virtual Pages
32-bit Virtual Address

31 12 11 0
…111111111 XXXXX
4GB
Virtual Page 1M-1

···
Virtual Page 2 8KB
Virtual Page 1 4KB
Virtual Page 0 0KB

65
Intel 80386: Translation
 Assume: Virtual Page 7 is mapped to Physical Page
32
 For an access to Virtual Page 7 …

31 12 11 0
Virtual Address: 0000000111 …011001
VPN Offset
Translated
31 12 11 0
Physical Address: 0000100000 …011001
PPN Offset
66
Intel 80386: Flat Page Table
 VA=32-bit; Page size = 12-bit; PTE= 4Bytes
 Page Table Size = 2^20 * 4B = 4MB

67
Intel 80386: Flat Page Table
uint32 PAGE_TABLE[1<<20];
PAGE_TABLE[7]=2;

PAGE_TABLE Virtual Address


31 0 31 12 11 0
NULL PTE1<<20-1 000000111 XXX
···

VPN Offset
NULL
000000010 PTE7
···

31 12 11 0
NULL PTE1
NULL PTE0 000000010 XXX
PPN Offset
Physical Address
Intel 80386: Page Table
 Two problems with page tables

 Problem #1: Page table is too


large
 Page table has 1M entries
 Each entry is 4B (because 4B = 20-bit PPN +
protection bits)
 Page table = 4MB (!!)
 very expensive in the 80s

 Problem #2: Page table is stored in memory


 Before every memory access, always fetch the PTE
from the slow memory?  Large performance
penalty 69
Problem #1: Page table is
too
Pagelarge
Table: A “lookup table” for the mappings
 Can be thought of as an array
 Each element in the array is called a page table

entry (PTE)
uint32 PAGE_TABLE[1<<20];
PAGE_TABLE[65]=981;
PAGE_TABLE[3161]=1629;
PAGE_TABLE[9327]=524; ...

70
Problem #1: Page table is
too
 large
Typically, the vast majority of PTEs are empty
PAGE_TABLE[0]=141;
...
PAGE_TABLE[532]=1190;
PAGE_TABLE[534]=NULL;
...
empty
PAGE_TABLE[1048401]=NULL;
PAGE_TABLE[1048402]=845;
...
PAGE_TABLE[1048575]=742; // 1048575=(1<<20)-1;
 Q: Why? − A: Virtual address space is extremely
large

71
Problem #1: Page table is
too
 large
Solution 01: “Unallocate” the empty PTEs to save
space
PAGE_TABLE[0]=141;
...
PAGE_TABLE[532]=1190;
PAGE_TABLE[534]=NULL;
...
PAGE_TABLE[1048401]=NULL;
PAGE_TABLE[1048402]=845;
...
PAGE_TABLE[1048575]=742; // 1048575=(1<<20)-1;

 Unallocating every single empty PTE is tedious


 Instead, unallocate only long stretches of empty PTEs

72
Problem #1: Page table is
too
 large
To allow PTEs to be “unallocated” …
 the page table must be restructured
 Before restructuring: flat
uint32 PAGE_TABLE[1024*1024];
uint32 PAGE_TABLE[7]=2;
uint32 PAGE_TABLE[1023]=381;
 Sol 02: After restructuring: hierarchical
uint32 *PAGE_DIRECTORY[1024];
PAGE_DIRECTORY[0]=malloc(sizeof(uint32)*1024);
PAGE_DIRECTORY[0][7]=2;
PAGE_DIRECTORY[0][1023]=381;
PAGE_DIRECTORY[1]=NULL; // 1024 PTEs unallocated
PAGE_DIRECTORY[2]=NULL; // 1024 PTEs unallocated

73
Problem #1: Page table is
too large
PAGE_DIRECTORY[0][7]=2;

PAGE_DIR PAGE_TABLE0
31 0 31 0
PDE1023 NULL NULL PTE1023

000000010
NULL PTE7
PDE1 NULL
PDE0 &PT
NULL0 NULL PTE0

VPN[19:0]=0000000000_0000000111
Directory index Table index
Intel 80386: Page Table
 Two problems with page tables

 Problem #1: Page table is too


large
 Page table has 1M entries
 Each entry is 4B (because 4B = 20-bit PPN +
protection bits)
 Page table = 4MB (!!)
 very expensive in the 80s

 Problem #2: Page table is stored in memory


 Before every memory access, always fetch the PTE
from the slow memory?  Large performance
penalty 75
Problem #2: Page table is stored in
memory
every time processor accesses memory it needs to

access process’ page table to get VA translated into PA


 so memory is accessed two times  large performance
penalty
 Sol: cache PTE

76
Problem #2: Page table is stored in
memory
 Solution: Translation Lookaside Buffer (TLB)

 caches Page Table Entries (PTEs) that are read from


process’ page tables stored in memory
 it makes successive translations from VA to PA fast
 typically small fully-associative cache with few entries
 e.g. AMD Opteron processor: 40 entries data TLB
 TLB hit or miss

77
Intel 80386: Page Table
 Two problems with page tables
 Problem #1: Page table is too large
 Page table has 1M entries
 Each entry is 4B
 Page table = 4MB (!!)
 very expensive in the 80s
 Solution: Hierarchical page table
 Problem #2: Page table is in memory
 Before every memory access, always fetch the PTE
from the slow memory?  Large performance
penalty
 Solution: Translation Lookaside Buffer

78
Translation Lookaside Buffer
(TLB)
Problem: Context Switch
 Assume that Process X is running
 Process X’s VPN 5 is mapped to PPN 100
 The TLB caches this mapping
 VPN 5  PPN 100

 Now assume a context switch to Process Y


 Process Y’s VPN 5 is mapped to PPN 200
 When Process Y tries to access VPN 5, it searches the
TLB
 Process Y finds an entry whose tag is 5
 Hurray! It’s a TLB hit!
 The PPN must be 100!
 … Are you sure? 79
Translation Lookaside Buffer
(TLB)
Problem: Context Switch
 Sol #1. Flush the TLB
 Whenever there is a context switch, flush the TLB
 All TLB entries are invalidated
 Example: 80386

 Sol # 2. Associate TLB entries with PID


 All TLB entries have an extra field in the tag ...
 That identifies the process to which it belongs
 also use PID while comparing tags
 Example: Modern x86, MIPS

80
Handling TLB Misses
 The TLB is small; it cannot hold all PTEs
 Some translations will inevitably miss in the TLB
 Must access memory to find the appropriate PTE
 Called walking the page directory/table
 Large performance penalty

 Who handles TLB misses?


1. Hardware-Managed TLB
2. Software-Managed TLB

81
Handling TLB Misses
 Approach #1. Hardware-Managed (e.g., x86)
 The hardware does the page walk
 The hardware fetches the PTE and inserts it into the
TLB
 If the TLB is full, the entry replaces another
entry
 All of this is done transparently

 Approach #2. Software-Managed (e.g., MIPS)


 The hardware raises an exception
 The operating system does the page walk
 The operating system fetches the PTE
 The operating system inserts/evicts entries in the TLB

82
Handling TLB Misses
 Hardware-Managed TLB
 Pro: No exceptions. Instruction just stalls
 Pro: Independent instructions may continue
 Pro: Small footprint (no extra instructions/data)
 Con: Page directory/table organization is etched in
stone

 Software-Managed TLB
 Pro: The OS can design the page directory/table
 Pro: More advanced TLB replacement policy
 Con: Flushes pipeline
 Con: Performance overhead

83
Page Fault
 If a virtual page is not mapped to a physical page …
 The virtual page does not have a valid PTE

 What would happen if you accessed that page?


 A hardware exception: page fault
 The operating system needs to handle it
 Page fault handler
 Reads the data from disk into a physical page in
memory
 Maps the virtual page to the physical page
 Creates the appropriate PDE/PTE
 Resume program that caused the page fault

84
Servicing a Page Fault
 Processor communicates with
controller
 Read block of length P starting
at disk address X and store
starting at memory address Y
 Read occurs
 Direct Memory Access (DMA)
 Done by I/O controller
 Controller signals completion
using Interrupt
 OS resumes suspended
process

85
Why Mapped to Disk?
 Why would a virtual page ever be mapped to disk?
 Two possible reasons

1. Demand Paging
 When a large file in disk needs to be read, not all of it
is loaded into memory at once
 Instead, page-sized chunks are loaded on-demand
 If most of the file is never actually read …
 Saves time (remember, disk is extremely slow)
 Saves memory space
 Q: When can demand paging be bad?
Why Mapped to Disk? (cont’d)
2. Swapping
 Assume that physical memory is exhausted
 You are running many programs that require lots of
memory
 What happens if you try to run another program?
 Some physical pages are “swapped out” to disk
 I.e., the data in some physical pages are migrated to
disk
 This frees up those physical pages
 As a result, their PTEs become invalid
 When you access a physical page that has been
swapped out, only then is it brought back into physical
memory
 This may cause another physical page to be swapped
out
 If this “ping-ponging” occurs frequently, it is called
Virtual Memory and Cache
Interaction
Address Translation and
Caching
When do we do the address translation?
 Before or after accessing the L1 cache?

 In other words, is the cache virtually addressed or


physically addressed?
 Virtual versus physical cache

 What are the issues with a virtually addressed


cache?

 Synonym problem:
 Two different virtual addresses can map to the same
physical address  same physical address can be
present in multiple locations in the cache  can lead to
inconsistency in data 89
Homonyms and Synonyms
 Homonym: Same VA can map to two different PAs
 Why?
 VA is in different processes

 Synonym: Different VAs can map to the same PA


 Why?
 Different pages can share the same physical frame within
or across processes
 Reasons: shared libraries, shared data, copy-on-write
pages within the same process, …

 Do homonyms and synonyms create problems when


we have a cache?
 Is the cache virtually or physically addressed?
90
Cache-VM Interaction

CPU CPU CPU

VA
TLB cache
PA
VA
cache tlb
PA

VA
cache tlb
PA

lower lower lower


hier. hier. hier.

physical cache virtual (L1) cache virtual-physical cache 91


Physical Cache
No homonym/
synonym
problem

92
Virtual Cache
Problem:
homonym/
synonym

93
Virtual Cache
 Why don’t we use virtual cache?
 Page level protection
 homonym
 synonym
 I/O devices

94
Virtual-Physical Cache
-No homonym
-synonym can
occur depending
on the cache
index size

95
Virtually-Indexed Physically-
Tagged
If C≤(page_size  associativity), the cache index bits come only
from page offset (same in VA and PA)
 If both cache and TLB are on chip
 index both arrays concurrently using VA bits
 check cache tag (physical) against TLB output at the end

VPN Page Offset


Index BiB

TLB physical
cache

PPN = tag data

TLB hit? cache hit? 96


Virtually-Indexed Physically-
Tagged
If C>(page_size  associativity), the cache index bits include VPN
 Synonyms can cause problems
 The same physical address can exist in two locations
 Solutions?
VPN Page Offset
Index BiB

a
TLB physical
cache

PPN = tag data

TLB hit? cache hit? 97


Some Solutions to the Synonym
Problem
Limit cache size to (page size times associativity)
 get index from page offset

 On a write to a block, search all possible indices that


can contain the same physical block, and
update/invalidate
 Used in Alpha 21264, MIPS R10K

 Restrict page placement in OS


 make sure index(VA) = index(PA)
 Called page coloring
 Used in many SPARC processors

98
Instruction vs. Data Caches
 Unified:
+ Dynamic sharing of cache space: no overprovisioning
that might happen with static partitioning (i.e., split I
and D caches)
-- Instructions and data can thrash each other (i.e., no
guaranteed space for either)
-- I and D are accessed in different places in the pipeline.
Where do we place the unified cache for fast access?

 First level caches are almost always split


 Mainly to avoid structural hazard
 Second and higher levels are almost always unified

99
Multi-level Caching in a
Pipelined
 Design
First-level caches (instruction and data)
 Decisions very much affected by cycle time
 Small, lower associativity
 Tag store and data store accessed in parallel
 Second-level caches
 Decisions need to balance hit rate and access latency
 Usually large and highly associative; latency not as
important
 Tag store and data store accessed serially

 Serial vs. Parallel access of levels


 Serial: Second level cache accessed only if first-level
misses
 Second level does not see the same accesses as the
first 100
Performance
Program Execution Time
• Prog Execution Time = (CPU cycles + Memory stall cycles) cycle time

 CPU cycles = Instruction Count (IC) * Average Cycles Per Instruction (CPI)

 CPI = Total Prog Execution Cycles / IC

 Memory stall cycles = Total Misses * Miss Penalty


= IC * Misses/Inst * Miss Penalty
= IC * Memory Accesses Per Inst * Miss Rate * Miss Penalty

 Misses / Inst = (Miss Rate * Total Memory Accesses)/ IC


= Miss Rate * Memory Accesses Per Inst

• Prog Execution Time = IC (CPI + Misses/Inst * Miss Penalty) cycle time


= IC (CPI + Memory Accesses Per Inst * Miss Rate * Miss
Penalty ) cycle time

102
Miss Rate
 Two ways to specify miss rate
 Miss Rate = misses / total accesses
 Misses/Inst = Miss Rate * Memory Accesses Per Inst
 dependent on ISA, e.g. X86 vs MIPS

103
Effect of Cache on Program
Execution

 Sol:
 Miss Penalty = 200 cycles
 CPI = 1 cycle
 Miss Rate = 2%
 Memory Accesses Per Inst = 1.5
 Misses/1000 = 30
 Prog Execution Time(Perfect cache) = ?
 Prog Execution Time(with cache) = ?
 Prog Execution Time(no cache) = ?

104
Effect of Cache on Program
Execution
Prog Execution Time(Perfect cache) = IC (CPI + Memory Accesses Per Inst * Miss Rate * Miss
Penalty ) cycle time
put Miss Rate = 0%

= IC(1+0) = IC *cycle time

Prog Execution Time(with cache) = ?


= IC (1+ 1.5 * 2% * 200) = IC* 7 * cycle time

Prog Execution Time(no cache) = ?


put Miss Rate = 100%
= IC (1+ 1.5 * 100% * 200) = IC* 301* cycle time

105
Choosing the Right Cache

 Sol:
 CPI = 1.6 cycle
 cycle time = 0.35ns
 memory Accesses Per Inst = 1.4
 cycle time = 0.35*1.35ns for 2-way
 miss penalty = 65ns
 hit time = 1 cycle
 miss rate = 2.1% (direct-mapped); miss rate = 1.9% (2-way)
 average access time=?
 performance=?

106
Choosing the Right Cache
 Direct-mapped:
access time = hit time + miss rate * miss penalty
= 0.35 + 2.1% * 65 = 1.72ns
Prog. execution time =IC[(CPI*cycle time) +(miss rate*memory access per
inst*miss
penalty* cycle
time)]
= IC[(1.6*0.35)+(2.1%*1.4*65)]
= 2.47 * IC ns
 2-way:
access time = 0.35*1.35 + 1.9% * 65 = 1.71ns
Prog. execution time = IC[(1.6*0.35*1.35)+(1.9%*1.4*65)]
= 2.49 * IC ns

Relative performance = 2.49 / 2.47 = 1.01

107

You might also like