02b_Cache
02b_Cache
Caches
Introduction
Memory Hierarchy
Column Decoder
…
11 bits Sense Amps & I/O D
Square root of number of memory bits is in each RAS & CAS address
9
The Quest for DRAM
Performance
1. Fast Page mode
– Add timing signals that allow repeated accesses to row
buffer without another row access time
– Such a buffer comes naturally, since each array
buffers 1024 to 2048 bits for each access
2. Synchronous DRAM (SDRAM)
– Add a clock signal to DRAM interface, so that the
repeated transfers would not suffer the time overhead
of synchronizing with the DRAM controller
3. Double Data Rate (DDR SDRAM)
– Transfer data on both the rising edge and falling edge
of the DRAM clock signal doubling the peak data
rate
– DDR2 lowers power by dropping the voltage from 2.5
to 1.8 volts + offers higher clock rates: up to 400 MHz
– DDR3 drops to 1.5 volts + higher clock rates: up to
800 MHz
• Improved Bandwidth, not Latency
10
DRAM name based on Peak Chip Transfers / Sec
DIMM name based on Peak DIMM MBytes / Sec
M
Stan- Clock Rate transfers DRAM Mbytes/s/ DIMM
dard (MHz) / second Name DIMM Name
Fastest for sale 4/06 ($125/GB)
• Solution: Caches
Why More on Memory
Hierarchy?
100,000
10,000
Performance
1,000
Processor Processor-Memory
100 Performance Gap
Growing
10
Memory
1
1980 1985 1990 1995 2000 2005 2010
Year
15
Storage Hierarchy and Locality
Capacity +
Disk
Speed -
SRAM Cache
Main Memory
Row buffer
L3 Cache
L2 Cache
Register File
• CyclesMemoryStall = CacheMisses x
(MissLatencyTotal – MissLatencyOverlapped)
– Increase overlapped miss latency
Reducing Cache Miss Penalty
(1)
• Multilevel caches
– Very Fast, small Level 1 (L1) cache
– Fast, not so small Level 2 (L2) cache
– May also have slower, large L3 cache, etc.
• Why does this help?
– Miss in L1 cache can hit in L2 cache, etc.
AMAT = HitTimeL1+MissRateL1MissPenaltyL1
MissPenaltyL1= HitTimeL2+MissRateL2MissPenaltyL2
MissPenaltyL2= HitTimeL3+MissRateL3MissPenaltyL3
Reducing Cache Miss Penalty
(2)
• Early Restart & Critical Word First
– Block transfer takes time (bus too narrow)
– Give data to loads before entire block arrive
• Early restart
– When needed word arrives, let processor use it
– Then continue block transfer to fill cache line
• Critical Word First
– Transfer loaded word first, then the rest of block
(with wrap-around to get the entire block)
– Use with early restart to let processor go ASAP
Reducing Cache Miss Penalty
(3)
• Increase Load Miss Priority
– Loads can have dependent instructions
– If a load misses and a store needs to go
to memory, let the load miss go first
– Need a write buffer to remember stores
• Merging Write Buffer
– If multiple write misses to the same
block, combine them in the write buffer
– Use block-write instead of a many small
writes
Reducing Cache Miss Penalty
(4)
• Victim Caches
– Recently kicked-out blocks kept in
small cache
– If we miss on those blocks, can get
them fast
– Why does it work: conflict misses
• Misses that we have in our N-way set-
assoc cache, but would not have if the
cache was fully associative
– Example: direct-mapped L1 cache and
a 16-line fully associative victim cache
• Victim cache prevents thrashing when
several “popular” blocks want to go to the
same entry
Kinds of Cache Misses
• The “3 Cs”
– Compulsory: have to have these
• Miss the first time each block is accessed
– Capacity: due to limited cache capacity
• Would not have them if cache size was
infinite
– Conflict: due to limited associativity
• Would not have them if cache was fully
associative
Reducing Cache Miss Rate (1)
• Larger blocks
– Helps if there is more spatial locality
Reducing Cache Miss Rate (2)
• Larger caches
– Fewer capacity misses, but longer hit
latency!
• Higher Associativity
– Fewer conflict misses, but longer hit latency
• Way Prediction
– Speeds up set-associative caches
– Predict which of N ways has our data,
fast access as direct-mapped cache
– If mispredicted, access again as set-assoc
cache
Reducing Cache Miss Rate (2)
• Pseudo Associative Caches
– Similar to way prediction
– Start with direct mapped cache
– If miss on “primary” entry, try another
entry
• Compiler optimizations
– Loop interchange
– Blocking
Reducing Hit Time (1)
• Small & Simple Caches are faster
Reducing Hit Time (2)
• Avoid address translation on cache hits
• Software uses virtual addresses,
memory accessed using physical
addresses
– Details of this later (virtual memory)
• HW must translate virtual to physical
– Normally the first thing we do
– Caches accessed using physical address
– Wait for translation before cache lookup
• Idea: index cache using virtual address
Reducing Hit Time (3)
• Pipelined Caches
– Improves bandwidth, but not latency
– Essential for L1 caches at high frequency
• Even small caches have 2-3 cycle latency at N
GHz
– Also used in many L2 caches
• Trace Caches
– For instruction caches
Hiding Miss Latency
• Idea: overlap miss latency with useful work
– Also called “latency hiding”
• Non-blocking caches
– A blocking cache services one access at a time
• While miss serviced, other accesses blocked (wait)
– Non-blocking caches remove this limitation
• While miss serviced, can process other requests
• Prefetching
– Predict what will be needed and get it ahead of
time
Non-Blocking Caches
• Hit Under Miss
– Allow cache hits while one miss in progress
– But another miss has to wait
• Miss Under Miss, Hit Under Multiple Misses
– Allow hits and misses when other misses in progress
– Memory system must allow multiple pending requests
Non-Blocking Cache
Prefetching
• Predict future misses and get data into
cache
– If access does happen, we have a hit now
(or a partial miss, if data is on the way)
– If access does not happen, cache pollution
(replaced other data with junk we don’t need)
• To avoid pollution, prefetch buffers
– Pollution a big problem for small caches
– Have a small separate buffer for prefetches
• When we do access it, put data in cache
• If we don’t access it, cache not polluted
Simple Sequential Prefetch
• On a cache miss, fetch two sequential memory
blocks
– Exploits spatial locality in both instructions & data
– Exploits high bandwidth for sequential accesses
– Called “Adjacent Cache Line Prefetch” or “Spatial Prefetch”
by Intel
• Extend to fetching N sequential memory blocks
– Pick N large enough to hide the memory latency
• Stream prefetching is a continuous version of
prefetching
– Stream buffer can fit N cache lines
– On a miss, start fetching N sequential cache lines
– On a stream buffer hit: Move cache line to cache, start
fetching line (N+1)
Strided Prefetch
• Idea: detect and prefetch strided accesses
– for (i=0; i<N; i++) A[i*1024]++;
• Stride detected using a PC-based table
– For each PC, remember the stride
– Stride detection
• Remember the last address used for this PC
• Compare to currently used address for this PC
– Track confidence using a two bit saturating
counter
• Increment when stride correct, decrement when
incorrect
SandybridgePrefetching(Int
el Core i7-2600K)
• “Intel 64 and IA-32 Architectures
Optimization Reference Manual, Jan
2011”, pg 2-24
Software Prefetching
• Two flavors: register prefetch and cache
prefetch
• Each flavor can be faulting or non-faulting
– If address bad, does it create exceptions?
• Faulting register prefetch is binding
– It is a normal load, address must be OK, uses register
• Not faulting cache prefetch is non-binding
– If address bad, becomes a NOP
– Does not affect register state
– Has more overhead (load still there),
ISA change (prefetch instruction),
complicates cache (prefetches and loads different)