SlideShare a Scribd company logo
Ch2-2
How to improve the
performance of
Memory hierarchy
2
Memory Technology and Optimizations
Performance metrics
Latency is concern of cache
Bandwidth is concern of multiprocessors and I/O
Access time
 Time between read request and when desired word arrives
Cycle time
 Minimum time between unrelated requests to memory
SRAM memory has low latency, use for cache
Organize DRAM chips into many banks for high
bandwidth, use for main memory
3
Memory Technology
SRAM
Requires low power to retain bit
Requires 6 transistors/bit
DRAM
Must be re-written after being read
Must also be periodically refeshed
 Every ~ 8 ms (roughly 5% of time)
 Each row can be refreshed simultaneously
One transistor/bit
Address lines are multiplexed:
 Upper half of address: row access strobe (RAS)
 Lower half of address: column access strobe (CAS)
4
DRAM logical organization
(64 Mbit)
Square root of bits per RAS/CAS
Column Decoder
Sense Amps & I/O
Memory Array
(16,384×16,384)
A0…A13
…
Address
buffer 14
Data
in
D
Q
Word Line
Storage
Cell
Data
out
Row
Decoder
…
Bit
Line
5
DRAM Read Timing
Every DRAM access begins at:
 The assertion of the RAS_L
 2 ways to read:
early or late v. CAS A
D
OE_L
256K x 8
DRAM
9 8
WE_L
CAS_L
RAS_L
OE_L
A Row Address
WE_L
Junk
Read Access
Time
Output Enable
Delay
CAS_L
RAS_L
Col Address Row Address Junk
Col Address
D High Z Data Out
DRAM Read Cycle Time
Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L
Junk Data Out High Z
6
Internal Organization of DRAM
7
Times of fast and slow DRAMs with each generation.
8
Memory Optimizations
9
Memory Technology
Amdahl:
 Memory capacity should grow linearly with processor speed
 Unfortunately, memory capacity and speed has not kept pace
with processors
Some optimizations:
 Multiple accesses to same row
 Synchronous DRAM
 Added clock to DRAM interface
 Burst mode with critical word first
 Wider interfaces
 Double data rate (DDR)
 Multiple banks on each DRAM device
10
1st
Improving DRAM Performance
Fast Page Mode DRAM (FPM)
Timing signals that allow repeated accesses to the row
buffer (page) without another row access time
 Such a buffer comes naturally, as each array will buffer
1024 to 2048 bits for each access.
Page: All bits on the same ROW (Spatial Locality)
Don’t need to wait for wordline to recharge
Toggle CAS with new column address
11
2nd
Improving DRAM Performance
Synchronous DRAM
conventional DRAMs have an asynchronous interface to
the memory controller, and hence every transfer
involves overhead to synchronize with the controller.
The solution was to add a clock signal to the DRAM
interface, so that the repeated transfers would not bear
that overhead.
 Data output is in bursts w/ each element clocked
12
3rd
Improving DRAM Performance
DDR--Double data rate
On both the rising edge and falling edge of the DRAM clock
signal, DRAM innovation to increase bandwidth is to transfer
data,
 thereby doubling the peak data rate.
2.5V
1.8v
1.5v
13
DDR--Double data rate
DDR:
DDR2
 Lower power (2.5 V -> 1.8 V)
 Higher clock rates (266 MHz, 333 MHz, 400 MHz)
DDR3
 1.5 V
 800 MHz
DDR4
 1-1.2 V
 1333 MHz
GDDR5 is graphics memory based on DDR3
14
4rd
Improving DRAM Performance
New DRAM Interface: RAMBUS (RDRAM)
a type of synchronous dynamic RAM, designed by the
Rambus Corporation.
Each chip has interleaved memory and a high speed interface.
Protocol based RAM w/ narrow (16-bit) bus
High clock rate (400 Mhz), but long latency
Pipelined operation
Multiple arrays w/ data transferred on both edges of clock
The first generation RAMBUS interface dropped RAS/CAS,
replacing it with a bus that allows other accesses over the bus
between the sending of the address and return of the data. It is
typically called RDRAM.
The second generation RAMBUS interface include a separate row-
and column-command buses instead of the conventional
multiplexing; and a much more sophisticated controller on chip.
Because of the separation of data, row, and column buses, three
transactions can be performed simultaneously. called Direct
RDRAM or DRDRAM
15
RDRAM Timing
16
RAMBUS (RDRAM)
RAMBUS Bank
RDRAM Memory System
17
Comparing RAMBUS and DDR SDRAM
Since the most computers use memory in DIMM
packages, which are typically at least 64-bits wide,
the DIMM memory bandwidth is closer to what
RAMBUS provides than you might expect when just
comparing DRAM chips.
Caution that performance of cache are based in part
on latency to the first byte and in part on the
bandwidth to deliver the rest of the bytes in the
block.
Although these innovations help with the latter
case, none help with latency.
Amdahl’s Law reminds us of the limits of
accelerating one piece of the problem while ignoring
another part.
18
Memory Optimizations
Reducing power in SDRAMs:
Lower voltage
Low power mode (ignores clock, continues to
refresh)
Graphics memory:
Achieve 2-5 X bandwidth per DRAM vs. DDR3
 Wider interfaces (32 vs. 16 bit)
 Higher clock rate
Possible because they are attached via soldering instead of socketted
DIMM modules
19
Summery
Memory organization
 Wider memory
 Simple interleaved memory
 Independent memory banks
Avoiding Memory Bank Conflicts
Memory chip
Fast Page Mode DRAM
Synchronize DRAM
Double Date Rate
 RDRAM
20
Memory Power Consumption
21
Stacked/Embedded DRAMs
Stacked DRAMs in same package as processor
High Bandwidth Memory (HBM)
22
Flash Memory
Type of EEPROM
Types: NAND (denser) and NOR (faster)
NAND Flash:
Reads are sequential, reads entire page (.5 to 4 KiB)
25 us for first byte, 40 MiB/s for subsequent bytes
SDRAM: 40 ns for first byte, 4.8 GB/s for subsequent
bytes
2 KiB transfer: 75 uS vs 500 ns for SDRAM, 150X
slower
300 to 500X faster than magnetic disk
23
NAND Flash Memory
Must be erased (in blocks) before being
overwritten
Nonvolatile, can use as little as zero power
Limited number of write cycles (~100,000)
$2/GiB, compared to $20-40/GiB for SDRAM
and $0.09 GiB for magnetic disk
Phase-Change (相变) /Memrister Memory
Possibly 10X improvement in write performance
and 2X improvement in read performance
24
Memory Dependability
Memory is susceptible to cosmic rays
Soft errors: dynamic errors
Detected and fixed by error correcting codes (ECC)
Hard errors: permanent errors
Use spare rows to replace defective rows
Chipkill: a RAID-like error recovery technique
25
How to Improve Cache Performance?
Reduce hit time(4)
 Small and simple first-level caches, Way prediction
 avoiding address translation, Trace cache
Increase bandwidth(3)
 Pipelined caches, multibanked caches, non-blocking caches
Reduce miss penalty(5)
 multilevel caches, read miss prior to writes,
 Critical word first, merging write buffers, and victim caches
Reduce miss rate(4)
 larger block size, large cache size, higher associativity
 Compiler optimizations
Reduce miss penalty or miss rate via parallelization (1)
 Hardware or compiler prefetching
AMAT = HitTime + MissRateMissPenalty
26
1st
Hit Time Reduction Technique:
Small and Simple Caches
Using small and Direct-mapped cache
The less hardware that is necessary to implement a
cache, the shorter the critical path through the
hardware.
Direct-mapped is faster than set associative for both
reads and writes.
Fitting the cache on the chip with the CPU is also very
important for fast access times.
27
L1 Size and Associativity
Access time vs. size and associativity
28
L1 Size and Associativity
Energy per read vs. size and associativity
29
2nd
Hit Time Reduction Technique:
Way Prediction
30
2nd
Hit Time Reduction Technique:
Way Prediction
Way Prediction (Pentium 4 )
Extra bits are kept in the cache to predict the way,or
block within set of the next cache access.
If the predictor is correct, the instruction cache
latency is 1 clock clock cycle.
If not,it tries the other block, changes the way
predictor, and has a latency of 1 extra clock cycles.
Simulation using SPEC95 suggested set prediction
accuracy is excess of 85%, so way prediction saves
pipeline stage in more than 85% of the instruction
fetches.
31
3rd
Hit Time Reduction Technique:
Avoiding Address Translation during Indexing of the Cache
Page table is a large data structure in memory
TWO memory accesses for every load, store, or
instruction fetch!!!
Virtually addressed cache?
synonym problem
Cache the address translations?
CPU
Trans-
lation Cache
Main
Memory
VA PA miss
hit
data
32
TLBs
A way to speed up translation is to use a special cache of
recently used page table entries -- this has many names,
but the most frequently used is Translation Lookaside
Buffer or TLB
Virtual Address Physical Address Dirty Ref Valid Access
Really just a cache on the page table mappings
TLB access time comparable to cache access time
(much less than main memory access time)
33
Translation Look-Aside Buffers
Just like any other cache, the TLB can be organized as fully associative,
set associative, or direct mapped
TLBs are usually small, typically not more than 128 - 256 entries even on
high end machines. This permits fully associative lookup on these
machines. Most mid-range machines use small n-way set associative
organizations.
CPU
TLB
Lookup
Cache
Main
Memory
VA PA miss
hit
data
Trans-
lation
hit
miss
20 t
t
1/2 t
Translation
with a TLB
34
Fast hits by Avoiding Address Translation
CPU
TB
$
MEM
VA
PA
PA
Conventional
Organization
CPU
$
TB
MEM
VA
VA
PA
Virtually Addressed Cache
Translate only on miss
Synonym Problem
VA
Tags
CPU
$ TB
MEM
VA
PA
Tags
PA
Virtual indexed, Physically tagged
Overlap $ access
with VA translation:
requires $ index to
remain invariant
across translation
L2 $
35
Virtual Addressed Cache
 Send virtual address to cache? Called Virtually
Addressed Cache or
just Virtual Cache (vs. Physical Cache)
Every time process is switched logically must flush the cache;
otherwise get false hits
 Cost is time to flush + “compulsory” misses from empty cache
 Add process identifier tag that identifies process as well as address
within process: can’t get a hit if wrong process
Any Problems ?
36
Virtual cache
Index
虚页号 page offset
Tag
=
37
Dealing with aliases
Dealing with aliases (synonyms); Two different virtual
addresses map to same physical address
NO aliasing! What are the implications?
HW antialiasing: guarantees every cache block has unique
address
 verify on miss (rather than on every hit)
 cache set size <= page size ?
 what if it gets larger?
How can SW simplify the problem? (called page coloring)
I/O must interact with cache, so need virtual address
38
Aliases problem with Virtual cache
Tag index offset
Tag index offset
A1
A2
A1
A2
If the index and offset bits of two aliases are forced to be
the same, than the aliases address will map to the same block in cache.
39
Overlap address translation and cache access
(Virtual indexed, physically tagged)
Index
PhysicalPageNo page offset
VitualPageNo page offset
TLB translation
Tag
=
Any limitation ?
40
What’s the limitation?
Index
PhysicalPageNo page offset
VitualPageNo page offset
TLB translation
Tag
=
IF it’s direct map cache, then
Cache size = 2index
* 2blockoffset
<= 2 pageoffset
How to solve this problem?
Use higher association. Say it’s a 4 way, then cache size can
reach 4 times 2pageoffset
with change nothing in index or tag or offset.
41
VPN: 30bits 8bits 5bits
=
=
VA
V D 25 bits 256 bits
V D 25 bits 256 bits
bank0
bank1
0
0
255
255
V D 30 bits 25 bits
=
=
=
=
VPN PPN
Cache
PPN : 25bits
index block offset
TLB
Example: Virtual indexed, physically tagged cache
Virtual address wide = 43 bits, Memory physical address wide = 38 bits, Page size = 8KB.
Cache capacity =16KB. If a virtually indexed and physically tagged cache is used.
And the cache is 2-way associative write back cache with 32 byte block size.
42
4th
Hit Time Reduction Technique:
Trace caches
Find a dynamic sequence of instructions including
taken branches to load into a cache block.
The block determined by CPU instead of by memory
layout.
Complicated address mapping mechanism
43
Why Trace Cache ?
Bring N instructions per cycle
No I-cache misses
No prediction miss
No packet breaks !
Because branch in each 5 instruction, so cache can only provide a
packet in one cycle.
44
What’s Trace ?
Trace: dynamic instruction sequence
When instructions ( operations ) retire from the
pipeline, pack the instruction segments into
TRACE, and store them in the TRACE cache,
including the branch instructions.
Though branch instruction may go a different
target, but most times the next operation
sequential will just be the same as the last
sequential. ( locality )
45
Whose propose ?
Peleg Weiser (1994) in Intel corporation
Patel / Patt ( 1996)
Rotenberg / J. Smith (1996)
Paper: ISCA’98
46
Trace in CPU
Trace cache I cache
LI M
Fill
unit
OP
DM
47
Instruction segment
Block
A
Block
B
Block
D
Block
C
T F
T F
Block
A
B
C
T, F
Fill unit pack
them into :
A
Predict info.
48
Pentium 4:
trace cache, 12 instr./per cycle
A B C
Trace Cache
Instruction
Cache
Decode
PC
BR
predictor
Decode
Pentium4
12 instr.
Rename
Execution
cord
Data
cache
Fill
unit
Trace segment
L2
Cache
Memory
Disk
Onchip
49
How to Improve Cache Performance?
Reduce hit time(4)
 Small and simple first-level caches, Way prediction
 avoiding address translation, Trace cache
Increase bandwidth(3)
 Pipelined caches, multibanked caches, non-blocking caches
Reduce miss penalty(5)
 multilevel caches, read miss prior to writes,
 Critical word first, merging write buffers, and victim caches
Reduce miss rate(4)
 larger block size, large cache size, higher associativity
 Compiler optimizations
Reduce miss penalty or miss rate via parallelization (1)
 Hardware or compiler prefetching
AMAT = HitTime + MissRateMissPenalty
50
Pipelined Caches
51
1st
Increasing cache bandwidth:
Pipelined Caches
Hit in multiple cycles,
giving fast clock cycle time
52
2nd
Increasing cache bandwidth:
Multibanked Caches
Organize cache as independent banks to support
simultaneous access
 ARM Cortex-A8 supports 1-4 banks for L2
 Intel i7 supports 4 banks for L1 and 8 banks for L2
Banking works best when accesses naturally spread
themselves across banks  mapping of addresses to
banks affects behavior of memory system
Interleave banks according to block address,Simple
mapping that works well is “sequential interleaving”
53
Single banked two bank two bank two bank
consecutive interleaving group interleaving
54
3rd
Increasing cache bandwidth:
Nonblocking Caches
Allow hits before previous misses complete
 “Hit under miss” , “ Hit under multiple miss”
L2 must support this
In general, processors can hide L1 miss penalty but not L2 miss
penalty
Nonblocking, in conjunction with out-of-order execution, can allow
the CPU to continue executing instructions after a data cache miss.
55
How to Improve Cache Performance?
Reduce hit time(4)
 Small and simple first-level caches, Way prediction
 avoiding address translation, Trace cache
Increase bandwidth(3)
 Pipelined caches, multibanked caches, non-blocking caches
Reduce miss penalty(5)
 multilevel caches, read miss prior to writes,
 Critical word first, merging write buffers, and victim caches
Reduce miss rate(4)
 larger block size, large cache size, higher associativity
 Compiler optimizations
Reduce miss penalty or miss rate via parallelization (1)
 Hardware or compiler prefetching
AMAT = HitTime + MissRateMissPenalty
56
1st
Miss Penalty Reduction Technique:
Multilevel Caches
 This method focuses on the interface between the
cache and main memory.
 Add an second-level cache between main memory and a
small, fast first-level cache, to make the cache fast
and large.
 The smaller first-level cache is fast enough to match
the clock cycle time of the fast CPU and to fit on the
chip with the CPU, thereby lessening the hits time.
 The second-level cache can be large enough to capture
many memory accesses that would go to main memory,
thereby lessening the effective miss penalty.
57
Parameter about Multilevel cache
 The performance of a two-level cache is calculated in a similar way to
the performance for a single level cache.
 L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 +
Miss RateL1 x (Hit TimeL2 + Miss RateL2 * Miss PenaltyL2)
1
2
2
2
2
1
1
L
L
L
L
L
L
L
rate
Miss
M
Misses
M
Misses
rate
Miss
M
Misses
rate
Miss




So the miss penalty for level 1 is calculated using the hit
time, miss rate, and miss penalty for the level 2 cache.
58
Two conceptions for two-level cache
Definitions:
Local miss rate— misses in this cache divided by the
total number of memory accesses to this cache
(Miss rateL2)
Global miss rate—misses in this cache divided by the
total number of memory accesses generated by the
CPU
Feb.2008_jxh_Introductio
n
Using the terms above, the global miss for the first-level cache is
stall just Miss rateL1, but for the second-level cache it is :
2
1
1
2
1
1
2
1
2
2
L
L
L
L
L
L
L
L
L
L
Global
rate
Miss
rate
Miss
rate
Miss
M
Misses
M
Misses
rate
Miss
M
Misses
M
rate
Miss
M
M
Misses
rate
Miss











59
2nd
Miss Penalty Reduction Technique:
Giving Priority to Read Misses over Writes
If a system has a write buffer, writes can be delayed
to come after reads.
The system must, however, be careful to check the
write buffer to see if the value being read is about to
be written.
60
Write buffer
• Write-back want buffer to hold displaced blocks
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the
read, and then do the write
– CPU stall less since restarts as soon as do read
• Write-through want write buffers => RAW conflicts
with main memory reads on cache misses
– If simply wait for write buffer to empty, might
increase read miss penalty (old MIPS 1000 by 50% )
– Check write buffer contents before read;
if no conflicts, let the memory access continue
61
3nd
Miss Penalty Reduction Technique:
Critical Word First & Early Restart
 Don’t wait for full block to be loaded before
restarting CPU
 Critical Word First—Request the missed word first
from memory and send it to the CPU as soon as it
arrives; let the CPU continue execution while filling
the rest of the words in the block. Also called
wrapped fetch and requested word first
 Early restart—As soon as the requested word of
the block arrives, send it to the CPU and let the
CPU continue execution
 Generally useful only in large blocks,
 Spatial locality => tend to want next sequential word,
so not clear if benefit by early restart
62
Example: Critical Word First
Assume:
cache block = 64-byte
L2: take 11 CLK to get the critical 8 bytes,(AMD Athlon)
and then 2 CLK per 8 byte to fetch the rest of the
block
There will be no other accesses to rest of the block
Calculate the average miss penalty for critical word first.
Then assuming the following instructions read data
sequentially 8 bytes at a time from the rest of the block
Compare the times with and without critical word first.
63
4th
Miss Penalty Reduction Technique:
Merging write Buffer
One word writes replaces with multiword writes, and it
improves buffers’s efficiency.
In write-through ,When write misses if the buffer
contains other modified blocks,the addresses can be
checked to see if the address of this new data matches
the address of a valid write buffer entry.If so,the new
data are combined with that entry.
The optimization also reduces stalls due to the write
buffer being full.
64
Merging Write Buffer
When storing to a block that is already pending in the
write buffer, update write buffer
Reduces stalls due to full write buffer
Do not apply to I/O addresses
No write
buffering
Write buffering
65
5th
Miss Penalty Reduction Technique:
Victim Caches
A victim cache is a small (usually, but not
necessarily) fully-associative cache that holds a
few of the most recently replaced blocks or
victims from the main cache.
This cache is checked on a miss data before
going to next lower-level memory(main memory).
to see if they have the desired item
If found, the victim block and the cache block are
swapped.
The AMD Athlon has a victim caches (write buffer for
write back blocks ) with 8 entries.
66
The Victim Cache
67
How to combine victim Cache ?
How to combine fast hit
time of direct mapped
yet still avoid conflict
misses?
Add buffer to place data
discarded from cache
Jouppi [1990]: 4-entry
victim cache removed 20%
to 95% of conflicts for a 4
KB direct mapped data
cache
Used in Alpha, HP
machines To Next Lower Level In
Hierarchy
DATA
TAGS
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
68
How to Improve Cache Performance?
Reduce hit time(4)
 Small and simple first-level caches, Way prediction
 avoiding address translation, Trace cache
Increase bandwidth(3)
 Pipelined caches, multibanked caches, non-blocking caches
Reduce miss penalty(5)
 multilevel caches, read miss prior to writes,
 Critical word first, merging write buffers, and victim caches
Reduce miss rate(4)
 larger block size, large cache size, higher associativity
 Compiler optimizations
Reduce miss penalty or miss rate via parallelization (1)
 Hardware or compiler prefetching
AMAT = HitTime + MissRateMissPenalty
69
Where misses come from?
 Classifying Misses: 3 Cs
Compulsory—The first access to a block is not in the cache, so the block
must be brought into the cache. Also called cold start misses or first
reference misses.
(Misses in even an Infinite Cache)
Capacity—If the cache cannot contain all the blocks needed during
execution of a program, capacity misses will occur due to blocks
being discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
Conflict—If block-placement strategy is set associative or direct
mapped, conflict misses (in addition to compulsory & capacity
misses) will occur because a block can be discarded and later
retrieved if too many blocks map to its set. Also called collision
misses or interference misses.
(Misses in N-way Associative, Size X Cache)
 4th “C”:
Coherence - Misses caused by cache coherence.
70
3Cs Absolute Miss Rate (SPEC92)
0. 00
0. 01
0. 02
0. 03
0. 04
0. 05
0. 06
0. 07
0. 08
0. 09
0. 10
4 8 16 32 64 128 256 512
1- w
ay
2- w
ay
4- w
ay
8- w
ay
capaci ty
com
pul sory
71
3Cs Relative Miss Rate
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4 8 16 32 64 128 256 512
1- w
ay
2- w
ay
4- w
ay
8- w
ay
capaci ty
com
pul sory
Flaws: for fixed block size
Good: insight => invention
72
Reducing Cache Miss Rate
To reduce cache miss rate, we have to eliminate
some of the misses due to the three C's.
We cannot reduce capacity misses much except
by making the cache larger.
We can, however, reduce the conflict misses and
compulsory misses in several ways:
73
Cache Organization?
 Assume total cache size not changed:
 What happens if:
1) Change Block Size:
2) Change Associativity:
3) Change Compiler:
Which of 3Cs is obviously affected?
74
1st Miss Rate Reduction Technique:
Larger Block Size (fixed size&assoc)
Larger blocks decrease the compulsory miss
rate by taking advantage of spatial locality.
Drawback--curve is U-shaped
 However, they may increase the miss penalty by requiring more
data to be fetched per miss.
 In addition, they will almost certainly increase conflict misses
since fewer blocks can be stored in the cache.
 And maybe even capacity misses in small caches
Trade-off
 Trying to minimize both the miss rate and the miss penalty.
 The selection of block size depends on both the latency and bandwidth
of lower-level memory
75
Miss Rate relates Block size
Block size
Cache size
4K 16K 64K 256K
16 8.57% 3.94% 2.04% 1.09%
32 7.24% 2.87% 1.35% 0.70%
64 7.00% 2.64% 1.06% 0.51%
128 7.78% 2.77% 1.02% 0.49%
256 9.51% 3.29% 1.15% 0.49%
76
Performance curve is U-shaped
Block Size (bytes)
Miss
Rate
0%
5%
10%
15%
20%
25%
16
32
64
128
256
1K
4K
16K
64K
256K
Reduced
compulsory
misses
Increased
Conflict
Misses
What else drives up block size?
77
Example: Larger Block Size (C-26)
Assume: memory takes 80 clock cycles of overhead
and then delivers 16 bytes every 2 cycles.
1 clock cycle hit time independent of block size.
Which block size has the smallest AMAT for each size
in Fig.5.17 ?
Answer:
AMAT16-byte block, 4KB = 1+(8.57%*82)=8.027
AMAT256-byte block, 256KB= 1+(0.49%*112)=1.549
78
2nd
Miss Rate Reduction Technique:
Larger Caches
rule of thumb: 2 x size => 25% cut in miss rate
What does it reduce ?
Cache Size (KB)
Miss
Rate
per
Type
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1
2
4
8
16
32
64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
Cache Size
79
Pro. Vs. cons for large caches
Pro.
 Reduce capacity misses
Con.
 Longer hit time, Higher cost, AMAT curve is U-shaped
Popular in off-chip caches
Block size Miss penalty
Cache size
4K 16K 64K 256K
16 82 8.027 4.231 2.673 1.894
32 84 7.082 3.411 2.134 1.588
64 88 7.160 3.323 1.933 1.449
128 96 8.469 3.659 1.979 1.470
256 112 11.651 4.685 2.288 1.549
80
3rd
Miss Rate Reduction Technique:
Higher Associativity
Conflict misses can be a problem for caches
with low associativity (especially direct-
mapped).
With higher associativity decreasing Conflict
misses to improve miss rate
cache rule of thumb
2:1 rule of thumb a direct-mapped cache of size N has
the same miss rate as a 2-way set-associative cache of
size N/2.
Eight-way set associative is for practical purposes as
effective in reducing misses for these sized cache as
fully associative.
81
Fall_Ad Computer Architecture
Associativity
0. 00
0. 01
0. 02
0. 03
0. 04
0. 05
0. 06
0. 07
0. 08
0. 09
0. 10
4 8 16 32 64 128 256 512
1- w
ay
2- w
ay
4- w
ay
8- w
ay
capaci ty
com
pul sory
Conflict
2:1 rule of thumb
82
Associativity vs Cycle Time
Beware: Execution time is only final measure!
Why is cycle time tied to hit time?
Will Clock Cycle time increase?
 Hill [1988] suggested hit time for 2-way vs. 1-way
external cache +10%,
internal + 2%
 suggested big and dumb caches
83
AMAT vs. Miss Rate (P430)
Example: assume CCT = 1.36 for 2-way, 1.44 for 4-way,
1.52 for 8-way vs. CCT direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
4 3.44 3.25 3.22 3.28
8 2.69 2.58 2.55 2.62
16 2.33 2.40 2.46 2.53
32 2.06 2.30 2.37 2.45
64 1.92 2.24 2.18 2.25
128 1.52 1.84 1.92 2.00
256 1.32 1.66 1.74 1.82
512 1.20 1.55 1.59 1.66
(Red means A.M.A.T. not improved by more associativity)
84
4th Miss Rate Reduction Technique:
Compiler Optimizations
The techniques reduces miss rates without any hardware
changes and reorders instruction sequence with compiler.
Instructions
 Reorder procedures in memory so as to reduce conflict misses
 Profiling to look at conflicts(using tools they developed)
Data
Merging Arrays: improve spatial locality by single array of
compound elements vs. 2 arrays
Loop Interchange: change nesting of loops to access data in
order stored in memory
Loop Fusion: Combine 2 independent loops that have same
looping and some variables overlap
Blocking: Improve temporal locality by accessing “blocks” of
data repeatedly vs. going down whole columns or rows
85
a. Merging Arrays
Combining independent matrices into a single compound
array.
Improving spatial locality
Example
/*before*/
Int val[SIZE];
Int key[SIZE];
/*after*/
Struct merge{
int val;
int key;
}
Struct merge merged_array[SIZE]
86
b. Loop Interchange
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through
memory every 100 words;
By switching the order in which loops execute, misses can
be reduced due to improvements in spatial locality.
87
c. Loop fusion
By fusion the code into a single loop, the data that are
fetched into the cache can be used repeatedly before
being swapped out.
Imporving the temporal locality
Example:
/*before*/ /*after*/
For (i=0; i<N; i=i+1) For (i=0; i<N; i=i+1)
For (i=0; j<N; j=i+1) For (j=0; j<N; j=i+1)
a[i][j]=1/b[i][j]*c[i][j]; {
For (i=0; i<N; i=i+1)
a[i][j]=1/b[i][j]*c[i][j];
For (j=0; j<N; j=j+1) d[i][j]=a[i][j]*c[i][j];
d[i][j]=a[i][j]*c[i][j]; }
88
d. Unoptimized Matrix Multiplication
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1)
r = r + y[i][k]*z[k][j];
x[i][j] = r;
};
Two Inner Loops:
 Write N elements of 1 row of X[ ]
 Read N elements of 1 row of Y[ ] repeatedly
 Read all NxN elements of Z[ ]
Capacity Misses a function of N & Cache Size:
 2N3
+ N2
=> (assuming no conflict; otherwise …)
Idea: compute on BxB submatrix that fits
y[1][k] z[k][j]
x[1][j]
((N+N)N+N)N=2N3
+ N2
Accessed For N3
operations
89
Blocking optimized Matrix Multiplication
Matrix multiplication is
performed by multiplying the
submatrices first.
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1)
r = r + y[i][k]*z[k][j];
x[i][j] = x[i][j] + r;
};
Y benefits from spatial locality
Z benefits from temporal locality
Capacity Misses from 2N3
+ N2
to N3
/B+2N2
BN B×B
BN
(BN+BN)+B2)×(N/B)2=2N3/B + N2
Accessed For N3 operations
B called Blocking Factor
90
Reducing Conflict Misses by Blocking
Conflict misses in caches not FA vs. Blocking size
Lam et al [1991] a blocking factor of 24 had a fifth the
misses vs. 48 despite both fit in cache
Blocking Factor
Miss
Rate
0
0.05
0.1
0 50 100 150
Fully Associative Cache
Direct Mapped Cache
91
Summary of Compiler Optimizations to Reduce Cache
Misses (by hand)
Performance Improvement
1 1.5 2 2.5 3
compress
cholesky
(nasa7)
spice
mxm (nasa7)
btrix (nasa7)
tomcatv
gmty (nasa7)
vpenta (nasa7)
merged
arrays
loop
interchange
loop fusion blocking
92
5th Miss Rate Reduction Technique:
Way Prediction and Pseudo-Associative Cache
Using two Technique reduces conflict misses and yet
maintains hit speed of direct-mapped cache
Predictive bit - Pseudo-Associative
Way Prediction (Alpha 21264 )
 Extra bits are kept in the cache to predict the way,or block within
set of the next cache access.
 If the predictor is correct, the instruction cache latency is 1 clock
clock cycle.
 If not,it tries the other block, changes the way predictor, and has a
latency of 3 clock cycles.
 Simulation using SPEC95 suggested set prediction accuracy is excess
of 85%, so way prediction saves pipeline stage in more than 85% of
the instruction fetches.
93
Pseudo-Associative Cache
(column associative)
How to combine fast hit time of Direct Mapped and have
the lower conflict misses of 2-way SA cache?
Divide cache: on a miss, check other half of cache to see if
there, if so have a pseudo-hit (slow hit)
Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
Better for caches not tied directly to processor (L2)
Used in MIPS R1000 L2 cache, similar in UltraSPARC
Time
Hit Time
Pseudo Hit Time Miss Penalty
94
Pseudo-Associative Cache
主 存
写缓存
4
CPU
地址
数据 数据
入 出
块帧地址 块内偏移
<21> <8> <5>
标志 索引
有效位 标志 数据
<1> <22> <256>
… …
4:1 MUX
=?
1
2
3
2*
95
How to Improve Cache Performance?
Reduce hit time(4)
 Small and simple first-level caches, Way prediction
 avoiding address translation, Trace cache
Increase bandwidth(3)
 Pipelined caches, multibanked caches, non-blocking caches
Reduce miss penalty(5)
 multilevel caches, read miss prior to writes,
 Critical word first, merging write buffers, and victim caches
Reduce miss rate(4)
 larger block size, large cache size, higher associativity
 Compiler optimizations
Reduce miss penalty or miss rate via parallelization (1)
 Hardware or compiler prefetching
AMAT = HitTime + MissRateMissPenalty
96
Hardware Prefetching
Fetch two blocks on miss (include next
sequential block)
Pentium 4 Pre-fetching
97
Compiler Prefetching
Insert prefetch instructions before data is needed
Non-faulting: prefetch doesn’t cause exceptions
Register prefetch
Loads data into register
Cache prefetch
Loads data into cache
Combine with loop unrolling and software
pipelining
98
Use HBM to Extend Hierarchy
128 MiB to 1 GiB
Smaller blocks require substantial tag storage
Larger blocks are potentially inefficient
One approach (L-H):
Each SDRAM row is a block index
Each row contains set of tags and 29 data
segments
29-set associative
Hit requires a CAS
99
Use HBM to Extend Hierarchy
Another approach (Alloy cache):
Mold tag and data together
Use direct mapped
Both schemes require two DRAM accesses
for misses
Two solutions:
 Use map to keep track of blocks
 Predict likely misses
100
Use HBM to Extend Hierarchy
101
Summary
102
1st Miss Penalty/Rate Reduction Technique: Hardware
Prefetching of Inst.and data
The act of getting data from memory before it is actually
needed by the CPU.
This reduces compulsory misses by retrieving the data
before it is requested.
Of course, this may increase other misses by removing
useful blocks from the cache.
Thus, many caches hold prefetched blocks in a special
buffer until they are actually needed.
E.g., Instruction Prefetching
 Alpha 21064 fetches 2 blocks on a miss
 Extra block placed in “stream buffer”
 On miss check stream buffer
Prefetching relies on having extra memory bandwidth that
can be used without penalty
103
2nd
Miss Penalty/Rate Reduction Technique:
Compiler-controlled prefetch
The compiler inserts prefetch instructions to request the
data before they are needed
Data Prefetch Load data into register (HP PA-RISC loads)
 Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
 Special prefetching instructions cannot cause faults; a form of
speculative execution
Prefetching comes in two flavors:
 Binding prefetch: Requests load directly into register.
 Must be correct address and register!
 Non-Binding prefetch: Load into cache.
 Can be incorrect. Faults?
Issuing Prefetch Instructions takes time
 Is cost of prefetch issues < savings in reduced misses?
 Higher superscalar reduces difficulty of issue bandwidth
104
Example (P307):
Compiler-controlled prefetch
for( i=0; i <3; i = i +1)
for( j=0; j<100; j=j+1)
a[i][j] = b[j][0] * b[j+1][0];
16B/block , 8B/element , 2elements/block
A[i][j]: 3*100 the even value of j will miss,
the odd values will hit, total 150 misses
B[i][j] : 101*3 the same elements are accessed for each
iteration of i
j=0 B[0][0] 、 B[1][0] 2
j=1 B[1][0] 、 B[2][0] 1
total 2+99=101 misses
105
Example cont.:
Compiler-controlled prefetch
For (j=0; j<100; j=j+1){
Prefetch(b[j+7][0]);
prefetch(a[0][j+7]);
a[0][j]=b[j][0]*b [j+1][0];}; 7 misses for b
4 misses for a
For (I=1; I<3; I=I+1)
For (j=0; j<100; j=j+1){
prefetch(a[i][j+7]);
a[i][j]=b[j][0]*b [j+1][0];};4 misses for a[1][j]
4 misses for a[2][j]
Total: 19 misses
save 232 cache misses at the price of 400 prefetch
instructions.
106
End.

More Related Content

PPTX
Memory technology and optimization in Advance Computer Architechture
PDF
Unit I Memory technology and optimization
PPTX
memorytechnologyandoptimization-140416131506-phpapp02.pptx
PDF
Computer architecture for HNDIT
PPTX
CA UNIT V..pptx
PPTX
UNIT IV Computer architecture Analysis.pptx
PPTX
Memory Organization
PPTX
CAQA5e_ch2.pptx memory hierarchy design storage
Memory technology and optimization in Advance Computer Architechture
Unit I Memory technology and optimization
memorytechnologyandoptimization-140416131506-phpapp02.pptx
Computer architecture for HNDIT
CA UNIT V..pptx
UNIT IV Computer architecture Analysis.pptx
Memory Organization
CAQA5e_ch2.pptx memory hierarchy design storage

Similar to 2021Arch_5_ch2_2.pptx How to improve the performance of Memory hierarchy (20)

PPTX
CPU Memory Hierarchy and Caching Techniques
PPTX
COMPUTER ARCHITECTURE-2.pptx
PPTX
Computer System Architecture Lecture Note 8.1 primary Memory
PPTX
MEMORY & I/O SYSTEMS
PDF
Improving DRAM performance
PPT
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
PPTX
Computer Memory Hierarchy Computer Architecture
PPTX
ICT-Lecture_07(Primary Memory and its types).pptx
PPTX
UNIT 4 COA MEMORY.pptx computer organisation
PPTX
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
PPT
Internal Memory FIT NED UNIVERSITY OF EN
PPT
Memory organization including cache and RAM.ppt
PPT
Ct213 memory subsystem
PPT
Cache Memory
PPTX
Computer organizatin.Chapter Seven.pptxs
PPT
cache memory
PPT
04_Cache Memory.ppt
PPT
cache memory.ppt
PPT
cache memory.ppt
PPTX
coa-Unit5-ppt1 (1).pptx
CPU Memory Hierarchy and Caching Techniques
COMPUTER ARCHITECTURE-2.pptx
Computer System Architecture Lecture Note 8.1 primary Memory
MEMORY & I/O SYSTEMS
Improving DRAM performance
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Computer Memory Hierarchy Computer Architecture
ICT-Lecture_07(Primary Memory and its types).pptx
UNIT 4 COA MEMORY.pptx computer organisation
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
Internal Memory FIT NED UNIVERSITY OF EN
Memory organization including cache and RAM.ppt
Ct213 memory subsystem
Cache Memory
Computer organizatin.Chapter Seven.pptxs
cache memory
04_Cache Memory.ppt
cache memory.ppt
cache memory.ppt
coa-Unit5-ppt1 (1).pptx
Ad

More from 542590982 (7)

PPTX
2021Arch_14_Ch5_2_coherence.pptx Cache coherence
PDF
2021Arch_15_Ch5_3_Syncronization.pdf Synchronization in Multiprocessor
PPTX
2021Arch_6_Ch3_0_Extend2SupportingMCoperation.pptx
PPTX
2021Arch_1_intro.pptx Computer Architecture ----A Quantitative Approach
PPTX
2021Arch_2_Ch1_1.pptx Fundamentals of Quantitative Design and Analysis
PPTX
Design compiler1_2012暑期.pptx teach people how to use design complier
PPTX
photograph skills to help people how to shoot in action
2021Arch_14_Ch5_2_coherence.pptx Cache coherence
2021Arch_15_Ch5_3_Syncronization.pdf Synchronization in Multiprocessor
2021Arch_6_Ch3_0_Extend2SupportingMCoperation.pptx
2021Arch_1_intro.pptx Computer Architecture ----A Quantitative Approach
2021Arch_2_Ch1_1.pptx Fundamentals of Quantitative Design and Analysis
Design compiler1_2012暑期.pptx teach people how to use design complier
photograph skills to help people how to shoot in action
Ad

Recently uploaded (20)

PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
composite construction of structures.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
web development for engineering and engineering
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Construction Project Organization Group 2.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Welding lecture in detail for understanding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Well-logging-methods_new................
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPT
Project quality management in manufacturing
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
composite construction of structures.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
web development for engineering and engineering
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Construction Project Organization Group 2.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Welding lecture in detail for understanding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Automation-in-Manufacturing-Chapter-Introduction.pdf
Operating System & Kernel Study Guide-1 - converted.pdf
Well-logging-methods_new................
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Project quality management in manufacturing
UNIT 4 Total Quality Management .pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
bas. eng. economics group 4 presentation 1.pptx
CH1 Production IntroductoryConcepts.pptx

2021Arch_5_ch2_2.pptx How to improve the performance of Memory hierarchy

  • 1. Ch2-2 How to improve the performance of Memory hierarchy
  • 2. 2 Memory Technology and Optimizations Performance metrics Latency is concern of cache Bandwidth is concern of multiprocessors and I/O Access time  Time between read request and when desired word arrives Cycle time  Minimum time between unrelated requests to memory SRAM memory has low latency, use for cache Organize DRAM chips into many banks for high bandwidth, use for main memory
  • 3. 3 Memory Technology SRAM Requires low power to retain bit Requires 6 transistors/bit DRAM Must be re-written after being read Must also be periodically refeshed  Every ~ 8 ms (roughly 5% of time)  Each row can be refreshed simultaneously One transistor/bit Address lines are multiplexed:  Upper half of address: row access strobe (RAS)  Lower half of address: column access strobe (CAS)
  • 4. 4 DRAM logical organization (64 Mbit) Square root of bits per RAS/CAS Column Decoder Sense Amps & I/O Memory Array (16,384×16,384) A0…A13 … Address buffer 14 Data in D Q Word Line Storage Cell Data out Row Decoder … Bit Line
  • 5. 5 DRAM Read Timing Every DRAM access begins at:  The assertion of the RAS_L  2 ways to read: early or late v. CAS A D OE_L 256K x 8 DRAM 9 8 WE_L CAS_L RAS_L OE_L A Row Address WE_L Junk Read Access Time Output Enable Delay CAS_L RAS_L Col Address Row Address Junk Col Address D High Z Data Out DRAM Read Cycle Time Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L Junk Data Out High Z
  • 7. 7 Times of fast and slow DRAMs with each generation.
  • 9. 9 Memory Technology Amdahl:  Memory capacity should grow linearly with processor speed  Unfortunately, memory capacity and speed has not kept pace with processors Some optimizations:  Multiple accesses to same row  Synchronous DRAM  Added clock to DRAM interface  Burst mode with critical word first  Wider interfaces  Double data rate (DDR)  Multiple banks on each DRAM device
  • 10. 10 1st Improving DRAM Performance Fast Page Mode DRAM (FPM) Timing signals that allow repeated accesses to the row buffer (page) without another row access time  Such a buffer comes naturally, as each array will buffer 1024 to 2048 bits for each access. Page: All bits on the same ROW (Spatial Locality) Don’t need to wait for wordline to recharge Toggle CAS with new column address
  • 11. 11 2nd Improving DRAM Performance Synchronous DRAM conventional DRAMs have an asynchronous interface to the memory controller, and hence every transfer involves overhead to synchronize with the controller. The solution was to add a clock signal to the DRAM interface, so that the repeated transfers would not bear that overhead.  Data output is in bursts w/ each element clocked
  • 12. 12 3rd Improving DRAM Performance DDR--Double data rate On both the rising edge and falling edge of the DRAM clock signal, DRAM innovation to increase bandwidth is to transfer data,  thereby doubling the peak data rate. 2.5V 1.8v 1.5v
  • 13. 13 DDR--Double data rate DDR: DDR2  Lower power (2.5 V -> 1.8 V)  Higher clock rates (266 MHz, 333 MHz, 400 MHz) DDR3  1.5 V  800 MHz DDR4  1-1.2 V  1333 MHz GDDR5 is graphics memory based on DDR3
  • 14. 14 4rd Improving DRAM Performance New DRAM Interface: RAMBUS (RDRAM) a type of synchronous dynamic RAM, designed by the Rambus Corporation. Each chip has interleaved memory and a high speed interface. Protocol based RAM w/ narrow (16-bit) bus High clock rate (400 Mhz), but long latency Pipelined operation Multiple arrays w/ data transferred on both edges of clock The first generation RAMBUS interface dropped RAS/CAS, replacing it with a bus that allows other accesses over the bus between the sending of the address and return of the data. It is typically called RDRAM. The second generation RAMBUS interface include a separate row- and column-command buses instead of the conventional multiplexing; and a much more sophisticated controller on chip. Because of the separation of data, row, and column buses, three transactions can be performed simultaneously. called Direct RDRAM or DRDRAM
  • 17. 17 Comparing RAMBUS and DDR SDRAM Since the most computers use memory in DIMM packages, which are typically at least 64-bits wide, the DIMM memory bandwidth is closer to what RAMBUS provides than you might expect when just comparing DRAM chips. Caution that performance of cache are based in part on latency to the first byte and in part on the bandwidth to deliver the rest of the bytes in the block. Although these innovations help with the latter case, none help with latency. Amdahl’s Law reminds us of the limits of accelerating one piece of the problem while ignoring another part.
  • 18. 18 Memory Optimizations Reducing power in SDRAMs: Lower voltage Low power mode (ignores clock, continues to refresh) Graphics memory: Achieve 2-5 X bandwidth per DRAM vs. DDR3  Wider interfaces (32 vs. 16 bit)  Higher clock rate Possible because they are attached via soldering instead of socketted DIMM modules
  • 19. 19 Summery Memory organization  Wider memory  Simple interleaved memory  Independent memory banks Avoiding Memory Bank Conflicts Memory chip Fast Page Mode DRAM Synchronize DRAM Double Date Rate  RDRAM
  • 21. 21 Stacked/Embedded DRAMs Stacked DRAMs in same package as processor High Bandwidth Memory (HBM)
  • 22. 22 Flash Memory Type of EEPROM Types: NAND (denser) and NOR (faster) NAND Flash: Reads are sequential, reads entire page (.5 to 4 KiB) 25 us for first byte, 40 MiB/s for subsequent bytes SDRAM: 40 ns for first byte, 4.8 GB/s for subsequent bytes 2 KiB transfer: 75 uS vs 500 ns for SDRAM, 150X slower 300 to 500X faster than magnetic disk
  • 23. 23 NAND Flash Memory Must be erased (in blocks) before being overwritten Nonvolatile, can use as little as zero power Limited number of write cycles (~100,000) $2/GiB, compared to $20-40/GiB for SDRAM and $0.09 GiB for magnetic disk Phase-Change (相变) /Memrister Memory Possibly 10X improvement in write performance and 2X improvement in read performance
  • 24. 24 Memory Dependability Memory is susceptible to cosmic rays Soft errors: dynamic errors Detected and fixed by error correcting codes (ECC) Hard errors: permanent errors Use spare rows to replace defective rows Chipkill: a RAID-like error recovery technique
  • 25. 25 How to Improve Cache Performance? Reduce hit time(4)  Small and simple first-level caches, Way prediction  avoiding address translation, Trace cache Increase bandwidth(3)  Pipelined caches, multibanked caches, non-blocking caches Reduce miss penalty(5)  multilevel caches, read miss prior to writes,  Critical word first, merging write buffers, and victim caches Reduce miss rate(4)  larger block size, large cache size, higher associativity  Compiler optimizations Reduce miss penalty or miss rate via parallelization (1)  Hardware or compiler prefetching AMAT = HitTime + MissRateMissPenalty
  • 26. 26 1st Hit Time Reduction Technique: Small and Simple Caches Using small and Direct-mapped cache The less hardware that is necessary to implement a cache, the shorter the critical path through the hardware. Direct-mapped is faster than set associative for both reads and writes. Fitting the cache on the chip with the CPU is also very important for fast access times.
  • 27. 27 L1 Size and Associativity Access time vs. size and associativity
  • 28. 28 L1 Size and Associativity Energy per read vs. size and associativity
  • 29. 29 2nd Hit Time Reduction Technique: Way Prediction
  • 30. 30 2nd Hit Time Reduction Technique: Way Prediction Way Prediction (Pentium 4 ) Extra bits are kept in the cache to predict the way,or block within set of the next cache access. If the predictor is correct, the instruction cache latency is 1 clock clock cycle. If not,it tries the other block, changes the way predictor, and has a latency of 1 extra clock cycles. Simulation using SPEC95 suggested set prediction accuracy is excess of 85%, so way prediction saves pipeline stage in more than 85% of the instruction fetches.
  • 31. 31 3rd Hit Time Reduction Technique: Avoiding Address Translation during Indexing of the Cache Page table is a large data structure in memory TWO memory accesses for every load, store, or instruction fetch!!! Virtually addressed cache? synonym problem Cache the address translations? CPU Trans- lation Cache Main Memory VA PA miss hit data
  • 32. 32 TLBs A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB Virtual Address Physical Address Dirty Ref Valid Access Really just a cache on the page table mappings TLB access time comparable to cache access time (much less than main memory access time)
  • 33. 33 Translation Look-Aside Buffers Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations. CPU TLB Lookup Cache Main Memory VA PA miss hit data Trans- lation hit miss 20 t t 1/2 t Translation with a TLB
  • 34. 34 Fast hits by Avoiding Address Translation CPU TB $ MEM VA PA PA Conventional Organization CPU $ TB MEM VA VA PA Virtually Addressed Cache Translate only on miss Synonym Problem VA Tags CPU $ TB MEM VA PA Tags PA Virtual indexed, Physically tagged Overlap $ access with VA translation: requires $ index to remain invariant across translation L2 $
  • 35. 35 Virtual Addressed Cache  Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache (vs. Physical Cache) Every time process is switched logically must flush the cache; otherwise get false hits  Cost is time to flush + “compulsory” misses from empty cache  Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process Any Problems ?
  • 37. 37 Dealing with aliases Dealing with aliases (synonyms); Two different virtual addresses map to same physical address NO aliasing! What are the implications? HW antialiasing: guarantees every cache block has unique address  verify on miss (rather than on every hit)  cache set size <= page size ?  what if it gets larger? How can SW simplify the problem? (called page coloring) I/O must interact with cache, so need virtual address
  • 38. 38 Aliases problem with Virtual cache Tag index offset Tag index offset A1 A2 A1 A2 If the index and offset bits of two aliases are forced to be the same, than the aliases address will map to the same block in cache.
  • 39. 39 Overlap address translation and cache access (Virtual indexed, physically tagged) Index PhysicalPageNo page offset VitualPageNo page offset TLB translation Tag = Any limitation ?
  • 40. 40 What’s the limitation? Index PhysicalPageNo page offset VitualPageNo page offset TLB translation Tag = IF it’s direct map cache, then Cache size = 2index * 2blockoffset <= 2 pageoffset How to solve this problem? Use higher association. Say it’s a 4 way, then cache size can reach 4 times 2pageoffset with change nothing in index or tag or offset.
  • 41. 41 VPN: 30bits 8bits 5bits = = VA V D 25 bits 256 bits V D 25 bits 256 bits bank0 bank1 0 0 255 255 V D 30 bits 25 bits = = = = VPN PPN Cache PPN : 25bits index block offset TLB Example: Virtual indexed, physically tagged cache Virtual address wide = 43 bits, Memory physical address wide = 38 bits, Page size = 8KB. Cache capacity =16KB. If a virtually indexed and physically tagged cache is used. And the cache is 2-way associative write back cache with 32 byte block size.
  • 42. 42 4th Hit Time Reduction Technique: Trace caches Find a dynamic sequence of instructions including taken branches to load into a cache block. The block determined by CPU instead of by memory layout. Complicated address mapping mechanism
  • 43. 43 Why Trace Cache ? Bring N instructions per cycle No I-cache misses No prediction miss No packet breaks ! Because branch in each 5 instruction, so cache can only provide a packet in one cycle.
  • 44. 44 What’s Trace ? Trace: dynamic instruction sequence When instructions ( operations ) retire from the pipeline, pack the instruction segments into TRACE, and store them in the TRACE cache, including the branch instructions. Though branch instruction may go a different target, but most times the next operation sequential will just be the same as the last sequential. ( locality )
  • 45. 45 Whose propose ? Peleg Weiser (1994) in Intel corporation Patel / Patt ( 1996) Rotenberg / J. Smith (1996) Paper: ISCA’98
  • 46. 46 Trace in CPU Trace cache I cache LI M Fill unit OP DM
  • 47. 47 Instruction segment Block A Block B Block D Block C T F T F Block A B C T, F Fill unit pack them into : A Predict info.
  • 48. 48 Pentium 4: trace cache, 12 instr./per cycle A B C Trace Cache Instruction Cache Decode PC BR predictor Decode Pentium4 12 instr. Rename Execution cord Data cache Fill unit Trace segment L2 Cache Memory Disk Onchip
  • 49. 49 How to Improve Cache Performance? Reduce hit time(4)  Small and simple first-level caches, Way prediction  avoiding address translation, Trace cache Increase bandwidth(3)  Pipelined caches, multibanked caches, non-blocking caches Reduce miss penalty(5)  multilevel caches, read miss prior to writes,  Critical word first, merging write buffers, and victim caches Reduce miss rate(4)  larger block size, large cache size, higher associativity  Compiler optimizations Reduce miss penalty or miss rate via parallelization (1)  Hardware or compiler prefetching AMAT = HitTime + MissRateMissPenalty
  • 51. 51 1st Increasing cache bandwidth: Pipelined Caches Hit in multiple cycles, giving fast clock cycle time
  • 52. 52 2nd Increasing cache bandwidth: Multibanked Caches Organize cache as independent banks to support simultaneous access  ARM Cortex-A8 supports 1-4 banks for L2  Intel i7 supports 4 banks for L1 and 8 banks for L2 Banking works best when accesses naturally spread themselves across banks  mapping of addresses to banks affects behavior of memory system Interleave banks according to block address,Simple mapping that works well is “sequential interleaving”
  • 53. 53 Single banked two bank two bank two bank consecutive interleaving group interleaving
  • 54. 54 3rd Increasing cache bandwidth: Nonblocking Caches Allow hits before previous misses complete  “Hit under miss” , “ Hit under multiple miss” L2 must support this In general, processors can hide L1 miss penalty but not L2 miss penalty Nonblocking, in conjunction with out-of-order execution, can allow the CPU to continue executing instructions after a data cache miss.
  • 55. 55 How to Improve Cache Performance? Reduce hit time(4)  Small and simple first-level caches, Way prediction  avoiding address translation, Trace cache Increase bandwidth(3)  Pipelined caches, multibanked caches, non-blocking caches Reduce miss penalty(5)  multilevel caches, read miss prior to writes,  Critical word first, merging write buffers, and victim caches Reduce miss rate(4)  larger block size, large cache size, higher associativity  Compiler optimizations Reduce miss penalty or miss rate via parallelization (1)  Hardware or compiler prefetching AMAT = HitTime + MissRateMissPenalty
  • 56. 56 1st Miss Penalty Reduction Technique: Multilevel Caches  This method focuses on the interface between the cache and main memory.  Add an second-level cache between main memory and a small, fast first-level cache, to make the cache fast and large.  The smaller first-level cache is fast enough to match the clock cycle time of the fast CPU and to fit on the chip with the CPU, thereby lessening the hits time.  The second-level cache can be large enough to capture many memory accesses that would go to main memory, thereby lessening the effective miss penalty.
  • 57. 57 Parameter about Multilevel cache  The performance of a two-level cache is calculated in a similar way to the performance for a single level cache.  L2 Equations AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 * Miss PenaltyL2) 1 2 2 2 2 1 1 L L L L L L L rate Miss M Misses M Misses rate Miss M Misses rate Miss     So the miss penalty for level 1 is calculated using the hit time, miss rate, and miss penalty for the level 2 cache.
  • 58. 58 Two conceptions for two-level cache Definitions: Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2) Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU Feb.2008_jxh_Introductio n Using the terms above, the global miss for the first-level cache is stall just Miss rateL1, but for the second-level cache it is : 2 1 1 2 1 1 2 1 2 2 L L L L L L L L L L Global rate Miss rate Miss rate Miss M Misses M Misses rate Miss M Misses M rate Miss M M Misses rate Miss           
  • 59. 59 2nd Miss Penalty Reduction Technique: Giving Priority to Read Misses over Writes If a system has a write buffer, writes can be delayed to come after reads. The system must, however, be careful to check the write buffer to see if the value being read is about to be written.
  • 60. 60 Write buffer • Write-back want buffer to hold displaced blocks – Read miss replacing dirty block – Normal: Write dirty block to memory, and then do the read – Instead copy the dirty block to a write buffer, then do the read, and then do the write – CPU stall less since restarts as soon as do read • Write-through want write buffers => RAW conflicts with main memory reads on cache misses – If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) – Check write buffer contents before read; if no conflicts, let the memory access continue
  • 61. 61 3nd Miss Penalty Reduction Technique: Critical Word First & Early Restart  Don’t wait for full block to be loaded before restarting CPU  Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first  Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution  Generally useful only in large blocks,  Spatial locality => tend to want next sequential word, so not clear if benefit by early restart
  • 62. 62 Example: Critical Word First Assume: cache block = 64-byte L2: take 11 CLK to get the critical 8 bytes,(AMD Athlon) and then 2 CLK per 8 byte to fetch the rest of the block There will be no other accesses to rest of the block Calculate the average miss penalty for critical word first. Then assuming the following instructions read data sequentially 8 bytes at a time from the rest of the block Compare the times with and without critical word first.
  • 63. 63 4th Miss Penalty Reduction Technique: Merging write Buffer One word writes replaces with multiword writes, and it improves buffers’s efficiency. In write-through ,When write misses if the buffer contains other modified blocks,the addresses can be checked to see if the address of this new data matches the address of a valid write buffer entry.If so,the new data are combined with that entry. The optimization also reduces stalls due to the write buffer being full.
  • 64. 64 Merging Write Buffer When storing to a block that is already pending in the write buffer, update write buffer Reduces stalls due to full write buffer Do not apply to I/O addresses No write buffering Write buffering
  • 65. 65 5th Miss Penalty Reduction Technique: Victim Caches A victim cache is a small (usually, but not necessarily) fully-associative cache that holds a few of the most recently replaced blocks or victims from the main cache. This cache is checked on a miss data before going to next lower-level memory(main memory). to see if they have the desired item If found, the victim block and the cache block are swapped. The AMD Athlon has a victim caches (write buffer for write back blocks ) with 8 entries.
  • 67. 67 How to combine victim Cache ? How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP machines To Next Lower Level In Hierarchy DATA TAGS One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator
  • 68. 68 How to Improve Cache Performance? Reduce hit time(4)  Small and simple first-level caches, Way prediction  avoiding address translation, Trace cache Increase bandwidth(3)  Pipelined caches, multibanked caches, non-blocking caches Reduce miss penalty(5)  multilevel caches, read miss prior to writes,  Critical word first, merging write buffers, and victim caches Reduce miss rate(4)  larger block size, large cache size, higher associativity  Compiler optimizations Reduce miss penalty or miss rate via parallelization (1)  Hardware or compiler prefetching AMAT = HitTime + MissRateMissPenalty
  • 69. 69 Where misses come from?  Classifying Misses: 3 Cs Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache)  4th “C”: Coherence - Misses caused by cache coherence.
  • 70. 70 3Cs Absolute Miss Rate (SPEC92) 0. 00 0. 01 0. 02 0. 03 0. 04 0. 05 0. 06 0. 07 0. 08 0. 09 0. 10 4 8 16 32 64 128 256 512 1- w ay 2- w ay 4- w ay 8- w ay capaci ty com pul sory
  • 71. 71 3Cs Relative Miss Rate 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 4 8 16 32 64 128 256 512 1- w ay 2- w ay 4- w ay 8- w ay capaci ty com pul sory Flaws: for fixed block size Good: insight => invention
  • 72. 72 Reducing Cache Miss Rate To reduce cache miss rate, we have to eliminate some of the misses due to the three C's. We cannot reduce capacity misses much except by making the cache larger. We can, however, reduce the conflict misses and compulsory misses in several ways:
  • 73. 73 Cache Organization?  Assume total cache size not changed:  What happens if: 1) Change Block Size: 2) Change Associativity: 3) Change Compiler: Which of 3Cs is obviously affected?
  • 74. 74 1st Miss Rate Reduction Technique: Larger Block Size (fixed size&assoc) Larger blocks decrease the compulsory miss rate by taking advantage of spatial locality. Drawback--curve is U-shaped  However, they may increase the miss penalty by requiring more data to be fetched per miss.  In addition, they will almost certainly increase conflict misses since fewer blocks can be stored in the cache.  And maybe even capacity misses in small caches Trade-off  Trying to minimize both the miss rate and the miss penalty.  The selection of block size depends on both the latency and bandwidth of lower-level memory
  • 75. 75 Miss Rate relates Block size Block size Cache size 4K 16K 64K 256K 16 8.57% 3.94% 2.04% 1.09% 32 7.24% 2.87% 1.35% 0.70% 64 7.00% 2.64% 1.06% 0.51% 128 7.78% 2.77% 1.02% 0.49% 256 9.51% 3.29% 1.15% 0.49%
  • 76. 76 Performance curve is U-shaped Block Size (bytes) Miss Rate 0% 5% 10% 15% 20% 25% 16 32 64 128 256 1K 4K 16K 64K 256K Reduced compulsory misses Increased Conflict Misses What else drives up block size?
  • 77. 77 Example: Larger Block Size (C-26) Assume: memory takes 80 clock cycles of overhead and then delivers 16 bytes every 2 cycles. 1 clock cycle hit time independent of block size. Which block size has the smallest AMAT for each size in Fig.5.17 ? Answer: AMAT16-byte block, 4KB = 1+(8.57%*82)=8.027 AMAT256-byte block, 256KB= 1+(0.49%*112)=1.549
  • 78. 78 2nd Miss Rate Reduction Technique: Larger Caches rule of thumb: 2 x size => 25% cut in miss rate What does it reduce ? Cache Size (KB) Miss Rate per Type 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 2 4 8 16 32 64 128 1-way 2-way 4-way 8-way Capacity Compulsory Cache Size
  • 79. 79 Pro. Vs. cons for large caches Pro.  Reduce capacity misses Con.  Longer hit time, Higher cost, AMAT curve is U-shaped Popular in off-chip caches Block size Miss penalty Cache size 4K 16K 64K 256K 16 82 8.027 4.231 2.673 1.894 32 84 7.082 3.411 2.134 1.588 64 88 7.160 3.323 1.933 1.449 128 96 8.469 3.659 1.979 1.470 256 112 11.651 4.685 2.288 1.549
  • 80. 80 3rd Miss Rate Reduction Technique: Higher Associativity Conflict misses can be a problem for caches with low associativity (especially direct- mapped). With higher associativity decreasing Conflict misses to improve miss rate cache rule of thumb 2:1 rule of thumb a direct-mapped cache of size N has the same miss rate as a 2-way set-associative cache of size N/2. Eight-way set associative is for practical purposes as effective in reducing misses for these sized cache as fully associative.
  • 81. 81 Fall_Ad Computer Architecture Associativity 0. 00 0. 01 0. 02 0. 03 0. 04 0. 05 0. 06 0. 07 0. 08 0. 09 0. 10 4 8 16 32 64 128 256 512 1- w ay 2- w ay 4- w ay 8- w ay capaci ty com pul sory Conflict 2:1 rule of thumb
  • 82. 82 Associativity vs Cycle Time Beware: Execution time is only final measure! Why is cycle time tied to hit time? Will Clock Cycle time increase?  Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2%  suggested big and dumb caches
  • 83. 83 AMAT vs. Miss Rate (P430) Example: assume CCT = 1.36 for 2-way, 1.44 for 4-way, 1.52 for 8-way vs. CCT direct mapped Cache Size Associativity (KB) 1-way 2-way 4-way 8-way 4 3.44 3.25 3.22 3.28 8 2.69 2.58 2.55 2.62 16 2.33 2.40 2.46 2.53 32 2.06 2.30 2.37 2.45 64 1.92 2.24 2.18 2.25 128 1.52 1.84 1.92 2.00 256 1.32 1.66 1.74 1.82 512 1.20 1.55 1.59 1.66 (Red means A.M.A.T. not improved by more associativity)
  • 84. 84 4th Miss Rate Reduction Technique: Compiler Optimizations The techniques reduces miss rates without any hardware changes and reorders instruction sequence with compiler. Instructions  Reorder procedures in memory so as to reduce conflict misses  Profiling to look at conflicts(using tools they developed) Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows
  • 85. 85 a. Merging Arrays Combining independent matrices into a single compound array. Improving spatial locality Example /*before*/ Int val[SIZE]; Int key[SIZE]; /*after*/ Struct merge{ int val; int key; } Struct merge merged_array[SIZE]
  • 86. 86 b. Loop Interchange /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; By switching the order in which loops execute, misses can be reduced due to improvements in spatial locality.
  • 87. 87 c. Loop fusion By fusion the code into a single loop, the data that are fetched into the cache can be used repeatedly before being swapped out. Imporving the temporal locality Example: /*before*/ /*after*/ For (i=0; i<N; i=i+1) For (i=0; i<N; i=i+1) For (i=0; j<N; j=i+1) For (j=0; j<N; j=i+1) a[i][j]=1/b[i][j]*c[i][j]; { For (i=0; i<N; i=i+1) a[i][j]=1/b[i][j]*c[i][j]; For (j=0; j<N; j=j+1) d[i][j]=a[i][j]*c[i][j]; d[i][j]=a[i][j]*c[i][j]; }
  • 88. 88 d. Unoptimized Matrix Multiplication /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1) r = r + y[i][k]*z[k][j]; x[i][j] = r; }; Two Inner Loops:  Write N elements of 1 row of X[ ]  Read N elements of 1 row of Y[ ] repeatedly  Read all NxN elements of Z[ ] Capacity Misses a function of N & Cache Size:  2N3 + N2 => (assuming no conflict; otherwise …) Idea: compute on BxB submatrix that fits y[1][k] z[k][j] x[1][j] ((N+N)N+N)N=2N3 + N2 Accessed For N3 operations
  • 89. 89 Blocking optimized Matrix Multiplication Matrix multiplication is performed by multiplying the submatrices first. /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }; Y benefits from spatial locality Z benefits from temporal locality Capacity Misses from 2N3 + N2 to N3 /B+2N2 BN B×B BN (BN+BN)+B2)×(N/B)2=2N3/B + N2 Accessed For N3 operations B called Blocking Factor
  • 90. 90 Reducing Conflict Misses by Blocking Conflict misses in caches not FA vs. Blocking size Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fit in cache Blocking Factor Miss Rate 0 0.05 0.1 0 50 100 150 Fully Associative Cache Direct Mapped Cache
  • 91. 91 Summary of Compiler Optimizations to Reduce Cache Misses (by hand) Performance Improvement 1 1.5 2 2.5 3 compress cholesky (nasa7) spice mxm (nasa7) btrix (nasa7) tomcatv gmty (nasa7) vpenta (nasa7) merged arrays loop interchange loop fusion blocking
  • 92. 92 5th Miss Rate Reduction Technique: Way Prediction and Pseudo-Associative Cache Using two Technique reduces conflict misses and yet maintains hit speed of direct-mapped cache Predictive bit - Pseudo-Associative Way Prediction (Alpha 21264 )  Extra bits are kept in the cache to predict the way,or block within set of the next cache access.  If the predictor is correct, the instruction cache latency is 1 clock clock cycle.  If not,it tries the other block, changes the way predictor, and has a latency of 3 clock cycles.  Simulation using SPEC95 suggested set prediction accuracy is excess of 85%, so way prediction saves pipeline stage in more than 85% of the instruction fetches.
  • 93. 93 Pseudo-Associative Cache (column associative) How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor (L2) Used in MIPS R1000 L2 cache, similar in UltraSPARC Time Hit Time Pseudo Hit Time Miss Penalty
  • 94. 94 Pseudo-Associative Cache 主 存 写缓存 4 CPU 地址 数据 数据 入 出 块帧地址 块内偏移 <21> <8> <5> 标志 索引 有效位 标志 数据 <1> <22> <256> … … 4:1 MUX =? 1 2 3 2*
  • 95. 95 How to Improve Cache Performance? Reduce hit time(4)  Small and simple first-level caches, Way prediction  avoiding address translation, Trace cache Increase bandwidth(3)  Pipelined caches, multibanked caches, non-blocking caches Reduce miss penalty(5)  multilevel caches, read miss prior to writes,  Critical word first, merging write buffers, and victim caches Reduce miss rate(4)  larger block size, large cache size, higher associativity  Compiler optimizations Reduce miss penalty or miss rate via parallelization (1)  Hardware or compiler prefetching AMAT = HitTime + MissRateMissPenalty
  • 96. 96 Hardware Prefetching Fetch two blocks on miss (include next sequential block) Pentium 4 Pre-fetching
  • 97. 97 Compiler Prefetching Insert prefetch instructions before data is needed Non-faulting: prefetch doesn’t cause exceptions Register prefetch Loads data into register Cache prefetch Loads data into cache Combine with loop unrolling and software pipelining
  • 98. 98 Use HBM to Extend Hierarchy 128 MiB to 1 GiB Smaller blocks require substantial tag storage Larger blocks are potentially inefficient One approach (L-H): Each SDRAM row is a block index Each row contains set of tags and 29 data segments 29-set associative Hit requires a CAS
  • 99. 99 Use HBM to Extend Hierarchy Another approach (Alloy cache): Mold tag and data together Use direct mapped Both schemes require two DRAM accesses for misses Two solutions:  Use map to keep track of blocks  Predict likely misses
  • 100. 100 Use HBM to Extend Hierarchy
  • 102. 102 1st Miss Penalty/Rate Reduction Technique: Hardware Prefetching of Inst.and data The act of getting data from memory before it is actually needed by the CPU. This reduces compulsory misses by retrieving the data before it is requested. Of course, this may increase other misses by removing useful blocks from the cache. Thus, many caches hold prefetched blocks in a special buffer until they are actually needed. E.g., Instruction Prefetching  Alpha 21064 fetches 2 blocks on a miss  Extra block placed in “stream buffer”  On miss check stream buffer Prefetching relies on having extra memory bandwidth that can be used without penalty
  • 103. 103 2nd Miss Penalty/Rate Reduction Technique: Compiler-controlled prefetch The compiler inserts prefetch instructions to request the data before they are needed Data Prefetch Load data into register (HP PA-RISC loads)  Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)  Special prefetching instructions cannot cause faults; a form of speculative execution Prefetching comes in two flavors:  Binding prefetch: Requests load directly into register.  Must be correct address and register!  Non-Binding prefetch: Load into cache.  Can be incorrect. Faults? Issuing Prefetch Instructions takes time  Is cost of prefetch issues < savings in reduced misses?  Higher superscalar reduces difficulty of issue bandwidth
  • 104. 104 Example (P307): Compiler-controlled prefetch for( i=0; i <3; i = i +1) for( j=0; j<100; j=j+1) a[i][j] = b[j][0] * b[j+1][0]; 16B/block , 8B/element , 2elements/block A[i][j]: 3*100 the even value of j will miss, the odd values will hit, total 150 misses B[i][j] : 101*3 the same elements are accessed for each iteration of i j=0 B[0][0] 、 B[1][0] 2 j=1 B[1][0] 、 B[2][0] 1 total 2+99=101 misses
  • 105. 105 Example cont.: Compiler-controlled prefetch For (j=0; j<100; j=j+1){ Prefetch(b[j+7][0]); prefetch(a[0][j+7]); a[0][j]=b[j][0]*b [j+1][0];}; 7 misses for b 4 misses for a For (I=1; I<3; I=I+1) For (j=0; j<100; j=j+1){ prefetch(a[i][j+7]); a[i][j]=b[j][0]*b [j+1][0];};4 misses for a[1][j] 4 misses for a[2][j] Total: 19 misses save 232 cache misses at the price of 400 prefetch instructions.

Editor's Notes

  • #5: Similar to DRAM write, DRAM read can also be a Early read or a Late read. In the Early Read Cycle, Output Enable is asserted before CAS is asserted so the data lines will contain valid data one Read access time after the CAS line has gone low. In the Late Read cycle, Output Enable is asserted after CAS is asserted so the data will not be available on the data lines until one read access time after OE is asserted. Once again, notice that the RAS line has to remain asserted during the entire time. The DRAM read cycle time is defined as the time between the two RAS pulse. Notice that the DRAM read cycle time is much longer than the read access time. Q: RAS & CAS at same time? Yes, both must be low +2 = 65 min. (Y:45)
  • #69: Intuitive Model by Mark Hill
  • #73: Ask which affected? Block size 1) Compulsory 2) More subtle, will change mapping