2. 2
Memory Technology and Optimizations
Performance metrics
Latency is concern of cache
Bandwidth is concern of multiprocessors and I/O
Access time
Time between read request and when desired word arrives
Cycle time
Minimum time between unrelated requests to memory
SRAM memory has low latency, use for cache
Organize DRAM chips into many banks for high
bandwidth, use for main memory
3. 3
Memory Technology
SRAM
Requires low power to retain bit
Requires 6 transistors/bit
DRAM
Must be re-written after being read
Must also be periodically refeshed
Every ~ 8 ms (roughly 5% of time)
Each row can be refreshed simultaneously
One transistor/bit
Address lines are multiplexed:
Upper half of address: row access strobe (RAS)
Lower half of address: column access strobe (CAS)
4. 4
DRAM logical organization
(64 Mbit)
Square root of bits per RAS/CAS
Column Decoder
Sense Amps & I/O
Memory Array
(16,384×16,384)
A0…A13
…
Address
buffer 14
Data
in
D
Q
Word Line
Storage
Cell
Data
out
Row
Decoder
…
Bit
Line
5. 5
DRAM Read Timing
Every DRAM access begins at:
The assertion of the RAS_L
2 ways to read:
early or late v. CAS A
D
OE_L
256K x 8
DRAM
9 8
WE_L
CAS_L
RAS_L
OE_L
A Row Address
WE_L
Junk
Read Access
Time
Output Enable
Delay
CAS_L
RAS_L
Col Address Row Address Junk
Col Address
D High Z Data Out
DRAM Read Cycle Time
Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L
Junk Data Out High Z
9. 9
Memory Technology
Amdahl:
Memory capacity should grow linearly with processor speed
Unfortunately, memory capacity and speed has not kept pace
with processors
Some optimizations:
Multiple accesses to same row
Synchronous DRAM
Added clock to DRAM interface
Burst mode with critical word first
Wider interfaces
Double data rate (DDR)
Multiple banks on each DRAM device
10. 10
1st
Improving DRAM Performance
Fast Page Mode DRAM (FPM)
Timing signals that allow repeated accesses to the row
buffer (page) without another row access time
Such a buffer comes naturally, as each array will buffer
1024 to 2048 bits for each access.
Page: All bits on the same ROW (Spatial Locality)
Don’t need to wait for wordline to recharge
Toggle CAS with new column address
11. 11
2nd
Improving DRAM Performance
Synchronous DRAM
conventional DRAMs have an asynchronous interface to
the memory controller, and hence every transfer
involves overhead to synchronize with the controller.
The solution was to add a clock signal to the DRAM
interface, so that the repeated transfers would not bear
that overhead.
Data output is in bursts w/ each element clocked
12. 12
3rd
Improving DRAM Performance
DDR--Double data rate
On both the rising edge and falling edge of the DRAM clock
signal, DRAM innovation to increase bandwidth is to transfer
data,
thereby doubling the peak data rate.
2.5V
1.8v
1.5v
13. 13
DDR--Double data rate
DDR:
DDR2
Lower power (2.5 V -> 1.8 V)
Higher clock rates (266 MHz, 333 MHz, 400 MHz)
DDR3
1.5 V
800 MHz
DDR4
1-1.2 V
1333 MHz
GDDR5 is graphics memory based on DDR3
14. 14
4rd
Improving DRAM Performance
New DRAM Interface: RAMBUS (RDRAM)
a type of synchronous dynamic RAM, designed by the
Rambus Corporation.
Each chip has interleaved memory and a high speed interface.
Protocol based RAM w/ narrow (16-bit) bus
High clock rate (400 Mhz), but long latency
Pipelined operation
Multiple arrays w/ data transferred on both edges of clock
The first generation RAMBUS interface dropped RAS/CAS,
replacing it with a bus that allows other accesses over the bus
between the sending of the address and return of the data. It is
typically called RDRAM.
The second generation RAMBUS interface include a separate row-
and column-command buses instead of the conventional
multiplexing; and a much more sophisticated controller on chip.
Because of the separation of data, row, and column buses, three
transactions can be performed simultaneously. called Direct
RDRAM or DRDRAM
17. 17
Comparing RAMBUS and DDR SDRAM
Since the most computers use memory in DIMM
packages, which are typically at least 64-bits wide,
the DIMM memory bandwidth is closer to what
RAMBUS provides than you might expect when just
comparing DRAM chips.
Caution that performance of cache are based in part
on latency to the first byte and in part on the
bandwidth to deliver the rest of the bytes in the
block.
Although these innovations help with the latter
case, none help with latency.
Amdahl’s Law reminds us of the limits of
accelerating one piece of the problem while ignoring
another part.
18. 18
Memory Optimizations
Reducing power in SDRAMs:
Lower voltage
Low power mode (ignores clock, continues to
refresh)
Graphics memory:
Achieve 2-5 X bandwidth per DRAM vs. DDR3
Wider interfaces (32 vs. 16 bit)
Higher clock rate
Possible because they are attached via soldering instead of socketted
DIMM modules
19. 19
Summery
Memory organization
Wider memory
Simple interleaved memory
Independent memory banks
Avoiding Memory Bank Conflicts
Memory chip
Fast Page Mode DRAM
Synchronize DRAM
Double Date Rate
RDRAM
22. 22
Flash Memory
Type of EEPROM
Types: NAND (denser) and NOR (faster)
NAND Flash:
Reads are sequential, reads entire page (.5 to 4 KiB)
25 us for first byte, 40 MiB/s for subsequent bytes
SDRAM: 40 ns for first byte, 4.8 GB/s for subsequent
bytes
2 KiB transfer: 75 uS vs 500 ns for SDRAM, 150X
slower
300 to 500X faster than magnetic disk
23. 23
NAND Flash Memory
Must be erased (in blocks) before being
overwritten
Nonvolatile, can use as little as zero power
Limited number of write cycles (~100,000)
$2/GiB, compared to $20-40/GiB for SDRAM
and $0.09 GiB for magnetic disk
Phase-Change (相变) /Memrister Memory
Possibly 10X improvement in write performance
and 2X improvement in read performance
24. 24
Memory Dependability
Memory is susceptible to cosmic rays
Soft errors: dynamic errors
Detected and fixed by error correcting codes (ECC)
Hard errors: permanent errors
Use spare rows to replace defective rows
Chipkill: a RAID-like error recovery technique
25. 25
How to Improve Cache Performance?
Reduce hit time(4)
Small and simple first-level caches, Way prediction
avoiding address translation, Trace cache
Increase bandwidth(3)
Pipelined caches, multibanked caches, non-blocking caches
Reduce miss penalty(5)
multilevel caches, read miss prior to writes,
Critical word first, merging write buffers, and victim caches
Reduce miss rate(4)
larger block size, large cache size, higher associativity
Compiler optimizations
Reduce miss penalty or miss rate via parallelization (1)
Hardware or compiler prefetching
AMAT = HitTime + MissRateMissPenalty
26. 26
1st
Hit Time Reduction Technique:
Small and Simple Caches
Using small and Direct-mapped cache
The less hardware that is necessary to implement a
cache, the shorter the critical path through the
hardware.
Direct-mapped is faster than set associative for both
reads and writes.
Fitting the cache on the chip with the CPU is also very
important for fast access times.
27. 27
L1 Size and Associativity
Access time vs. size and associativity
28. 28
L1 Size and Associativity
Energy per read vs. size and associativity
30. 30
2nd
Hit Time Reduction Technique:
Way Prediction
Way Prediction (Pentium 4 )
Extra bits are kept in the cache to predict the way,or
block within set of the next cache access.
If the predictor is correct, the instruction cache
latency is 1 clock clock cycle.
If not,it tries the other block, changes the way
predictor, and has a latency of 1 extra clock cycles.
Simulation using SPEC95 suggested set prediction
accuracy is excess of 85%, so way prediction saves
pipeline stage in more than 85% of the instruction
fetches.
31. 31
3rd
Hit Time Reduction Technique:
Avoiding Address Translation during Indexing of the Cache
Page table is a large data structure in memory
TWO memory accesses for every load, store, or
instruction fetch!!!
Virtually addressed cache?
synonym problem
Cache the address translations?
CPU
Trans-
lation Cache
Main
Memory
VA PA miss
hit
data
32. 32
TLBs
A way to speed up translation is to use a special cache of
recently used page table entries -- this has many names,
but the most frequently used is Translation Lookaside
Buffer or TLB
Virtual Address Physical Address Dirty Ref Valid Access
Really just a cache on the page table mappings
TLB access time comparable to cache access time
(much less than main memory access time)
33. 33
Translation Look-Aside Buffers
Just like any other cache, the TLB can be organized as fully associative,
set associative, or direct mapped
TLBs are usually small, typically not more than 128 - 256 entries even on
high end machines. This permits fully associative lookup on these
machines. Most mid-range machines use small n-way set associative
organizations.
CPU
TLB
Lookup
Cache
Main
Memory
VA PA miss
hit
data
Trans-
lation
hit
miss
20 t
t
1/2 t
Translation
with a TLB
34. 34
Fast hits by Avoiding Address Translation
CPU
TB
$
MEM
VA
PA
PA
Conventional
Organization
CPU
$
TB
MEM
VA
VA
PA
Virtually Addressed Cache
Translate only on miss
Synonym Problem
VA
Tags
CPU
$ TB
MEM
VA
PA
Tags
PA
Virtual indexed, Physically tagged
Overlap $ access
with VA translation:
requires $ index to
remain invariant
across translation
L2 $
35. 35
Virtual Addressed Cache
Send virtual address to cache? Called Virtually
Addressed Cache or
just Virtual Cache (vs. Physical Cache)
Every time process is switched logically must flush the cache;
otherwise get false hits
Cost is time to flush + “compulsory” misses from empty cache
Add process identifier tag that identifies process as well as address
within process: can’t get a hit if wrong process
Any Problems ?
37. 37
Dealing with aliases
Dealing with aliases (synonyms); Two different virtual
addresses map to same physical address
NO aliasing! What are the implications?
HW antialiasing: guarantees every cache block has unique
address
verify on miss (rather than on every hit)
cache set size <= page size ?
what if it gets larger?
How can SW simplify the problem? (called page coloring)
I/O must interact with cache, so need virtual address
38. 38
Aliases problem with Virtual cache
Tag index offset
Tag index offset
A1
A2
A1
A2
If the index and offset bits of two aliases are forced to be
the same, than the aliases address will map to the same block in cache.
39. 39
Overlap address translation and cache access
(Virtual indexed, physically tagged)
Index
PhysicalPageNo page offset
VitualPageNo page offset
TLB translation
Tag
=
Any limitation ?
40. 40
What’s the limitation?
Index
PhysicalPageNo page offset
VitualPageNo page offset
TLB translation
Tag
=
IF it’s direct map cache, then
Cache size = 2index
* 2blockoffset
<= 2 pageoffset
How to solve this problem?
Use higher association. Say it’s a 4 way, then cache size can
reach 4 times 2pageoffset
with change nothing in index or tag or offset.
41. 41
VPN: 30bits 8bits 5bits
=
=
VA
V D 25 bits 256 bits
V D 25 bits 256 bits
bank0
bank1
0
0
255
255
V D 30 bits 25 bits
=
=
=
=
VPN PPN
Cache
PPN : 25bits
index block offset
TLB
Example: Virtual indexed, physically tagged cache
Virtual address wide = 43 bits, Memory physical address wide = 38 bits, Page size = 8KB.
Cache capacity =16KB. If a virtually indexed and physically tagged cache is used.
And the cache is 2-way associative write back cache with 32 byte block size.
42. 42
4th
Hit Time Reduction Technique:
Trace caches
Find a dynamic sequence of instructions including
taken branches to load into a cache block.
The block determined by CPU instead of by memory
layout.
Complicated address mapping mechanism
43. 43
Why Trace Cache ?
Bring N instructions per cycle
No I-cache misses
No prediction miss
No packet breaks !
Because branch in each 5 instruction, so cache can only provide a
packet in one cycle.
44. 44
What’s Trace ?
Trace: dynamic instruction sequence
When instructions ( operations ) retire from the
pipeline, pack the instruction segments into
TRACE, and store them in the TRACE cache,
including the branch instructions.
Though branch instruction may go a different
target, but most times the next operation
sequential will just be the same as the last
sequential. ( locality )
45. 45
Whose propose ?
Peleg Weiser (1994) in Intel corporation
Patel / Patt ( 1996)
Rotenberg / J. Smith (1996)
Paper: ISCA’98
48. 48
Pentium 4:
trace cache, 12 instr./per cycle
A B C
Trace Cache
Instruction
Cache
Decode
PC
BR
predictor
Decode
Pentium4
12 instr.
Rename
Execution
cord
Data
cache
Fill
unit
Trace segment
L2
Cache
Memory
Disk
Onchip
49. 49
How to Improve Cache Performance?
Reduce hit time(4)
Small and simple first-level caches, Way prediction
avoiding address translation, Trace cache
Increase bandwidth(3)
Pipelined caches, multibanked caches, non-blocking caches
Reduce miss penalty(5)
multilevel caches, read miss prior to writes,
Critical word first, merging write buffers, and victim caches
Reduce miss rate(4)
larger block size, large cache size, higher associativity
Compiler optimizations
Reduce miss penalty or miss rate via parallelization (1)
Hardware or compiler prefetching
AMAT = HitTime + MissRateMissPenalty
52. 52
2nd
Increasing cache bandwidth:
Multibanked Caches
Organize cache as independent banks to support
simultaneous access
ARM Cortex-A8 supports 1-4 banks for L2
Intel i7 supports 4 banks for L1 and 8 banks for L2
Banking works best when accesses naturally spread
themselves across banks mapping of addresses to
banks affects behavior of memory system
Interleave banks according to block address,Simple
mapping that works well is “sequential interleaving”
53. 53
Single banked two bank two bank two bank
consecutive interleaving group interleaving
54. 54
3rd
Increasing cache bandwidth:
Nonblocking Caches
Allow hits before previous misses complete
“Hit under miss” , “ Hit under multiple miss”
L2 must support this
In general, processors can hide L1 miss penalty but not L2 miss
penalty
Nonblocking, in conjunction with out-of-order execution, can allow
the CPU to continue executing instructions after a data cache miss.
55. 55
How to Improve Cache Performance?
Reduce hit time(4)
Small and simple first-level caches, Way prediction
avoiding address translation, Trace cache
Increase bandwidth(3)
Pipelined caches, multibanked caches, non-blocking caches
Reduce miss penalty(5)
multilevel caches, read miss prior to writes,
Critical word first, merging write buffers, and victim caches
Reduce miss rate(4)
larger block size, large cache size, higher associativity
Compiler optimizations
Reduce miss penalty or miss rate via parallelization (1)
Hardware or compiler prefetching
AMAT = HitTime + MissRateMissPenalty
56. 56
1st
Miss Penalty Reduction Technique:
Multilevel Caches
This method focuses on the interface between the
cache and main memory.
Add an second-level cache between main memory and a
small, fast first-level cache, to make the cache fast
and large.
The smaller first-level cache is fast enough to match
the clock cycle time of the fast CPU and to fit on the
chip with the CPU, thereby lessening the hits time.
The second-level cache can be large enough to capture
many memory accesses that would go to main memory,
thereby lessening the effective miss penalty.
57. 57
Parameter about Multilevel cache
The performance of a two-level cache is calculated in a similar way to
the performance for a single level cache.
L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 +
Miss RateL1 x (Hit TimeL2 + Miss RateL2 * Miss PenaltyL2)
1
2
2
2
2
1
1
L
L
L
L
L
L
L
rate
Miss
M
Misses
M
Misses
rate
Miss
M
Misses
rate
Miss
So the miss penalty for level 1 is calculated using the hit
time, miss rate, and miss penalty for the level 2 cache.
58. 58
Two conceptions for two-level cache
Definitions:
Local miss rate— misses in this cache divided by the
total number of memory accesses to this cache
(Miss rateL2)
Global miss rate—misses in this cache divided by the
total number of memory accesses generated by the
CPU
Feb.2008_jxh_Introductio
n
Using the terms above, the global miss for the first-level cache is
stall just Miss rateL1, but for the second-level cache it is :
2
1
1
2
1
1
2
1
2
2
L
L
L
L
L
L
L
L
L
L
Global
rate
Miss
rate
Miss
rate
Miss
M
Misses
M
Misses
rate
Miss
M
Misses
M
rate
Miss
M
M
Misses
rate
Miss
59. 59
2nd
Miss Penalty Reduction Technique:
Giving Priority to Read Misses over Writes
If a system has a write buffer, writes can be delayed
to come after reads.
The system must, however, be careful to check the
write buffer to see if the value being read is about to
be written.
60. 60
Write buffer
• Write-back want buffer to hold displaced blocks
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the
read, and then do the write
– CPU stall less since restarts as soon as do read
• Write-through want write buffers => RAW conflicts
with main memory reads on cache misses
– If simply wait for write buffer to empty, might
increase read miss penalty (old MIPS 1000 by 50% )
– Check write buffer contents before read;
if no conflicts, let the memory access continue
61. 61
3nd
Miss Penalty Reduction Technique:
Critical Word First & Early Restart
Don’t wait for full block to be loaded before
restarting CPU
Critical Word First—Request the missed word first
from memory and send it to the CPU as soon as it
arrives; let the CPU continue execution while filling
the rest of the words in the block. Also called
wrapped fetch and requested word first
Early restart—As soon as the requested word of
the block arrives, send it to the CPU and let the
CPU continue execution
Generally useful only in large blocks,
Spatial locality => tend to want next sequential word,
so not clear if benefit by early restart
62. 62
Example: Critical Word First
Assume:
cache block = 64-byte
L2: take 11 CLK to get the critical 8 bytes,(AMD Athlon)
and then 2 CLK per 8 byte to fetch the rest of the
block
There will be no other accesses to rest of the block
Calculate the average miss penalty for critical word first.
Then assuming the following instructions read data
sequentially 8 bytes at a time from the rest of the block
Compare the times with and without critical word first.
63. 63
4th
Miss Penalty Reduction Technique:
Merging write Buffer
One word writes replaces with multiword writes, and it
improves buffers’s efficiency.
In write-through ,When write misses if the buffer
contains other modified blocks,the addresses can be
checked to see if the address of this new data matches
the address of a valid write buffer entry.If so,the new
data are combined with that entry.
The optimization also reduces stalls due to the write
buffer being full.
64. 64
Merging Write Buffer
When storing to a block that is already pending in the
write buffer, update write buffer
Reduces stalls due to full write buffer
Do not apply to I/O addresses
No write
buffering
Write buffering
65. 65
5th
Miss Penalty Reduction Technique:
Victim Caches
A victim cache is a small (usually, but not
necessarily) fully-associative cache that holds a
few of the most recently replaced blocks or
victims from the main cache.
This cache is checked on a miss data before
going to next lower-level memory(main memory).
to see if they have the desired item
If found, the victim block and the cache block are
swapped.
The AMD Athlon has a victim caches (write buffer for
write back blocks ) with 8 entries.
67. 67
How to combine victim Cache ?
How to combine fast hit
time of direct mapped
yet still avoid conflict
misses?
Add buffer to place data
discarded from cache
Jouppi [1990]: 4-entry
victim cache removed 20%
to 95% of conflicts for a 4
KB direct mapped data
cache
Used in Alpha, HP
machines To Next Lower Level In
Hierarchy
DATA
TAGS
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
68. 68
How to Improve Cache Performance?
Reduce hit time(4)
Small and simple first-level caches, Way prediction
avoiding address translation, Trace cache
Increase bandwidth(3)
Pipelined caches, multibanked caches, non-blocking caches
Reduce miss penalty(5)
multilevel caches, read miss prior to writes,
Critical word first, merging write buffers, and victim caches
Reduce miss rate(4)
larger block size, large cache size, higher associativity
Compiler optimizations
Reduce miss penalty or miss rate via parallelization (1)
Hardware or compiler prefetching
AMAT = HitTime + MissRateMissPenalty
69. 69
Where misses come from?
Classifying Misses: 3 Cs
Compulsory—The first access to a block is not in the cache, so the block
must be brought into the cache. Also called cold start misses or first
reference misses.
(Misses in even an Infinite Cache)
Capacity—If the cache cannot contain all the blocks needed during
execution of a program, capacity misses will occur due to blocks
being discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
Conflict—If block-placement strategy is set associative or direct
mapped, conflict misses (in addition to compulsory & capacity
misses) will occur because a block can be discarded and later
retrieved if too many blocks map to its set. Also called collision
misses or interference misses.
(Misses in N-way Associative, Size X Cache)
4th “C”:
Coherence - Misses caused by cache coherence.
70. 70
3Cs Absolute Miss Rate (SPEC92)
0. 00
0. 01
0. 02
0. 03
0. 04
0. 05
0. 06
0. 07
0. 08
0. 09
0. 10
4 8 16 32 64 128 256 512
1- w
ay
2- w
ay
4- w
ay
8- w
ay
capaci ty
com
pul sory
71. 71
3Cs Relative Miss Rate
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4 8 16 32 64 128 256 512
1- w
ay
2- w
ay
4- w
ay
8- w
ay
capaci ty
com
pul sory
Flaws: for fixed block size
Good: insight => invention
72. 72
Reducing Cache Miss Rate
To reduce cache miss rate, we have to eliminate
some of the misses due to the three C's.
We cannot reduce capacity misses much except
by making the cache larger.
We can, however, reduce the conflict misses and
compulsory misses in several ways:
73. 73
Cache Organization?
Assume total cache size not changed:
What happens if:
1) Change Block Size:
2) Change Associativity:
3) Change Compiler:
Which of 3Cs is obviously affected?
74. 74
1st Miss Rate Reduction Technique:
Larger Block Size (fixed size&assoc)
Larger blocks decrease the compulsory miss
rate by taking advantage of spatial locality.
Drawback--curve is U-shaped
However, they may increase the miss penalty by requiring more
data to be fetched per miss.
In addition, they will almost certainly increase conflict misses
since fewer blocks can be stored in the cache.
And maybe even capacity misses in small caches
Trade-off
Trying to minimize both the miss rate and the miss penalty.
The selection of block size depends on both the latency and bandwidth
of lower-level memory
76. 76
Performance curve is U-shaped
Block Size (bytes)
Miss
Rate
0%
5%
10%
15%
20%
25%
16
32
64
128
256
1K
4K
16K
64K
256K
Reduced
compulsory
misses
Increased
Conflict
Misses
What else drives up block size?
77. 77
Example: Larger Block Size (C-26)
Assume: memory takes 80 clock cycles of overhead
and then delivers 16 bytes every 2 cycles.
1 clock cycle hit time independent of block size.
Which block size has the smallest AMAT for each size
in Fig.5.17 ?
Answer:
AMAT16-byte block, 4KB = 1+(8.57%*82)=8.027
AMAT256-byte block, 256KB= 1+(0.49%*112)=1.549
78. 78
2nd
Miss Rate Reduction Technique:
Larger Caches
rule of thumb: 2 x size => 25% cut in miss rate
What does it reduce ?
Cache Size (KB)
Miss
Rate
per
Type
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1
2
4
8
16
32
64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
Cache Size
79. 79
Pro. Vs. cons for large caches
Pro.
Reduce capacity misses
Con.
Longer hit time, Higher cost, AMAT curve is U-shaped
Popular in off-chip caches
Block size Miss penalty
Cache size
4K 16K 64K 256K
16 82 8.027 4.231 2.673 1.894
32 84 7.082 3.411 2.134 1.588
64 88 7.160 3.323 1.933 1.449
128 96 8.469 3.659 1.979 1.470
256 112 11.651 4.685 2.288 1.549
80. 80
3rd
Miss Rate Reduction Technique:
Higher Associativity
Conflict misses can be a problem for caches
with low associativity (especially direct-
mapped).
With higher associativity decreasing Conflict
misses to improve miss rate
cache rule of thumb
2:1 rule of thumb a direct-mapped cache of size N has
the same miss rate as a 2-way set-associative cache of
size N/2.
Eight-way set associative is for practical purposes as
effective in reducing misses for these sized cache as
fully associative.
81. 81
Fall_Ad Computer Architecture
Associativity
0. 00
0. 01
0. 02
0. 03
0. 04
0. 05
0. 06
0. 07
0. 08
0. 09
0. 10
4 8 16 32 64 128 256 512
1- w
ay
2- w
ay
4- w
ay
8- w
ay
capaci ty
com
pul sory
Conflict
2:1 rule of thumb
82. 82
Associativity vs Cycle Time
Beware: Execution time is only final measure!
Why is cycle time tied to hit time?
Will Clock Cycle time increase?
Hill [1988] suggested hit time for 2-way vs. 1-way
external cache +10%,
internal + 2%
suggested big and dumb caches
83. 83
AMAT vs. Miss Rate (P430)
Example: assume CCT = 1.36 for 2-way, 1.44 for 4-way,
1.52 for 8-way vs. CCT direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
4 3.44 3.25 3.22 3.28
8 2.69 2.58 2.55 2.62
16 2.33 2.40 2.46 2.53
32 2.06 2.30 2.37 2.45
64 1.92 2.24 2.18 2.25
128 1.52 1.84 1.92 2.00
256 1.32 1.66 1.74 1.82
512 1.20 1.55 1.59 1.66
(Red means A.M.A.T. not improved by more associativity)
84. 84
4th Miss Rate Reduction Technique:
Compiler Optimizations
The techniques reduces miss rates without any hardware
changes and reorders instruction sequence with compiler.
Instructions
Reorder procedures in memory so as to reduce conflict misses
Profiling to look at conflicts(using tools they developed)
Data
Merging Arrays: improve spatial locality by single array of
compound elements vs. 2 arrays
Loop Interchange: change nesting of loops to access data in
order stored in memory
Loop Fusion: Combine 2 independent loops that have same
looping and some variables overlap
Blocking: Improve temporal locality by accessing “blocks” of
data repeatedly vs. going down whole columns or rows
85. 85
a. Merging Arrays
Combining independent matrices into a single compound
array.
Improving spatial locality
Example
/*before*/
Int val[SIZE];
Int key[SIZE];
/*after*/
Struct merge{
int val;
int key;
}
Struct merge merged_array[SIZE]
86. 86
b. Loop Interchange
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through
memory every 100 words;
By switching the order in which loops execute, misses can
be reduced due to improvements in spatial locality.
87. 87
c. Loop fusion
By fusion the code into a single loop, the data that are
fetched into the cache can be used repeatedly before
being swapped out.
Imporving the temporal locality
Example:
/*before*/ /*after*/
For (i=0; i<N; i=i+1) For (i=0; i<N; i=i+1)
For (i=0; j<N; j=i+1) For (j=0; j<N; j=i+1)
a[i][j]=1/b[i][j]*c[i][j]; {
For (i=0; i<N; i=i+1)
a[i][j]=1/b[i][j]*c[i][j];
For (j=0; j<N; j=j+1) d[i][j]=a[i][j]*c[i][j];
d[i][j]=a[i][j]*c[i][j]; }
88. 88
d. Unoptimized Matrix Multiplication
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1)
r = r + y[i][k]*z[k][j];
x[i][j] = r;
};
Two Inner Loops:
Write N elements of 1 row of X[ ]
Read N elements of 1 row of Y[ ] repeatedly
Read all NxN elements of Z[ ]
Capacity Misses a function of N & Cache Size:
2N3
+ N2
=> (assuming no conflict; otherwise …)
Idea: compute on BxB submatrix that fits
y[1][k] z[k][j]
x[1][j]
((N+N)N+N)N=2N3
+ N2
Accessed For N3
operations
89. 89
Blocking optimized Matrix Multiplication
Matrix multiplication is
performed by multiplying the
submatrices first.
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1)
r = r + y[i][k]*z[k][j];
x[i][j] = x[i][j] + r;
};
Y benefits from spatial locality
Z benefits from temporal locality
Capacity Misses from 2N3
+ N2
to N3
/B+2N2
BN B×B
BN
(BN+BN)+B2)×(N/B)2=2N3/B + N2
Accessed For N3 operations
B called Blocking Factor
90. 90
Reducing Conflict Misses by Blocking
Conflict misses in caches not FA vs. Blocking size
Lam et al [1991] a blocking factor of 24 had a fifth the
misses vs. 48 despite both fit in cache
Blocking Factor
Miss
Rate
0
0.05
0.1
0 50 100 150
Fully Associative Cache
Direct Mapped Cache
92. 92
5th Miss Rate Reduction Technique:
Way Prediction and Pseudo-Associative Cache
Using two Technique reduces conflict misses and yet
maintains hit speed of direct-mapped cache
Predictive bit - Pseudo-Associative
Way Prediction (Alpha 21264 )
Extra bits are kept in the cache to predict the way,or block within
set of the next cache access.
If the predictor is correct, the instruction cache latency is 1 clock
clock cycle.
If not,it tries the other block, changes the way predictor, and has a
latency of 3 clock cycles.
Simulation using SPEC95 suggested set prediction accuracy is excess
of 85%, so way prediction saves pipeline stage in more than 85% of
the instruction fetches.
93. 93
Pseudo-Associative Cache
(column associative)
How to combine fast hit time of Direct Mapped and have
the lower conflict misses of 2-way SA cache?
Divide cache: on a miss, check other half of cache to see if
there, if so have a pseudo-hit (slow hit)
Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
Better for caches not tied directly to processor (L2)
Used in MIPS R1000 L2 cache, similar in UltraSPARC
Time
Hit Time
Pseudo Hit Time Miss Penalty
95. 95
How to Improve Cache Performance?
Reduce hit time(4)
Small and simple first-level caches, Way prediction
avoiding address translation, Trace cache
Increase bandwidth(3)
Pipelined caches, multibanked caches, non-blocking caches
Reduce miss penalty(5)
multilevel caches, read miss prior to writes,
Critical word first, merging write buffers, and victim caches
Reduce miss rate(4)
larger block size, large cache size, higher associativity
Compiler optimizations
Reduce miss penalty or miss rate via parallelization (1)
Hardware or compiler prefetching
AMAT = HitTime + MissRateMissPenalty
97. 97
Compiler Prefetching
Insert prefetch instructions before data is needed
Non-faulting: prefetch doesn’t cause exceptions
Register prefetch
Loads data into register
Cache prefetch
Loads data into cache
Combine with loop unrolling and software
pipelining
98. 98
Use HBM to Extend Hierarchy
128 MiB to 1 GiB
Smaller blocks require substantial tag storage
Larger blocks are potentially inefficient
One approach (L-H):
Each SDRAM row is a block index
Each row contains set of tags and 29 data
segments
29-set associative
Hit requires a CAS
99. 99
Use HBM to Extend Hierarchy
Another approach (Alloy cache):
Mold tag and data together
Use direct mapped
Both schemes require two DRAM accesses
for misses
Two solutions:
Use map to keep track of blocks
Predict likely misses
102. 102
1st Miss Penalty/Rate Reduction Technique: Hardware
Prefetching of Inst.and data
The act of getting data from memory before it is actually
needed by the CPU.
This reduces compulsory misses by retrieving the data
before it is requested.
Of course, this may increase other misses by removing
useful blocks from the cache.
Thus, many caches hold prefetched blocks in a special
buffer until they are actually needed.
E.g., Instruction Prefetching
Alpha 21064 fetches 2 blocks on a miss
Extra block placed in “stream buffer”
On miss check stream buffer
Prefetching relies on having extra memory bandwidth that
can be used without penalty
103. 103
2nd
Miss Penalty/Rate Reduction Technique:
Compiler-controlled prefetch
The compiler inserts prefetch instructions to request the
data before they are needed
Data Prefetch Load data into register (HP PA-RISC loads)
Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
Special prefetching instructions cannot cause faults; a form of
speculative execution
Prefetching comes in two flavors:
Binding prefetch: Requests load directly into register.
Must be correct address and register!
Non-Binding prefetch: Load into cache.
Can be incorrect. Faults?
Issuing Prefetch Instructions takes time
Is cost of prefetch issues < savings in reduced misses?
Higher superscalar reduces difficulty of issue bandwidth
104. 104
Example (P307):
Compiler-controlled prefetch
for( i=0; i <3; i = i +1)
for( j=0; j<100; j=j+1)
a[i][j] = b[j][0] * b[j+1][0];
16B/block , 8B/element , 2elements/block
A[i][j]: 3*100 the even value of j will miss,
the odd values will hit, total 150 misses
B[i][j] : 101*3 the same elements are accessed for each
iteration of i
j=0 B[0][0] 、 B[1][0] 2
j=1 B[1][0] 、 B[2][0] 1
total 2+99=101 misses
105. 105
Example cont.:
Compiler-controlled prefetch
For (j=0; j<100; j=j+1){
Prefetch(b[j+7][0]);
prefetch(a[0][j+7]);
a[0][j]=b[j][0]*b [j+1][0];}; 7 misses for b
4 misses for a
For (I=1; I<3; I=I+1)
For (j=0; j<100; j=j+1){
prefetch(a[i][j+7]);
a[i][j]=b[j][0]*b [j+1][0];};4 misses for a[1][j]
4 misses for a[2][j]
Total: 19 misses
save 232 cache misses at the price of 400 prefetch
instructions.
#5:Similar to DRAM write, DRAM read can also be a Early read or a Late read.
In the Early Read Cycle, Output Enable is asserted before CAS is asserted so the data lines will contain valid data one Read access time after the CAS line has gone low.
In the Late Read cycle, Output Enable is asserted after CAS is asserted so the data will not be available on the data lines until one read access time after OE is asserted.
Once again, notice that the RAS line has to remain asserted during the entire time. The DRAM read cycle time is defined as the time between the two RAS pulse.
Notice that the DRAM read cycle time is much longer than the read access time.
Q: RAS & CAS at same time? Yes, both must be low
+2 = 65 min. (Y:45)