0% found this document useful (0 votes)
96 views

UNIT-2 (Memory Hierarchy Design)

The document discusses memory hierarchy design and how it is becoming more crucial with the rise of multi-core processors. It describes how memory hierarchies are organized with faster but smaller memories closer to the CPU caching data from slower but larger memories. The key concepts of locality, cache hits, misses, and replacement policies are explained as well as how memory hierarchies attempt to provide high bandwidth access to memory while reducing the latency penalty of cache misses.

Uploaded by

Rachu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

UNIT-2 (Memory Hierarchy Design)

The document discusses memory hierarchy design and how it is becoming more crucial with the rise of multi-core processors. It describes how memory hierarchies are organized with faster but smaller memories closer to the CPU caching data from slower but larger memories. The key concepts of locality, cache hits, misses, and replacement policies are explained as well as how memory hierarchies attempt to provide high bandwidth access to memory while reducing the latency penalty of cache misses.

Uploaded by

Rachu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Unit-2

Memory Hierarchy Design


By
Dr. Sarvesh Vishwakarma
Professor – CSE
TCS 704 - Advanced Computer Architecture

Dr. Sarvesh Vishwakarma TCS-704 ACA


Introduction
Introduction
• Programmers want unlimited amounts of memory with low
latency
• Fast memory technology is more expensive per bit than
slower memory
• Solution: organize memory system into a hierarchy
– Entire addressable memory space available in largest, slowest memory
– Incrementally smaller and faster memories, each containing a subset
of the memory below it, proceed in steps up toward the processor
• Temporal and spatial locality insures that nearly all references
can be found in smaller memories
– Gives the allusion of a large, fast memory being presented to the
processor

Dr. Sarvesh Vishwakarma TCS-704 ACA


Principle of Locality
• Temporal Locality

• Spatial Locality

Dr. Sarvesh Vishwakarma TCS-704 ACA


Principle of Temporal Locality
❖ Program spends 90% of its execution time in only 10% of the code

❖ It’s a property of program

❖ This word (Data or Instruction) is likely to be needed in again in near


future.

Dr. Sarvesh Vishwakarma TCS-704 ACA


Principle of Spatial Locality
❖ Data or Instruction whose addresses are near one another tend to
be referenced close together in near future time.

❖ It’s a property of program

❖ There is high probability that the other data in the block will be
needed soon.

Dr. Sarvesh Vishwakarma TCS-704 ACA


Processor-Memory Performance
µProc
1000 CPU 60%/yr.
CPU-DRAM Gap“Moore’s Law” (2X/1.5yr)
Performance

100 Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
DRAM 9%/yr.
1
1989
(2X/10 yrs)
1984

1986

1996

1999
1980
1981
1982
1983

1985

1987
1988

1990
1991
1992
1993
1994
1995

1997
1998

2000
Problem: Time
Improvements in access time are not enough to catch up
Solution:
Increase the bandwidth of main memory (improve throughput)
Dr. Sarvesh Vishwakarma TCS-704 ACA
Capacity Memory Hierarchy Upper Level
Access Time Staging
Transfer Unit faster
CPU Registers
1KB SIZE Registers
50,000-500,000 MB/sec
< 0.5 ns Instr. Operands Prog./compiler

Cache
16 MB Cache
5000-20,000 MB/sec
0.5-25 ns cache cntl
Blocks
Main Memory
2-8 G B Main Memory
2500-10,000 MB/sec
50-250 ns OS
Pages
Disk
G-T Bytes
50-500 MB/sec
Disk
50,000 ns
user/operator
Files
Tape Larger
infinite
sec-min Tape Lower Level
Dr. Sarvesh Vishwakarma TCS-704 ACA
Introduction
Memory Hierarchy

Dr. Sarvesh Vishwakarma TCS-704 ACA


Introduction
Memory Hierarchy Design
• Memory hierarchy design becomes more crucial with
recent multi-core processors:
– Aggregate peak bandwidth grows with # cores:
• Intel Core i7 can generate two references per core per clock
• Four cores and 3.2 GHz clock
– 25.6 billion 64-bit data references/second +
– 12.8 billion 128-bit instruction references
– = 409.6 GB/s!
• DRAM bandwidth is only 6% of this (25 GB/s)
• Requires:
– Multi-port, pipelined caches
– Two levels of cache per core
– Shared third-level cache on chip

Dr. Sarvesh Vishwakarma TCS-704 ACA


Introduction
Performance and Power
• High-end microprocessors have >10 MB on-chip
cache
– Consumes large amount of area and power budget

Dr. Sarvesh Vishwakarma TCS-704 ACA


Introduction
Memory Hierarchy Basics
• When a word is not found in the cache, a miss
occurs:
– Fetch word from lower level in hierarchy, requiring a
higher latency reference
– Lower level may be another cache or the main memory
– Also fetch the other words contained within the block
• Takes advantage of spatial locality
– Place block into cache in any location within its set,
determined by address
• block address MOD number of sets

Dr. Sarvesh Vishwakarma TCS-704 ACA


Introduction
Memory Hierarchy Basics
• n sets => n-way set associative
– Direct-mapped cache => one block per set
– Fully associative => one set

• Writing to cache: two strategies


– Write-through
• Immediately update lower levels of hierarchy
– Write-back
• Only update lower levels of hierarchy when an updated block is
replaced
– Both strategies use write buffer to make writes
asynchronous

Dr. Sarvesh Vishwakarma TCS-704 ACA


Introduction
Memory Hierarchy Basics
• Miss rate
– Fraction of cache access that result in a miss

• Causes of misses
– Compulsory
• First reference to a block
– Capacity
• Blocks discarded and later retrieved
– Conflict
• Program makes repeated references to multiple addresses from
different blocks that map to the same location in the cache

Dr. Sarvesh Vishwakarma TCS-704 ACA


Introduction
Memory Hierarchy Basics

• Note that speculative and multithreaded processors


may execute other instructions during a miss
– Reduces performance impact of misses

Dr. Sarvesh Vishwakarma TCS-704 ACA


Main Memory Background
• Performance of Main Memory:
– Latency: affects cache miss penalty
• Access Time: time between request and word arrives
• Cycle Time: time between requests
– Bandwidth: primary concern for I/O & large block
• Main Memory is DRAM: Dynamic RAM
– Dynamic since needs to be refreshed periodically
– Addresses divided into 2 halves (Row/Column)
• Cache uses SRAM: Static RAM
– No refresh
• 6 transistors/bit vs. 1 transistor/bit, 10X area
– Address not divided: Full address

Dr. Sarvesh Vishwakarma TCS-704 ACA


Memory hierarchy

•Pyramid
•Technical specification of memory elements
•Concepts of cache memory
•Temporal locality
•Spatial locality
•Hit
•Miss
•Hit Rate
•Miss Rate
•Miss penalty

✓Hit rate + Miss Rate = 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


Categorize Cache Miss

1.Compulsory misses
2.Capacity misses
3.Conflict misses

3 C’s Model

Dr. Sarvesh Vishwakarma TCS-704 ACA


CASE STUDY-1
User program contain 320 instructions
Instruction Count: 320
Size of individual instruction: 1 byte
Program size: 320 bytes

Memory has 320 cells


Capacity of single cell: 1 byte or 1 instruction

Max carrying capacity of cache: 1 byte

Dr. Sarvesh Vishwakarma TCS-704 ACA


ALU Flags Capacity
Reg.[B] Reg.[C]
Reg.[D] Reg.[E]
PC SP

Cache

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


CASE STUDY-1
➢Fail to implement principle of locality
✓Temporal locality
✓Spatial locality

➢Increase in miss rate (especially capacity miss)

Dr. Sarvesh Vishwakarma TCS-704 ACA


CASE STUDY-2
➢Try to implement principle of locality
✓Temporal locality
▪ Cache size: 80 cell or 80 instructions
▪Block size of cache (line length): 1 cell
▪Use bigger cache

➢Decrease in miss rate (reduction in capacity miss)

➢Drawback: (with bigger cache memory)


•Longer hit time
•High Cost
•More Power Consumption
Dr. Sarvesh Vishwakarma TCS-704 ACA
Capable of holding both old and new 0 1 2 3 4 5 6 7 ALU Flags
Information (Data/Instruction)
B C
D E
Temporal Locality PC SP
Now it has many cells to hold
the multiple instructions along Processor
with their data for longer time Cache
But Still cache is fetching the Data
from memory as per the demand of
Processor
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Main Memory

Dr. Sarvesh Vishwakarma TCS-704 ACA


CASE STUDY-3
➢Also implement second principle of locality
✓Spatial locality
▪ Cache size: 80 cell or 80 instructions
▪Use Large block size
▪Block size of cache (line length): 10 cell

➢Decrease in miss rate (reduce in compulsory misses)


✓ Increase hit rate by increasing the amount of data
brought into the cache when a cache miss occur

➢Drawback of increasing the length of cache line


✓Line fetch time get longer
✓Spatial locality principle loss its strength

Dr. Sarvesh Vishwakarma TCS-704 ACA


Capable of holding both old and new 0 1 2 3 4 5 6 7 ALU Flags
Information (Data/Instruction)
B C
D E
Spatial Locality PC SP
Now it has many cells to hold
the multiple instructions along Processor
with their data for longer time Cache Obviously frequency
of transaction
Here cache is bringing the whole between Cache and
chunk of Data from memory Memory would get
reduce

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Main Memory

Dr. Sarvesh Vishwakarma TCS-704 ACA


Mapping Function
Associativity
How many locations within the cache may contain a given
memory address………….?

If cache is partitioned into N number of cache blocks or line lengths…

How many groups can we form within the cache by distributing N number of
cache blocks or line lengths……………………………….???

Dr. Sarvesh Vishwakarma TCS-704 ACA


Different way of creating clusters in
Cache area
If N = 8

1. 8 cache blocks can be clubbed together in single group/set. 1 cluster/set


2. 8 cache blocks can be clubbed in a group of four blocks. 2 clusters/sets
3. 8 cache blocks can be clubbed in a group of two blocks. 4 clusters/sets
4. 8 cache blocks can be clubbed in a group of one blocks. 8 clusters/sets

If N = 16

1. 16 cache blocks can be clubbed together in single group/set. 1 cluster/set


2. 16 cache blocks can be clubbed in a group of eight blocks. 2 clusters/sets
3. 16 cache blocks can be clubbed in a group of four blocks. 4 clusters/sets
4. 16 cache blocks can be clubbed in a group of two blocks. 8 clusters/sets
5. 16 cache blocks can be clubbed in a group of one blocks. 16 clusters/sets
Dr. Sarvesh Vishwakarma TCS-704 ACA
Associativity

Referenced
address
Store Ref. Address in
➢ any location or
➢only one location or
➢fixed number of location
?
Dr. Sarvesh Vishwakarma TCS-704 ACA
Fully Associative Caches

Associativity
Referenced Referenced Referenced Referenced Referenced Referenced Referenced Referenced
address address address address address address address address

Referenced
address

Store Ref. Address in


➢ any location or
➢only one location or
➢fixed number of location
?
Dr. Sarvesh Vishwakarma TCS-704 ACA
Direct-Mapped Caches

Associativity
Referenced
address

Referenced
address

Store Ref. Address in


➢ any location or
➢only one location or
➢fixed number of location
?
Dr. Sarvesh Vishwakarma TCS-704 ACA
Set -Associative Caches
Associativity
SET-1 SET-2
Referenced Referenced Referenced Referenced
address address address address

Referenced
address

Store Ref. Address in


➢ any location or
➢only one location or
➢fixed number of location
?
There are TCS-704
Dr. Sarvesh Vishwakarma locations
ACAwhere a given address may be stored.
No. of locations in each set is the associativity of the cache
0 1 2 3 4 5 6 7 No. of Cache block: 8
Where to place
memory block with
index field 26?

No. of Memory block: 32


0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


(Block Address) MOD (Number of sets in cache)
0 MOD 8 = 0 16 MOD 8 = 0
0 1 2 3 4 5 6 7 There are eight groups/sets ;
1 MOD 8 = 1 17 MOD 8 = 1 Each group contain single
2 MOD 8 = 2 18 MOD 8 = 2
3 MOD 8 = 3 19 MOD 8 = 3 cache block
4 MOD 8 = 4 20 MOD 8 = 4
5 MOD 8 = 5 21 MOD 8 = 5
6 MOD 8 = 6 22 MOD 8 = 6
7 MOD 8 = 7 23 MOD 8 = 7 Where to place
8 MOD 8 = 0 24 MOD 8 = 0
9 MOD 8 = 1 25 MOD 8 = 1 memory block with
10 MOD 8 = 2
11 MOD 8 = 3
26 MOD 8 = 2 index field 26?
27 MOD 8 = 3
12 MOD 8 = 4 28 MOD 8 = 4
13 MOD 8 = 5 29 MOD 8 = 5 One-way set associative
14 MOD 8 = 6 30 MOD 8 = 6
15 MOD 8 = 7 31 MOD 8 = 7

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


(Block Address) MOD (Number of sets in cache)

0 1 2 3 4 5 6 7 One-way set associative

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


One-way set associative

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


Where to place
memory block with
One-way set associative index field 26?

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


(Block Address) MOD (Number of sets in cache)
0 MOD 4 = 0 16 MOD 4 = 0
0 1 2 3 All eight cache blocks are
1 MOD 4 = 1 17 MOD 4 = 1 distributed in a group of two
2 MOD 4 = 2 18 MOD 4 = 2
3 MOD 4 = 3 19 MOD 4 = 3 cache blocks.
4 MOD 4 = 0 20 MOD 4 = 0 It form four groups.
5 MOD 4 = 1 21 MOD 4 = 1 Four Sets
6 MOD 4 = 2 22 MOD 4 = 2
7 MOD 4 = 3 23 MOD 4 = 3 Two-way set associative
8 MOD 4 = 0 24 MOD 4 = 0
9 MOD 4 = 1 25 MOD 4 = 1
10 MOD 4 = 2 26 MOD 4 = 2
11 MOD 4 = 3 27 MOD 4 = 3
12 MOD 4 = 0 28 MOD 4 = 0
13 MOD 4 = 1 29 MOD 4 = 1
14 MOD 4 = 2 30 MOD 4 = 2 Memory block => 32
15 MOD 4 = 3 31 MOD 4 = 3

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


(Block Address) MOD (Number of sets in cache)
0 MOD 4 = 0 16 MOD 4 = 0
4 MOD 4 = 0 20 MOD 4 = 0
8 MOD 4 = 0 24 MOD 4 = 0 Two-way set associative
12 MOD 4 = 0 28 MOD 4 = 0

Two-way for set `0’ 0 1 2 3

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


(Block Address) MOD (Number of sets in cache)

1 MOD 4 = 1 17 MOD 4 = 1
5 MOD 4 = 1 21 MOD 4 = 1
9 MOD 4 = 1 25 MOD 4 = 1 Two-way set associative
13 MOD 4 = 1 29 MOD 4 = 1

Two-way for set ‘1’ 0 1 2 3

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


(Block Address) MOD (Number of sets in cache)

2 MOD 4 = 2 18 MOD 4 = 2
6 MOD 4 = 2 22 MOD 4 = 2
10 MOD 4 = 2 26 MOD 4 = 2 Two-way set associative
14 MOD 4 = 2 30 MOD 4 = 2

Two-way for set ‘2’ 0 1 2 3

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


(Block Address) MOD (Number of sets in cache)

3 MOD 4 = 3 19 MOD 4 = 3
7 MOD 4 = 3 23 MOD 4 = 3
11 MOD 4 = 3 27 MOD 4 = 3
Two-way set associative
15 MOD 4 = 3 31 MOD 4 = 3

Two-way for set ‘3’ 0 1 2 3

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


(Block Address) MOD (Number of sets in cache)
0 1 All eight cache blocks are
0 MOD 2 = 0 16 MOD 2 = 0
1 MOD 2 = 1 17 MOD 2 = 1 distributed in a group of 4
2 MOD 2 = 0 18 MOD 2 = 0 cache blocks.
3 MOD 2 = 1 19 MOD 2 = 1
4 MOD 2 = 0 20 MOD 2 = 0
5 MOD 2 = 1 21 MOD 2 = 1 It form two sets/groups in
6 MOD 2 = 0 22 MOD 2 = 0
7 MOD 2 = 1 23 MOD 2 = 1 cache.
8 MOD 2 = 0 24 MOD 2 = 0
9 MOD 2 = 1 25 MOD 2 = 1
10 MOD 2 = 0 26 MOD 2 = 0 Four-way set associativity
11 MOD 2 = 1 27 MOD 2 = 1
12 MOD 2 = 0 28 MOD 2 = 0
13 MOD 2 = 1 29 MOD 2 = 1
14 MOD 2 = 0 30 MOD 2 = 0 Memory block => 32
15 MOD 2 = 1 31 MOD 2 = 1

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


(Block Address) MOD (Number of sets in cache)

0 MOD 2 = 0 16 MOD 2 = 0
2 MOD 2 = 0 18 MOD 2 = 0
4 MOD 2 = 0 20 MOD 2 = 0
6 MOD 2 = 0 22 MOD 2 = 0
8 MOD 2 = 0 24 MOD 2 = 0
10 MOD 2 = 0 26 MOD 2 = 0
12 MOD 2 = 0 28 MOD 2 = 0
14 MOD 2 = 0 30 MOD 2 = 0

4-way set associativity

0 1

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


(Block Address) MOD (Number of sets in cache)
1 MOD 2 = 1 17 MOD 2 = 1
3 MOD 2 = 1 19 MOD 2 = 1
4-way set
5 MOD 2 = 1 21 MOD 2 = 1 associativity
7 MOD 2 = 1 23 MOD 2 = 1
9 MOD 2 = 1 25 MOD 2 = 1
11 MOD 2 = 1 27 MOD 2 = 1
13 MOD 2 = 1 29 MOD 2 = 1
15 MOD 2 = 1 31 MOD 2 = 1

0 1

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


(Block Address) MOD (Number of sets
in cache) 8-way set associativity
0 MOD 1 = 0 16 MOD 1 = 0
1 MOD 1 = 0 17 MOD 1 = 0
2 MOD 1 = 0 18 MOD 1 = 0
3 MOD 1 = 0 19 MOD 1 = 0 Fully Associative
4 MOD 1 = 0 20 MOD 1 = 0
5 MOD 1 = 0 21 MOD 1 = 0
6 MOD 1 = 0 22 MOD 1 = 0
7 MOD 1 = 0 23 MOD 1 = 0
8 MOD 1 = 0 24 MOD 1 = 0
9 MOD 1 = 0 25 MOD 1 = 0
10 MOD 1 = 0 26 MOD 1 = 0
11 MOD 1 = 0 27 MOD 1 = 0 0
12 MOD 1 = 0 28 MOD 1 = 0
13 MOD 1 = 0 29 MOD 1 = 0
14 MOD 1 = 0 30 MOD 1 = 0
15 MOD 1 = 0 31 MOD 1 = 0

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Dr. Sarvesh Vishwakarma TCS-704 ACA


Processor

Cache

Memory

storage

Dr. Sarvesh Vishwakarma TCS-704 ACA


Processor
Reply for requested
Request for data
data

Cache

Memory

storage

Dr. Sarvesh Vishwakarma TCS-704 ACA


Processor
Reply for requested
Request for data
data

Cache

Queries for Reply for enquired


missing data data

Memory

storage

Dr. Sarvesh Vishwakarma TCS-704 ACA


Avg. Memory Access Time= Hit time L1:Cache +(Miss Rate L1: Cache x Miss Penalty L1: cache)

Miss Penalty L1: cache = Hit time Memory + Miss Rate Memory x Miss Penalty Memory

Miss Penalty Memory = 0

Dr. Sarvesh Vishwakarma TCS-704 ACA


Processor
Hit time for cache
Request for data
to cache
Cache
Queries for
Cache missing Reply for enquired
Miss data in data
cache
Request for Hit time for
Memory
missing data Main Memory
to Main Data found
Memory

storage

Dr. Sarvesh Vishwakarma TCS-704 ACA


Processor

L1: Cache

L2: Cache

Memory

storage

Dr. Sarvesh Vishwakarma TCS-704 ACA


Request for data to L1: cache Processor

t1 t2 L1: cache Hit

L1: Cache

L2: Cache

Memory

storage
Dr. Sarvesh Vishwakarma TCS-704 ACA
Request for data to L1: cache Processor

t1 t4 L1: cache Hit

Request for L1: cache Miss L1: Cache


missing data
to L2: cache t2
t3 L2: cache Hit

L2: Cache

Memory

storage
Dr. Sarvesh Vishwakarma TCS-704 ACA
Request for data to L1: cache Processor

t1 t6 L1: cache Hit

Request for L1: cache Miss L1: Cache


missing data
to L2: cache t2
t5 L2: cache Hit

Request for L2: cache Miss L2: Cache


missing data
to Memory
t3 Memory Hit
t4

Memory

storage
Dr. Sarvesh Vishwakarma TCS-704 ACA
Request for data to L1: cache Processor

t1 t8 L1: cache Hit

Request for L1: cache Miss L1: Cache


missing data
to L2: cache t2
t7 L2: cache Hit

Request for L2: cache Miss L2: Cache


missing data
to Memory
t3 Memory Hit
t6

Request for Page fault Memory


missing data
to hard disk
Virtual pages loaded
t4 into Memory
t5

Storage DATA
FOUND
Dr. Sarvesh Vishwakarma TCS-704 ACA
Global Miss Rate for L2: cache
L 2 : CacheMiss
L1 : Cache Re quest

Local Miss Rate for L2: cache


L 2 : CacheMiss
L 2 : Cache Re quest
Dr. Sarvesh Vishwakarma TCS-704 ACA
Global Miss Rate for L1: cache
L1 : CacheMiss
L1 : Cache Re quest

Local Miss Rate for L1: cache


L1 : CacheMiss
L1 : Cache Re quest
Dr. Sarvesh Vishwakarma TCS-704 ACA
Request for data to L1: cache Processor

t1 t6 L1: cache Hit

Request for L1: L1: Cache


L2:CacheRequest
missing data cache
= L1:CacheMiss to L2: cache Miss
t2 t5 L2: cache Hit

L2: Cache
MemoryRequest Request for L2:
missing data cache
= L2:CacheMiss to Memory Miss
t3 t4 Memory Hit

Memory

storage
Dr. Sarvesh Vishwakarma TCS-704 ACA
L 2 : CacheMiss
Global Miss Rate for L2: cache =
L1 : Cache Re quest

L1 : CacheMiss
Local Miss Rate for L1: cache =
L1 : Cache Re quest
L 2 : CacheMiss
Local Miss Rate for L2: cache =
L1 : CacheMiss

Global Miss Rate for L2: cache = (Local Miss Rate for L1: cache)x (Local Miss Rate for L2: cache)

Dr. Sarvesh Vishwakarma TCS-704 ACA


Avg. Memory Access Time= Hit time L1:Cache +(Miss Rate L1: Cache x Miss Penalty L1: cache)

Miss Penalty L1: cache = Hit time L2:Cache + (Miss Rate L2:Cache x Miss Penalty L2:Cache)

Miss Penalty L2:Cache = Hit time Memory + (Miss Rate Memory x Miss Penalty Memory)

Miss Penalty Memory = 0

Avg. Memory Access Time= Hit time L1:Cache +(Miss Rate L1: Cache) x (Hit time
L2:Cache + (Miss Rate L2:Cache x Miss Penalty L2:Cache))

Avg. Memory Access Time= Hit time L1:Cache +Miss Rate L1: Cache x (Hit time L2:Cache +
Miss Rate L2:Cache x Miss Penalty L2:Cache)

T = Hit time L1 +Miss Rate L1 x (Hit time L2 +Miss Rate L2 x Miss Penalty L2)
Dr. Sarvesh Vishwakarma TCS-704 ACA
Introduction
Memory Hierarchy

Dr. Sarvesh Vishwakarma TCS-704 ACA


Caches
❖Capacity : 32 KB, 64 KB, 2MB, 4 MB etc
❖Line length: 10 byte, 32 byte, 64 byte, Is it aligned or not
❖Associativity:
a) Direct mapped,
b) Fully associative,
c) Set associative
❖Replacement policy:
a) Least-recently Used
b) Random replacement
c) First in, first out
d) Not-most-recently used
❖How to handle storing the information in lower level of
memory hierarchy
a) Write Back
b) Write Through
Dr. Sarvesh Vishwakarma TCS-704 ACA
Replacement Algorithms

1. Least recently used or LRU LRU bit

2. Least frequently used or LFU counter

3. First In First Out or FIFO Circular


buffer

Dr. Sarvesh Vishwakarma TCS-704 ACA


Introduction
Memory Hierarchy Basics
• Six basic cache optimizations:
– Larger block size
• Reduces compulsory misses
• Increases capacity and conflict misses, increases miss penalty
– Larger total cache capacity to reduce miss rate
• Increases hit time, increases power consumption
– Higher associativity
• Reduces conflict misses
• Increases hit time, increases power consumption
– Higher number of cache levels
• Reduces overall memory access time
– Giving priority to read misses over writes
• Reduces miss penalty
– Avoiding address translation in cache indexing
• Reduces hit time

Dr. Sarvesh Vishwakarma TCS-704 ACA


Eleven Advanced cache Optimization
Techniques
1. Reducing the hit time:
a) small and simple caches,
b) way prediction, and
c) Trace caches

2. Increasing cache bandwidth:


a) pipelined caches,
b) multibanked caches, and
c) Nonblocking caches

3. Reducing the miss penalty:


a) critical word first and
b) merging write buffers

4. Reducing the miss rate: compiler optimizations


a) Code and Data Rearrangement
b) Loop interchange
c) Blocking

5. Reducing the miss penalty or miss rate via parallelism: hardware


prefetching and compiler prefetching
Dr. Sarvesh Vishwakarma TCS-704 ACA
Advanced Optimizations
Ten Advanced Optimizations
• Small and simple first level caches
– Critical timing path:
• addressing tag memory, then
• comparing tags, then
• selecting correct set
– Direct-mapped caches can overlap tag compare and
transmission of data
– Lower associativity reduces power because fewer cache
lines are accessed

Dr. Sarvesh Vishwakarma TCS-704 ACA


Advanced Optimizations
L1 Size and Associativity

Access time vs. size and associativity

Dr. Sarvesh Vishwakarma TCS-704 ACA


Advanced Optimizations
L1 Size and Associativity

Energy per read vs. size and associativity

Dr. Sarvesh Vishwakarma TCS-704 ACA


Advanced Optimizations
Way Prediction
• To improve hit time, predict the way to pre-set mux
– Mis-prediction gives longer hit time
– Prediction accuracy
• > 90% for two-way
• > 80% for four-way
• I-cache has better accuracy than D-cache
– First used on MIPS R10000 in mid-90s
– Used on ARM Cortex-A8
• Extend to predict block as well
– “Way selection”
– Increases mis-prediction penalty

Dr. Sarvesh Vishwakarma TCS-704 ACA


Advanced Optimizations
Pipelining Cache
• Pipeline cache access to improve bandwidth
– Examples:
• Pentium: 1 cycle
• Pentium Pro – Pentium III: 2 cycles
• Pentium 4 – Core i7: 4 cycles

• Increases branch mis-prediction penalty


• Makes it easier to increase associativity

Dr. Sarvesh Vishwakarma TCS-704 ACA


Advanced Optimizations
Nonblocking Caches
• Allow hits before
previous misses
complete
– “Hit under miss”
– “Hit under multiple
miss”
• L2 must support this
• In general, processors
can hide L1 miss
penalty but not L2
miss penalty

Dr. Sarvesh Vishwakarma TCS-704 ACA


Advanced Optimizations
Multibanked Caches
• Organize cache as independent banks to support
simultaneous access
– ARM Cortex-A8 supports 1-4 banks for L2
– Intel i7 supports 4 banks for L1 and 8 banks for L2

• Interleave banks according to block address

Dr. Sarvesh Vishwakarma TCS-704 ACA


Advanced Optimizations
Critical Word First, Early Restart
• Critical word first
– Request missed word from memory first
– Send it to the processor as soon as it arrives
• Early restart
– Request words in normal order
– Send missed work to the processor as soon as it arrives

• Effectiveness of these strategies depends on block


size and likelihood of another access to the
portion of the block that has not yet been fetched

Dr. Sarvesh Vishwakarma TCS-704 ACA


Advanced Optimizations
Merging Write Buffer
• When storing to a block that is already pending in the
write buffer, update write buffer
• Reduces stalls due to full write buffer
• Do not apply to I/O addresses

No write
buffering

Write buffering

Dr. Sarvesh Vishwakarma TCS-704 ACA


Advanced Optimizations
Compiler Optimizations
• Code and Data Rearrangement

Dr. Sarvesh Vishwakarma TCS-704 ACA


Compiler Optimizations Memory
• Traditional loop operations X[0][0] S1
– i-loop is nested inside j-loop to access memory in X[0][1] S5
nonsequential order X[0][2] S9

– for (j = 0; j < 4; j = j+1) X[0][3] S13


X[1][0] S2
for(i=0; i< 3; i=i+1)
X[1][1] S6
X[i][j] = 2*X[i][j];
X[1][2] S10
X[1][3] S14
j
X X[2][0] S3
0 1 2 3 X[2][1] S7
X[2][2] S11
0
X[2][3] S15
i X[3][0] S4
1
X[3][1] S8
2
X[3][2] S12
Dr. Sarvesh Vishwakarma TCS-704 ACA
X[3][3] S16
Compiler Optimizations Memory
• Loop Interchange X[0][0] S1
– Swap nested loops to access memory in X[0][1] S2
sequential order X[0][2] S3
– for (i = 0; i < 3; i = i+1) X[0][3] S4
for(j=0; j< 4; j=j+1) X[1][0] S5
X[i][j] = 2*X[i][j];
X[1][1] S6
– Reordering maximizes use of data in a cache
X[1][2] S7
block before they are discarded.
X[1][3] S8
j
X X[2][0] S9
0 1 2 3 X[2][1] S10
X[2][2] S11
0
X[2][3] S12
i X[3][0] S13
1
X[3][1] S14
2
X[3][2] S15
Dr. Sarvesh Vishwakarma TCS-704 ACA
X[3][3] S16
Compiler Optimizations

Advanced Optimizations
• Blocking
– Instead of accessing entire rows or columns, subdivide matrices
into blocks
– Requires more memory accesses but improves locality of accesses
Y: 1 2 3 1 2 2 Z: 1 1 0 1 2 1 X: 14
0 0 2 0 1 1 1 0 0 1 3 1
1 2 3 2 2 1 2 2 0 3 2 0
2 3 0 2 0 1 3 3 2 1 1 1
1 2 3 4 1 0 1 0 1 4 0 2
3 2 1 2 0 1 0 1 2 0 1 2

for (i=0;i<N; i = i+1)


1 1 + 2  1 + 3  2 + 1 3 + 2  1 + 2  0 for (j=0;j<N; j = j+1)
1+ 2 + 6 + 3 + 2 + 0 { r =0;
for (k=0;k<N; k = k+1)
14 r = r + Y[i][k]*Z[k][j];
X[i][j] = r;
Dr. Sarvesh Vishwakarma TCS-704 ACA
};
Compiler Optimizations

Advanced Optimizations
• Blocking
– Instead of accessing entire rows or columns, subdivide matrices
into blocks
– Requires more memory accesses but improves locality of accesses
Y: 1 2 3 1 2 2 Z: 1 1 0 1 2 1 X: 9
0 0 2 0 1 1 1 0 0 1 3 1
1 2 3 2 2 1 2 2 0 3 2 0
2 3 0 2 0 1 3 3 2 1 1 1
1 2 3 4 1 0 1 0 1 4 0 2
3 2 1 2 0 1 0 1 2 0 1 2

1 1 + 2  1 + 3  2 for (jj=0;jj<N; jj = jj+1)


for (kk=0;kk<N; kk = kk+B)
1+ 2 + 6 for (i=0;i<N; i = i+1)
9 for (j=jj;j<min(jj+B,N);j=j+1)
{ r =0;
Intermediate result for (k=kk;k<min(kk+B,N);k=k+1)
r = r + Y[i][k]*Z[k][j];
X[i][j] = X[i][j] +r;
Dr. Sarvesh Vishwakarma TCS-704 ACA
};
Advanced Optimizations
Hardware Prefetching
• Fetch two blocks on miss (include next sequential
block)

Pentium 4 Pre-fetching
Dr. Sarvesh Vishwakarma TCS-704 ACA
Advanced Optimizations
Compiler Prefetching
• Insert prefetch instructions before data is needed
• Non-faulting: prefetch doesn’t cause exceptions

• Register prefetch
– Loads data into register
• Cache prefetch
– Loads data into cache

• Combine with loop unrolling and software


pipelining

Dr. Sarvesh Vishwakarma TCS-704 ACA


Advanced Optimizations
Summary

Dr. Sarvesh Vishwakarma TCS-704 ACA


Virtual Memory
• Three memory problems
– Not enough RAM
– Holes in our address space
– Programs writing over each other
• What is virtual memory?
– Indirection
– How does it solve the problem
– Pages Tables and Transaction
• Implementing virtual memory
– Where do we store the page tables?
– Making translation fast
Dr. Sarvesh Vishwakarma TCS-704 ACA
Max. addressing capability of 8085 procesor :
65536
216 = 28  28 = 256  256 = 65536 or = 64KB
1024

System bus (16 address lines)

0000H
8085 ROM RAM 2000H Disk
CPU 8k 8k Drive
1FFFH 3FFFH

2000H Storage
(8 K)
2000H
3FFFH
Program
(16 KB)
4000H
5FFFH
(8 K)
5FFFH

Dr. Sarvesh Vishwakarma TCS-704 ACA


What if we don’t have enough memory?

• MIPS gives each program its own 32-bit address


space
• Programs can access any byte in their 32-bit address
space
– Indirection
– How does it solve the problem
– Pages Tables and Transaction
• Implementing virtual memory
– Where do we store the page tables?
– Making translation fast
Dr. Sarvesh Vishwakarma TCS-704 ACA
Overlay
• Customer billing software without overlay

Get data from customer


Calculate total (1 KB)

Print bill
(1 KB)

• Customer billing software with overlay

Get data from customer


Calculate total (1 KB) Print bill (1 KB)
Load printing routine

Dr. Sarvesh Vishwakarma TCS-704 ACA


Virtual Memory
If a program became too large for physical memory, it was the
programmer’s job to make it fit. Programmers divided programs
into pieces, then identified the pieces that were mutually exclusive,
and loaded or unloaded those overlays under user program control
during execution. The Programmer ensured that the program never
tried to access more physical main memory than was in the
computer, and that the proper overlay was loaded at the proper time.
This responsibility eroded programmer productivity.

Virtual memory was invented to relieve programmers of this


burden. It automatically manages the two levels of the memory
hierarchy represented by main memory and secondary storage and
allows the programmer to develop the software assuming there is
more memory available in the computer than in reality
Dr. Sarvesh Vishwakarma TCS-704 ACA
Virtual Memory
Virtual
address
0 A 0
4K B 4K C
8K C 8K
12K D 12K
Virtual memory 16K A
20K
Mapping of virtual memory
to physical memory for a 24K B
program with four pages A, 28K
B, C, and D Physical main memory

Dr. Sarvesh Vishwakarma TCS-704 ACA


Disk
Making VM work: translation

• How does a program access memory?


1. Program executes a load with a virtual address (VA)
2. Computer translates the address to the physical
address (PA) in memory
3. If the physical address (PA) is not in memory, the
operating system loads it in from disk
4. The computer then reads the RAM using the physical
address (PA) and returns the data to the program

Dr. Sarvesh Vishwakarma TCS-704 ACA


Making VM work: translation

Dr. Sarvesh Vishwakarma TCS-704 ACA


Making VM work: translation

Dr. Sarvesh Vishwakarma TCS-704 ACA


What happens if a page is not in RAM?
• Page Table Entry says the page is on disk
• CPU generates a page fault exception
• The hardware jumps to the OS page fault handler to clean
up
– The OS chooses a page to evict from RAM and write to disk
– If the page is dirty, it needs to be written back to disk first
– The OS then reads the page from disk and puts it in RAM
– The OS then changes the Page Table to map the new page
• The OS jumps back to the instruction that caused the page
fault.
– This time it won’t cause a page fault since the page has been
loaded Dr. Sarvesh Vishwakarma TCS-704 ACA
How does the virtual memory system prevent
programs from accessing each other’s data?

❖Each program has its own virtual address space.

❖The virtual memory system is responsible for ensuring that different


programs virtual addresses translate to different physical addresses. So no
two programs have virtual addresses that map onto the same physical
address at the same time.

❖This mean that memory references from one program can not target the
physical addresses containing another program’s data, preventing programs
from accessing each other’s data.

Dr. Sarvesh Vishwakarma TCS-704 ACA


Caches Vs Virtual Memory

Parameter First-Level cache Virtual Memory


Block (page) size 16-128 bytes 4096-65536 bytes
Hit time 1-3 clock cycles 50-150 clock cycles
Miss penalty 8-150 clock cycles 1000,000-10,000,000 clock cycles
Access time 6-130 clock cycles 800,000-8,000,000 clock cycles
Transfer time 2-20 clock cycles 200,000-2,000,000 clock cycles
Miss rate 0.1-10% 0.00001-0.001%
Address mapping 25-45 bit physical 32-64 bit virtual address to 25-45 bit
address to 14-20 bit physical address
cache address

Dr. Sarvesh Vishwakarma TCS-704 ACA


Caches Vs Virtual Memory

1. Replacement on cache misses is primarily controlled by


hardware, while virtual memory replacement is primarily
controlled by the operating system.

2. The size of the processor address determines the size of


virtual memory, but the cache size is independent of the
processor address size.

3. In addition to acting as the lower-level backing store for main


memory in the hierarchy, secondary storage is also used for
the file system. In fact, the file system occupies most of
secondary storage. It is not normally in the address space.

Dr. Sarvesh Vishwakarma TCS-704 ACA


Mapping of a virtual address to a physical
address via a page table
Virtual address
Virtual page number Page offset

Main memory
Page table
Physical address

Dr. Sarvesh Vishwakarma TCS-704 ACA


Techniques for Fast Address Translation
Address space
number <8>

Virtual page Page offset


number <35> <13>

ASN Prot V Tag Physical address


<8> <4> <1> <35> <31>
1 2

44 bits physical address


Low-order 13 bits of address

3
128:1 MUX 4

High-order 31 bits of address

Dr. Sarvesh Vishwakarma TCS-704 ACA


Techniques for Fast Address Translation
1 Step 1 & 2 : Translation begins by sending the virtual address to all tags. The
tag must be marked valid to allow a match.

2 Step 2: At the same time, the type of memory access is checked for a violation
against protection information in the TLB.

To reduce TLB misses due to context switches, each entry has an 8-bit address
space number (ASN), which plays the same role as the PID. If the context
switching returns to the process with the same ASN, it can still match the TLB.
The process ASN and the page table entry (PTE) ASN must also match for a
valid tag.

3 Step 3: The matching tag sends corresponding physical address through


effectively a 128:1 multiplexor.

4 Step 4: The page offset is then combined with the physical page frame to form
a full physical address.

Dr. Sarvesh Vishwakarma TCS-704 ACA


Dr. Sarvesh Vishwakarma TCS-704 ACA

You might also like