Cache Memory: Computer Organization and Architecture Characteristics of Memory Systems
Cache Memory: Computer Organization and Architecture Characteristics of Memory Systems
Location Capacity
• CPU • Word size
— Registers and control unit memory — The natural unit of organisation
• Internal — Typically number of bits used to represent an
— Main memory and cache integer in the processor
1
Access Methods (2) Performance
• Random • From user’s perspective the most important
— Individual addresses identify locations exactly characteristics of memory are capacity and
— Access time is independent of location or previous performance
access • Three performance parameters:
— e.g. RAM — Access time
• Associative — Cycle Time
— Data is located by a comparison with contents of a — Transfer Rate
portion of the store
• Access time (latency)
— Access time is independent of location or previous
— For RAM access time is the time between presenting an
access
address to memory and getting the data on the bus
— All memory is checked simultaneously; access time
— For other memories the largest component is
is constant
positioning the read/write mechanism
— e.g. cache
2
Organization Memory Hierarchy
• Physical arrangement of bits into words • For any memory:
• Not always obvious, e.g., interleaved memory — How fast?
(examples later) — How much?
— How expensive?
• Faster memory => greater cost per bit
• Greater capacity => smaller cost / bit
• Greater capacity => slower access
• Going down the hierarchy:
— Decreasing cost / bit
— Increasing capacity
— Increasing access time
— Decreasing frequency of access by processor
3
Cache Cache Memory Principles
• A small amount of fast memory that sits • If data sought is not present in cache, a block
between normal main memory and CPU of memory of fixed size is read into the cache
• May be located on CPU chip or module • Locality of reference makes it likely that other
• Intended to allow access speed approaching words in the same block will be accessed soon
register speed
• When processor attempts to read a word from
memory, cache is checked first
4
Cache/Main Memory Structure Cache view of memory
• N address lines => 2n words of memory
• Cache stores fixed length blocks of K words
• Cache views memory as an array of M blocks
where M = 2n/K
• A block of memory in cache is referred to as a
line. K is the line size
• Cache size of C blocks where C < M
(considerably)
• Each line includes a tag that identifies the
block being stored
• Tag is usually upper portion of memory address
5
Elements of Cache Design Cache Size does matter
• Addresses (logical or physical) • Cost
• Size — More cache is expensive
• Mapping Function (direct, assoociative, set associative) — Would like cost/bit to approach cost of main
• Replacement Algorithm (LRU, LFU, FIFO, random) memory
• Write Policy (write through, write back, write once) • Speed
• Line Size — But we want speed to approach cache speed for all
• Number of Caches (how many levels, unified or split) memory access
— More cache is faster (up to a point)
Note that cache design for High Performance Computing (HPC) is very
different from cache design for other computers — Checking cache for data takes time
Some HPC applications perform poorly with typical cache designs — Larger caches are slower to operate
6
Look-aside and Look-through Look-through cache
• Look-aside cache is parallel with main memory • Cache checked first when processor requests
• Cache and main memory both see the bus cycle data from memory
— Cache hit: processor loaded from cache, bus cycle — Hit: data loaded from cache
terminates — Miss: cache loaded from memory, then processor
— Cache miss: processor AND cache loaded from loaded from cache
memory in parallel • Pro:
• Pro: less expensive, better response to cache — Processor can run on cache while another bus
miss master uses the bus
• Con: Processor cannot access cache while • Con:
another bus master accesses memory — More expensive than look-aside, cache misses slower
7
Example
Direct Mapping Cache Organization
8
Associative Mapping Direct Mapping compared to Associative
• Because no bit field in the address specifies a
line number the cache size is not determined
by the address size
• Associative-mapped memory is also called
“content-addressable memory.”
• Items are found not by their address but by
their content
— Used extensively in routers and other network
devices
— Corresponds to associative arrays in Perl and other
languages
• Primary disadvantage is the cost of circuitry
Word
Tag 22 bit 2 bit
• 22 bit tag stored with each 32 bit block of data
• Compare tag field with tag entry in cache to check for
hit
• Least significant 2 bits of address identify which 16 bit
word is required from 32 bit data block
• e.g.
— Address Tag Data Cache line
— FFFFFD FFFFFC 24682468 3FFF
Example
Associative Mapping
• Parking lot analogy: there are more permits than
spaces
• Any student can park in any space
• Makes full use of parking lot
— With direct mapping many spaces may be unfilled
9
Associative Mapping Summary Set Associative Mapping
• Address length = (s + w) bits where w = • A compromise that provides strengths of both
log2(block size) direct and associative approaches
• Number of addressable units = 2s+w words or • Cache is divided into a number of sets of lines
bytes • Each set contains a fixed number of lines
• Block size = line size = 2w words or bytes • A given block maps to any line in a given set
• Number of blocks in main memory determined by that block’s address
= 2s+ w/2w = 2s — e.g. Block B can be in any line of set i
• Number of lines in cache = undetermined • e.g. 2 lines per set
• Size of tag = s bits — 2-way associative mapping
— A given block can be in one of 2 lines in only one set
10
Set Associative Mapping Address Structure Example
Word
Tag 9 bit Set 13 bit 2 bit
• Cache control logic sees address as three fields: tag,
set and word
• Use set field to determine cache set to look in
• Compare tag field to see if we have a hit
• e.g
— Address Tag Data Set number
— 1FF 7FFC 1FF 12345678 1FFF
— 001 7FFC 001 11223344 1FFF
• Tags are much smaller than fully associative memories
and comparators for simultaneous lookup are much less
expensive
11
Associative Mapped Implementation Varying associativity over cache size
12
Other Algorithms Write Policy
• First in first out (FIFO) • When a block of memory about to be
— replace block that has been in cache longest overwritten in cache:
— Implemented as circular queue — No problem if not modified in cache
• Least frequently used — Has to written back to main memory if modified
(dirty)
— replace block which has had fewest hits
• Random • Must not overwrite a cache block unless main
memory is up to date
— Almost as good other choices
• LRU is often favored because of ease of
hardware implementation
13
Line Size Line Size
• When a cache line is filled it normally includes • Relationship between block size and hit ratio is
more than the requested data – some adjacent complex and program-dependent
words are retrieved • No optimal formula exists
• As block size increases, cache hit ratio will also • General purpose computing uses blocks of 8 to
increase because of locality of reference – to a 64 bytes
limit • In HPC 64 and 128 byte lines are most common
• If block size is too large, possibility of
reference to parts of block decreases; there
are fewer blocks in cache so more chance of
block being overwritten
14
Split Cache Pentium Cache Evolution
• Current trend favors split caches • 80386 – no on chip cache
— Useful for superscalar machines with parallel • 80486 – 8k using 16 byte lines and four way set
associative organization
execution of instructions and prefetching of
predicted instructions • Pentium (all versions) – two on chip L1 caches
— Data & instructions
— Split cache eliminates contention for cache between
• Pentium III – L3 cache added off chip
instruction fetch/decode unit and the execution
• Pentium 4
unit (when accessing data)
— L1 caches
— Helps to keep pipeline full because the EU will block – 8k bytes
the fetch/decode unit otherwise – 64 byte lines
– four way set associative
— L2 cache
– Feeding both L1 caches
– 256k
– 128 byte lines
– 8 way set associative
— L3 cache on chip
15
Pentium 4 Cache Operating Modes ARM Cache Organization
• ARM3 started with 4KB of cache
• ARM design emphasis on few transistors and
small, low-power chips has kept cache fairly
small
16