Parallel 2
Parallel 2
CS448
Terminology
Coherence
Defines what values can be returned by a read
Coherent if:
If P writes to X then reads X, with no writes to X by other processors,
then the value read should be that written by P
If P writes to X and then another processor reads from X, if read/write
sufficiently separated then the value read should be that written by P
Writes to the same location are serialized; two writes to the same
location by any two processors are seen in the same order by all
processors
Consistency
Determines when a written value will be returned by a read,
well need to define a memory consistency model
For now assume all processors see effect of all writes before a write
completes
Snooping
No centralized directory
Each cache snoops or listens to maintain coherency
among caches
Used with CSM machines using a bus
4
Write Invalidate: When one processor writes, invalidate all copies of this data
that may be in other caches Write Back Cache
Write Broadcast: When one processor writes, broadcast the value and update any
copies that may be in other caches
Performance Differences
Multiple writes to the same word
Multiple broadcasts using update protocol
Only one initial invalidation using invalidate protocol
Implementing Invalidation
Bus-based scheme
Processor to invalidate acquires the bus
Broadcasts the address to invalidate
All other processors continuously snoop on the bus
watching the addresses
If an address is invalidated that matches an address in its
cache, then the corresponding data is invalidated
Implementing Invalidation
Write-through cache
To locate a data item when a cache miss occurs, just
go to memory (since memory will contain the most
up-to-date value in a write-through cache)
Write-back cache
What problem do we have reading in data on a cache
miss when all processors use write-back caches?
Implementing Invalidation
Write-Back Cache
May need to find the most recent value of a data item in some
other processors cache, not in memory
We can do this using the same snooping scheme for cache
misses and writes
Each processor snoops every address placed on the bus when a read is
requested from memory
If a processor has a dirty copy of the requested cache block (i.e. one we
wrote to and is hence updated), provide that cache block to the requestor
and abort the memory access
Example protocol
Each cache uses a finite-state transition diagram to determine
the proper state and action
10
Write-Invalidate Write-Back
Cache Coherence Protocol
11
CPU 1 starts in invalid; places read miss, reads block X, goes to shared state
CPU 1 re-reads block X, these are read hits
CPU 2 reads block X, places read miss, reads block X, goes to shared state
CPU 1 writes block X, always places write miss, moves to exclusive state
CPU 2 using right side reads write-miss and moves to Invalid state
CPU 1 writes or reads block X, stays in exclusive state
CPU 2 reads block X, places read miss
CPU 1 using right side gets read miss and moves to shared mode,
supplies correct memory block to CPU 2
CPU 2 moves to shared mode
12
Performance of Snooping
Coherence Protocols
Use the four parallel programs described earlier as
a benchmark
Split cache misses into two sets
Coherence Misses misses due to cache invalidation
Capacity Misses actually capacity, compulsory and
conflict misses, but most of these are capacity. Normal
cache misses from a uniprocessor
14
15
One solution:
DSM Coherency
Another solution: software-based coherency
Possible but slow and conservative, every block that might be
shared treated as if it is shared
Directory Protocol
Each directory must track the following states for its
cache blocks
Shared?
If shared, what processors are sharing this block?
This prevents broadcast if we need to invalidate those blocks, instead we
can send a message to only these specific processors
Uncached?
Set if no processor has a copy of the cache block
Exclusive?
Exactly one processor has a copy of the cache block and has written to it,
so memory copy is out of date
The processor of the exclusive block is called the owner of the block
20
10
Home node
Node where the memory location and directory entry reside
Could be the local node as well
Remote node
Node that has a copy of a cache block
Might be exclusive or shared
22
11
23
Directory-Based Performance
Use same parallel programs as for the snooping
protocol for our benchmark
Miss rate broken into two categories
Local misses
Remote misses
Remote misses are much more expensive than local misses
Longer read latencies to traverse interconnect
Will want to have bigger caches to avoid the latencies
12
Anomaly here
25
26
13
Summary
Coherence protocols may be needed for correct program
behavior
Most common protocol is write-back cache, write
invalidation
Can use snooping or directory based mechanism to
implement coherence
Coherence requests become more important in programs
that are less optimized
Optimized programs will access most data locally and have
fewer requests
Exactly how the cache miss rates affect CPU performance
depends on the memory system, interconnect, latency,
bandwidth, etc.
27
14