Cache Coherence (Part 1)
Cache Coherence (Part 1)
CS 740 P1 P2 P3
$ 4 $ 5 $ 3
u:5 u:5
Topics
• The Cache Coherence Problem
• Snoopy Protocols
1 I/O devices
u:5 2
Memory
Slides from Todd Mowry/Culler&Singh
PrRD V V
PrWr
BusWr
BusWr/—
PrRd/BusRd BusWr/—
PrRd/BusRd
BusRd
Obs/gen I I
• Two states per block in each cache, as in uniprocessor • Two states per block in each cache, as in uniprocessor
– state of a block can be seen as p-vector – state of a block can be seen as p-vector
• Hardware state bits associated with only blocks that are in the cache • Hardware state bits associated with only blocks that are in the cache
– other blocks can be seen as being in invalid (not-present) state in that cache – other blocks can be seen as being in invalid (not-present) state in that cache
• Write will invalidate all other caches (no local change of state) • Write will invalidate all other caches (no local change of state)
– can have multiple simultaneous readers of block,but write invalidates them – can have multiple simultaneous readers of block,but write invalidates them
– 11 – CS 740 F’03 – 12 – CS 740 F’03
Is WT Coherent? Problem with Write-Through
• Assume: High bandwidth requirements
• A bus transaction completes before next one starts • Every write from every processor goes to shared bus and memory
• Atomic bus transactions • Consider a 500MHz, 1CPI processor, where 15% of instructions are
8-byte stores
• Atomic memory transactions • Each processor generates 75M stores or 600MB data per second
• Write propagation? • 1GB/s bus can support only 1 processor without saturating
• Write-through especially unpopular for SMPs
• Write Serialization?
Write-back caches absorb most writes as cache hits
• Write hits don’t go on bus
• Key is the bus: writes serialized by bus • But now how do we ensure write propagation and serialization?
• Need more sophisticated protocols: large design space
• Extend cache controller and exploit bus (provides serialization) /*Assume initial value of A and flag is 0*/
while (flag == 0); /*spin idly*/
Dirty state now also indicates exclusive ownership
A = 1;
flag = 1; print A;
• Exclusive: only cache with a valid copy (main memory may be too)
• Owner: responsible for supplying block upon a request for it
Design space • Coherence talks about 1 memory location
• Invalidation versus Update-based protocols
• Set of states
1.Every process issues memory ops in program order Provides SC, not just coherence
2.After a write op is issued, the issuing process waits
for the write to complete before issuing its next op
Extend arguments used for coherence
3.After a read op is issued, the issuing process waits • Writes and read misses to all locations serialized by bus into bus
for the read to complete, and for the write whose order
value is being returned by the read to complete, • If read obtains value of write W, W guaranteed to have
before issuing its next operation (provides write completed
atomicity) – since it caused a bus transaction
• When write W is performed w.r.t. any processor, all previous
• Issues: writes in bus order have completed
• Compilers (loop transforms, register allocation)
• Hardware (write buffers, OO-execution)
• Reason: uniprocessors care only about dependences to
same location (i.e., above conditions VERY restrictive)
– 23 – CS 740 F’03 – 24 – CS 740 F’03
Write-Back Snoopy Protocols Invalidation-based Protocols
“Exclusive” state means can modify without notifying anyone else
No need to change processor, main memory, • i.e. without bus transaction
cache … • Must first get block in exclusive state before writing into it
• Extend cache controller and exploit bus (provides serialization) • Even if already in valid state, need transaction, so called a write miss
Dirty state now also indicates exclusive Store to non-dirty data generates a read- exclusive bus transaction
ownership • Tells others about impending write, obtains exclusive ownership
– makes the write visible, i.e. write is performed
• Exclusive: only cache with a valid copy (main memory may be too) – may be actually observed (by a read miss) only later
• Owner: responsible for supplying block upon a request for it – write hit made visible (performed) when block updated in writer’s cache
Design space • Only one RdX can succeed at a time for a block: serialized by bus
• Invalidation versus Update-based protocols Read and Read-exclusive bus transactions drive coherence actions
• Writeback transactions also, but not caused by memory operation and
• Set of states quite incidental to coherence protocol
– note: replaced block that is not in modified state can be dropped
A write operation updates values in other caches Basic question of program behavior
• New, update bus transaction • Is a block written by one processor read by others before it is
rewritten?
Advantages
• Other processors don’t miss on next access: reduced latency Invalidation:
– In invalidation protocols, they would miss and cause more • Yes => readers will take a miss
transactions • No => multiple writes without additional traffic
• Single bus transaction to update several caches can save – and clears out copies that won’t be used again
bandwidth
Update:
– Also, only the word written is transferred, not whole block
• Yes => readers will not miss if they had a copy previously
Disadvantages – single bus transaction to update all copies
• Multiple writes by same processor cause multiple update • No => multiple useless updates, even to dead copies
transactions
Need to look at program behavior and hardware complexity
– In invalidation, first write gets exclusive ownership, others local
Invalidation protocols much more popular
Detailed tradeoffs more complex • Some systems provide both, or even hybrid
– 27 – CS 740 F’03 – 28 – CS 740 F’03
Basic MSI Writeback Inval Protocol State Transition Diagram
PrRd/- PrWr/-
States
• Invalid (I) M
• States S
′
BusRdX/Flush
– invalid PrRd/
)
BusRd (S
– exclusive or exclusive-clean (only this cache has copy, but not modified)
PrRd/—
′
BusRd/Flush
– modified (dirty) I
BusRd/— BusRd/—
E Sc E Sc
PrRdMiss/BusRd(S) PrRdMiss/BusRd(S) PrRdMiss/BusRd(S) PrRdMiss/BusRd(S)
PrWr/— PrWr/—
PrWr/BusUpd(S) PrWr/BusUpd(S)
PrWr/BusUpd(S) PrWr/BusUpd(S)
BusUpd/Update BusUpd/Update
BusRd/Flush BusRd/Flush
PrWrMiss/(BusRd(S); BusUpd) PrWrMiss/BusRd(S) PrWrMiss/(BusRd(S); BusUpd) PrWrMiss/BusRd(S)
Sm M Sm M
PrWr/BusUpd(S) PrWr/BusUpd(S)
• If those that used continue to use, and writes between use are few,
Data bus
160
140
120 80
– e.g. producer-consumer pattern
70
• If those that use unlikely to use again, or many writes between
Address bus
100
Traffic (MB/s)
60 Data bus
80
50 reads, which should do better?
– “pack rat” phenomenon particularly bad under process migration
60 40
30
40
20
20
10
– useless updates where only last one will be used
Can construct scenarios where one or other is much better
Raytrace/III Ill
Ex
l
t
x
t
d
OS-Data/3St-RdEx E
0 0
Barnes/3St-RdEx
Raytrace/3St-RdEx
Appl-Code/3St-RdEx
Radiosity/III
Appl-Code/III
LU/3St-RdEx
Ocean/3St-RdEx
Appl-Data/3St-RdEx
LU/III
Radiosity/3St
Radiosity/3St-RdEx
Radix/III
OS-Data/III
Barnes/III
LU/3St
Ocean/III
Raytrace/3St
Appl-Code/3St
OS-Code/3St-RdEx
Barnes/3St
Appl-Data/3St
OS-Code/III
OS-Data/3St
Radix/3St-RdEx
OS-Code/3St
Radix/3St
Appl-Data/III
Ocean/3S
0.00
0.50
1.00
1.50
2.00
2.50
False sharing
0.50
True sharing • Update traffic is substantial LU/inv
0.30
– could delay updates or use merging Ocean/mix
1.00
0.20
• Overall, trend is away from update Ocean/upd
Raytrace/upd
• Will see later that updates have
greater problems for scalable systems
0.00 0.00
LU/inv
Ocean/inv
Radix/inv
LU/upd
Ocean/mix
Raytrace/inv
Radix/mix
Ocean/upd
Radix/upd
Raytrace/upd
Upgrade/update rate (%)
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
Radix/inv
• Lots of coherence misses: updates help
Radix/mix
• Lots of capacity misses: updates hurt (keep data in cache uselessly)
Radix/upd
• Updates seem to help, but this ignores upgrade and update traffic
– 45 – CS 740 F’03 – 46 – CS 740 F’03
Upgrade
8
– increase misses due to false sharing if spatial locality not good
0
6
8
4
0
Radiosity/32
Radiosity/256
Radiosity/8
Ocean/8
Ocean/16
Ocean/64
Raytrace/128
Raytrace/256
Radix/8
Barnes/128
Barnes/256
Radix/16
Radix/32
Radix/64
Raytrace/8
Lu/64
Barnes/32
Barnes/64
Ocean/128
Ocean/256
Radix/128
Radix/256
Raytrace/16
Raytrace/32
Raytrace/64
Barnes/8
Lu/8
Lu/128
Lu/256
Lu/16
Lu/32
Ocean/32
Radiosity/16
Radiosity/64
Barnes/16
Radiosity/128
– increase misses due to conflicts in fixed-size cache
– increase traffic due to fetching unnecessary data and due to false
sharing
– can increase miss penalty and perhaps hit cost •Working set doesn’t fit: impact on capacity misses much more critical
– 47 – CS 740 F’03 – 48 – CS 740 F’03
Impact of Block Size on Traffic Making Large Blocks More Effective
Traffic affects performance indirectly through contention
Software
0.18
10 1.8
• Improve spatial locality by better data structuring
• Compiler techniques
Address bus
0.16 9 Address bus Address bus
1.6
Traffic (bytes/instructions)
Data bus
Data bus Data bus
0.14 8
1.4
Hardware
Traffic (bytes/instruction)
7
Traffic (bytes/FLOP)
0.12
1.2
0.1 6
1
Radiosity/64 4
0
0 0
Raytrace/32
Raytrace/64
Raytrace/8
Raytrace/128
Raytrace/256
LU/8
Radiosity/256
Raytrace/16
Barnes/8
Radiosity/8
Radiosity/32
Barnes/128
LU/16
LU/32
LU/64
LU/128
LU/256
Barnes/16
Barnes/32
Barnes/64
Radiosity/16
Ocean/16
Ocean/32
Ocean/64
Ocean/8
Ocean/128
Ocean/256
Radix/8
Radix/16
Radix/32
Radix/64
Radix/128
Radix/256
• Proposals for adjustable cache size
• More subtle: delay propagation of invalidations and perform all at
• Results different than for miss rate: traffic almost always increases once
• When working sets fits, overall traffic still small, except for Radix
– But can change consistency model: discuss later in course
• Fixed overhead is significant component
• Use update instead of invalidate protocols to reduce false sharing
– So total traffic often minimized at 16-32 byte block, not smaller effect
• Working set doesn’t fit: even 128-byte good for Ocean due to capacity
– 49 – CS 740 F’03 – 50 – CS 740 F’03