1
Sequential Consistency
and
Cache Coherence Protocols
Arvind
Computer Science and Artificial Intelligence Lab
M.I.T.
Based on the material prepared by
Arvind and Krste Asanovic
6.823 L17- 2
Arvind
Memory Consistency in SMPs
CPU-1 CPU-2
A 100 cache-1 A 100 cache-2
CPU-Memory bus
A 100 memory
Suppose CPU-1 updates A to 200.
write-back: memory and cache-2 have stale values
write-through: cache-2 has a stale value
Do these stale values matter?
What is the view of shared memory for programming?
November 9, 2005
6.823 L17- 3
Arvind
Write-back Caches & SC
prog T1 cache-1 memory cache-2 prog T2
ST X, 1 X= 1 X=0 Y= LD Y, R1
ST Y,11 Y=11 Y =10 Y’= ST Y’, R1
• T1 is executed X’= X= LD X, R2
Y’= X’= ST X’,R2
X= 1 X=0 Y=
• cache-1 writes back Y
Y=11 Y =11 Y’=
X’= X=
Y’= X’=
X= 1 X=0 Y = 11
Y=11 Y =11 Y’= 11
• T2 executed X’= X=0
Y’= X’= 0
X= 1 X=1 Y = 11
• cache-1 writes back X
Y=11 Y =11 Y’= 11
t
X’= X=0 n
Y’= X’= 0
r e
he
X= 1 X=1 Y =11 co
• cache-2 writes back
Y=11 Y =11 Y’=11 in
X’= 0 X=0
X’ & Y’
Y’=11 X’= 0
November 9, 2005
6.823 L17- 4
Arvind
Write-through Caches & SC
cache-1 memory cache-2 prog T2
prog T1
X= 0 X=0 Y= LD Y, R1
ST X, 1
Y=10 Y =10 Y’= ST Y’, R1
ST Y,11
X’= X=0 LD X, R2
Y’= X’= ST X’,R2
X= 1 X=1 Y=
Y=11 Y =11 Y’=
• T1 executed
X’= X=0
Y’= X’=
X= 1 X=1 Y = 11
• T2 executed
Y=11 Y =11 Y’= 11
X’= 0 X=0
Y’=11 X’= 0
Write-through caches don’t preserve
sequential consistency either
November 9, 2005
6.823 L17- 5
Arvind
Maintaining Sequential Consistency
SC is sufficient for correct producer-consumer
and mutual exclusion code (e.g., Dekker)
Multiple copies of a location in various caches
can cause SC to break down.
Hardware support is required such that
• only one processor at a time has write
permission for a location
• no processor can load a stale copy of
the location after a write
⇒ cache coherence protocols
November 9, 2005
6.823 L17- 6
Arvind
A System with Multiple Caches
P P P P
L1 L1 L1 L1
P
P L1
L1 L2 L2
Interconnect
M
• Modern systems often have hierarchical caches
• Each cache has exactly one parent but can have zero
or more children
• Only a parent and its children can communicate
directly
• Inclusion property is maintained between a parent
and its children, i.e.,
a ∈ Li ⇒ a ∈ Li+1
November 9, 2005
6.823 L17- 7
Arvind
Cache Coherence Protocols for SC
write request:
the address is invalidated (updated) in all other
caches before (after) the write is performed
read request:
if a dirty copy is found in some cache, a write-
back is performed before the memory is read
We will focus on Invalidation protocols
as opposed to Update protocols
November 9, 2005
6.823 L17- 8
Arvind
Warmup: Parallel I/O
Memory Physical
Address (A) Bus
Memory
Proc. Data (D) Cache
R/W
Page transfers
occur while the
Processor is running
A
Either Cache or DMA can D
DMA
be the Bus Master and DISK
R/W
effect transfers
DMA stands for Direct Memory Access
November 9, 2005
6.823 L17- 9
Arvind
Problems with Parallel I/O
Cached portions
of page Physical
Memory Memory
Bus
Proc.
Cache
DMA transfers
DMA
DISK
Memory Disk: Physical memory may be
stale if Cache copy is dirty
Disk Memory: Cache may have data
corresponding to the memory
November 9, 2005
6.823 L17- 10
Arvind
Snoopy Cache Goodman 1983
• Idea: Have cache watch (or snoop upon)
DMA transfers, and then “do the right
thing”
• Snoopy cache tags are dual-ported
Used to drive Memory Bus
when Cache is Bus Master
A A
Tags and Snoopy read port
State attached to Memory
Proc. R/W R/W
Bus
Data
D (lines)
Cache
November 9, 2005
6.823 L17- 11
Arvind
Snoopy Cache Actions
Observed Bus
Cycle Cache State Cache Action
Address not cached No action
Read Cycle Cached, unmodified No action
Memory Disk Cached, modified Cache intervenes
Address not cached No action
Write Cycle Cached, unmodified Cache purges its copy
Disk Memory Cached, modified ???
November 9, 2005
6.823 L17- 12
Arvind
Shared Memory Multiprocessor
Memory
Bus
Snoopy
M1 Cache Physical
Memory
Snoopy
M2 Cache
M3 Snoopy DMA DISKS
Cache
Use snoopy mechanism to keep all
processors’ view of memory coherent
November 9, 2005
6.823 L17- 13
Arvind
Cache State Transition Diagram
The MSI protocol
Each cache line has a tag M: Modified
S: Shared
Address tag I: Invalid
state
bits
P1 reads
Other processor reads M or writes
P1 writes back Write miss
rite
w Other processor
to
n ts intents to write
t e
in
Read
P1
miss
S I
Read by any
Other processor
processor
intents to write
Cache state in
processor P1
November 9, 2005
6.823 L17- 14
Arvind
2 Processor Example
P1 reads
P1 reads P1 P2 reads, or writes
P1 writes P1 writes back
M
P2 reads Write miss
rite
P2 writes
tow P2 intent to write
P1 reads t
in ten
P1 writes Read P1
miss
P2 writes S I
P1 writes P2 intent to write
P2 P1 reads,
P2 reads
or writes
P2 writes back M
Write miss
e
writ
t to P1 intent to write
ten
Read in
P2
miss
S I
P1 intent to write
November 9, 2005
6.823 L17- 15
Arvind
Observation
P1 reads
Other processor reads M or writes
P1 writes back Write miss
rite
w Other processor
to
n ts intents to write
t e
in
Read
P1
miss
S I
Read by any Other processor
processor intents to write
• If a line is in the M state then no other
cache can have a copy of the line!
– Memory stays coherent, multiple differing copies
cannot exist
November 9, 2005
6.823 L17- 16
Arvind
MESI: An Enhanced MSI protocol
Each cache line has a tag M: Modified Exclusive
E: Exclusive, unmodified
Address tag S: Shared
state
I: Invalid
bits
P1 write P1 read
P1 write M E
or read
Write miss
it e
Other processor reads w r Other processor
P1 writes back t to intent to write
n
Read miss, te
in
shared P1
S
I
Read by any Other processor
processor intent to write
Cache state in
processor P1
November 9, 2005
17
Five-minute break to stretch your legs
6.823 L17- 18
Arvind
Cache Coherence State Encoding
block Address
tag indexm offset tag V M data block
Valid and dirty bits can be used
to encode S, I, and (E, M) states
V=0, D=x ⇒ Invalid Hit? word
V=1, D=0 ⇒ Shared (not dirty)
V=1, D=1 ⇒ Exclusive (dirty)
November 9, 2005
6.823 L17- 19
Arvind
2-Level Caches
CPU CPU CPU CPU
L1 $ L1 $ L1 $ L1 $
L2 $ L2 $ L2 $ L2 $
Snooper Snooper Snooper Snooper
• Processors often have two-level caches
• Small L1 on chip, large L2 off chip
• Inclusion property: entries in L1 must be in L2
invalidation in L2 ⇒ invalidation in L1
• Snooping on L2 does not affect CPU-L1 bandwidth
What problem could occur?
November 9, 2005
6.823 L17- 20
Arvind
Intervention
CPU-1 CPU-2
A 200 cache-1 cache-2
CPU-Memory bus
A 100 memory (stale data)
When a read-miss for A occurs in cache-2,
a read request for A is placed on the bus
• Cache-1 needs to supply & change its state to shared
• The memory may respond to the request also!
Does memory know it has stale data?
Cache-1 needs to intervene through memory
controller to supply correct data to cache-2
November 9, 2005
6.823 L17- 21
Arvind
False Sharing
state blk addr data0 data1 ... dataN
A cache block contains more than one word
Cache-coherence is done at the block-level and
not word-level
Suppose M1 writes wordi and M2 writes wordk and
both words have the same block address.
What can happen?
November 9, 2005
6.823 L17- 22
Arvind
Synchronization and Caches:
Performance Issues
Processor 1 Processor 2 Processor 3
R←1 R←1 R←1
L: swap(mutex, R); L: swap(mutex, R); L: swap(mutex, R);
if <R> then goto L; if <R> then goto L; if <R> then goto L;
<critical section> <critical section> <critical section>
M[mutex] ← 0; M[mutex] ← 0; M[mutex] ← 0;
cache mutex=1
cache cache
CPU-Memory Bus
Cache-coherence protocols will cause mutex to ping-pong
between P1’s and P2’s caches.
Ping-ponging can be reduced by first reading the mutex
location (non-atomically) and executing a swap only if it is
found to be zero.
November 9, 2005
6.823 L17- 23
Performance Related to Bus
Arvind
occupancy
In general, a read-modify-write instruction
requires two memory (bus) operations without
intervening memory operations by other
processors
In a multiprocessor setting, bus needs to be
locked for the entire duration of the atomic read
and write operation
⇒ expensive for simple buses
⇒ very expensive for split-transaction buses
modern processors use
load-reserve
store-conditional
November 9, 2005
6.823 L17- 24
Arvind
Load-reserve & Store-conditional
Special register(s) to hold reservation flag and
address, and the outcome of store-conditional
Load-reserve(R, a): Store-conditional(a, R):
<flag, adr> ← <1, a>; if <flag, adr> == <1, a>
R ← M[a]; then cancel other procs’
reservation on a;
M[a] ← <R>;
status ← succeed;
else status ← fail;
If the snooper sees a store transaction to the address
in the reserve register, the reserve bit is set to 0
• Several processors may reserve ‘a’ simultaneously
• These instructions are like ordinary loads and stores
with respect to the bus traffic
November 9, 2005
6.823 L17- 25
Arvind
Performance:
Load-reserve & Store-conditional
The total number of memory (bus) transactions
is not necessarily reduced, but splitting an
atomic instruction into load-reserve & store-
conditional:
• increases bus utilization (and reduces
processor stall time), especially in split-
transaction buses
• reduces cache ping-pong effect because
processors trying to acquire a semaphore do
not have to perform a store each time
November 9, 2005
6.823 L17- 26
Arvind
Out-of-Order Loads/Stores & CC
snooper
Wb-req, Inv-req, Inv-rep
load/store
buffers
pushout (Wb-rep) Memory
Cache
CPU
(I/S/E) (S-rep, E-rep)
(S-req, E-req)
Blocking caches CPU/Memory
One request at a time + CC ⇒ SC Interface
Non-blocking caches
Multiple requests (different addresses) concurrently + CC
⇒ Relaxed memory models
CC ensures that all processors observe the same
order of loads and stores to an address
November 9, 2005
6.823 L17- 27
Arvind
next time
Designing a Cache Coherence
Protocol
November 9, 2005
28
Thank you !
6.823 L17- 29
Arvind
2 Processor Example
P1 write P1 read
Block b P1 write
or read
M E
Write miss
P2 reads, rite
ow
P1
P1 writes back
ten
t t P2 intent to write
Read
in
P1
miss
S I
P2 intent to write
P2 write P2 read
P2 write
Block b or read
M E
Write miss
e
P1 reads, writ
P2
P2 writes back t to P1 intent to write
ten
Read
in
P2
miss
S I
P1 intent to write
November 9, 2005