0% found this document useful (0 votes)
46 views29 pages

Sequential Consistency and Cache Coherence Protocols: Computer Science and Artificial Intelligence Lab M.I.T

The document discusses maintaining sequential consistency in shared memory multiprocessor systems with multiple caches. It describes how write-back and write-through caches can allow stale values and break sequential consistency. Hardware cache coherence protocols are required where only one processor has write permission to a memory location at a time and no processor can load a stale value. The document focuses on invalidation protocols, where the address is invalidated in other caches before a write.

Uploaded by

Harshal Gala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views29 pages

Sequential Consistency and Cache Coherence Protocols: Computer Science and Artificial Intelligence Lab M.I.T

The document discusses maintaining sequential consistency in shared memory multiprocessor systems with multiple caches. It describes how write-back and write-through caches can allow stale values and break sequential consistency. Hardware cache coherence protocols are required where only one processor has write permission to a memory location at a time and no processor can load a stale value. The document focuses on invalidation protocols, where the address is invalidated in other caches before a write.

Uploaded by

Harshal Gala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

1

Sequential Consistency

and

Cache Coherence Protocols

Arvind

Computer Science and Artificial Intelligence Lab

M.I.T.

Based on the material prepared by

Arvind and Krste Asanovic

6.823 L17- 2
Arvind

Memory Consistency in SMPs

CPU-1 CPU-2

A 100 cache-1 A 100 cache-2

CPU-Memory bus

A 100 memory

Suppose CPU-1 updates A to 200.


write-back: memory and cache-2 have stale values
write-through: cache-2 has a stale value

Do these stale values matter?


What is the view of shared memory for programming?
November 9, 2005
6.823 L17- 3
Arvind

Write-back Caches & SC

prog T1 cache-1 memory cache-2 prog T2


ST X, 1 X= 1 X=0 Y= LD Y, R1
ST Y,11 Y=11 Y =10 Y’= ST Y’, R1
• T1 is executed X’= X= LD X, R2
Y’= X’= ST X’,R2
X= 1 X=0 Y=
• cache-1 writes back Y
Y=11 Y =11 Y’=
X’= X=
Y’= X’=
X= 1 X=0 Y = 11
Y=11 Y =11 Y’= 11
• T2 executed X’= X=0
Y’= X’= 0
X= 1 X=1 Y = 11
• cache-1 writes back X
Y=11 Y =11 Y’= 11

t
X’= X=0 n
Y’= X’= 0
r e

he
X= 1 X=1 Y =11 co
• cache-2 writes back
Y=11 Y =11 Y’=11 in
X’= 0 X=0
X’ & Y’
Y’=11 X’= 0
November 9, 2005
6.823 L17- 4
Arvind

Write-through Caches & SC

cache-1 memory cache-2 prog T2


prog T1
X= 0 X=0 Y= LD Y, R1
ST X, 1
Y=10 Y =10 Y’= ST Y’, R1
ST Y,11
X’= X=0 LD X, R2
Y’= X’= ST X’,R2

X= 1 X=1 Y=
Y=11 Y =11 Y’=
• T1 executed
X’= X=0
Y’= X’=

X= 1 X=1 Y = 11
• T2 executed
Y=11 Y =11 Y’= 11
X’= 0 X=0
Y’=11 X’= 0

Write-through caches don’t preserve


sequential consistency either
November 9, 2005
6.823 L17- 5
Arvind

Maintaining Sequential Consistency

SC is sufficient for correct producer-consumer


and mutual exclusion code (e.g., Dekker)

Multiple copies of a location in various caches


can cause SC to break down.

Hardware support is required such that

• only one processor at a time has write

permission for a location


• no processor can load a stale copy of
the location after a write

⇒ cache coherence protocols

November 9, 2005
6.823 L17- 6
Arvind

A System with Multiple Caches

P P P P
L1 L1 L1 L1
P
P L1
L1 L2 L2

Interconnect
M

• Modern systems often have hierarchical caches


• Each cache has exactly one parent but can have zero
or more children
• Only a parent and its children can communicate
directly
• Inclusion property is maintained between a parent
and its children, i.e.,

a ∈ Li ⇒ a ∈ Li+1

November 9, 2005
6.823 L17- 7
Arvind

Cache Coherence Protocols for SC

write request:
the address is invalidated (updated) in all other
caches before (after) the write is performed

read request:

if a dirty copy is found in some cache, a write-

back is performed before the memory is read

We will focus on Invalidation protocols

as opposed to Update protocols

November 9, 2005
6.823 L17- 8
Arvind

Warmup: Parallel I/O


Memory Physical
Address (A) Bus
Memory
Proc. Data (D) Cache

R/W
Page transfers
occur while the
Processor is running
A
Either Cache or DMA can D
DMA
be the Bus Master and DISK
R/W
effect transfers

DMA stands for Direct Memory Access

November 9, 2005
6.823 L17- 9
Arvind

Problems with Parallel I/O


Cached portions
of page Physical
Memory Memory
Bus
Proc.
Cache
DMA transfers

DMA
DISK
Memory Disk: Physical memory may be
stale if Cache copy is dirty

Disk Memory: Cache may have data


corresponding to the memory

November 9, 2005
6.823 L17- 10
Arvind

Snoopy Cache Goodman 1983

• Idea: Have cache watch (or snoop upon)


DMA transfers, and then “do the right
thing”
• Snoopy cache tags are dual-ported

Used to drive Memory Bus


when Cache is Bus Master

A A
Tags and Snoopy read port
State attached to Memory
Proc. R/W R/W
Bus
Data
D (lines)

Cache

November 9, 2005
6.823 L17- 11
Arvind

Snoopy Cache Actions


Observed Bus
Cycle Cache State Cache Action

Address not cached No action

Read Cycle Cached, unmodified No action


Memory Disk Cached, modified Cache intervenes

Address not cached No action


Write Cycle Cached, unmodified Cache purges its copy
Disk Memory Cached, modified ???

November 9, 2005
6.823 L17- 12
Arvind

Shared Memory Multiprocessor

Memory

Bus

Snoopy
M1 Cache Physical
Memory
Snoopy
M2 Cache

M3 Snoopy DMA DISKS


Cache

Use snoopy mechanism to keep all


processors’ view of memory coherent

November 9, 2005
6.823 L17- 13
Arvind

Cache State Transition Diagram

The MSI protocol

Each cache line has a tag M: Modified


S: Shared
Address tag I: Invalid
state
bits
P1 reads
Other processor reads M or writes
P1 writes back Write miss

rite
w Other processor
to
n ts intents to write
t e
in
Read
P1
miss

S I
Read by any
Other processor
processor
intents to write
Cache state in
processor P1
November 9, 2005
6.823 L17- 14
Arvind

2 Processor Example
P1 reads
P1 reads P1 P2 reads, or writes
P1 writes P1 writes back
M
P2 reads Write miss
rite
P2 writes
tow P2 intent to write
P1 reads t
in ten
P1 writes Read P1
miss
P2 writes S I
P1 writes P2 intent to write

P2 P1 reads,
P2 reads
or writes
P2 writes back M
Write miss
e
writ
t to P1 intent to write
ten
Read in
P2
miss
S I
P1 intent to write
November 9, 2005
6.823 L17- 15
Arvind

Observation
P1 reads

Other processor reads M or writes

P1 writes back Write miss


rite
w Other processor
to
n ts intents to write
t e
in
Read
P1
miss

S I
Read by any Other processor
processor intents to write

• If a line is in the M state then no other


cache can have a copy of the line!
– Memory stays coherent, multiple differing copies
cannot exist
November 9, 2005
6.823 L17- 16
Arvind

MESI: An Enhanced MSI protocol

Each cache line has a tag M: Modified Exclusive


E: Exclusive, unmodified
Address tag S: Shared
state
I: Invalid
bits

P1 write P1 read
P1 write M E
or read
Write miss
it e
Other processor reads w r Other processor
P1 writes back t to intent to write
n
Read miss, te
in
shared P1
S
I
Read by any Other processor

processor intent to write

Cache state in
processor P1
November 9, 2005
17

Five-minute break to stretch your legs

6.823 L17- 18
Arvind

Cache Coherence State Encoding


block Address

tag indexm offset tag V M data block

Valid and dirty bits can be used


to encode S, I, and (E, M) states
V=0, D=x ⇒ Invalid Hit? word
V=1, D=0 ⇒ Shared (not dirty)
V=1, D=1 ⇒ Exclusive (dirty)

November 9, 2005
6.823 L17- 19
Arvind

2-Level Caches
CPU CPU CPU CPU

L1 $ L1 $ L1 $ L1 $

L2 $ L2 $ L2 $ L2 $

Snooper Snooper Snooper Snooper

• Processors often have two-level caches

• Small L1 on chip, large L2 off chip


• Inclusion property: entries in L1 must be in L2
invalidation in L2 ⇒ invalidation in L1
• Snooping on L2 does not affect CPU-L1 bandwidth
What problem could occur?
November 9, 2005
6.823 L17- 20
Arvind

Intervention

CPU-1 CPU-2

A 200 cache-1 cache-2

CPU-Memory bus

A 100 memory (stale data)

When a read-miss for A occurs in cache-2,


a read request for A is placed on the bus
• Cache-1 needs to supply & change its state to shared
• The memory may respond to the request also!
Does memory know it has stale data?
Cache-1 needs to intervene through memory
controller to supply correct data to cache-2
November 9, 2005
6.823 L17- 21
Arvind

False Sharing

state blk addr data0 data1 ... dataN

A cache block contains more than one word

Cache-coherence is done at the block-level and


not word-level

Suppose M1 writes wordi and M2 writes wordk and


both words have the same block address.

What can happen?

November 9, 2005
6.823 L17- 22
Arvind

Synchronization and Caches:


Performance Issues
Processor 1 Processor 2 Processor 3

R←1 R←1 R←1


L: swap(mutex, R); L: swap(mutex, R); L: swap(mutex, R);
if <R> then goto L; if <R> then goto L; if <R> then goto L;
<critical section> <critical section> <critical section>
M[mutex] ← 0; M[mutex] ← 0; M[mutex] ← 0;

cache mutex=1
cache cache

CPU-Memory Bus

Cache-coherence protocols will cause mutex to ping-pong


between P1’s and P2’s caches.

Ping-ponging can be reduced by first reading the mutex


location (non-atomically) and executing a swap only if it is
found to be zero.

November 9, 2005
6.823 L17- 23

Performance Related to Bus


Arvind

occupancy
In general, a read-modify-write instruction

requires two memory (bus) operations without

intervening memory operations by other

processors

In a multiprocessor setting, bus needs to be

locked for the entire duration of the atomic read

and write operation

⇒ expensive for simple buses


⇒ very expensive for split-transaction buses

modern processors use


load-reserve
store-conditional
November 9, 2005
6.823 L17- 24
Arvind

Load-reserve & Store-conditional

Special register(s) to hold reservation flag and

address, and the outcome of store-conditional

Load-reserve(R, a): Store-conditional(a, R):


<flag, adr> ← <1, a>; if <flag, adr> == <1, a>
R ← M[a]; then cancel other procs’
reservation on a;
M[a] ← <R>;
status ← succeed;
else status ← fail;
If the snooper sees a store transaction to the address
in the reserve register, the reserve bit is set to 0
• Several processors may reserve ‘a’ simultaneously
• These instructions are like ordinary loads and stores
with respect to the bus traffic

November 9, 2005
6.823 L17- 25
Arvind

Performance:
Load-reserve & Store-conditional

The total number of memory (bus) transactions


is not necessarily reduced, but splitting an
atomic instruction into load-reserve & store-
conditional:

• increases bus utilization (and reduces


processor stall time), especially in split-
transaction buses

• reduces cache ping-pong effect because


processors trying to acquire a semaphore do
not have to perform a store each time

November 9, 2005
6.823 L17- 26
Arvind

Out-of-Order Loads/Stores & CC

snooper
Wb-req, Inv-req, Inv-rep

load/store

buffers
pushout (Wb-rep) Memory
Cache
CPU

(I/S/E) (S-rep, E-rep)

(S-req, E-req)
Blocking caches CPU/Memory
One request at a time + CC ⇒ SC Interface
Non-blocking caches
Multiple requests (different addresses) concurrently + CC
⇒ Relaxed memory models
CC ensures that all processors observe the same
order of loads and stores to an address
November 9, 2005
6.823 L17- 27
Arvind

next time

Designing a Cache Coherence


Protocol

November 9, 2005
28

Thank you !

6.823 L17- 29
Arvind

2 Processor Example

P1 write P1 read
Block b P1 write
or read
M E
Write miss
P2 reads, rite
ow
P1
P1 writes back
ten
t t P2 intent to write
Read
in
P1
miss

S I
P2 intent to write

P2 write P2 read
P2 write
Block b or read
M E
Write miss
e
P1 reads, writ
P2
P2 writes back t to P1 intent to write
ten
Read
in
P2
miss

S I
P1 intent to write
November 9, 2005

You might also like