0% found this document useful (0 votes)

110 views29 pages

Cache Coherence and Sequential Consistency

The document discusses maintaining sequential consistency in shared memory multiprocessor systems with multiple caches. It describes how write-back and write-through caches can allow stale values and break sequential consistency. Hardware cache coherence protocols are required where only one processor has write permission to a memory location at a time and no processor can load a stale value. The document focuses on invalidation protocols, where the address is invalidated in other caches before a write.

Uploaded by

Harshal Gala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views29 pages

Cache Coherence and Sequential Consistency

Uploaded by

Harshal Gala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1

Sequential Consistency

and

Cache Coherence Protocols

Arvind

Computer Science and Artificial Intelligence Lab

M.I.T.

Based on the material prepared by

Arvind and Krste Asanovic

6.823 L17- 2
Arvind

Memory Consistency in SMPs

CPU-1 CPU-2

A 100 cache-1 A 100 cache-2

CPU-Memory bus

A 100 memory

Suppose CPU-1 updates A to 200.

write-back: memory and cache-2 have stale values
write-through: cache-2 has a stale value

Do these stale values matter?

What is the view of shared memory for programming?
November 9, 2005
6.823 L17- 3
Arvind

Write-back Caches & SC

prog T1 cache-1 memory cache-2 prog T2

ST X, 1 X= 1 X=0 Y= LD Y, R1
ST Y,11 Y=11 Y =10 Y’= ST Y’, R1
• T1 is executed X’= X= LD X, R2
Y’= X’= ST X’,R2
X= 1 X=0 Y=
• cache-1 writes back Y
Y=11 Y =11 Y’=
X’= X=
Y’= X’=
X= 1 X=0 Y = 11
Y=11 Y =11 Y’= 11
• T2 executed X’= X=0
Y’= X’= 0
X= 1 X=1 Y = 11
• cache-1 writes back X
Y=11 Y =11 Y’= 11

t
X’= X=0 n
Y’= X’= 0
r e

he
X= 1 X=1 Y =11 co
• cache-2 writes back
Y=11 Y =11 Y’=11 in
X’= 0 X=0
X’ & Y’
Y’=11 X’= 0
November 9, 2005
6.823 L17- 4
Arvind

Write-through Caches & SC

cache-1 memory cache-2 prog T2

prog T1
X= 0 X=0 Y= LD Y, R1
ST X, 1
Y=10 Y =10 Y’= ST Y’, R1
ST Y,11
X’= X=0 LD X, R2
Y’= X’= ST X’,R2

X= 1 X=1 Y=
Y=11 Y =11 Y’=
• T1 executed
X’= X=0
Y’= X’=

X= 1 X=1 Y = 11
• T2 executed
Y=11 Y =11 Y’= 11
X’= 0 X=0
Y’=11 X’= 0

Write-through caches don’t preserve

sequential consistency either
November 9, 2005
6.823 L17- 5
Arvind

Maintaining Sequential Consistency

SC is sufficient for correct producer-consumer

and mutual exclusion code (e.g., Dekker)

Multiple copies of a location in various caches

can cause SC to break down.

Hardware support is required such that

• only one processor at a time has write

permission for a location

• no processor can load a stale copy of
the location after a write

⇒ cache coherence protocols

November 9, 2005
6.823 L17- 6
Arvind

A System with Multiple Caches

P P P P
L1 L1 L1 L1
P
P L1
L1 L2 L2

Interconnect
M

• Modern systems often have hierarchical caches

• Each cache has exactly one parent but can have zero
or more children
• Only a parent and its children can communicate
directly
• Inclusion property is maintained between a parent
and its children, i.e.,

a ∈ Li ⇒ a ∈ Li+1

November 9, 2005
6.823 L17- 7
Arvind

Cache Coherence Protocols for SC

write request:
the address is invalidated (updated) in all other
caches before (after) the write is performed

read request:

if a dirty copy is found in some cache, a write-

back is performed before the memory is read

We will focus on Invalidation protocols

as opposed to Update protocols

November 9, 2005
6.823 L17- 8
Arvind

Warmup: Parallel I/O

Memory Physical
Address (A) Bus
Memory
Proc. Data (D) Cache

R/W
Page transfers
occur while the
Processor is running
A
Either Cache or DMA can D
DMA
be the Bus Master and DISK
R/W
effect transfers

DMA stands for Direct Memory Access

November 9, 2005
6.823 L17- 9
Arvind

Problems with Parallel I/O

Cached portions
of page Physical
Memory Memory
Bus
Proc.
Cache
DMA transfers

DMA
DISK
Memory Disk: Physical memory may be
stale if Cache copy is dirty

Disk Memory: Cache may have data

corresponding to the memory

November 9, 2005
6.823 L17- 10
Arvind

Snoopy Cache Goodman 1983

• Idea: Have cache watch (or snoop upon)

DMA transfers, and then “do the right
thing”
• Snoopy cache tags are dual-ported

Used to drive Memory Bus

when Cache is Bus Master

A A
Tags and Snoopy read port
State attached to Memory
Proc. R/W R/W
Bus
Data
D (lines)

Cache

November 9, 2005
6.823 L17- 11
Arvind

Snoopy Cache Actions

Observed Bus
Cycle Cache State Cache Action

Address not cached No action

Read Cycle Cached, unmodified No action

Memory Disk Cached, modified Cache intervenes

Address not cached No action

Write Cycle Cached, unmodified Cache purges its copy
Disk Memory Cached, modified ???

November 9, 2005
6.823 L17- 12
Arvind

Shared Memory Multiprocessor

Memory

Bus

Snoopy
M1 Cache Physical
Memory
Snoopy
M2 Cache

M3 Snoopy DMA DISKS

Cache

Use snoopy mechanism to keep all

processors’ view of memory coherent

November 9, 2005
6.823 L17- 13
Arvind

Cache State Transition Diagram

The MSI protocol

Each cache line has a tag M: Modified

S: Shared
Address tag I: Invalid
state
bits
P1 reads
Other processor reads M or writes
P1 writes back Write miss

rite
w Other processor
to
n ts intents to write
t e
in
Read
P1
miss

S I
Read by any
Other processor
processor
intents to write
Cache state in
processor P1
November 9, 2005
6.823 L17- 14
Arvind

2 Processor Example
P1 reads
P1 reads P1 P2 reads, or writes
P1 writes P1 writes back
M
P2 reads Write miss
rite
P2 writes
tow P2 intent to write
P1 reads t
in ten
P1 writes Read P1
miss
P2 writes S I
P1 writes P2 intent to write

P2 P1 reads,
P2 reads
or writes
P2 writes back M
Write miss
e
writ
t to P1 intent to write
ten
Read in
P2
miss
S I
P1 intent to write
November 9, 2005
6.823 L17- 15
Arvind

Observation
P1 reads

P1 writes back Write miss

rite
w Other processor
to
n ts intents to write
t e
in
Read
P1
miss

S I
Read by any Other processor
processor intents to write

• If a line is in the M state then no other

cache can have a copy of the line!
– Memory stays coherent, multiple differing copies
cannot exist
November 9, 2005
6.823 L17- 16
Arvind

MESI: An Enhanced MSI protocol

Each cache line has a tag M: Modified Exclusive

E: Exclusive, unmodified
Address tag S: Shared
state
I: Invalid
bits

P1 write P1 read
P1 write M E
or read
Write miss
it e
Other processor reads w r Other processor
P1 writes back t to intent to write
n
Read miss, te
in
shared P1
S
I
Read by any Other processor

processor intent to write

Cache state in
processor P1
November 9, 2005
17

Five-minute break to stretch your legs

6.823 L17- 18
Arvind

Cache Coherence State Encoding

block Address

tag indexm offset tag V M data block

Valid and dirty bits can be used

to encode S, I, and (E, M) states
V=0, D=x ⇒ Invalid Hit? word
V=1, D=0 ⇒ Shared (not dirty)
V=1, D=1 ⇒ Exclusive (dirty)

November 9, 2005
6.823 L17- 19
Arvind

2-Level Caches
CPU CPU CPU CPU

L1 $ L1 $ L1 $ L1 $

L2 $ L2 $ L2 $ L2 $

Snooper Snooper Snooper Snooper

• Processors often have two-level caches

• Small L1 on chip, large L2 off chip

• Inclusion property: entries in L1 must be in L2
invalidation in L2 ⇒ invalidation in L1
• Snooping on L2 does not affect CPU-L1 bandwidth
What problem could occur?
November 9, 2005
6.823 L17- 20
Arvind

Intervention

CPU-1 CPU-2

A 200 cache-1 cache-2

CPU-Memory bus

A 100 memory (stale data)

When a read-miss for A occurs in cache-2,

a read request for A is placed on the bus
• Cache-1 needs to supply & change its state to shared
• The memory may respond to the request also!
Does memory know it has stale data?
Cache-1 needs to intervene through memory
controller to supply correct data to cache-2
November 9, 2005
6.823 L17- 21
Arvind

False Sharing

state blk addr data0 data1 ... dataN

A cache block contains more than one word

Cache-coherence is done at the block-level and

not word-level

Suppose M1 writes wordi and M2 writes wordk and

both words have the same block address.

What can happen?

November 9, 2005
6.823 L17- 22
Arvind

Synchronization and Caches:

Performance Issues
Processor 1 Processor 2 Processor 3

R←1 R←1 R←1

L: swap(mutex, R); L: swap(mutex, R); L: swap(mutex, R);
if <R> then goto L; if <R> then goto L; if <R> then goto L;
<critical section> <critical section> <critical section>
M[mutex] ← 0; M[mutex] ← 0; M[mutex] ← 0;

cache mutex=1
cache cache

CPU-Memory Bus

Cache-coherence protocols will cause mutex to ping-pong

between P1’s and P2’s caches.

Ping-ponging can be reduced by first reading the mutex

location (non-atomically) and executing a swap only if it is
found to be zero.

November 9, 2005
6.823 L17- 23

Performance Related to Bus

Arvind

occupancy
In general, a read-modify-write instruction

requires two memory (bus) operations without

intervening memory operations by other

processors

In a multiprocessor setting, bus needs to be

locked for the entire duration of the atomic read

and write operation

⇒ expensive for simple buses

⇒ very expensive for split-transaction buses

modern processors use

load-reserve
store-conditional
November 9, 2005
6.823 L17- 24
Arvind

Load-reserve & Store-conditional

Special register(s) to hold reservation flag and

address, and the outcome of store-conditional

Load-reserve(R, a): Store-conditional(a, R):

<flag, adr> ← <1, a>; if <flag, adr> == <1, a>
R ← M[a]; then cancel other procs’
reservation on a;
M[a] ← <R>;
status ← succeed;
else status ← fail;
If the snooper sees a store transaction to the address
in the reserve register, the reserve bit is set to 0
• Several processors may reserve ‘a’ simultaneously
• These instructions are like ordinary loads and stores
with respect to the bus traffic

November 9, 2005
6.823 L17- 25
Arvind

Performance:
Load-reserve & Store-conditional

The total number of memory (bus) transactions

is not necessarily reduced, but splitting an
atomic instruction into load-reserve & store-
conditional:

• increases bus utilization (and reduces

processor stall time), especially in split-
transaction buses

• reduces cache ping-pong effect because

processors trying to acquire a semaphore do
not have to perform a store each time

November 9, 2005
6.823 L17- 26
Arvind

Out-of-Order Loads/Stores & CC

snooper
Wb-req, Inv-req, Inv-rep

load/store

buffers
pushout (Wb-rep) Memory
Cache
CPU

(I/S/E) (S-rep, E-rep)

(S-req, E-req)
Blocking caches CPU/Memory
One request at a time + CC ⇒ SC Interface
Non-blocking caches
Multiple requests (different addresses) concurrently + CC
⇒ Relaxed memory models
CC ensures that all processors observe the same
order of loads and stores to an address
November 9, 2005
6.823 L17- 27
Arvind

next time

Designing a Cache Coherence

Protocol

November 9, 2005
28

Thank you !

6.823 L17- 29
Arvind

2 Processor Example

P1 write P1 read
Block b P1 write
or read
M E
Write miss
P2 reads, rite
ow
P1
P1 writes back
ten
t t P2 intent to write
Read
in
P1
miss

S I
P2 intent to write

P2 write P2 read
P2 write
Block b or read
M E
Write miss
e
P1 reads, writ
P2
P2 writes back t to P1 intent to write
ten
Read
in
P2
miss

S I
P1 intent to write
November 9, 2005

Cache Coherence - MESI MOESI
No ratings yet
Cache Coherence - MESI MOESI
57 pages
Centralized Shared Memory Architectures
No ratings yet
Centralized Shared Memory Architectures
31 pages
Cache Coherence
No ratings yet
Cache Coherence
53 pages
Overview of DSM and SMP Architectures
No ratings yet
Overview of DSM and SMP Architectures
42 pages
Cache Coherence Protocols Guide
No ratings yet
Cache Coherence Protocols Guide
24 pages
EGC121lect20 Multicore MSI Protocol
No ratings yet
EGC121lect20 Multicore MSI Protocol
39 pages
Multiprocessor Architectures & Cache Coherence
No ratings yet
Multiprocessor Architectures & Cache Coherence
54 pages
MESI Protocol for Cache Coherence
No ratings yet
MESI Protocol for Cache Coherence
33 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Cache Coherence and Synchronization - Tutorialspoint
No ratings yet
Cache Coherence and Synchronization - Tutorialspoint
7 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
Memory Hierarchy for Engineers
No ratings yet
Memory Hierarchy for Engineers
32 pages
1.symmetric and Distributed Shared Memory Architectures
79% (19)
1.symmetric and Distributed Shared Memory Architectures
29 pages
Multiprocessing in Computer Architecture
No ratings yet
Multiprocessing in Computer Architecture
8 pages
Shared-Memory Architecture Overview
No ratings yet
Shared-Memory Architecture Overview
33 pages
Multiprocessor Cache Coherency Overview
No ratings yet
Multiprocessor Cache Coherency Overview
14 pages
Cache Coherence in Multicore Systems
No ratings yet
Cache Coherence in Multicore Systems
36 pages
Shared Memory System Design Overview
No ratings yet
Shared Memory System Design Overview
39 pages
Shared Memory Architecture Concepts and Performance Issues: Outline
No ratings yet
Shared Memory Architecture Concepts and Performance Issues: Outline
7 pages
Shared Memory in Multiprocessor Systems
No ratings yet
Shared Memory in Multiprocessor Systems
24 pages
Understanding Cache Coherence Protocols
No ratings yet
Understanding Cache Coherence Protocols
13 pages
Snoop-Based Multiprocessor Design Insights
No ratings yet
Snoop-Based Multiprocessor Design Insights
57 pages
Snooping Cache in Multiprocessors
No ratings yet
Snooping Cache in Multiprocessors
59 pages
Module 4
No ratings yet
Module 4
40 pages
Lect4 Parallelsystem-Shared Memory
No ratings yet
Lect4 Parallelsystem-Shared Memory
31 pages
Lecture 9 Multi-Processor
No ratings yet
Lecture 9 Multi-Processor
83 pages
Bus-Based Snoopy Protocol Overview
No ratings yet
Bus-Based Snoopy Protocol Overview
19 pages
Coherence
No ratings yet
Coherence
16 pages
Multiprocessor Memory Architecture Insights
No ratings yet
Multiprocessor Memory Architecture Insights
51 pages
Unit 4 - Advanced Computer Architecture - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Advanced Computer Architecture - WWW - Rgpvnotes.in
14 pages
Multiprocessors and Thread-Level Parallelism
No ratings yet
Multiprocessors and Thread-Level Parallelism
27 pages
Cache Coherence in Multiprocessor Systems
No ratings yet
Cache Coherence in Multiprocessor Systems
39 pages
Understanding Cache Coherence Protocols
No ratings yet
Understanding Cache Coherence Protocols
37 pages
Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back
No ratings yet
Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back
21 pages
Cache Coherence Protocols Overview
No ratings yet
Cache Coherence Protocols Overview
9 pages
Understanding Cache Coherence Protocols
No ratings yet
Understanding Cache Coherence Protocols
22 pages
Understanding Thread Level Parallelism
No ratings yet
Understanding Thread Level Parallelism
41 pages
Shared Memory Multiprocessor Design Insights
No ratings yet
Shared Memory Multiprocessor Design Insights
107 pages
Cache Coherence
No ratings yet
Cache Coherence
18 pages
Multiprocessor Cache Coherence
No ratings yet
Multiprocessor Cache Coherence
13 pages
Understanding Cache Hierarchies and Misses
No ratings yet
Understanding Cache Hierarchies and Misses
20 pages
Overview of Distributed Operating Systems
No ratings yet
Overview of Distributed Operating Systems
89 pages
2.symmetric Shared Memory Architectures
No ratings yet
2.symmetric Shared Memory Architectures
12 pages
Multiprocessor Cache Coherence Design
No ratings yet
Multiprocessor Cache Coherence Design
32 pages
Cache Coherence: - According To Webster's Dictionary
No ratings yet
Cache Coherence: - According To Webster's Dictionary
15 pages
Week 5
No ratings yet
Week 5
52 pages
CA-unit 5-Material-For Reference
No ratings yet
CA-unit 5-Material-For Reference
16 pages
Cache Coherence: CEG 4131 Computer Architecture III Slides Developed by Dr. Hesham El-Rewini
No ratings yet
Cache Coherence: CEG 4131 Computer Architecture III Slides Developed by Dr. Hesham El-Rewini
63 pages
Advanced Shared Memory Systems
No ratings yet
Advanced Shared Memory Systems
25 pages
Key Concepts in Computer Architecture
No ratings yet
Key Concepts in Computer Architecture
4 pages
MC&CC
No ratings yet
MC&CC
21 pages
Lecture 06
No ratings yet
Lecture 06
26 pages
F05 - Memory Consistency Models Plus Introduction To Caches
No ratings yet
F05 - Memory Consistency Models Plus Introduction To Caches
48 pages
Cache Coherence Protocols Explained
No ratings yet
Cache Coherence Protocols Explained
13 pages
Cache Memory in Computer Architecture
No ratings yet
Cache Memory in Computer Architecture
18 pages
Cau 6 Cache
No ratings yet
Cau 6 Cache
25 pages
Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9
No ratings yet
Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9
79 pages
GRE Mini Dictionary
No ratings yet
GRE Mini Dictionary
42 pages
Memory Management: Paging & Addressing
No ratings yet
Memory Management: Paging & Addressing
33 pages
Modern Virtual Memory Systems Overview
No ratings yet
Modern Virtual Memory Systems Overview
34 pages
Install Notes
No ratings yet
Install Notes
4 pages
VLSI Architecture Course Overview
No ratings yet
VLSI Architecture Course Overview
5 pages
Direct Mapped Cache Memory Analysis
No ratings yet
Direct Mapped Cache Memory Analysis
29 pages
Memory Subsystem: Dr. Gayathri Sivakumar Assistant Professor (SG-I) School of Electronics VIT, Chennai
No ratings yet
Memory Subsystem: Dr. Gayathri Sivakumar Assistant Professor (SG-I) School of Electronics VIT, Chennai
16 pages
Bca 2019 20
No ratings yet
Bca 2019 20
42 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
8 pages
IWT Unit 1
No ratings yet
IWT Unit 1
33 pages
12 - C - 25 - Draft - CS - 1 - 6 - Sem - ComputerScience Ce&engineering
No ratings yet
12 - C - 25 - Draft - CS - 1 - 6 - Sem - ComputerScience Ce&engineering
248 pages
MCA CET 2025 Computer Memory Worksheet
No ratings yet
MCA CET 2025 Computer Memory Worksheet
4 pages
Improving GPU Performance Via Large Warps and Two-Level Warp Scheduling
No ratings yet
Improving GPU Performance Via Large Warps and Two-Level Warp Scheduling
10 pages
Getting Processor Information in Device Manager and PowerShell
No ratings yet
Getting Processor Information in Device Manager and PowerShell
10 pages
Essbase Application Performance Tuning
No ratings yet
Essbase Application Performance Tuning
17 pages
Computer Architecture and Organization Assignment II
No ratings yet
Computer Architecture and Organization Assignment II
2 pages
Memory Management Techniques Overview
No ratings yet
Memory Management Techniques Overview
21 pages
Esp32 Wrover e - Esp32 Wrover Ie - Datasheet - en
No ratings yet
Esp32 Wrover e - Esp32 Wrover Ie - Datasheet - en
31 pages
Interleaved Memory Organisation, Associative Memo
No ratings yet
Interleaved Memory Organisation, Associative Memo
19 pages
The Intel Microprocessors: Architecture, Programming, and Interfacing - 6 Ed.
No ratings yet
The Intel Microprocessors: Architecture, Programming, and Interfacing - 6 Ed.
75 pages
Computer Architecture For Scientists - Andrew A Chien
No ratings yet
Computer Architecture For Scientists - Andrew A Chien
266 pages
Transmeta Efficeon TM8300/TM8600 Processor
100% (1)
Transmeta Efficeon TM8300/TM8600 Processor
10 pages
Gpu-Initiated On-Demand High-Throughput Storage Access in The Bam System Architecture
No ratings yet
Gpu-Initiated On-Demand High-Throughput Storage Access in The Bam System Architecture
17 pages
C++ AMP - Language and Programming Model, Microsoft Corp.
No ratings yet
C++ AMP - Language and Programming Model, Microsoft Corp.
131 pages
DDCA - CO-3 & 4 - Terminal Questions
No ratings yet
DDCA - CO-3 & 4 - Terminal Questions
18 pages
OS Questions and Answer
No ratings yet
OS Questions and Answer
185 pages
Computer Architecture and Organization QP 1
No ratings yet
Computer Architecture and Organization QP 1
2 pages
Analysis of MongoDB TCMalloc Memory Usage After Index Build
No ratings yet
Analysis of MongoDB TCMalloc Memory Usage After Index Build
12 pages
COA Handwritten Notes
100% (4)
COA Handwritten Notes
171 pages
Computer Architecture Question Bank 2024-25
No ratings yet
Computer Architecture Question Bank 2024-25
5 pages
07 CacheOptimizations
No ratings yet
07 CacheOptimizations
38 pages
Iec 60870-5 PDF
No ratings yet
Iec 60870-5 PDF
42 pages
Module 6 Cybersecurity
No ratings yet
Module 6 Cybersecurity
50 pages
ARM ISA and Cortex M0
No ratings yet
ARM ISA and Cortex M0
45 pages