0% found this document useful (0 votes)

59 views14 pages

Parallel 2

Cache coherence protocols ensure that all processors see a consistent view of shared memory locations. Snooping and directory-based protocols are commonly used to enforce coherence. Snooping relies on broadcasting cache coherence messages on a shared bus, while directory protocols track shared data using a centralized directory. Both aim to minimize coherence misses through cache invalidation or updating as data moves between caches and memory. The performance impacts of coherence protocols include increased miss rates with more processors and communication-intensive workloads.

Uploaded by

sivakumarb92

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views14 pages

Parallel 2

Uploaded by

sivakumarb92

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Multiprocessor Cache Coherency

CS448

What is Cache Coherence?

Two processors can have two different values for
the same memory location

Write Through Cache

Terminology
Coherence
Defines what values can be returned by a read
Coherent if:
If P writes to X then reads X, with no writes to X by other processors,
then the value read should be that written by P
If P writes to X and then another processor reads from X, if read/write
sufficiently separated then the value read should be that written by P
Writes to the same location are serialized; two writes to the same
location by any two processors are seen in the same order by all
processors

Consistency
Determines when a written value will be returned by a read,
well need to define a memory consistency model
For now assume all processors see effect of all writes before a write
completes

Techniques to Enforce Coherence

Directory-based
Centralized Directory holds the status of sharing a
block of physical memory
Used with DSM machines; scales to larger processor
counts but higher overhead than snooping

Snooping
No centralized directory
Each cache snoops or listens to maintain coherency
among caches
Used with CSM machines using a bus
4

Snooping Protocols for Coherency

Write Invalidate: When one processor writes, invalidate all copies of this data
that may be in other caches Write Back Cache

Write Broadcast: When one processor writes, broadcast the value and update any
copies that may be in other caches

Performance Differences
Multiple writes to the same word
Multiple broadcasts using update protocol
Only one initial invalidation using invalidate protocol

Multiword Cache Bocks

Invalidation works on cache blocks
Update must work on individual bytes

Delay between writing a word and reading it

Less in write update, data immediately into readers cache
Higher in invalidate, reader must cache miss and go to memory
to fetch the data

Usually Write Invalidate is used; less bandwidth

Implementing Invalidation
Bus-based scheme
Processor to invalidate acquires the bus
Broadcasts the address to invalidate
All other processors continuously snoop on the bus
watching the addresses
If an address is invalidated that matches an address in its
cache, then the corresponding data is invalidated

Serialization of the bus forces serialization of access

automatically
7

Implementing Invalidation
Write-through cache
To locate a data item when a cache miss occurs, just
go to memory (since memory will contain the most
up-to-date value in a write-through cache)

Write-back cache
What problem do we have reading in data on a cache
miss when all processors use write-back caches?

Implementing Invalidation
Write-Back Cache
May need to find the most recent value of a data item in some
other processors cache, not in memory
We can do this using the same snooping scheme for cache
misses and writes
Each processor snoops every address placed on the bus when a read is
requested from memory
If a processor has a dirty copy of the requested cache block (i.e. one we
wrote to and is hence updated), provide that cache block to the requestor
and abort the memory access

Since write-back caches generate lower memory

requirements, they are preferred in multiprocessors
despite increased complexity
9

Invalidation with Write-Back

Cache
Snooping
Can use normal cache valid, dirty bits to invalidate or
determine if we have the most updated copy
Add an extra state bit to indicate if a block is shared
Write to a block in the shared state generates an invalidation on the bus,
marks the block as private
Writes to a block in the private state dont generate invalidations since it
should already be invalidated elsewhere
Move to the shared state when another processor has a read miss (tries to
read this block from memory)

Example protocol
Each cache uses a finite-state transition diagram to determine
the proper state and action
10

Write-Invalidate Write-Back
Cache Coherence Protocol

Normal font = stimulus

Bold font = action

Action by CPU owner of cache

Action from Bus

Explanation of Previous Slide

Left side: state transitions based on actions of the CPU associated
with the cache
Right side: state transitions based on actions of other CPUs placed
on the bus
Example

CPU 1 starts in invalid; places read miss, reads block X, goes to shared state
CPU 1 re-reads block X, these are read hits
CPU 2 reads block X, places read miss, reads block X, goes to shared state
CPU 1 writes block X, always places write miss, moves to exclusive state
CPU 2 using right side reads write-miss and moves to Invalid state
CPU 1 writes or reads block X, stays in exclusive state
CPU 2 reads block X, places read miss
CPU 1 using right side gets read miss and moves to shared mode,
supplies correct memory block to CPU 2
CPU 2 moves to shared mode
12

Merged State Transition Diagram

In practice, well
have a single state
transition diagram
with both types of
stimulus merged
together
Functionally the
same as the split
diagram on the
previous slide
Protocol somewhat
simplified from
those in use today 13

Performance of Snooping
Coherence Protocols
Use the four parallel programs described earlier as
a benchmark
Split cache misses into two sets
Coherence Misses misses due to cache invalidation
Capacity Misses actually capacity, compulsory and
conflict misses, but most of these are capacity. Normal
cache misses from a uniprocessor

Miss Rate vs. Processor Count

Coherence Miss
Rate goes up with
processor count,
more
communication
Overall miss rate
slightly down, due
to more cache as
we add more
processors

High-communication app would be bad

Miss Rate vs. Cache Size

Fixed processor count = 16
Miss rate drops as cache size
increased, but varies on the
application

Other variations possible:

block size, set-associative
cache, etc. Behaves
similarly to the uniprocessor
case
16

Distributed Shared Memory

Architectures
Snooping protocol not so efficient on most DSM
machines
Snooping requires a broadcast mechanism, which is easy to do
on a bus
Most DSM systems dont have a bus but a more complex
system interconnect (LAN, mesh, hypercube, etc.) so broadcast
becomes a much more expensive operation

One solution:

Prevent coherency by marking shared data as uncacheable

Private data can still be cached
For shared data, we must always access through memory
Simple to implement, but can slow things down if programs are
not written with this model in mind
17
Access to remote memory can be quite slow

DSM Coherency
Another solution: software-based coherency
Possible but slow and conservative, every block that might be
shared treated as if it is shared

Most popular alternative: Directory Protocol

Directory keeps the state of every block that may be cached
Information in the directory includes which caches have copies
of the block, whether it is dirty, shared, etc.
Centralized version of snooping this directory is always in the
same location
Memory requirements for the directory are proportional to the
number of memory blocks * number of processors
18

Directory Protocol
Each directory must track the following states for its
cache blocks
Shared?
If shared, what processors are sharing this block?
This prevents broadcast if we need to invalidate those blocks, instead we
can send a message to only these specific processors

Uncached?
Set if no processor has a copy of the cache block

Exclusive?
Exactly one processor has a copy of the cache block and has written to it,
so memory copy is out of date
The processor of the exclusive block is called the owner of the block

Very similar to the snooping protocol, but the directory

tracks of who has what data
19

Directory Protocol for DSM

Directory added to each node to implement cache coherence

Each directory tracks caching for memory in its node

Directory Protocol Terminology

Local node
Node where a request originates

Home node
Node where the memory location and directory entry reside
Could be the local node as well

Remote node
Node that has a copy of a cache block
Might be exclusive or shared

Nodes will pass messages to one another; messages will

move a directory between states in a transition diagram,
just like with the snooping protocol
21

Directory Protocol Messages

1-2 : Miss requests by local cache to home

3-5 : Home sends to remote cache when home needs to satisfy request
6 : Home sends requested data to local cache
7 : Block replaced, needs to be written back to home or fetch requested

Directory-Based State Transition

Diagram

Same as the snooping

protocol diagram,
except we are sending
explicit messages
instead of putting data
on a common bus

Directory-Based Performance
Use same parallel programs as for the snooping
protocol for our benchmark
Miss rate broken into two categories
Local misses
Remote misses
Remote misses are much more expensive than local misses
Longer read latencies to traverse interconnect
Will want to have bigger caches to avoid the latencies

Mostly will be coherence misses

Miss Rate vs. Num Processors

As with snooping
caches, remote misses
increases somewhat as
processor count
increases

Anomaly here
25

Miss Rate vs. Cache Size

P fixed at 64
Miss rates decrease
as cache size grows,
as you would expect!
Plateau varies with
the application

Summary
Coherence protocols may be needed for correct program
behavior
Most common protocol is write-back cache, write
invalidation
Can use snooping or directory based mechanism to
implement coherence
Coherence requests become more important in programs
that are less optimized
Optimized programs will access most data locally and have
fewer requests
Exactly how the cache miss rates affect CPU performance
depends on the memory system, interconnect, latency,
bandwidth, etc.

X 64 DBG
No ratings yet
X 64 DBG
239 pages
Ahu Operation Qualification Document
100% (10)
Ahu Operation Qualification Document
21 pages
L39 - Centralized Shared Memory Architectures
No ratings yet
L39 - Centralized Shared Memory Architectures
31 pages
Cache Coherence
No ratings yet
Cache Coherence
53 pages
Coherence
No ratings yet
Coherence
16 pages
Cache Coherency
No ratings yet
Cache Coherency
33 pages
Module 4
No ratings yet
Module 4
40 pages
ACA Lecture 29 Cache-Coherence 2
No ratings yet
ACA Lecture 29 Cache-Coherence 2
42 pages
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
No ratings yet
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
33 pages
1.symmetric and Distributed Shared Memory Architectures
79% (19)
1.symmetric and Distributed Shared Memory Architectures
29 pages
Unit 4 - Advanced Computer Architecture - Www.rgpvnotes.in
No ratings yet
Unit 4 - Advanced Computer Architecture - Www.rgpvnotes.in
14 pages
Lec 6 SharedArch PDF
No ratings yet
Lec 6 SharedArch PDF
33 pages
Snooping Cache and Directory Based Multiprocessors
No ratings yet
Snooping Cache and Directory Based Multiprocessors
59 pages
CA Lecture 13
No ratings yet
CA Lecture 13
27 pages
CA-unit 5-Material-For Reference
No ratings yet
CA-unit 5-Material-For Reference
16 pages
MODULE 4 hpc
No ratings yet
MODULE 4 hpc
41 pages
Shared Memory Architecture Concepts and Performance Issues: Outline
No ratings yet
Shared Memory Architecture Concepts and Performance Issues: Outline
7 pages
Multiprocessing: Flynn's Classification (1966)
No ratings yet
Multiprocessing: Flynn's Classification (1966)
8 pages
Shared Memory Architecture
No ratings yet
Shared Memory Architecture
39 pages
Shared Memory Architectures
No ratings yet
Shared Memory Architectures
34 pages
Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9
No ratings yet
Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9
79 pages
Muge - Snoop Based Multiprocessor Design
No ratings yet
Muge - Snoop Based Multiprocessor Design
32 pages
Memory Hierarchy: Haresh Dagale Dept of ESE
No ratings yet
Memory Hierarchy: Haresh Dagale Dept of ESE
32 pages
CSCI 8150 Advanced Computer Architecture
100% (2)
CSCI 8150 Advanced Computer Architecture
46 pages
CS 523 Advanced Computer Architecture: Introduction To Cache Coherence Protocols
No ratings yet
CS 523 Advanced Computer Architecture: Introduction To Cache Coherence Protocols
24 pages
Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back
No ratings yet
Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back
21 pages
Cache Coherence: Caches Memory Coherence Caches Multiprocessing
No ratings yet
Cache Coherence: Caches Memory Coherence Caches Multiprocessing
4 pages
Multiprocessor Cache Coherence
No ratings yet
Multiprocessor Cache Coherence
13 pages
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
No ratings yet
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
54 pages
Cheat Sheet Prepared For Advanced Computer Architecture Midterm Exam - UofM
No ratings yet
Cheat Sheet Prepared For Advanced Computer Architecture Midterm Exam - UofM
11 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
2.Symmetric Shared Memory Architectures
No ratings yet
2.Symmetric Shared Memory Architectures
12 pages
18bce2429 Da 2 Cao
No ratings yet
18bce2429 Da 2 Cao
13 pages
Cache Coherence (Part 1)
No ratings yet
Cache Coherence (Part 1)
13 pages
Chapter 7
No ratings yet
Chapter 7
97 pages
Cache Coherence - MESI MOESI
No ratings yet
Cache Coherence - MESI MOESI
57 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
23 pages
MN Cache Coherence
No ratings yet
MN Cache Coherence
11 pages
Snoop-Based Multiprocessor Design
No ratings yet
Snoop-Based Multiprocessor Design
57 pages
Multiprocessors and Thread
No ratings yet
Multiprocessors and Thread
4 pages
Cache Coherence_20250120_142158_0000
No ratings yet
Cache Coherence_20250120_142158_0000
34 pages
Directory-Based Cache Coherence Protocols: Interconnection Networks For Multiprocessors
No ratings yet
Directory-Based Cache Coherence Protocols: Interconnection Networks For Multiprocessors
9 pages
05 Multiprocessor
No ratings yet
05 Multiprocessor
54 pages
Cache Coherence: CSE 661 - Parallel and Vector Architectures
No ratings yet
Cache Coherence: CSE 661 - Parallel and Vector Architectures
37 pages
CSA Mod 3-Part 2 Notes (Cache Coherence)
No ratings yet
CSA Mod 3-Part 2 Notes (Cache Coherence)
19 pages
Chapter 4: Multiprocessor: Dr. Eng. Amr T. Abdel-Hamid Spring 2011
No ratings yet
Chapter 4: Multiprocessor: Dr. Eng. Amr T. Abdel-Hamid Spring 2011
22 pages
Unit 5
No ratings yet
Unit 5
89 pages
Cs 6461 Computer Architecture Lecture 11
No ratings yet
Cs 6461 Computer Architecture Lecture 11
51 pages
Thread-Level Parallelism: A Quantitative Approach, Sixth Edition
No ratings yet
Thread-Level Parallelism: A Quantitative Approach, Sixth Edition
40 pages
Cache Coherence
No ratings yet
Cache Coherence
14 pages
comporg6_ch12
No ratings yet
comporg6_ch12
36 pages
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
No ratings yet
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
42 pages
Lecture 5
No ratings yet
Lecture 5
15 pages
Mehmet Senvar - Cache Coherence Protocols
No ratings yet
Mehmet Senvar - Cache Coherence Protocols
30 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Shared Memory Multiprocessors: Logical Design and Software Interactions
No ratings yet
Shared Memory Multiprocessors: Logical Design and Software Interactions
107 pages
Distributed Shared Memory: Introduction & Thisis
No ratings yet
Distributed Shared Memory: Introduction & Thisis
22 pages
Cache Coherence and Synchronization - Tutorialspoint
No ratings yet
Cache Coherence and Synchronization - Tutorialspoint
7 pages
Hack into your Friends Computer
From Everand
Hack into your Friends Computer
Magelan Cyber Security
No ratings yet
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
List of CPU Architectures
0% (1)
List of CPU Architectures
3 pages
Data Hazards in ALU Instructions: Consider This Sequence
No ratings yet
Data Hazards in ALU Instructions: Consider This Sequence
14 pages
8b/10b Encoder Design: Xu Qiaoyu Liu Huijie
No ratings yet
8b/10b Encoder Design: Xu Qiaoyu Liu Huijie
3 pages
An Implementation of Open Core Protocol For The On-Chip Bus: CH - Suryanarayana, M.Vinodh Kumar
No ratings yet
An Implementation of Open Core Protocol For The On-Chip Bus: CH - Suryanarayana, M.Vinodh Kumar
4 pages
MIPS
No ratings yet
MIPS
6 pages
Chapter 12: Multiprocessor Architectures: Cache Coherence Problem and Cache Synchronization Solutions Part 1
No ratings yet
Chapter 12: Multiprocessor Architectures: Cache Coherence Problem and Cache Synchronization Solutions Part 1
31 pages
NVL-14.Minimizing Energy of Integer Unit by Higher Voltage
No ratings yet
NVL-14.Minimizing Energy of Integer Unit by Higher Voltage
3 pages
NVL 09.low Power Pulse Triggered Flip Flop Design
No ratings yet
NVL 09.low Power Pulse Triggered Flip Flop Design
3 pages
Data Hazards
No ratings yet
Data Hazards
29 pages
NVL-12.Constant Delay Logic Style
No ratings yet
NVL-12.Constant Delay Logic Style
3 pages
NVL-08.Novel Class of Energy-Efficient Very High-Speed
No ratings yet
NVL-08.Novel Class of Energy-Efficient Very High-Speed
3 pages
NVL-07.Thwarting Scan-Based Attacks On Secure-ICs
No ratings yet
NVL-07.Thwarting Scan-Based Attacks On Secure-ICs
3 pages
NVL-04.an Optimized Modified Booth Recoder For Efficient Design of The Add-Multiply Operator
No ratings yet
NVL-04.an Optimized Modified Booth Recoder For Efficient Design of The Add-Multiply Operator
4 pages
Mips Isa
No ratings yet
Mips Isa
32 pages
NVL-06.Carbon Nanotubes Blowing New Life Into NP Dynamic
No ratings yet
NVL-06.Carbon Nanotubes Blowing New Life Into NP Dynamic
3 pages
NVL-05.Area-Delay-Power Efficient Carry-Select Adder
No ratings yet
NVL-05.Area-Delay-Power Efficient Carry-Select Adder
3 pages
A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application To A Double-Throughput MAC Unit
No ratings yet
A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application To A Double-Throughput MAC Unit
9 pages
Fault Tolerant Systems: Prerequisites
No ratings yet
Fault Tolerant Systems: Prerequisites
14 pages
Shear-Flexible Element With Warping For Thin-Walled Open Beams
No ratings yet
Shear-Flexible Element With Warping For Thin-Walled Open Beams
19 pages
Jai
No ratings yet
Jai
2 pages
Close Coiled Helical Spring
75% (4)
Close Coiled Helical Spring
4 pages
The Flow of Real Gases Through Porous Media: Jun/Or Member A/Me Texas A&M College Station, Tex
No ratings yet
The Flow of Real Gases Through Porous Media: Jun/Or Member A/Me Texas A&M College Station, Tex
13 pages
Board Benchmark Basic 12th
No ratings yet
Board Benchmark Basic 12th
3 pages
How To Replace XFabric Link Cable Online
No ratings yet
How To Replace XFabric Link Cable Online
4 pages
Parallel Design of JPEG-LS Encoder On Graphics Processing Units
No ratings yet
Parallel Design of JPEG-LS Encoder On Graphics Processing Units
14 pages
Backup 1
No ratings yet
Backup 1
7 pages
Unit-I Introduction Web Services Part A-2 Marks
No ratings yet
Unit-I Introduction Web Services Part A-2 Marks
23 pages
Rog Pricelist PDF
No ratings yet
Rog Pricelist PDF
2 pages
HVAC Consolidated Issues List
No ratings yet
HVAC Consolidated Issues List
10 pages
Una Investigación de Los Accidentes Therac-25
No ratings yet
Una Investigación de Los Accidentes Therac-25
7 pages
Details of RC1530DH PDF
No ratings yet
Details of RC1530DH PDF
8 pages
IZND Services (50 Most Admired Companies in 2019)
No ratings yet
IZND Services (50 Most Admired Companies in 2019)
2 pages
EN For August 7, 2015 - REBD0002
No ratings yet
EN For August 7, 2015 - REBD0002
10 pages
Keckley Float Valves
No ratings yet
Keckley Float Valves
15 pages
Herose
100% (1)
Herose
1 page
AF-PHASE-1-UPDATED-OM-PLAN_CONCRETE-PATHWAY
No ratings yet
AF-PHASE-1-UPDATED-OM-PLAN_CONCRETE-PATHWAY
37 pages
Ben Bollman Updated Resume
No ratings yet
Ben Bollman Updated Resume
2 pages
A 610 Imam Kitchen Detail
No ratings yet
A 610 Imam Kitchen Detail
1 page
Manual - Conductor Bar Safe-Lec 2
No ratings yet
Manual - Conductor Bar Safe-Lec 2
46 pages
Examples of Aliasing in Real Life: by Dr. Wajih A. Abu-Al-Saud Modified by Dr. Muqaibel
No ratings yet
Examples of Aliasing in Real Life: by Dr. Wajih A. Abu-Al-Saud Modified by Dr. Muqaibel
3 pages
Legato Partners - Technician Onboarding Process (Dynamics)
No ratings yet
Legato Partners - Technician Onboarding Process (Dynamics)
19 pages
SML-Spare-Parts-Catalogue-11-16
No ratings yet
SML-Spare-Parts-Catalogue-11-16
6 pages
DOC042.52.20151Nov2013TechnicalNotepartsofapHelectrode
No ratings yet
DOC042.52.20151Nov2013TechnicalNotepartsofapHelectrode
1 page
European Recommendations For Surface Eletromiography PDF
No ratings yet
European Recommendations For Surface Eletromiography PDF
4 pages
SecuScan-« Installation permanent System_V.5.0_WZ
No ratings yet
SecuScan-« Installation permanent System_V.5.0_WZ
17 pages
Hyper-V - Live Migration Network Configuration Guide
No ratings yet
Hyper-V - Live Migration Network Configuration Guide
7 pages

Parallel 2

Uploaded by

Parallel 2

Uploaded by

Multiprocessor Cache Coherency

What is Cache Coherence?

Write Through Cache

Techniques to Enforce Coherence

Snooping Protocols for Coherency

Multiword Cache Bocks

Delay between writing a word and reading it

Usually Write Invalidate is used; less bandwidth

Serialization of the bus forces serialization of access

Since write-back caches generate lower memory

Invalidation with Write-Back

Normal font = stimulus

Action by CPU owner of cache

Action from Bus

Explanation of Previous Slide

Merged State Transition Diagram

Miss Rate vs. Processor Count

High-communication app would be bad

Miss Rate vs. Cache Size

Other variations possible:

Distributed Shared Memory

Prevent coherency by marking shared data as uncacheable

Most popular alternative: Directory Protocol

Very similar to the snooping protocol, but the directory

Directory Protocol for DSM

Directory added to each node to implement cache coherence

Directory Protocol Terminology

Nodes will pass messages to one another; messages will

Directory Protocol Messages

1-2 : Miss requests by local cache to home

Directory-Based State Transition

Same as the snooping

Mostly will be coherence misses

Miss Rate vs. Num Processors

Miss Rate vs. Cache Size

You might also like