0% found this document useful (0 votes)

113 views13 pages

Cache Coherence (Part 1)

The document discusses the cache coherence problem in multiprocessor systems. It introduces the problem that the notion of the "last write" to a memory location is not well-defined in parallel systems. It also discusses different approaches to solving cache coherence, including software-based solutions, hardware-based snoopy protocols, and directory-based solutions. It provides examples of snoopy cache coherence schemes and how they work by broadcasting coherence actions to all processors.

Uploaded by

mahmoudchawi100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views13 pages

Cache Coherence (Part 1)

Uploaded by

mahmoudchawi100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Cache Coherence: Part 1 The Cache Coherence Problem

CS 740 P1 P2 P3

October 27, 2003 u=? u=? u=7

$ 4 $ 5 $ 3
u:5 u:5

Topics
• The Cache Coherence Problem
• Snoopy Protocols
1 I/O devices
u:5 2

Memory
Slides from Todd Mowry/Culler&Singh

–2– CS 740 F’03

A Coherent Memory System: Intuition Problems with the Intuition

Reading a location should return latest value written Recall:

(by any process) • Value returned by read should be last value written
But “last” is not well-defined!
Easy in uniprocessors
• Except for I/O: coherence between I/O devices and processors
Even in sequential case:
• “last” is defined in terms of program order, not time
• But infrequent so software solutions work
– Order of operations in the machine language presented to processor
– uncacheable operations, flush pages, pass I/O data through caches
– “Subsequent” defined in analogous way, and well defined
Would like same to hold when processes run on
different processors In parallel case:
• program order defined within a process, but need to make sense of
• E.g. as if the processes were interleaved on a uniprocessor
orders across processes
The coherence problem is more pervasive and Must define a meaningful semantics
performance-critical in multiprocessors • the answer involves both “cache coherence” and an appropriate
• has a much larger impact on hardware design “memory consistency model” (to be discussed in a later lecture)

–3– CS 740 F’03 –4– CS 740 F’03

Formal Definition of Coherence Cache Coherence Solutions
Results of a program: values returned by its read operations Software Based:
A memory system is coherent if the results of any execution • often used in clusters of workstations or PCs (e.g., “Treadmarks”)
of a program are such that for each location, it is possible • extend virtual memory system to perform more work on page faults
to construct a hypothetical serial order of all operations to – send messages to remote machines if necessary
the location that is consistent with the results of the
execution and in which: Hardware Based:
• two most common variations:
1. operations issued by any particular process occur in the
– “snoopy” schemes
order issued by that process, and
» rely on broadcast to observe all coherence traffic
2. the value returned by a read is the value written by the
» well suited for buses and small-scale systems
last write to that location in the serial order
» example: SGI Challenge
Two necessary features: – directory schemes
• Write propagation: value written must become visible to others » uses centralized information to avoid broadcast
• Write serialization: writes to location seen in same order by all » scales well to large numbers of processors
– if I see w1 after w2, you should not see w2 before w1 » example: SGI Origin 2000
– no need for analogous read serialization since reads not visible to others
–5– CS 740 F’03 –6– CS 740 F’03

Shared Caches Snoopy Cache Coherence Schemes

• Processors share a single cache, essentially punting
Basic Idea:
the problem.
• all coherence-related activity is broadcast to all processors
• Useful for very small machines. – e.g., on a global bus
• E.g., DPC in the Encore, Alliant FX/8. • each processor (or its representative) monitors (aka “snoops”) these
• Problems are limited cache bandwidth and cache interference actions and reacts to any which are relevant to the current
• Benefits are fine-grain sharing and prefetch effects contents of its cache
– examples:
P P P » if another processor wishes to write to a line, you may need to
P P “invalidate” (I.e. discard) the copy in your own cache
Crossbar » if another processor wishes to read a line for which you have a
Shd. Cache dirty copy, you may need to supply
2-4 way interleaved cache
Most common approach in commercial multiprocessors.
Memory Examples:
• SGI Challenge, SUN Enterprise, multiprocessor PCs, etc.
Memory

–7– CS 740 F’03 –8– CS 740 F’03

Implementing a Snoopy Protocol Coherence with Write-through Caches
Cache controller now receives inputs from both sides: P1 Pn
• Requests from processor, bus requests/responses from snooper Bus snoop

In either case, takes zero or more actions $ $

• Updates state, responds with data, generates new bus
transactions
Protocol is a distributed algorithm: cooperating state Cache-memory
machines Mem I/O devices transaction

• Set of states, state transition diagram, actions

Granularity of coherence is typically a cache block • Key extensions to uniprocessor: snooping, invalidating/updating caches
• Like that of allocation in cache and transfer to/from cache – no new states or bus transactions in this case
– invalidation- versus update-based protocols
• Write propagation: even in inval case, later reads will see new value
– inval causes miss on later access, and memory up-to-date via write-through

–9– CS 740 F’03 – 10 – CS 740 F’03

Write-through State Transition Diagram Write-through State Transition Diagram

PrRd/— PrWr/BusWr PrRd/— PrWr/BusWr

PrRD V V
PrWr
BusWr
BusWr/—
PrRd/BusRd BusWr/—
PrRd/BusRd
BusRd

Obs/gen I I

PrWr/BusWr Processor-initiated transactions Processor-initiated transactions

PrWr/BusWr
Bus-snooper-initiated transactions Bus-snooper-initiated transactions

• Two states per block in each cache, as in uniprocessor • Two states per block in each cache, as in uniprocessor
– state of a block can be seen as p-vector – state of a block can be seen as p-vector
• Hardware state bits associated with only blocks that are in the cache • Hardware state bits associated with only blocks that are in the cache
– other blocks can be seen as being in invalid (not-present) state in that cache – other blocks can be seen as being in invalid (not-present) state in that cache
• Write will invalidate all other caches (no local change of state) • Write will invalidate all other caches (no local change of state)
– can have multiple simultaneous readers of block,but write invalidates them – can have multiple simultaneous readers of block,but write invalidates them
– 11 – CS 740 F’03 – 12 – CS 740 F’03
Is WT Coherent? Problem with Write-Through
• Assume: High bandwidth requirements
• A bus transaction completes before next one starts • Every write from every processor goes to shared bus and memory
• Atomic bus transactions • Consider a 500MHz, 1CPI processor, where 15% of instructions are
8-byte stores
• Atomic memory transactions • Each processor generates 75M stores or 600MB data per second
• Write propagation? • 1GB/s bus can support only 1 processor without saturating
• Write-through especially unpopular for SMPs
• Write Serialization?
Write-back caches absorb most writes as cache hits
• Write hits don’t go on bus
• Key is the bus: writes serialized by bus • But now how do we ensure write propagation and serialization?
• Need more sophisticated protocols: large design space

– 13 – CS 740 F’03 – 14 – CS 740 F’03

Write-Back Snoopy Protocols Pause: Is Coherence enough?

No need to change processor, main memory, cache … P1 P2

• Extend cache controller and exploit bus (provides serialization) /*Assume initial value of A and flag is 0*/
while (flag == 0); /*spin idly*/
Dirty state now also indicates exclusive ownership
A = 1;
flag = 1; print A;
• Exclusive: only cache with a valid copy (main memory may be too)
• Owner: responsible for supplying block upon a request for it
Design space • Coherence talks about 1 memory location
• Invalidation versus Update-based protocols
• Set of states

– 15 – CS 740 F’03 – 16 – CS 740 F’03

Pause: Is Coherence enough? Sequential Consistency
P1 P2 Processors P1 P2 Pn
issuing memory
/*Assume initial value of A and flag is 0*/ references as
A = 1; while (flag == 0); /*spin idly*/ per program The “switch” is randomly
flag = 1; print A; order set after each memory
reference
Memory
• Coherence talks about 1 memory location
• Consistency talks about different locations • (as if there were no caches, and a single memory)
• Total order achieved by interleaving accesses from
Is there an interleaving of the partial orders of different processes
each processor that yields a total order that obeys • Maintains program order, and memory operations, from all
program order? processes, appear to [issue, execute, complete] atomically
w.r.t. others
• Programmer’s intuition is maintained
– 17 – CS 740 F’03 – 18 – CS 740 F’03

What Really is Program Order? SC Example

What matters is:
Intuitively, order in which operations appear in order in which appears to execute, not executes
source code
• Straightforward translation of source code to assembly P1 P2
• At most one memory operation per instruction
/*Assume initial values of A and B are 0*/
But not the same as order presented to hardware by (1a) A = 1; (2a) print B;
compiler (1b) B = 2; (2b) print A;
So which is program order?
• possible outcomes for (A,B): (0,0), (1,0), (1,2);
Depends on which layer, and who’s doing the
impossible under SC: (0,2)
reasoning
• we know 1a->1b and 2a->2b by program order
We assume order as seen by programmer
• A = 0 implies 2b->1a, which implies 2a->1b
• B = 2 implies 1b->2a, which leads to a contradiction

• BUT, actual execution 1b->1a->2b->2a is SC.

– 19 – CS 740 F’03 – 20 – CS 740 F’03
Implementing SC Write Atomicity
Write Atomicity : Position in total order at which a write
Two kinds of requirements appears to perform should be the same for all processes
• Program order
– memory operations issued by a process must appear to • Nothing a process does after it has seen the new value produced by a
become visible (to others and itself) in program order write W should be visible to other processes until they too have seen W
• Atomicity • In effect, extends write serialization to writes from multiple processes
– in the overall total order, one memory operation should
appear to complete with respect to all processes before the P1 P2 P3
next one is issued A=1; while (A==0);
– needed to guarantee that total order is consistent across B=1; while (B==0);
processes
print A;
– tricky part is making writes atomic
•Transitivityimplies A should print as 1 under SC
•Problem if P2 leaves loop, writes B, and P3 sees new B but old A
(from its cache, say)
– 21 – CS 740 F’03 – 22 – CS 740 F’03

Sufficient Conditions for SC SC in Write-through Example

1.Every process issues memory ops in program order Provides SC, not just coherence
2.After a write op is issued, the issuing process waits
for the write to complete before issuing its next op
Extend arguments used for coherence
3.After a read op is issued, the issuing process waits • Writes and read misses to all locations serialized by bus into bus
for the read to complete, and for the write whose order
value is being returned by the read to complete, • If read obtains value of write W, W guaranteed to have
before issuing its next operation (provides write completed
atomicity) – since it caused a bus transaction
• When write W is performed w.r.t. any processor, all previous
• Issues: writes in bus order have completed
• Compilers (loop transforms, register allocation)
• Hardware (write buffers, OO-execution)
• Reason: uniprocessors care only about dependences to
same location (i.e., above conditions VERY restrictive)
– 23 – CS 740 F’03 – 24 – CS 740 F’03
Write-Back Snoopy Protocols Invalidation-based Protocols
“Exclusive” state means can modify without notifying anyone else
No need to change processor, main memory, • i.e. without bus transaction
cache … • Must first get block in exclusive state before writing into it
• Extend cache controller and exploit bus (provides serialization) • Even if already in valid state, need transaction, so called a write miss
Dirty state now also indicates exclusive Store to non-dirty data generates a read- exclusive bus transaction
ownership • Tells others about impending write, obtains exclusive ownership
– makes the write visible, i.e. write is performed
• Exclusive: only cache with a valid copy (main memory may be too) – may be actually observed (by a read miss) only later
• Owner: responsible for supplying block upon a request for it – write hit made visible (performed) when block updated in writer’s cache
Design space • Only one RdX can succeed at a time for a block: serialized by bus

• Invalidation versus Update-based protocols Read and Read-exclusive bus transactions drive coherence actions
• Writeback transactions also, but not caused by memory operation and
• Set of states quite incidental to coherence protocol
– note: replaced block that is not in modified state can be dropped

– 25 – CS 740 F’03 – 26 – CS 740 F’03

Update-based Protocols Invalidate versus Update

A write operation updates values in other caches Basic question of program behavior
• New, update bus transaction • Is a block written by one processor read by others before it is
rewritten?
Advantages
• Other processors don’t miss on next access: reduced latency Invalidation:
– In invalidation protocols, they would miss and cause more • Yes => readers will take a miss
transactions • No => multiple writes without additional traffic
• Single bus transaction to update several caches can save – and clears out copies that won’t be used again
bandwidth
Update:
– Also, only the word written is transferred, not whole block
• Yes => readers will not miss if they had a copy previously
Disadvantages – single bus transaction to update all copies
• Multiple writes by same processor cause multiple update • No => multiple useless updates, even to dead copies
transactions
Need to look at program behavior and hardware complexity
– In invalidation, first write gets exclusive ownership, others local
Invalidation protocols much more popular
Detailed tradeoffs more complex • Some systems provide both, or even hybrid
– 27 – CS 740 F’03 – 28 – CS 740 F’03
Basic MSI Writeback Inval Protocol State Transition Diagram
PrRd/- PrWr/-
States
• Invalid (I) M

• Shared (S): one or more

• Dirty or Modified (M): one only
BusRd/Flush
PrWr/BusRdX
Processor Events:
BusRdX/Flush
• PrRd (read) S
PrRd/BusRd
• PrWr (write)
Bus Transactions PrWr/BusRdX BusRdX/-
PrRd/-
• BusRd: asks for copy with no intent to modify BusRd/-
• BusRdX: asks for copy with intent to modify
I
• BusWB: updates memory
Actions • Write to shared block:
– Already have latest data; can use upgrade (BusUpgr) instead of BusRdX
• Update state, perform bus transaction, flush value onto bus
• Replacement changes state of two blocks: outgoing and incoming
– 29 – CS 740 F’03 – 30 – CS 740 F’03

State Transition Diagram Satisfying Coherence

PrRd/- PrWr/-
Write propagation is clear
M
Write serialization?
• All writes that appear on the bus (BusRdX) ordered by the bus
BusRd/Flush – Write performed in writer’s cache before it handles other
PrWr/BusRdX transactions, so ordered in same way even w.r.t. writer
BusRdX/Flush • Reads that appear on the bus ordered wrt these
S
PrRd/BusRd • Writes that don’t appear on the bus:
BusRdX/- – sequence of such writes between two bus xactions for the block must
PrWr/BusRdX come from same processor, say P
PrRd/-
BusRd/- – in serialization, the sequence appears between these two bus xactions
I
– reads by P will seem them in this order w.r.t. other bus transactions
– reads by other processors separated from sequence by a bus xaction,
• Write to shared block: which places them in the serialized order w.r.t the writes
– Already have latest data; can use upgrade (BusUpgr) instead of BusRdX – so reads by all processors see writes in same order
• Replacement changes state of two blocks: outgoing and incoming
– 31 – CS 740 F’03 – 32 – CS 740 F’03
Satisfying Sequential Consistency Lower-level Protocol Choices
1. Appeal to definition: BusRd observed in M state: what transition to make?
• Bus imposes total order on bus xactions for all locations
• Between xactions, procs perform reads/writes locally in program order Depends on expectations of access patterns
• So any execution defines a natural partial order • S: assumption that I’ll read again soon, rather than other will
write
• In segment between two bus transactions, any interleaving of ops from
different processors leads to consistent total order – good for mostly read data
– what about “migratory” data
• In such a segment, writes observed by processor P serialized as follows
– Writes from other processors by the previous bus xaction P issued » I read and write, then you read and write, then X reads and
writes...
– Writes from P by program order
» better to go to I state, so I don’t have to be invalidated on your
2. Show sufficient conditions are satisfied write
• Write completion: can detect when write appears on bus • Synapse transitioned to I state
• Write atomicity: if a read returns the value of a write, that write has • Sequent Symmetry and MIT Alewife use adaptive protocols
already become visible to all others already (can reason different
cases) Choices can affect performance of memory system

– 33 – CS 740 F’03 – 34 – CS 740 F’03

MESI (4-state) Invalidation Protocol MESI State Transition Diagram

Problem with MSI protocol
PrRd
PrWr/—

• Reading and modifying data is 2 bus transactions, even if no sharing M

– e.g. even in sequential program BusRd/Flush

BusRdX/Flush

– BusRd (I->S) followed by BusRdX or BusUpgr (S->M) PrWr/—

PrWr/BusRdX

Add exclusive state: write locally without xaction, but E

not modified BusRd/

Flush
BusRdX/Flush
PrRd/—
• Main memory is up to date, so cache not necessarily owner PrWr/BusRdX

• States S

′
BusRdX/Flush
– invalid PrRd/
)
BusRd (S
– exclusive or exclusive-clean (only this cache has copy, but not modified)
PrRd/—
′
BusRd/Flush

– shared (two or more caches may have copies) PrRd/

BusRd(S)

– modified (dirty) I

• I -> E on PrRd if no other processor has a copy

• BusRd(S) means shared line asserted on BusRd transaction
– needs “shared” signal on bus: wired-or line asserted in response to
BusRd • Flush’: if cache-to-cache sharing (see next), only one cache flushes data
• MOESI protocol: Owned state: exclusive but memory not valid
– 35 – CS 740 F’03 – 36 – CS 740 F’03
Lower-level Protocol Choices Dragon Write-back Update Protocol
Who supplies data on miss when not in M state: memory or cache 4 states
Original, lllinois MESI: cache, since assumed faster than memory • Exclusive-clean or exclusive (E): I and memory have it
• Cache- to- cache sharing • Shared clean (Sc): I, others, and maybe memory, but I’m not owner
Not true in modern systems • Shared modified (Sm): I and others but not memory, and I’m the owner
• Intervening in another cache more expensive than getting from – Sm and Sc can coexist in different caches, with only one Sm
memory • Modified or dirty (D): I and nobody else
Cache-to-cache sharing also adds complexity No invalid state
• How does memory know it should supply data (must wait for caches) • If in cache, cannot be invalid
• Selection algorithm if multiple caches have valid data • If not present in cache, can view as being in not-present or invalid
But valuable for cache-coherent machines with distributed memory state
• May be cheaper to obtain from nearby cache than distant memory New processor events: PrRdMiss, PrWrMiss
• Especially when constructed out of SMP nodes (Stanford DASH) • Introduced to specify actions when block not present in cache
New bus transaction: BusUpd
• Broadcasts single word written on bus; updates other relevant caches
– 37 – CS 740 F’03 – 38 – CS 740 F’03

Dragon State Transition Diagram Dragon State Transition Diagram

PrRd/— PrRd/—
PrRd/— BusUpd/Update PrRd/— BusUpd/Update

BusRd/— BusRd/—
E Sc E Sc
PrRdMiss/BusRd(S) PrRdMiss/BusRd(S) PrRdMiss/BusRd(S) PrRdMiss/BusRd(S)

PrWr/— PrWr/—
PrWr/BusUpd(S) PrWr/BusUpd(S)

PrWr/BusUpd(S) PrWr/BusUpd(S)
BusUpd/Update BusUpd/Update

BusRd/Flush BusRd/Flush
PrWrMiss/(BusRd(S); BusUpd) PrWrMiss/BusRd(S) PrWrMiss/(BusRd(S); BusUpd) PrWrMiss/BusRd(S)
Sm M Sm M
PrWr/BusUpd(S) PrWr/BusUpd(S)

PrRd/— PrRd/— PrRd/— PrRd/—

PrWr/BusUpd(S) PrWr/BusUpd(S)
BusRd/Flush PrWr/— BusRd/Flush PrWr/—

– 39 – CS 740 F’03 – 40 – CS 740 F’03

Lower-level Protocol Choices Assessing Protocol Tradeoffs
Can shared-modified state be eliminated? Tradeoffs affected by performance and organization characteristics
• If update memory as well on BusUpd transactions (DEC Firefly) Part art and part science
• Dragon protocol doesn’t (assumes DRAM memory slow to update) • Art: experience, intuition and aesthetics of designers
Should replacement of an Sc block be broadcast? • Science: Workload-driven evaluation for cost-performance
• Would allow last copy to go to E state and not generate updates – want a balanced system: no expensive resource heavily underutilized
• Replacement bus xaction is not in critical path, later update may be Methodology:
Shouldn’t update local copy on write hit before controller gets bus • Use simulator; choose parameters per earlier methodology (default
• Can mess up serialization 1MB, 4-way cache, 64-byte block, 16 processors; 64K cache for some)
• Focus on frequencies, not end performance for now
Coherence, consistency considerations much like write-through case
– transcends architectural details, but not what we’re really after
• Use idealized memory performance model to avoid changes of reference
In general, many subtle race conditions in protocols interleaving across processors with machine parameters
But first, let’s illustrate quantitative assessment at logical level – Cheap simulation: no need to model contention

– 41 – CS 740 F’03 – 42 – CS 740 F’03

Impact of Protocol Optimizations Update versus Invalidate

(Computing traffic from state transitions discussed in book)
Effect of E state, and of BusUpgr instead of BusRdX Much debate over the years: tradeoff depends on sharing patterns
Intuition:
200

180 Address bus

• If those that used continue to use, and writes between use are few,
Data bus
160

which should do better?

Traffic (MB/s)

140

120 80
– e.g. producer-consumer pattern
70
• If those that use unlikely to use again, or many writes between
Address bus
100
Traffic (MB/s)

60 Data bus
80
50 reads, which should do better?
– “pack rat” phenomenon particularly bad under process migration
60 40
30
40

20
20
10
– useless updates where only last one will be used
Can construct scenarios where one or other is much better
Raytrace/III Ill

Ex
l

t
x

t
d

OS-Data/3St-RdEx E

0 0
Barnes/3St-RdEx

Raytrace/3St-RdEx

Appl-Code/3St-RdEx
Radiosity/III

Appl-Code/III
LU/3St-RdEx

Ocean/3St-RdEx

Appl-Data/3St-RdEx
LU/III

Radiosity/3St
Radiosity/3St-RdEx

Radix/III

OS-Data/III
Barnes/III

LU/3St

Ocean/III

Raytrace/3St

Appl-Code/3St

OS-Code/3St-RdEx
Barnes/3St

Appl-Data/3St

OS-Code/III

OS-Data/3St
Radix/3St-RdEx

OS-Code/3St
Radix/3St

Appl-Data/III
Ocean/3S

Can combine them in hybrid schemes!

• E.g. competitive: observe patterns at runtime and change protocol
• MSI versus MESI doesn’t seem to matter for bw for these workloads Let’s look at real workloads
• Upgrades instead of read-exclusive helps
• Same story when working sets don’t fit for Ocean, Radix, Raytrace
– 43 – CS 740 F’03 – 44 – CS 740 F’03
Update vs Invalidate: Miss Rates Upgrade and Update Rates (Traffic)
Upgrade/update rate (%)
0.60 2.50

0.00

0.50

1.00

1.50

2.00

2.50
False sharing

0.50
True sharing • Update traffic is substantial LU/inv

• Main cause is multiple writes by a

Capacity 2.00
LU/upd

processor before a read by other

Cold
0.40
– many bus transactions versus one in
Miss rate (%)

Miss rate (%)

1.50
invalidation case
Ocean/inv

0.30
– could delay updates or use merging Ocean/mix

1.00
0.20
• Overall, trend is away from update Ocean/upd

based protocols as default

0.50 – bandwidth, complexity, large blocks Raytrace/inv
trend, pack rat for process migration
0.10

Raytrace/upd
• Will see later that updates have
greater problems for scalable systems
0.00 0.00
LU/inv

Ocean/inv

Radix/inv
LU/upd

Ocean/mix

Raytrace/inv

Radix/mix
Ocean/upd

Radix/upd
Raytrace/upd
Upgrade/update rate (%)

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00
Radix/inv
• Lots of coherence misses: updates help
Radix/mix
• Lots of capacity misses: updates hurt (keep data in cache uselessly)
Radix/upd
• Updates seem to help, but this ignores upgrade and update traffic
– 45 – CS 740 F’03 – 46 – CS 740 F’03

Impact of Cache Block Size Impact of Block Size on Miss Rate

Multiprocessors add new kind of miss to cold, capacity, conflict Results shown only for default problem size: varied behavior
• Coherence misses: true sharing and false sharing • Need to examine impact of problem size and p as well
– latter due to granularity of coherence being larger than a word
• Both miss rate and traffic matter 0.6 12

Upgrade

Reducing misses architecturally in invalidation protocol

Upgrade
False sharing False sharing
0.5 10
True sharing True sharing

• Capacity: enlarge cache; increase block size (if spatial locality)

Capacity Capacity
Cold Cold
0.4 8

• Conflict: increase associativity

Miss rate (%)

• Cold and Coherence: only block size
0.3 6

Increasing block size has advantages and disadvantages 0.2 4

• Can reduce misses if spatial locality is good 0.1 2

• Can hurt too

8
– increase misses due to false sharing if spatial locality not good
0

6
8

4
0

Radiosity/32

Radiosity/256
Radiosity/8

Ocean/8
Ocean/16

Ocean/64

Raytrace/128
Raytrace/256
Radix/8
Barnes/128

Barnes/256

Radix/16

Radix/32
Radix/64

Raytrace/8
Lu/64
Barnes/32

Barnes/64

Ocean/128

Ocean/256

Radix/128

Radix/256

Raytrace/16

Raytrace/32
Raytrace/64
Barnes/8

Lu/8

Lu/128

Lu/256
Lu/16
Lu/32

Ocean/32
Radiosity/16

Radiosity/64
Barnes/16

Radiosity/128
– increase misses due to conflicts in fixed-size cache
– increase traffic due to fetching unnecessary data and due to false
sharing
– can increase miss penalty and perhaps hit cost •Working set doesn’t fit: impact on capacity misses much more critical
– 47 – CS 740 F’03 – 48 – CS 740 F’03
Impact of Block Size on Traffic Making Large Blocks More Effective
Traffic affects performance indirectly through contention
Software
0.18
10 1.8
• Improve spatial locality by better data structuring
• Compiler techniques
Address bus
0.16 9 Address bus Address bus
1.6
Traffic (bytes/instructions)

Data bus
Data bus Data bus
0.14 8
1.4

Hardware

Traffic (bytes/instruction)
7

Traffic (bytes/FLOP)
0.12
1.2
0.1 6
1

• Retain granularity of transfer but reduce granularity of coherence

5
0.08
0.8
4
0.06

– use subblocks: same tag but different state bits

0.6
3
0.04
2 0.4

– one subblock may be valid but another invalid or dirty

0.02
1 0.2
Radiosity/128 28
2

Radiosity/64 4

0
0 0

• Reduce both granularities, but prefetch more blocks on a miss

Barnes/256

Raytrace/32

Raytrace/64
Raytrace/8

Raytrace/128
Raytrace/256

LU/8
Radiosity/256

Raytrace/16
Barnes/8

Radiosity/8

Radiosity/32
Barnes/128

LU/16
LU/32
LU/64
LU/128
LU/256
Barnes/16
Barnes/32

Barnes/64

Radiosity/16

Ocean/16
Ocean/32
Ocean/64
Ocean/8

Ocean/128
Ocean/256
Radix/8

Radix/16

Radix/32

Radix/64

Radix/128

Radix/256
• Proposals for adjustable cache size
• More subtle: delay propagation of invalidations and perform all at
• Results different than for miss rate: traffic almost always increases once
• When working sets fits, overall traffic still small, except for Radix
– But can change consistency model: discuss later in course
• Fixed overhead is significant component
• Use update instead of invalidate protocols to reduce false sharing
– So total traffic often minimized at 16-32 byte block, not smaller effect
• Working set doesn’t fit: even 128-byte good for Ocean due to capacity
– 49 – CS 740 F’03 – 50 – CS 740 F’03

Prana Pratistha (Smarta & Gaudiya) 2018.09.25-1
100% (5)
Prana Pratistha (Smarta & Gaudiya) 2018.09.25-1
13 pages
BS en 1371-2-2015
No ratings yet
BS en 1371-2-2015
24 pages
Shared Memory Multiprocessors: Logical Design and Software Interactions
No ratings yet
Shared Memory Multiprocessors: Logical Design and Software Interactions
107 pages
KTMTSS Shared Memory Multiprocessor
No ratings yet
KTMTSS Shared Memory Multiprocessor
29 pages
Cache Coherence: Part I: CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)
No ratings yet
Cache Coherence: Part I: CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)
31 pages
Cs 6461 Computer Architecture Lecture 11
No ratings yet
Cs 6461 Computer Architecture Lecture 11
51 pages
1.symmetric and Distributed Shared Memory Architectures
79% (19)
1.symmetric and Distributed Shared Memory Architectures
29 pages
Multiprocessing: Flynn's Classification (1966)
No ratings yet
Multiprocessing: Flynn's Classification (1966)
8 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Multiprocessor Cache Coherence
No ratings yet
Multiprocessor Cache Coherence
13 pages
05 Multiprocessor
No ratings yet
05 Multiprocessor
54 pages
Shared Memory Architecture
No ratings yet
Shared Memory Architecture
39 pages
Unit 4 - Advanced Computer Architecture - Www.rgpvnotes.in
No ratings yet
Unit 4 - Advanced Computer Architecture - Www.rgpvnotes.in
14 pages
L39 - Centralized Shared Memory Architectures
No ratings yet
L39 - Centralized Shared Memory Architectures
31 pages
Multi Processor
No ratings yet
Multi Processor
63 pages
Parallel 2
No ratings yet
Parallel 2
14 pages
Cache Coherence - MESI MOESI
No ratings yet
Cache Coherence - MESI MOESI
57 pages
2.Symmetric Shared Memory Architectures
No ratings yet
2.Symmetric Shared Memory Architectures
12 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
CA-unit 5-Material-For Reference
No ratings yet
CA-unit 5-Material-For Reference
16 pages
MODULE 4 hpc
No ratings yet
MODULE 4 hpc
41 pages
4-Module #4-Shared-Memory-Students-Version-Final-October-24-2024
No ratings yet
4-Module #4-Shared-Memory-Students-Version-Final-October-24-2024
25 pages
04 Coherence
No ratings yet
04 Coherence
74 pages
Module 4
No ratings yet
Module 4
40 pages
CA Lecture 13
No ratings yet
CA Lecture 13
27 pages
Cache Coherence: CSE 661 - Parallel and Vector Architectures
No ratings yet
Cache Coherence: CSE 661 - Parallel and Vector Architectures
37 pages
Cache Coherency
No ratings yet
Cache Coherency
19 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
17 pages
Coherence
No ratings yet
Coherence
16 pages
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
No ratings yet
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
54 pages
Cache Coherence: - According To Webster's Dictionary
No ratings yet
Cache Coherence: - According To Webster's Dictionary
15 pages
Snoop-Based Multiprocessor Design
No ratings yet
Snoop-Based Multiprocessor Design
57 pages
ACA Lecture 29 Cache-Coherence 2
No ratings yet
ACA Lecture 29 Cache-Coherence 2
42 pages
Lecture 18: Coherence Protocols
No ratings yet
Lecture 18: Coherence Protocols
18 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
51 pages
CH17 COA9e
No ratings yet
CH17 COA9e
51 pages
R12 U5 MultiProcessor Architectures
No ratings yet
R12 U5 MultiProcessor Architectures
47 pages
comporg6_ch12
No ratings yet
comporg6_ch12
36 pages
Chapter 8_Parallel Processing
No ratings yet
Chapter 8_Parallel Processing
50 pages
Shared Memory Architecture Concepts and Performance Issues: Outline
No ratings yet
Shared Memory Architecture Concepts and Performance Issues: Outline
7 pages
Shared Memory Multiprocessors
No ratings yet
Shared Memory Multiprocessors
45 pages
Snoop-Based Multiprocessor Design I: Multiprocessor Design I: Base Design
No ratings yet
Snoop-Based Multiprocessor Design I: Multiprocessor Design I: Base Design
6 pages
Cache Coherence_20250120_142158_0000
No ratings yet
Cache Coherence_20250120_142158_0000
34 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
Memory Consistency Model
No ratings yet
Memory Consistency Model
17 pages
Snooping Cache and Directory Based Multiprocessors
No ratings yet
Snooping Cache and Directory Based Multiprocessors
59 pages
Computer Architecture: Multiprocessors Shared Memory Architectures Prof. Jerry Breecher CSCI 240 Fall 2003
No ratings yet
Computer Architecture: Multiprocessors Shared Memory Architectures Prof. Jerry Breecher CSCI 240 Fall 2003
24 pages
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
No ratings yet
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
33 pages
Untitled
No ratings yet
Untitled
27 pages
Lecture 06
No ratings yet
Lecture 06
26 pages
Week 5
No ratings yet
Week 5
52 pages
Lec 6 SharedArch PDF
No ratings yet
Lec 6 SharedArch PDF
33 pages
Sequential Consistency and Cache Coherence Protocols: Computer Science and Artificial Intelligence Lab M.I.T
No ratings yet
Sequential Consistency and Cache Coherence Protocols: Computer Science and Artificial Intelligence Lab M.I.T
29 pages
3_concurrency
No ratings yet
3_concurrency
52 pages
Lec 6
No ratings yet
Lec 6
8 pages
Cache Coherence
No ratings yet
Cache Coherence
53 pages
Memory Hierarchy: Haresh Dagale Dept of ESE
No ratings yet
Memory Hierarchy: Haresh Dagale Dept of ESE
32 pages
Cache Coherence: Caches Memory Coherence Caches Multiprocessing
No ratings yet
Cache Coherence: Caches Memory Coherence Caches Multiprocessing
4 pages
Cache Coherency
No ratings yet
Cache Coherency
33 pages
Hack into your Friends Computer
From Everand
Hack into your Friends Computer
Magelan Cyber Security
No ratings yet
Node.js, JavaScript, API: Interview Questions and Answers
From Everand
Node.js, JavaScript, API: Interview Questions and Answers
John Edward Cooper Berg
5/5 (1)
The Complete Future Trait Guide
From Everand
The Complete Future Trait Guide
Hamze Ghalebi
No ratings yet
DLL Q3-MATH 9 WEEK 4-Solving Problems Involving Parallelogram, Trapezoids and Kites
100% (2)
DLL Q3-MATH 9 WEEK 4-Solving Problems Involving Parallelogram, Trapezoids and Kites
5 pages
Dissertation Ideas Clinical Psychology
100% (2)
Dissertation Ideas Clinical Psychology
5 pages
Modern Advanced Accounting in Canada Canadian 8th Edition Hilton Test Bank - Download Now For An Unlimited Reading Experience
100% (5)
Modern Advanced Accounting in Canada Canadian 8th Edition Hilton Test Bank - Download Now For An Unlimited Reading Experience
48 pages
Presentation Schedule
No ratings yet
Presentation Schedule
4 pages
Well: What We Need To Talk About When We Talk About Health 1st Edition Sandro Galea Download PDF
100% (5)
Well: What We Need To Talk About When We Talk About Health 1st Edition Sandro Galea Download PDF
49 pages
Alan Simpson
No ratings yet
Alan Simpson
10 pages
Department of Education: Activity Completion Report (Acr)
No ratings yet
Department of Education: Activity Completion Report (Acr)
3 pages
BPS 8 Шмит моно PDF
No ratings yet
BPS 8 Шмит моно PDF
357 pages
Growing Old The Rose Tree: Matthew Arnold
No ratings yet
Growing Old The Rose Tree: Matthew Arnold
2 pages
Placement Test
No ratings yet
Placement Test
9 pages
Pentecostal Choirs With Chords
No ratings yet
Pentecostal Choirs With Chords
43 pages
Unit 4 in Search of True North
No ratings yet
Unit 4 in Search of True North
3 pages
Module 2. Genetics
No ratings yet
Module 2. Genetics
5 pages
AlcoholEdu Answers 2011
No ratings yet
AlcoholEdu Answers 2011
7 pages
Artikel 6
No ratings yet
Artikel 6
24 pages
Lecture 1-Technical and Business Report Writing
No ratings yet
Lecture 1-Technical and Business Report Writing
50 pages
Grade 9 Writing
No ratings yet
Grade 9 Writing
5 pages
Huffman and Lempel-Ziv-Welch
No ratings yet
Huffman and Lempel-Ziv-Welch
14 pages
NỘI DUNG ÔN TẬP ta
No ratings yet
NỘI DUNG ÔN TẬP ta
35 pages
1.2 Systems and Models PDF
No ratings yet
1.2 Systems and Models PDF
8 pages
Past simple of be was,were
No ratings yet
Past simple of be was,were
7 pages
Makin - V - Attorney - General - For - New - South - Wales Index Card
No ratings yet
Makin - V - Attorney - General - For - New - South - Wales Index Card
2 pages
Flight Data Application - Demo Example For Integration Technologies
No ratings yet
Flight Data Application - Demo Example For Integration Technologies
2 pages
Appeal From Force - Script
No ratings yet
Appeal From Force - Script
3 pages
Genetic Potential of Lichen-Forming Fungi in Polyketide Bio Synthesis
No ratings yet
Genetic Potential of Lichen-Forming Fungi in Polyketide Bio Synthesis
298 pages
Articles N Irregular Plurals Day 38 To 40
No ratings yet
Articles N Irregular Plurals Day 38 To 40
113 pages
Tle 9 Weekly Learning Plan
No ratings yet
Tle 9 Weekly Learning Plan
3 pages
Prediksi Soal Ujian Mid Semester 1 K
No ratings yet
Prediksi Soal Ujian Mid Semester 1 K
15 pages

Cache Coherence (Part 1)

Uploaded by

Cache Coherence (Part 1)

Uploaded by

Cache Coherence: Part 1 The Cache Coherence Problem

October 27, 2003 u=? u=? u=7

–2– CS 740 F’03

A Coherent Memory System: Intuition Problems with the Intuition

Reading a location should return latest value written Recall:

–3– CS 740 F’03 –4– CS 740 F’03

Shared Caches Snoopy Cache Coherence Schemes

–7– CS 740 F’03 –8– CS 740 F’03

In either case, takes zero or more actions $ $

• Set of states, state transition diagram, actions

–9– CS 740 F’03 – 10 – CS 740 F’03

Write-through State Transition Diagram Write-through State Transition Diagram

PrWr/BusWr Processor-initiated transactions Processor-initiated transactions

– 13 – CS 740 F’03 – 14 – CS 740 F’03

Write-Back Snoopy Protocols Pause: Is Coherence enough?

No need to change processor, main memory, cache … P1 P2

– 15 – CS 740 F’03 – 16 – CS 740 F’03

What Really is Program Order? SC Example

• BUT, actual execution 1b->1a->2b->2a is SC.

Sufficient Conditions for SC SC in Write-through Example

– 25 – CS 740 F’03 – 26 – CS 740 F’03

Update-based Protocols Invalidate versus Update

• Shared (S): one or more

State Transition Diagram Satisfying Coherence

– 33 – CS 740 F’03 – 34 – CS 740 F’03

MESI (4-state) Invalidation Protocol MESI State Transition Diagram

• Reading and modifying data is 2 bus transactions, even if no sharing M

– e.g. even in sequential program BusRd/Flush

– BusRd (I->S) followed by BusRdX or BusUpgr (S->M) PrWr/—

Add exclusive state: write locally without xaction, but E

not modified BusRd/

– shared (two or more caches may have copies) PrRd/

• I -> E on PrRd if no other processor has a copy

Dragon State Transition Diagram Dragon State Transition Diagram

PrRd/— PrRd/— PrRd/— PrRd/—

– 39 – CS 740 F’03 – 40 – CS 740 F’03

– 41 – CS 740 F’03 – 42 – CS 740 F’03

Impact of Protocol Optimizations Update versus Invalidate

180 Address bus

which should do better?

Can combine them in hybrid schemes!

• Main cause is multiple writes by a

processor before a read by other

Miss rate (%)

based protocols as default

Impact of Cache Block Size Impact of Block Size on Miss Rate

Reducing misses architecturally in invalidation protocol

• Capacity: enlarge cache; increase block size (if spatial locality)

• Conflict: increase associativity

Miss rate (%)

Increasing block size has advantages and disadvantages 0.2 4

• Can reduce misses if spatial locality is good 0.1 2

• Can hurt too

• Retain granularity of transfer but reduce granularity of coherence

– use subblocks: same tag but different state bits

– one subblock may be valid but another invalid or dirty

• Reduce both granularities, but prefetch more blocks on a miss

You might also like