0% found this document useful (0 votes)
113 views13 pages

Cache Coherence (Part 1)

The document discusses the cache coherence problem in multiprocessor systems. It introduces the problem that the notion of the "last write" to a memory location is not well-defined in parallel systems. It also discusses different approaches to solving cache coherence, including software-based solutions, hardware-based snoopy protocols, and directory-based solutions. It provides examples of snoopy cache coherence schemes and how they work by broadcasting coherence actions to all processors.

Uploaded by

mahmoudchawi100
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views13 pages

Cache Coherence (Part 1)

The document discusses the cache coherence problem in multiprocessor systems. It introduces the problem that the notion of the "last write" to a memory location is not well-defined in parallel systems. It also discusses different approaches to solving cache coherence, including software-based solutions, hardware-based snoopy protocols, and directory-based solutions. It provides examples of snoopy cache coherence schemes and how they work by broadcasting coherence actions to all processors.

Uploaded by

mahmoudchawi100
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Cache Coherence: Part 1 The Cache Coherence Problem

CS 740 P1 P2 P3

October 27, 2003 u=? u=? u=7

$ 4 $ 5 $ 3
u:5 u:5

Topics
• The Cache Coherence Problem
• Snoopy Protocols
1 I/O devices
u:5 2

Memory
Slides from Todd Mowry/Culler&Singh

–2– CS 740 F’03

A Coherent Memory System: Intuition Problems with the Intuition

Reading a location should return latest value written Recall:


(by any process) • Value returned by read should be last value written
But “last” is not well-defined!
Easy in uniprocessors
• Except for I/O: coherence between I/O devices and processors
Even in sequential case:
• “last” is defined in terms of program order, not time
• But infrequent so software solutions work
– Order of operations in the machine language presented to processor
– uncacheable operations, flush pages, pass I/O data through caches
– “Subsequent” defined in analogous way, and well defined
Would like same to hold when processes run on
different processors In parallel case:
• program order defined within a process, but need to make sense of
• E.g. as if the processes were interleaved on a uniprocessor
orders across processes
The coherence problem is more pervasive and Must define a meaningful semantics
performance-critical in multiprocessors • the answer involves both “cache coherence” and an appropriate
• has a much larger impact on hardware design “memory consistency model” (to be discussed in a later lecture)

–3– CS 740 F’03 –4– CS 740 F’03


Formal Definition of Coherence Cache Coherence Solutions
Results of a program: values returned by its read operations Software Based:
A memory system is coherent if the results of any execution • often used in clusters of workstations or PCs (e.g., “Treadmarks”)
of a program are such that for each location, it is possible • extend virtual memory system to perform more work on page faults
to construct a hypothetical serial order of all operations to – send messages to remote machines if necessary
the location that is consistent with the results of the
execution and in which: Hardware Based:
• two most common variations:
1. operations issued by any particular process occur in the
– “snoopy” schemes
order issued by that process, and
» rely on broadcast to observe all coherence traffic
2. the value returned by a read is the value written by the
» well suited for buses and small-scale systems
last write to that location in the serial order
» example: SGI Challenge
Two necessary features: – directory schemes
• Write propagation: value written must become visible to others » uses centralized information to avoid broadcast
• Write serialization: writes to location seen in same order by all » scales well to large numbers of processors
– if I see w1 after w2, you should not see w2 before w1 » example: SGI Origin 2000
– no need for analogous read serialization since reads not visible to others
–5– CS 740 F’03 –6– CS 740 F’03

Shared Caches Snoopy Cache Coherence Schemes


• Processors share a single cache, essentially punting
Basic Idea:
the problem.
• all coherence-related activity is broadcast to all processors
• Useful for very small machines. – e.g., on a global bus
• E.g., DPC in the Encore, Alliant FX/8. • each processor (or its representative) monitors (aka “snoops”) these
• Problems are limited cache bandwidth and cache interference actions and reacts to any which are relevant to the current
• Benefits are fine-grain sharing and prefetch effects contents of its cache
– examples:
P P P » if another processor wishes to write to a line, you may need to
P P “invalidate” (I.e. discard) the copy in your own cache
Crossbar » if another processor wishes to read a line for which you have a
Shd. Cache dirty copy, you may need to supply
2-4 way interleaved cache
Most common approach in commercial multiprocessors.
Memory Examples:
• SGI Challenge, SUN Enterprise, multiprocessor PCs, etc.
Memory

–7– CS 740 F’03 –8– CS 740 F’03


Implementing a Snoopy Protocol Coherence with Write-through Caches
Cache controller now receives inputs from both sides: P1 Pn
• Requests from processor, bus requests/responses from snooper Bus snoop

In either case, takes zero or more actions $ $


• Updates state, responds with data, generates new bus
transactions
Protocol is a distributed algorithm: cooperating state Cache-memory
machines Mem I/O devices transaction

• Set of states, state transition diagram, actions


Granularity of coherence is typically a cache block • Key extensions to uniprocessor: snooping, invalidating/updating caches
• Like that of allocation in cache and transfer to/from cache – no new states or bus transactions in this case
– invalidation- versus update-based protocols
• Write propagation: even in inval case, later reads will see new value
– inval causes miss on later access, and memory up-to-date via write-through

–9– CS 740 F’03 – 10 – CS 740 F’03

Write-through State Transition Diagram Write-through State Transition Diagram


PrRd/— PrWr/BusWr PrRd/— PrWr/BusWr

PrRD V V
PrWr
BusWr
BusWr/—
PrRd/BusRd BusWr/—
PrRd/BusRd
BusRd

Obs/gen I I

PrWr/BusWr Processor-initiated transactions Processor-initiated transactions


PrWr/BusWr
Bus-snooper-initiated transactions Bus-snooper-initiated transactions

• Two states per block in each cache, as in uniprocessor • Two states per block in each cache, as in uniprocessor
– state of a block can be seen as p-vector – state of a block can be seen as p-vector
• Hardware state bits associated with only blocks that are in the cache • Hardware state bits associated with only blocks that are in the cache
– other blocks can be seen as being in invalid (not-present) state in that cache – other blocks can be seen as being in invalid (not-present) state in that cache
• Write will invalidate all other caches (no local change of state) • Write will invalidate all other caches (no local change of state)
– can have multiple simultaneous readers of block,but write invalidates them – can have multiple simultaneous readers of block,but write invalidates them
– 11 – CS 740 F’03 – 12 – CS 740 F’03
Is WT Coherent? Problem with Write-Through
• Assume: High bandwidth requirements
• A bus transaction completes before next one starts • Every write from every processor goes to shared bus and memory
• Atomic bus transactions • Consider a 500MHz, 1CPI processor, where 15% of instructions are
8-byte stores
• Atomic memory transactions • Each processor generates 75M stores or 600MB data per second
• Write propagation? • 1GB/s bus can support only 1 processor without saturating
• Write-through especially unpopular for SMPs
• Write Serialization?
Write-back caches absorb most writes as cache hits
• Write hits don’t go on bus
• Key is the bus: writes serialized by bus • But now how do we ensure write propagation and serialization?
• Need more sophisticated protocols: large design space

– 13 – CS 740 F’03 – 14 – CS 740 F’03

Write-Back Snoopy Protocols Pause: Is Coherence enough?

No need to change processor, main memory, cache … P1 P2

• Extend cache controller and exploit bus (provides serialization) /*Assume initial value of A and flag is 0*/
while (flag == 0); /*spin idly*/
Dirty state now also indicates exclusive ownership
A = 1;
flag = 1; print A;
• Exclusive: only cache with a valid copy (main memory may be too)
• Owner: responsible for supplying block upon a request for it
Design space • Coherence talks about 1 memory location
• Invalidation versus Update-based protocols
• Set of states

– 15 – CS 740 F’03 – 16 – CS 740 F’03


Pause: Is Coherence enough? Sequential Consistency
P1 P2 Processors P1 P2 Pn
issuing memory
/*Assume initial value of A and flag is 0*/ references as
A = 1; while (flag == 0); /*spin idly*/ per program The “switch” is randomly
flag = 1; print A; order set after each memory
reference
Memory
• Coherence talks about 1 memory location
• Consistency talks about different locations • (as if there were no caches, and a single memory)
• Total order achieved by interleaving accesses from
Is there an interleaving of the partial orders of different processes
each processor that yields a total order that obeys • Maintains program order, and memory operations, from all
program order? processes, appear to [issue, execute, complete] atomically
w.r.t. others
• Programmer’s intuition is maintained
– 17 – CS 740 F’03 – 18 – CS 740 F’03

What Really is Program Order? SC Example


What matters is:
Intuitively, order in which operations appear in order in which appears to execute, not executes
source code
• Straightforward translation of source code to assembly P1 P2
• At most one memory operation per instruction
/*Assume initial values of A and B are 0*/
But not the same as order presented to hardware by (1a) A = 1; (2a) print B;
compiler (1b) B = 2; (2b) print A;
So which is program order?
• possible outcomes for (A,B): (0,0), (1,0), (1,2);
Depends on which layer, and who’s doing the
impossible under SC: (0,2)
reasoning
• we know 1a->1b and 2a->2b by program order
We assume order as seen by programmer
• A = 0 implies 2b->1a, which implies 2a->1b
• B = 2 implies 1b->2a, which leads to a contradiction

• BUT, actual execution 1b->1a->2b->2a is SC.


– 19 – CS 740 F’03 – 20 – CS 740 F’03
Implementing SC Write Atomicity
Write Atomicity : Position in total order at which a write
Two kinds of requirements appears to perform should be the same for all processes
• Program order
– memory operations issued by a process must appear to • Nothing a process does after it has seen the new value produced by a
become visible (to others and itself) in program order write W should be visible to other processes until they too have seen W
• Atomicity • In effect, extends write serialization to writes from multiple processes
– in the overall total order, one memory operation should
appear to complete with respect to all processes before the P1 P2 P3
next one is issued A=1; while (A==0);
– needed to guarantee that total order is consistent across B=1; while (B==0);
processes
print A;
– tricky part is making writes atomic
•Transitivityimplies A should print as 1 under SC
•Problem if P2 leaves loop, writes B, and P3 sees new B but old A
(from its cache, say)
– 21 – CS 740 F’03 – 22 – CS 740 F’03

Sufficient Conditions for SC SC in Write-through Example

1.Every process issues memory ops in program order Provides SC, not just coherence
2.After a write op is issued, the issuing process waits
for the write to complete before issuing its next op
Extend arguments used for coherence
3.After a read op is issued, the issuing process waits • Writes and read misses to all locations serialized by bus into bus
for the read to complete, and for the write whose order
value is being returned by the read to complete, • If read obtains value of write W, W guaranteed to have
before issuing its next operation (provides write completed
atomicity) – since it caused a bus transaction
• When write W is performed w.r.t. any processor, all previous
• Issues: writes in bus order have completed
• Compilers (loop transforms, register allocation)
• Hardware (write buffers, OO-execution)
• Reason: uniprocessors care only about dependences to
same location (i.e., above conditions VERY restrictive)
– 23 – CS 740 F’03 – 24 – CS 740 F’03
Write-Back Snoopy Protocols Invalidation-based Protocols
“Exclusive” state means can modify without notifying anyone else
No need to change processor, main memory, • i.e. without bus transaction
cache … • Must first get block in exclusive state before writing into it
• Extend cache controller and exploit bus (provides serialization) • Even if already in valid state, need transaction, so called a write miss
Dirty state now also indicates exclusive Store to non-dirty data generates a read- exclusive bus transaction
ownership • Tells others about impending write, obtains exclusive ownership
– makes the write visible, i.e. write is performed
• Exclusive: only cache with a valid copy (main memory may be too) – may be actually observed (by a read miss) only later
• Owner: responsible for supplying block upon a request for it – write hit made visible (performed) when block updated in writer’s cache
Design space • Only one RdX can succeed at a time for a block: serialized by bus

• Invalidation versus Update-based protocols Read and Read-exclusive bus transactions drive coherence actions
• Writeback transactions also, but not caused by memory operation and
• Set of states quite incidental to coherence protocol
– note: replaced block that is not in modified state can be dropped

– 25 – CS 740 F’03 – 26 – CS 740 F’03

Update-based Protocols Invalidate versus Update

A write operation updates values in other caches Basic question of program behavior
• New, update bus transaction • Is a block written by one processor read by others before it is
rewritten?
Advantages
• Other processors don’t miss on next access: reduced latency Invalidation:
– In invalidation protocols, they would miss and cause more • Yes => readers will take a miss
transactions • No => multiple writes without additional traffic
• Single bus transaction to update several caches can save – and clears out copies that won’t be used again
bandwidth
Update:
– Also, only the word written is transferred, not whole block
• Yes => readers will not miss if they had a copy previously
Disadvantages – single bus transaction to update all copies
• Multiple writes by same processor cause multiple update • No => multiple useless updates, even to dead copies
transactions
Need to look at program behavior and hardware complexity
– In invalidation, first write gets exclusive ownership, others local
Invalidation protocols much more popular
Detailed tradeoffs more complex • Some systems provide both, or even hybrid
– 27 – CS 740 F’03 – 28 – CS 740 F’03
Basic MSI Writeback Inval Protocol State Transition Diagram
PrRd/- PrWr/-
States
• Invalid (I) M

• Shared (S): one or more


• Dirty or Modified (M): one only
BusRd/Flush
PrWr/BusRdX
Processor Events:
BusRdX/Flush
• PrRd (read) S
PrRd/BusRd
• PrWr (write)
Bus Transactions PrWr/BusRdX BusRdX/-
PrRd/-
• BusRd: asks for copy with no intent to modify BusRd/-
• BusRdX: asks for copy with intent to modify
I
• BusWB: updates memory
Actions • Write to shared block:
– Already have latest data; can use upgrade (BusUpgr) instead of BusRdX
• Update state, perform bus transaction, flush value onto bus
• Replacement changes state of two blocks: outgoing and incoming
– 29 – CS 740 F’03 – 30 – CS 740 F’03

State Transition Diagram Satisfying Coherence


PrRd/- PrWr/-
Write propagation is clear
M
Write serialization?
• All writes that appear on the bus (BusRdX) ordered by the bus
BusRd/Flush – Write performed in writer’s cache before it handles other
PrWr/BusRdX transactions, so ordered in same way even w.r.t. writer
BusRdX/Flush • Reads that appear on the bus ordered wrt these
S
PrRd/BusRd • Writes that don’t appear on the bus:
BusRdX/- – sequence of such writes between two bus xactions for the block must
PrWr/BusRdX come from same processor, say P
PrRd/-
BusRd/- – in serialization, the sequence appears between these two bus xactions
I
– reads by P will seem them in this order w.r.t. other bus transactions
– reads by other processors separated from sequence by a bus xaction,
• Write to shared block: which places them in the serialized order w.r.t the writes
– Already have latest data; can use upgrade (BusUpgr) instead of BusRdX – so reads by all processors see writes in same order
• Replacement changes state of two blocks: outgoing and incoming
– 31 – CS 740 F’03 – 32 – CS 740 F’03
Satisfying Sequential Consistency Lower-level Protocol Choices
1. Appeal to definition: BusRd observed in M state: what transition to make?
• Bus imposes total order on bus xactions for all locations
• Between xactions, procs perform reads/writes locally in program order Depends on expectations of access patterns
• So any execution defines a natural partial order • S: assumption that I’ll read again soon, rather than other will
write
• In segment between two bus transactions, any interleaving of ops from
different processors leads to consistent total order – good for mostly read data
– what about “migratory” data
• In such a segment, writes observed by processor P serialized as follows
– Writes from other processors by the previous bus xaction P issued » I read and write, then you read and write, then X reads and
writes...
– Writes from P by program order
» better to go to I state, so I don’t have to be invalidated on your
2. Show sufficient conditions are satisfied write
• Write completion: can detect when write appears on bus • Synapse transitioned to I state
• Write atomicity: if a read returns the value of a write, that write has • Sequent Symmetry and MIT Alewife use adaptive protocols
already become visible to all others already (can reason different
cases) Choices can affect performance of memory system

– 33 – CS 740 F’03 – 34 – CS 740 F’03

MESI (4-state) Invalidation Protocol MESI State Transition Diagram


Problem with MSI protocol
PrRd
PrWr/—

• Reading and modifying data is 2 bus transactions, even if no sharing M

– e.g. even in sequential program BusRd/Flush


BusRdX/Flush

– BusRd (I->S) followed by BusRdX or BusUpgr (S->M) PrWr/—


PrWr/BusRdX

Add exclusive state: write locally without xaction, but E

not modified BusRd/


Flush
BusRdX/Flush
PrRd/—
• Main memory is up to date, so cache not necessarily owner PrWr/BusRdX

• States S


BusRdX/Flush
– invalid PrRd/
)
BusRd (S
– exclusive or exclusive-clean (only this cache has copy, but not modified)
PrRd/—

BusRd/Flush

– shared (two or more caches may have copies) PrRd/


BusRd(S)

– modified (dirty) I

• I -> E on PrRd if no other processor has a copy


• BusRd(S) means shared line asserted on BusRd transaction
– needs “shared” signal on bus: wired-or line asserted in response to
BusRd • Flush’: if cache-to-cache sharing (see next), only one cache flushes data
• MOESI protocol: Owned state: exclusive but memory not valid
– 35 – CS 740 F’03 – 36 – CS 740 F’03
Lower-level Protocol Choices Dragon Write-back Update Protocol
Who supplies data on miss when not in M state: memory or cache 4 states
Original, lllinois MESI: cache, since assumed faster than memory • Exclusive-clean or exclusive (E): I and memory have it
• Cache- to- cache sharing • Shared clean (Sc): I, others, and maybe memory, but I’m not owner
Not true in modern systems • Shared modified (Sm): I and others but not memory, and I’m the owner
• Intervening in another cache more expensive than getting from – Sm and Sc can coexist in different caches, with only one Sm
memory • Modified or dirty (D): I and nobody else
Cache-to-cache sharing also adds complexity No invalid state
• How does memory know it should supply data (must wait for caches) • If in cache, cannot be invalid
• Selection algorithm if multiple caches have valid data • If not present in cache, can view as being in not-present or invalid
But valuable for cache-coherent machines with distributed memory state
• May be cheaper to obtain from nearby cache than distant memory New processor events: PrRdMiss, PrWrMiss
• Especially when constructed out of SMP nodes (Stanford DASH) • Introduced to specify actions when block not present in cache
New bus transaction: BusUpd
• Broadcasts single word written on bus; updates other relevant caches
– 37 – CS 740 F’03 – 38 – CS 740 F’03

Dragon State Transition Diagram Dragon State Transition Diagram


PrRd/— PrRd/—
PrRd/— BusUpd/Update PrRd/— BusUpd/Update

BusRd/— BusRd/—
E Sc E Sc
PrRdMiss/BusRd(S) PrRdMiss/BusRd(S) PrRdMiss/BusRd(S) PrRdMiss/BusRd(S)

PrWr/— PrWr/—
PrWr/BusUpd(S) PrWr/BusUpd(S)

PrWr/BusUpd(S) PrWr/BusUpd(S)
BusUpd/Update BusUpd/Update

BusRd/Flush BusRd/Flush
PrWrMiss/(BusRd(S); BusUpd) PrWrMiss/BusRd(S) PrWrMiss/(BusRd(S); BusUpd) PrWrMiss/BusRd(S)
Sm M Sm M
PrWr/BusUpd(S) PrWr/BusUpd(S)

PrRd/— PrRd/— PrRd/— PrRd/—


PrWr/BusUpd(S) PrWr/BusUpd(S)
BusRd/Flush PrWr/— BusRd/Flush PrWr/—

– 39 – CS 740 F’03 – 40 – CS 740 F’03


Lower-level Protocol Choices Assessing Protocol Tradeoffs
Can shared-modified state be eliminated? Tradeoffs affected by performance and organization characteristics
• If update memory as well on BusUpd transactions (DEC Firefly) Part art and part science
• Dragon protocol doesn’t (assumes DRAM memory slow to update) • Art: experience, intuition and aesthetics of designers
Should replacement of an Sc block be broadcast? • Science: Workload-driven evaluation for cost-performance
• Would allow last copy to go to E state and not generate updates – want a balanced system: no expensive resource heavily underutilized
• Replacement bus xaction is not in critical path, later update may be Methodology:
Shouldn’t update local copy on write hit before controller gets bus • Use simulator; choose parameters per earlier methodology (default
• Can mess up serialization 1MB, 4-way cache, 64-byte block, 16 processors; 64K cache for some)
• Focus on frequencies, not end performance for now
Coherence, consistency considerations much like write-through case
– transcends architectural details, but not what we’re really after
• Use idealized memory performance model to avoid changes of reference
In general, many subtle race conditions in protocols interleaving across processors with machine parameters
But first, let’s illustrate quantitative assessment at logical level – Cheap simulation: no need to model contention

– 41 – CS 740 F’03 – 42 – CS 740 F’03

Impact of Protocol Optimizations Update versus Invalidate


(Computing traffic from state transitions discussed in book)
Effect of E state, and of BusUpgr instead of BusRdX Much debate over the years: tradeoff depends on sharing patterns
Intuition:
200

180 Address bus

• If those that used continue to use, and writes between use are few,
Data bus
160

which should do better?


Traffic (MB/s)

140

120 80
– e.g. producer-consumer pattern
70
• If those that use unlikely to use again, or many writes between
Address bus
100
Traffic (MB/s)

60 Data bus
80
50 reads, which should do better?
– “pack rat” phenomenon particularly bad under process migration
60 40
30
40

20
20
10
– useless updates where only last one will be used
Can construct scenarios where one or other is much better
Raytrace/III Ill

Ex
l

t
x

t
d

OS-Data/3St-RdEx E

0 0
Barnes/3St-RdEx

Raytrace/3St-RdEx

Appl-Code/3St-RdEx
Radiosity/III

Appl-Code/III
LU/3St-RdEx

Ocean/3St-RdEx

Appl-Data/3St-RdEx
LU/III

Radiosity/3St
Radiosity/3St-RdEx

Radix/III

OS-Data/III
Barnes/III

LU/3St

Ocean/III

Raytrace/3St

Appl-Code/3St

OS-Code/3St-RdEx
Barnes/3St

Appl-Data/3St

OS-Code/III

OS-Data/3St
Radix/3St-RdEx

OS-Code/3St
Radix/3St

Appl-Data/III
Ocean/3S

Can combine them in hybrid schemes!


• E.g. competitive: observe patterns at runtime and change protocol
• MSI versus MESI doesn’t seem to matter for bw for these workloads Let’s look at real workloads
• Upgrades instead of read-exclusive helps
• Same story when working sets don’t fit for Ocean, Radix, Raytrace
– 43 – CS 740 F’03 – 44 – CS 740 F’03
Update vs Invalidate: Miss Rates Upgrade and Update Rates (Traffic)
Upgrade/update rate (%)
0.60 2.50

0.00

0.50

1.00

1.50

2.00

2.50
False sharing

0.50
True sharing • Update traffic is substantial LU/inv

• Main cause is multiple writes by a


Capacity 2.00
LU/upd

processor before a read by other


Cold
0.40
– many bus transactions versus one in
Miss rate (%)

Miss rate (%)


1.50
invalidation case
Ocean/inv

0.30
– could delay updates or use merging Ocean/mix

1.00
0.20
• Overall, trend is away from update Ocean/upd

based protocols as default


0.50 – bandwidth, complexity, large blocks Raytrace/inv
trend, pack rat for process migration
0.10

Raytrace/upd
• Will see later that updates have
greater problems for scalable systems
0.00 0.00
LU/inv

Ocean/inv

Radix/inv
LU/upd

Ocean/mix

Raytrace/inv

Radix/mix
Ocean/upd

Radix/upd
Raytrace/upd
Upgrade/update rate (%)

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00
Radix/inv
• Lots of coherence misses: updates help
Radix/mix
• Lots of capacity misses: updates hurt (keep data in cache uselessly)
Radix/upd
• Updates seem to help, but this ignores upgrade and update traffic
– 45 – CS 740 F’03 – 46 – CS 740 F’03

Impact of Cache Block Size Impact of Block Size on Miss Rate


Multiprocessors add new kind of miss to cold, capacity, conflict Results shown only for default problem size: varied behavior
• Coherence misses: true sharing and false sharing • Need to examine impact of problem size and p as well
– latter due to granularity of coherence being larger than a word
• Both miss rate and traffic matter 0.6 12

Upgrade

Reducing misses architecturally in invalidation protocol


Upgrade
False sharing False sharing
0.5 10
True sharing True sharing

• Capacity: enlarge cache; increase block size (if spatial locality)


Capacity Capacity
Cold Cold
0.4 8

• Conflict: increase associativity


Miss rate (%)

Miss rate (%)


• Cold and Coherence: only block size
0.3 6

Increasing block size has advantages and disadvantages 0.2 4

• Can reduce misses if spatial locality is good 0.1 2

• Can hurt too


8

8
– increase misses due to false sharing if spatial locality not good
0

6
8

4
0

Radiosity/32

Radiosity/256
Radiosity/8

Ocean/8
Ocean/16

Ocean/64

Raytrace/128
Raytrace/256
Radix/8
Barnes/128

Barnes/256

Radix/16

Radix/32
Radix/64

Raytrace/8
Lu/64
Barnes/32

Barnes/64

Ocean/128

Ocean/256

Radix/128

Radix/256

Raytrace/16

Raytrace/32
Raytrace/64
Barnes/8

Lu/8

Lu/128

Lu/256
Lu/16
Lu/32

Ocean/32
Radiosity/16

Radiosity/64
Barnes/16

Radiosity/128
– increase misses due to conflicts in fixed-size cache
– increase traffic due to fetching unnecessary data and due to false
sharing
– can increase miss penalty and perhaps hit cost •Working set doesn’t fit: impact on capacity misses much more critical
– 47 – CS 740 F’03 – 48 – CS 740 F’03
Impact of Block Size on Traffic Making Large Blocks More Effective
Traffic affects performance indirectly through contention
Software
0.18
10 1.8
• Improve spatial locality by better data structuring
• Compiler techniques
Address bus
0.16 9 Address bus Address bus
1.6
Traffic (bytes/instructions)

Data bus
Data bus Data bus
0.14 8
1.4

Hardware

Traffic (bytes/instruction)
7

Traffic (bytes/FLOP)
0.12
1.2
0.1 6
1

• Retain granularity of transfer but reduce granularity of coherence


5
0.08
0.8
4
0.06

– use subblocks: same tag but different state bits


0.6
3
0.04
2 0.4

– one subblock may be valid but another invalid or dirty


0.02
1 0.2
Radiosity/128 28
2

Radiosity/64 4

0
0 0

• Reduce both granularities, but prefetch more blocks on a miss


Barnes/256

Raytrace/32

Raytrace/64
Raytrace/8

Raytrace/128
Raytrace/256

LU/8
Radiosity/256

Raytrace/16
Barnes/8

Radiosity/8

Radiosity/32
Barnes/128

LU/16
LU/32
LU/64
LU/128
LU/256
Barnes/16
Barnes/32

Barnes/64

Radiosity/16

Ocean/16
Ocean/32
Ocean/64
Ocean/8

Ocean/128
Ocean/256
Radix/8

Radix/16

Radix/32

Radix/64

Radix/128

Radix/256
• Proposals for adjustable cache size
• More subtle: delay propagation of invalidations and perform all at
• Results different than for miss rate: traffic almost always increases once
• When working sets fits, overall traffic still small, except for Radix
– But can change consistency model: discuss later in course
• Fixed overhead is significant component
• Use update instead of invalidate protocols to reduce false sharing
– So total traffic often minimized at 16-32 byte block, not smaller effect
• Working set doesn’t fit: even 128-byte good for Ocean due to capacity
– 49 – CS 740 F’03 – 50 – CS 740 F’03

You might also like