0% found this document useful (0 votes)
25 views7 pages

Presentation On Multithreading/Vector

1) Multithreading and simultaneous multithreading (SMT) are approaches to exploiting threads to increase performance beyond single-thread instruction-level parallelism (ILP). 2) With multithreading, multiple threads share the functional units of a processor via overlapping execution, allowing other threads to execute when one thread stalls. 3) SMT aims to exploit both ILP and thread-level parallelism (TLP) simultaneously by issuing instructions from multiple threads each cycle to better utilize functional units and hide stalls.

Uploaded by

Blu007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views7 pages

Presentation On Multithreading/Vector

1) Multithreading and simultaneous multithreading (SMT) are approaches to exploiting threads to increase performance beyond single-thread instruction-level parallelism (ILP). 2) With multithreading, multiple threads share the functional units of a processor via overlapping execution, allowing other threads to execute when one thread stalls. 3) SMT aims to exploit both ILP and thread-level parallelism (TLP) simultaneously by issuing instructions from multiple threads each cycle to better utilize functional units and hide stalls.

Uploaded by

Blu007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Performance beyond single thread ILP

CS252
Graduate Computer Architecture • There can be much higher natural parallelism in
Lecture 12 some applications
– e.g., Database or Scientific codes
Multithreading / Vector Processing – Explicit Thread Level Parallelism or Data Level Parallelism
March 2nd, 2011 • Thread: instruction stream with own PC and data
– thread may be a process part of a parallel program of multiple
processes, or it may be an independent program
– Each thread has all the state (instructions, data, PC, register
John Kubiatowicz state, and so on) necessary to allow it to execute
Electrical Engineering and Computer Sciences • Thread Level Parallelism (TLP):
University of California, Berkeley – Exploit the parallelism inherent between threads to improve
performance
• Data Level Parallelism (DLP):
http://www.eecs.berkeley.edu/~kubitron/cs252
– Perform identical operations on data, and lots of data

3/2/2011 cs252-S11, Lecture 12 2

One approach to exploiting threads:


Multithreading (TLP within processor) Fine-Grained Multithreading
• Switches between threads on each instruction,
• Multithreading: multiple threads to share the causing the execution of multiples threads to be
functional units of 1 processor via overlapping interleaved
– processor must duplicate independent state of each thread – Usually done in a round-robin fashion, skipping any stalled
e.g., a separate copy of register file, a separate PC, and for threads
running independent programs, a separate page table – CPU must be able to switch threads every clock
– memory shared through the virtual memory mechanisms, • Advantage:
which already support multiple processes – can hide both short and long stalls, since instructions from
other threads executed when one thread stalls
– HW for fast thread switch; much faster than full process switch
 100s to 1000s of clocks • Disadvantage:
– slows down execution of individual threads, since a thread
• When switch? ready to execute without stalls will be delayed by instructions
from other threads
– Alternate instruction per thread (fine grain)
– When a thread is stalled, perhaps for a cache miss, another • Used on Sun’s Niagra (recent), several research
thread can be executed (coarse grain) multiprocessors, Tera

3/2/2011 cs252-S11, Lecture 12 3 3/2/2011 cs252-S11, Lecture 12 4


Simultaneous Multithreading (SMT):
Course-Grained Multithreading Do both ILP and TLP
• Switches threads only on costly stalls, such as L2 • TLP and ILP exploit two different kinds of
cache misses
parallel structure in a program
• Advantages
– Relieves need to have very fast thread-switching • Could a processor oriented at ILP to
– Doesn’t slow down thread, since instructions from other exploit TLP?
threads issued only when the thread encounters a costly – functional units are often idle in data path designed for
stall
ILP because of either stalls or dependences in the code
• Disadvantage is hard to overcome throughput
losses from shorter stalls, due to pipeline start-up • Could the TLP be used as a source of
costs independent instructions that might keep
– Since CPU issues instructions from 1 thread, when a stall the processor busy during stalls?
occurs, the pipeline must be emptied or frozen
– New thread must fill pipeline before instructions can
complete
• Could TLP be used to employ the
• Because of this start-up overhead, coarse-grained
functional units that would otherwise lie
multithreading is better for reducing penalty of idle when insufficient ILP exists?
high cost stalls, where pipeline refill << stall time
• Used in IBM AS/400, Sparcle (for Alewife)
3/2/2011 cs252-S11, Lecture 12 5 3/2/2011 cs252-S11, Lecture 12 6

Justification: For most apps, most


execution units lie idle Simultaneous Multi-threading ...
For an 8-way
superscalar.
One thread, 8 units Two threads, 8 units
Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC
1 1

2 2

3 3

4 4

5 5

6 6

7 7
From: Tullsen,
Eggers, and Levy, 8 8
“Simultaneous
Multithreading: 9 9
Maximizing On-chip
Parallelism, ISCA M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes
3/2/2011 cs252-S11, Lecture 121995. 7 3/2/2011 cs252-S11, Lecture 12 8
Simultaneous Multithreading Details Design Challenges in SMT
• Simultaneous multithreading (SMT): insight that • Since SMT makes sense only with fine-grained
dynamically scheduled processor already has many implementation, impact of fine-grained scheduling
HW mechanisms to support multithreading on single thread performance?
– Large set of virtual registers that can be used to hold the register – A preferred thread approach sacrifices neither throughput nor
sets of independent threads single-thread performance?
– Unfortunately, with a preferred thread, the processor is likely to
– Register renaming provides unique register identifiers, so sacrifice some throughput, when preferred thread stalls
instructions from multiple threads can be mixed in datapath without
confusing sources and destinations across threads • Larger register file needed to hold multiple contexts
– Out-of-order completion allows the threads to execute out of order, • Clock cycle time, especially in:
and get better utilization of the HW – Instruction issue - more candidate instructions need to be
• Just adding a per thread renaming table and keeping considered
– Instruction completion - choosing which instructions to commit
separate PCs may be challenging
– Independent commitment can be supported by logically keeping a
separate reorder buffer for each thread • Ensuring that cache and TLB conflicts generated
by SMT do not degrade performance

Source: Micrprocessor Report, December 6, 1999


“Compaq Chooses SMT for Alpha”
3/2/2011 cs252-S11, Lecture 12 9 3/2/2011 cs252-S11, Lecture 12 10

Power 4
Power 4
Single-threaded predecessor to
Power 5. 8 execution units in
out-of-order engine, each may
issue an instruction each cycle.
2 commits
Power 5 (architected
register sets)

2 fetch (PC),
2 initial decodes
3/2/2011 cs252-S11, Lecture 12 11 3/2/2011 cs252-S11, Lecture 12 12
Power 5 data flow ... Power 5 thread performance ...
Relative priority
of each thread
controllable in
hardware.

For balanced
operation, both
threads run
slower than if
Why only 2 threads? With 4, one of the shared
they “owned”
resources (physical registers, cache, memory
the machine.
bandwidth) would be prone to bottleneck
3/2/2011 cs252-S11, Lecture 12 13 3/2/2011 cs252-S11, Lecture 12 14

Changes in Power 5 to support SMT Initial Performance of SMT


• Increased associativity of L1 instruction cache • Pentium 4 Extreme SMT yields 1.01 speedup for
and the instruction address translation buffers SPECint_rate benchmark and 1.07 for SPECfp_rate
• Added per thread load and store queues – Pentium 4 is dual threaded SMT
• Increased size of the L2 (1.92 vs. 1.44 MB) and L3 – SPECRate requires that each SPEC benchmark be run against a
caches vendor-selected number of copies of the same benchmark

• Added separate instruction prefetch and • Running on Pentium 4 each of 26 SPEC


buffering per thread benchmarks paired with every other (262 runs)
• Increased the number of virtual registers from speed-ups from 0.90 to 1.58; average was 1.20
152 to 240 • Power 5, 8 processor server 1.23 faster for
• Increased the size of several issue queues SPECint_rate with SMT, 1.16 faster for SPECfp_rate
• The Power5 core is about 24% larger than the • Power 5 running 2 copies of each app speedup
Power4 core because of the addition of SMT between 0.89 and 1.41
support – Most gained some
– Fl.Pt. apps had most cache conflicts and least gains

3/2/2011 cs252-S11, Lecture 12 15 3/2/2011 cs252-S11, Lecture 12 16


Multithreaded Categories Administrivia
Simultaneous • Exam: Wednesday 3/30
Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading Location: 320 Soda
Time (processor cycle)

TIME: 2:30-5:30
– This info is on the Lecture page (has been)
– Get on 8 ½ by 11 sheet of notes (both sides)
– Meet at LaVal’s afterwards for Pizza and Beverages
• CS252 First Project proposal due by Friday 3/4
– Need two people/project (although can justify three for right project)
– Complete Research project in 9 weeks
» Typically investigate hypothesis by building an artifact and
measuring it against a “base case”
» Generate conference-length paper/give oral presentation
» Often, can lead to an actual publication.

Thread 1 Thread 3 Thread 5


Thread 2 Thread 4 Idle slot
3/2/2011 cs252-S11, Lecture 12 17 3/2/2011 cs252-S11, Lecture 12 18

Discussion of SPARCLE paper Discussion of Papers: Sparcle (Con’t)


• Example of close coupling between processor and memory
controller (CMMU) • Message Interface
– All of features mentioned in this paper implemented by combination of – Closely couple with processor
processor and memory controller
– Some functions implemented as special “coprocessor” instructions
» Interface at speed of first-level cache
– Others implemented as “Tagged” loads/stores/swaps – Atomic message launch:
• Course Grained Multithreading » Describe message (including DMA ops) with simple stio insts
– Using SPARC register windows » Atomic launch instruction (ipilaunch)
– Automatic synchronous trap on cache miss – Message Reception
– Fast handling of all other traps/interrupts (great for message interface!)
» Possible interrupt on message receive: use fast context switch
– Multithreading half in hardware/half software (hence 14 cycles)
» Examine message with simple ldio instructions
• Fine Grained Synchronization
– Full-Empty bit/32 bit word (effectively 33 bits) » Discard in pieces, possibly with DMA
» Groups of 4 words/cache line  F/E bits put into memory TAG » Free message (ipicst, i.e “coherent storeback”)
– Fast TRAP on bad condition
– Multiple instructions. Examples:
• We will talk about message interface in greater detail
» LDT (load/trap if empty)
» LDET (load/set empty/trap if empty)
» STF (Store/set full)
» STFT (store/set full/trap if full)
3/2/2011 cs252-S11, Lecture 12 19 3/2/2011 cs252-S11, Lecture 12 20
Supercomputers Vector Supercomputers

Definition of a supercomputer:
Epitomized by Cray-1, 1976:
• Fastest machine in world at given task
Scalar Unit + Vector Extensions
• A device to turn a compute-bound problem into an
I/O bound problem • Load/Store Architecture
• Any machine costing $30M+ • Vector Registers
• Any machine designed by Seymour Cray • Vector Instructions
• Hardwired Control
CDC6600 (Cray, 1964) regarded as first supercomputer • Highly Pipelined Functional Units
• Interleaved Memory System
• No Data Caches
• No Virtual Memory

3/2/2011 cs252-S11, Lecture 12 21 3/2/2011 cs252-S11, Lecture 12 22

Cray-1 (1976) Cray-1 (1976)


V0 Vi V. Mask
V1
V2 Vj
64 Element V3 V. Length
Vector Registers V4 Vk
Single Port V5
V6
Memory V7
FP Add
S0 Sj FP Mul
16 banks of ( (Ah) + j k m ) S1
S2 Sk FP Recip
64-bit words Si S3
(A0) 64 S4 Si Int Add
+ Tjk S5
T Regs S6
8-bit SECDED S7
Int Logic
Int Shift
A0
80MW/sec data ( (Ah) + j k m ) A1 Pop Cnt
A2
load/store Ai A3
Aj
(A0) 64 A4 Ak Addr Add
Bjk A5
Ai
320MW/sec B Regs A6 Addr Mul
A7
instruction
buffer refill NIP CIP
64-bitx16
LIP
4 Instruction Buffers
memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)
3/2/2011 cs252-S11, Lecture 12 23 3/2/2011 cs252-S11, Lecture 12 24
Vector Programming Model Multithreading and Vector Summary
Scalar Registers Vector Registers
r15 v15 • Explicitly parallel (Data level parallelism or Thread
level parallelism) is next step to performance
• Coarse grain vs. Fine grained multihreading
r0 v0
[0] [1] [2] [VLRMAX-1] – Only on big stall vs. every clock cycle

Vector Length Register VLR • Simultaneous Multithreading if fine grained


multithreading based on OOO superscalar
v1
Vector Arithmetic v2 microarchitecture
Instructions + + + + + + – Instead of replicating registers, reuse rename registers
ADDV v3, v1, v2 v3 • Vector is alternative model for exploiting ILP
[0] [1] [VLR-1] – If code is vectorizable, then simpler hardware, more energy
efficient, and better real-time model than Out-of-order machines
Vector Load and Vector Register – Design issues include number of lanes, number of functional
Store Instructions v1 units, number of vector registers, length of vector registers,
exception handling, conditional operations
LV v1, r1, r2
• Fundamental design issue is memory bandwidth
– With virtual address translation and caching
Memory
Base, r1
3/2/2011 Stride, r2 cs252-S11, Lecture 12 25 3/2/2011 cs252-S11, Lecture 12 26

You might also like