0% found this document useful (0 votes)
36 views37 pages

L06 Memory

Uploaded by

evael
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views37 pages

L06 Memory

Uploaded by

evael
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

http://inst.eecs.berkeley.

edu/~cs152

CS 152/252A Computer
Architecture and Engineering Sophia Shao
Lecture 6 – Memory
NASA, Microchip, SiFive Announces
Partnership for RISC-V Spaceflight
Computing Platform
NASA has confirmed a partnership with
Microchip and SiFive to create a space-
centric processor built around the free
and open source RISC-V architecture:
the High-Performance Spaceflight
Computing (HPSC) chip.
https://www.hackster.io/news/nasa-
microchip-sifive-announces-
partnership-for-risc-v-spaceflight-
computing-platform-f52c55cf14f6
Last time in Lecture 4
§ Handling exceptions in pipelined machines by passing
exceptions down pipeline until instructions cross commit
point in order
§ Can use values before commit through bypass network
§ Pipeline hazards can be avoided through software
techniques: scheduling, loop unrolling
§ Decoupled architectures use queues between “access”
and “execute” pipelines to tolerate long memory latency
§ Regularizing all functional units to have same latency
simplifies more complex pipeline design by avoiding
structural hazards, can be expanded to in-order
superscalar designs

2
More Complex In-Order Pipeline
Inst. Data
PC D Decode GPRs X1 + X2 Mem X3 W
Mem

§ Delay writeback so all operations


have same latency to W stage
FPRs X1 X2 FAdd X3 W
– Write ports never oversubscribed
(one inst. in & one inst. out every
cycle)
– Stall pipeline on long latency
operations, e.g., divides, cache X2 FMul X3
misses
– Handle exceptions in-order at
commit point Unpipelined
divider
FDiv X2 X3
How to prevent increased writeback latency
from slowing down single cycle integer Commit
Point
operations? Bypassing

3
In-Order Superscalar Pipeline
Inst. 2 Dual Data
PC Mem D Decode GPRs X1 + X2 Mem X3 W

FPRs X1 X2 FAdd X3 W

§ Fetch two instructions per cycle; issue both


simultaneously if one is integer/memory
and other is floating point X2 FMul X3
§ Inexpensive way of increasing throughput,
examples include Alpha 21064 (1992) &
MIPS R5000 series (1996) Unpipelined
divider
§ Same idea can be extended to wider issue FDiv X2 X3
by duplicating functional units (e.g. 4-issue Commit
UltraSPARC & Alpha 21164) but regfile ports Point
and bypassing costs grow quickly

4
Early Read-Only Memory Technologies

Punched cards, From early


1700s through Jaquard Loom,
Punched paper tape,
Babbage, and then IBM
instruction stream in
Harvard Mk 1
Diode Matrix, EDSAC-2
µcode store

IBM Balanced
Capacitor ROS
IBM Card Capacitor ROS
5
Early Read/Write Main Memory Technologies
Babbage, 1800s: Digits
stored on mechanical wheels

Williams Tube,
Manchester Mark 1, 1947

Rotating magnetic
drum memory on
IBM 650, 1954

6
MIT Whirlwind Core Memory, 1950

7
Core Memory
§ Core memory was first large scale reliable main memory
– invented by Forrester in late 40s/early 50s at MIT for Whirlwind project
§ Bits stored as magnetization polarity on small ferrite cores
threaded onto two-dimensional grid of wires
§ Coincident current pulses on X and Y wires would write
cell and also sense original state (destructive reads)
§ Robust, non-volatile storage
§ Used on space shuttle
computers
§ Cores threaded onto wires by
hand (25 billion a year at peak
production)
§ Core access time ~ 1µs
DEC PDP-8/E Board,
4K words x 12 bits, (1968)
8
Semiconductor Memory
§ Semiconductor memory began to be
competitive in early 1970s
– Intel formed to exploit market for semiconductor
memory
– Early semiconductor memory was Static RAM (SRAM).
SRAM cell internals similar to a latch (cross-coupled
inverters).

§ First commercial Dynamic RAM (DRAM)


was Intel 1103
– 1Kbit of storage on single chip
– charge on a capacitor used to hold value

Semiconductor memory quickly replaced


core in ‘70s [ Thomas Nguyen CC-BY-SA ]

9
One-Transistor Dynamic RAM [Dennard, IBM]
1-T DRAM Cell

word
access transistor
TiN top electrode (VREF)
VREF
Ta2O5 dielectric

bit
Storage
capacitor (FET gate,
trench, stack)

poly W bottom
word electrode
line access
transistor

10
Modern DRAM Structure

[Samsung, sub-70nm DRAM, 2004]


11
DRAM Architecture
bit lines
Col. Col.
1 2M word lines
Row 1

Row Address
N

Decoder
Row 2N
Memory cell
M Column Decoder & (one bit)
N+M
Sense Amplifiers

Data D

§ Bits stored in 2-dimensional arrays on chip


§ Modern chips have around 4-8 logical banks on each chip
§ each logical bank physically implemented as many smaller arrays
12
DRAM Packaging
(Laptops/Desktops/Servers)

~7
Clock and control signals
DRAM
Address lines multiplexed
row/column address ~12 chip
Data bus
(4b,8b,16b,32b)

§ DIMM (Dual Inline Memory Module) contains


multiple chips with clock/control/address signals
connected in parallel (sometimes need buffers to
drive signals to all chips)
§ Data pins work together to return wide word (e.g.,
64-bit data bus using 16x4-bit parts)

13
DRAM Packaging, Apple M1

Two DRAM chips


on same package
as system SoC

• 128b databus,
running at 4.2Gb/s
• 68GB/s bandwidth

14
DRAM Operation
§ Three steps in read/write access to a given bank
§ Row access (RAS)
– decode row address, enable addressed row (often multiple Kb in row)
– bitlines share charge with storage cell
– small change in voltage detected by sense amplifiers which latch whole row of bits
– sense amplifiers drive bitlines full rail to recharge storage cells
§ Column access (CAS)
– decode column address to select small number of sense amplifier latches (4, 8, 16,
or 32 bits depending on DRAM package)
– on read, send latched bits out to chip pins
– on write, change sense amplifier latches which then charge storage cells to
required value
– can perform multiple column accesses on same row without another row access
(burst mode)
§ Precharge
– charges bit lines to known value, required before next row access
§ Each step has a latency of around 15-20ns in modern DRAMs
§ Various DRAM standards (DDR, RDRAM) have different ways of
encoding the signals for transmission to the DRAM, but all share
same core architecture

15
Double-Data Rate (DDR2) DRAM
200MHz
Clock

Row Column Precharge Row’

Data

400Mb/s
[ Micron, 256Mb DDR2 SDRAM datasheet ] Data Rate
16
Computer Architecture Terminology
Latency (in seconds or cycles): Time taken for a single
operation from start to finish (initiation to useable result)
Bandwidth (in operations/second or operations/cycle): Rate
of which operations can be performed
Occupancy (in seconds or cycles): Time during which the
unit is blocked on an operation (structural hazard)
Note, for a single functional unit:
§ Occupancy can be much less than latency (how?)
§ Occupancy can be greater than latency (how?)
§ Bandwidth can be greater than 1/latency (how?)
§ Bandwidth can be less than 1/latency (how?)

17
CS152 Administrivia
§ HW1 released
– Due Today
§ Lab1 released
– Due Feb 09
§ Lab reports must be readable English summaries – not
dumps of log files!!!!!!
– We will reward good reports, and penalize undecipherable reports
– Page limit (check lab spec/Ed)
§ Lecture Ed thread
– One thread per lecture
– Post your questions following the format:
• [Slide #] Your question
– The staff team will address and clarify the questions asynchronously.
§ Tell us what you think
– http://tinyurl.com/cs152feedback
CS252 18
CS252 Administrivia
§ CS252 Readings on
– https://ucb-cs252-sp23.hotcrp.com/u/0/
– Use hotcrp to upload reviews before Wednesday:
• Write one paragraph on main content of paper including good/bad
points of paper
• Also, answer/ask 1-3 questions about paper for discussion
• First two “360 Architecture”, “VAX11-780”
– 2-3pm Wednesday, Soda 606/Zoom
§ CS252 Project Timeline
– Proposal Wed Feb 22
– Use 252A GSIs (Abe and Prashanth) and my OHs to get feedback.

CS252 19
CPU-Memory Bottleneck

CPU Memory

Performance of high-speed computers is usually limited by


memory bandwidth & latency
§ Latency (time for a single access)
– Memory access time >> Processor cycle time
§ Bandwidth (number of accesses per unit time)
if fraction m of instructions access memory
⇒ 1+m memory references / instruction
§ Also, Occupancy (time a memory bank is busy with one
request)

20
Processor-DRAM Gap (latency)

µProc 60%/year
1000 CPU
Performance

Processor-Memory
100 Performance Gap:
(growing 50%/yr)

10 DRAM
7%/year
DRAM
1
1988
1980
1981
1982
1983
1984
1985
1986
1987

1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Time
Four-issue 3GHz superscalar accessing 100ns DRAM could execute 1,200
instructions during time for one memory access!

21
Physical Size Affects Latency

CPU

CPU

Small
Memory Big Memory

§ Signals have further to travel


§ Fan out to more locations

22
Memory Hierarchy

A B
Small,
Fast Memory Big, Slow Memory
CPU
(RF, SRAM) (DRAM)

holds frequently used data

• capacity: Register << SRAM << DRAM


• latency: Register << SRAM << DRAM
• bandwidth: on-chip >> off-chip
On a data access:
if data Î fast memory Þ low latency access (SRAM)
if data Ï fast memory Þ high latency access (DRAM)

24
Management of Memory Hierarchy
§ Small/fast storage, e.g., registers
– Address usually specified in instruction
– Generally implemented directly as a register file
• but hardware might do things behind software’s
back, e.g., stack management, register renaming

§ Larger/slower storage, e.g., main memory


– Address usually computed from values in register
– Generally implemented as a hardware-managed cache
hierarchy (hardware decides what is kept in fast
memory)
• but software may provide “hints”, e.g., prefetch

25
Real Memory Reference Patterns
Memory Address (one dot per access)

Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. Time
IBM Systems Journal 10(3): 168-192 (1971)
26
Typical Memory Reference Patterns

Address n loop iterations

Instruction
fetches

subroutine subroutine
call return
Stack
accesses
argument access

cess
r ac
Data ecto
v
accesses scalar accesses
Time

27
Two predictable properties of memory
references:
§ Temporal Locality: If a location is
referenced it is likely to be referenced again
in the near future.

§ Spatial Locality: If a location is referenced it


is likely that locations near it will be
referenced in the near future.

28
Memory Reference Patterns
Memory Address (one dot per access)

Temporal
Locality

Spatial
Locality
Donald J. Hatfield, Jeanette Gerald: Program Time
Restructuring for Virtual Memory. IBM Systems Journal
10(3): 168-192 (1971) 29
Caches exploit both types of
predictability:
§ Exploit temporal locality by remembering the
contents of recently accessed locations.

§ Exploit spatial locality by fetching blocks of data


around recently accessed locations.

30
Inside a Cache

Address Address

Processor Main
CACHE
Memory
Data Data

copy of main copy of main


memory memory
location 100 location 101
Data Data
100 Byte Byte Line
Data
304 Byte

Address 6848
Tag 416

Data Block

31
Cache Algorithm (Read)
Look at Processor Address, search cache tags to
find match. Then either

Found in cache Not in cache


a.k.a. HIT a.k.a. MISS

Return copy Read block of data from


of data from Main Memory
cache
Wait …

Return data to processor


and update cache
Q: Which line do we replace? 32
Placement Policy
1111111111 2222222222 33
Block Number 0123456789 0123456789 0123456789 01

Memory

Set Number 0 1 2 3 01234567

Cache

Fully (2-way) Set Direct


Associative Associative Mapped
anywhere anywhere in only into
e.g., block 12
set 0 block 4
can be placed
(12 mod 4) (12 mod 8)

33
Direct-Mapped Cache

Tag Index Block


Offset

t
k b
V Tag Data Block

2k
lines

t
=

HIT Data Word or Byte

34
Direct Map Address Selection
higher-order vs. lower-order address bits

Index Tag Block


Offset
k t
b
V Tag Data Block

2k
lines

t
=

HIT Data Word or Byte

35
2-Way Set-Associative Cache
Tag Index Block
Offset b
t
k
V Tag Data Block V Tag Data Block

= = Data
Word
or Byte

HIT

36
Fully Associative Cache
V Tag Data Block

t
=
Tag

t
=
HIT
Offset

Data
Block

= Word
b or Byte

37
Acknowledgements
§ This course is partly inspired by previous MIT 6.823 and
Berkeley CS252 computer architecture courses created by
my collaborators and colleagues:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)

38

You might also like