Chong - DTCOSTCO in The Era of Vertical Integration
Chong - DTCOSTCO in The Era of Vertical Integration
Vertical Integration
YK Chong
Fellow, Arm
Acknowledgement: Thomas Wong, Leah Schuth, Kiran Burli, Ron Preston,
Vivek Asthana, Sriram Thyagarajan, Andy Chen, Rahul Mathur
1
Outline
• Introduction
• DTCO
• Methodology
• Foundation IPs (Logic, SRAM)
• DTCO examples
• Backside power delivery network (BSPDN)
• Performance Gains
• STCO
Overview & Knobs
Compute RTL & Physical IP co-optimization
3D-IC, multi-die, and System Partitioning
• Conclusion
2
User Experiences Driving Design Complexity
Gaming moving
from traditional
More processing & graphics to ML &
storage of private & ray tracing
confidential data
Multi-camera
system-wide security
use case
w/ AI
heterogeneous Generative AI is
Large displays, enabling new,
ML
foldables natural conversation-
multiple based interactions
concurrent apps with devices
3
Product Development Timelines are Compressed
Launch annually:
Device & SW FIXED
OEM Development
4
Form Factor & Cost Drive Design Choices
• CPU/GPU die size keep increasing to gain performance • SRAM Content
• Yield
• Batteries are growing to improve Days-of-Use, but the • Package Size
package cannot grow • Die Size
• Increasing wafer costs and turnaround time make • Process Complexity
mistakes expensive
5 Source: semiengineering.com/big-trouble-at-3nm
Technology Scaling (1)
7-5nm: EUV arrived
3nm
5nm
7nm
Less Happy 10nm
14nm
scaling era 20nm
Happy 28nm
40nm
scaling era 65nm
90nm
(Now)
2005 2007 2009 2011 2013 2015 2017 2019 2021 2023 2025 2027
6
Technology Scaling: DTCO (2)
7-5nm: EUV arrived
3nm
5nm
7nm
Less Happy 10nm Scaling boosters
14nm Track height reduction
scaling era 20nm Special constructs
Happy 28nm
40nm DTCO
scaling era 65nm
90nm
(Now)
2005 2007 2009 2011 2013 2015 2017 2019 2021 2023 2025 2027
7
Technology Scaling: DTCO + STCO (3)
7-5nm: EUV arrived 3nm: Double whammy
10-7nm: Multi-patterning costs Cost • Buried power rails
• Functional backside
16/14nm: FinFET arrived STCO • Sequential 3D
(# of transistors/$)
1.4nm • ….
20nm: cost/tr stalled 2nm - beyond GAA
Log Scale
3nm
5nm
7nm
Less Happy 10nm Scaling boosters 3nm and below:
14nm Track height reduction Challenging scaling era
scaling era 20nm Special constructs
Happy 28nm >1 metrics is broken:
40nm DTCO performance, power, cost
scaling era 65nm
90nm
(Now)
2005 2007 2009 2011 2013 2015 2017 2019 2021 2023 2025 2027
8
DTCO — Definition
• Moore’s Law is slowing
• DTCO is Design and Technology Co-Optimization to improve PPAC
• Best technology in the timeline given
o Collaborate between design and process engineers to achieve PPA and high-
volume yield.
o Start from earlier v0.01 PDK to maximize benefits.
o Propose solutions for improvement instead of telling the foundry to make it
better. Solutions driven by engineers that recognize the process challenges.
o Measure in a way that foundry can respond.
o Breakdown in components like wire, via, gate, and SRAM delays
9
DTCO - Bias
10
DTCO — Component Based Methodology
First DTCO Feedback
• Device (Idsat/Ieff, Ioff, capacitance) Logic
• Start with what didn’t scale Logic edge
• Lower Idsat is ok if device capacitance is reduced SRAM edge
• Device FOM like ring osc and critical paths
• Start with Logic density and power
• Cell architecture, new cells with different trade-offs
SRAM Macro
• Area projection for block level
• Wire and Via (RC and Metal Stack)
• 3x worst M0 resistance vs logic area scaling
• Wafer cost increase for each additional metal layer
• SRAM
• Bitcell vs Macro (include redundancy, Vmin assist,
white space) area
• Performance/Watt: CV/(Iread), dynamic, and leakage powers
11
DTCO — Component Based Methodology
Second DTCO Feedback
• Early synthesis, place and route (P&R) CPU benchmarks
• Arm Cortex-M0 core: logic only (turnaround in 2-4 hours)
• Debug the overall P&R EDA flow setup and techfile
• Arm Cortex-A9 core: ~120K logic only (turnaround in 8-12 hours)
• Quicker feedback on logic PPA and study PDN, NDR, Via-pillar choices
• Arm Cortex-A75 D-engine: ~500K logic only (turnaround 1-2 days)
• Early PPA sweeps with the comparison to previous technologies
• Resolve major decisions like PDN choice, library content
• Arm Cortex-X core: logic + SRAM (turnaround >1 week)
• Feedback if logic, SRAM, wire RC, metal stack need to be improved
12
Logic Cell Architecture vs PPA
Max performance @ min area point CPU Area vs Performance
• Cells are all min area, so pin density is higher Max Perf Point
• Timing matters, so P&R must balance both congestion
Min Area Max Perf @
and performance Point Min Area Point
Area
• Area is limited by BEOL density and not cell area
• Area and power are dominated by process decisions
POLY CPODE
16
FinFET vs GAA Widths Trade-off
Source:https://newsroom.lamresearch.com/
FinFETs-Give-Way-to-Gate-All-Around
V0 V1 V2 V3 V4 V5 to M4 to M6
Via resistance 50 40 50 30 30 30 Equivalent Res
INV_X10 1 1 1 1 1 1 170 230
INV_X10 bar bar bar 1 1 1 123 183 Source:https://www.techdesignforums
INV_X10 bar bar bar 2 2 2 108 138 .com/blog/2017/06/21/synopsys-arm-
tsmc-10nm-dac-panels/
INV_X10 4 2 2 2 2 2 73 103
INV_X10 8 4 4 4 4 2 36 59 Via Pillar
Improve
12ps
original
43ps
19
M1 Pitch
2M1 for 3 M1 for
2 CPP 2 CPP
• GWL width/space/colors
• Different widths for green and red GWL
to match the RC for different colors
• M1/M3 WL width/transition from SRAM to
periphery need to be logic rule compliant
• Avoid jump from WL driver to bitcell Core-array
• BL width/space trade off resistance vs
capacitance
• Support legal BL jog or allow Via-pillar
Continue WL WL jump
from core-array to periphery to mitigate
V0/V1 resistance
M3 M1 M3 M1
22
WL-driver
SRAM DTCO
23
4C/4L 4C/5L 4C/6L 4C/8L
IR-Drop in Advanced Technologies
IR Drop
Blue Low
Green OK
Yellow Marginal
Orange High
Red Very High
7nm
• Meeting IR drop requirements is increasingly difficult in advanced processes
• IR drop for the same design run at each technology’s Vnom and Fmax
• 5% higher IR-drop leads to either ~10% higher power or ~5% loss in performance
24
Power Rails Scaling
• Most of the PDN resistance and IR drop come from the M0 rails
• Resistance ⍺ Length / (Width * Height)
• Cell height reductions — narrower power rails (width ↓)
• Min-M0 pitch reduction — width sets the thickness of the M0/M1 layer (height ↓)
• Length reduction — CPP scaling is ~6%/generation (can’t keep up with the area reduction)
• By 2nm, M0 power rails will contribute 50-100 Ω of resistance
• It will be challenging to scale cell area and maintain acceptable IR drop
7nm
7nm
7nm 5nm 3nm 5nm
5nm
~20 Ω/µm ~50 Ω/µm ~150 Ω/µm
3nm
3nm
Std-cell Power Rail resistance Std-cell M0 cross-sections
25
IR-Drop with BPR vs Conventional
BPR – Buried power rail
27
BSPDN — IP Compatibility VDD signal
Memory
• SRAM bitcell modification
>200MHz • A must-have RTL feature
>1% CPU IPC 6-8% Boost
in CPU
Logic & Performance
• New Logic cells
Metal Stack • Metal stack optimization
>200MHz
32
System Technology Co-optimization (STCO)
• Decades of process pitch scaling progress
with little interaction between system
architecture, design, and technology
• The motivations for STCO are to enable
higher levels of integrated functionality, at STCO +
lower cost
• STCO requires cross-disciplinary collaboration
• 3D stacking may be required for system
integration. Options like partitioning will
introduce dependencies between tiers
• STCO starts with workload analysis to assess
and optimize the technology to enable higher Source: www.newelectronics.co.uk/electronics-technology/
a-question-of-scale/233822/
levels of performance and functionality
33
STCO
• STCO means different things to different teams
• Some decisions require change to RTL architecture
• 3D vs 2D decision could re-define the RTL hierarchy
• Should we keep SLC in top die or move to lower die?
• Which interface can tolerate the extra cross dies sign-off overhead without
impacting system performance?
• How many crossing signals can meet the micro-bump or TSV pitch?
• Start with redefining the RTL or get nothing
• Collaboration with the RTL team needs to start much earlier. It could be a
multi-year development schedule
34
What Knobs Do We Have?
35
Compute & Physical IP Co-optimization
38
Motivation for 3D System Integration
Source: wikichip.org/wiki/chiplet
39
3D Block-Level Partitioning: Optimize PPAC
High performance
Efficiency over raw (N node)
performance (N node)
High efficiency
Many analog functions (N node)
don’t want large
scaling (N-1 nodes)
Yield and leakage
(N-1 node)
Different blocks favor different technologies. Challenges are less in 3DIC and
more in costs, thermal management, design tools, and supply chain issues.
40 Source: Greg Yeric Arm TechCon 2016
Multi-die Everywhere!
41
The Era of Multi-die Designs
LPDDR
Chiplet2 Chiplet1
SoC
SoC2
1 Chiplet3
SoC3
Source: nextplatform.com/2022/01/04/inside-amazons-graviton3-arm-server-processor/
42
TSV Columns for 3D
TSV array for power
(from die1 to die2)
C4 Bumps for
power (to die1)
43
TSV Integrated RAM for 3D RAM redesigned to have KOZ
(Keep Out Zones) for TSV
Original SRAM on die1
colmux8_b
colmux8_b
Colmux8
Colmux8
Cell Array Cell Array Cell Array Cell Array
WL WL WL WL
Row Dec CK Row Dec Row Dec CK Row Dec
WL WL WL WL
colmux8_b
colmux8_b
Colmux8
Colmux8
TSV
44
C4 bump
Traditional Monolithic Premium Smart Phone SOC
CPU Cluster GPU Cluster
Big Mid Little GPU Shader
CPUs CPUs CPUs Core LSC
L1$ L1$ L1$ Modem,Other
Other Sensors,
Other Devices
Devices
Devices
L2$ L2$ L2$ ISPs, DSPs, I/O, etc.
GPU Top L2$
UCIe
UCIe
L3$
Shared Unit (DSU) Interconnect (NCI) Interconnect (NCI)
Die to Die
Compute Die
System Level Cache (SLC)
(N process node)
DRAM DRAM DRAM DRAM
UCIe = Universal Chiplet Companion Die
Interconnect Express Memory system (N-1 process node)
46
Potential Future Design 2 (3D)
CPU Cluster GPU Cluster Companion Die
Big Mid Little GPU Shader (N-1 process node)
CPUs CPUs CPUs Neural
Core LSC
L1$ L1$ L1$ engine Modem,Other
Other Sensors,
Other Devices
Devices
Devices
L2$ L2$ L2$ ISPs, DSPs, I/O, etc.
GPU Top L2$
49