0% found this document useful (0 votes)
100 views49 pages

Chong - DTCOSTCO in The Era of Vertical Integration

Uploaded by

rainkirk1979
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views49 pages

Chong - DTCOSTCO in The Era of Vertical Integration

Uploaded by

rainkirk1979
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

DTCO/STCO in the Era of

Vertical Integration
YK Chong
Fellow, Arm
Acknowledgement: Thomas Wong, Leah Schuth, Kiran Burli, Ron Preston,
Vivek Asthana, Sriram Thyagarajan, Andy Chen, Rahul Mathur

1
Outline
• Introduction
• DTCO
• Methodology
• Foundation IPs (Logic, SRAM)
• DTCO examples
• Backside power delivery network (BSPDN)
• Performance Gains
• STCO
 Overview & Knobs
 Compute RTL & Physical IP co-optimization
 3D-IC, multi-die, and System Partitioning
• Conclusion
2
User Experiences Driving Design Complexity

Gaming moving
from traditional
More processing & graphics to ML &
storage of private & ray tracing
confidential data 
Multi-camera
system-wide security
use case
w/ AI 
heterogeneous Generative AI is
Large displays, enabling new,
ML
foldables  natural conversation-
multiple based interactions
concurrent apps with devices
3
Product Development Timelines are Compressed
Launch annually:
Device & SW FIXED
OEM Development

Foundry Tape-out Tape-out to Fab-out


takes longer

Silicon System Integration Verification Impl. Post-Si Complexity in design


Partners requires more time

IP Development More & more IP to


IP Providers
develop & deliver

4
Form Factor & Cost Drive Design Choices
• CPU/GPU die size keep increasing to gain performance • SRAM Content
• Yield
• Batteries are growing to improve Days-of-Use, but the • Package Size
package cannot grow • Die Size
• Increasing wafer costs and turnaround time make • Process Complexity
mistakes expensive

5 Source: semiengineering.com/big-trouble-at-3nm
Technology Scaling (1)
7-5nm: EUV arrived

10-7nm: Multi-patterning costs Cost


16/14nm: FinFET arrived
(# of transistors/$)

20nm: first sign of trouble 1.4nm


2nm
Log Scale

3nm
5nm
7nm
Less Happy 10nm
14nm
scaling era 20nm
Happy 28nm
40nm
scaling era 65nm
90nm
(Now)
2005 2007 2009 2011 2013 2015 2017 2019 2021 2023 2025 2027

6
Technology Scaling: DTCO (2)
7-5nm: EUV arrived

10-7nm: Multi-patterning costs Cost


16/14nm: FinFET arrived
(# of transistors/$)

20nm: cost/tr stalled 1.4nm


2nm
Log Scale

3nm
5nm
7nm
Less Happy 10nm Scaling boosters
14nm Track height reduction
scaling era 20nm Special constructs
Happy 28nm
40nm DTCO
scaling era 65nm
90nm
(Now)
2005 2007 2009 2011 2013 2015 2017 2019 2021 2023 2025 2027

7
Technology Scaling: DTCO + STCO (3)
7-5nm: EUV arrived 3nm: Double whammy
10-7nm: Multi-patterning costs Cost • Buried power rails
• Functional backside
16/14nm: FinFET arrived STCO • Sequential 3D
(# of transistors/$)

1.4nm • ….
20nm: cost/tr stalled 2nm - beyond GAA
Log Scale

3nm
5nm
7nm
Less Happy 10nm Scaling boosters 3nm and below:
14nm Track height reduction Challenging scaling era
scaling era 20nm Special constructs
Happy 28nm >1 metrics is broken:
40nm DTCO performance, power, cost
scaling era 65nm
90nm
(Now)
2005 2007 2009 2011 2013 2015 2017 2019 2021 2023 2025 2027

8
DTCO — Definition
• Moore’s Law is slowing
• DTCO is Design and Technology Co-Optimization to improve PPAC
• Best technology in the timeline given
o Collaborate between design and process engineers to achieve PPA and high-
volume yield.
o Start from earlier v0.01 PDK to maximize benefits.
o Propose solutions for improvement instead of telling the foundry to make it
better. Solutions driven by engineers that recognize the process challenges.
o Measure in a way that foundry can respond.
o Breakdown in components like wire, via, gate, and SRAM delays

• This short course does not cover DTCO on Analog or IO.

9
DTCO - Bias

• DTCO bias - Prejudice in favor of one thing


• How to detect and avoid it?
• Benchmarks should be done fairly for the targeted applications
• Some DTCO feedback might conflict with each other, we need to find
the best trade-off
• SRAM: Yield vs Vmin vs area scaling
• Logic: cell height vs transistor width vs wire pitch vs wire/via resistance scaling
• SoC: power efficiency vs Fmax vs area scaling

10
DTCO — Component Based Methodology
First DTCO Feedback
• Device (Idsat/Ieff, Ioff, capacitance) Logic
• Start with what didn’t scale Logic edge
• Lower Idsat is ok if device capacitance is reduced SRAM edge
• Device FOM like ring osc and critical paths
• Start with Logic density and power
• Cell architecture, new cells with different trade-offs
SRAM Macro
• Area projection for block level
• Wire and Via (RC and Metal Stack)
• 3x worst M0 resistance vs logic area scaling
• Wafer cost increase for each additional metal layer
• SRAM
• Bitcell vs Macro (include redundancy, Vmin assist,
white space) area
• Performance/Watt: CV/(Iread), dynamic, and leakage powers

11
DTCO — Component Based Methodology
Second DTCO Feedback
• Early synthesis, place and route (P&R) CPU benchmarks
• Arm Cortex-M0 core: logic only (turnaround in 2-4 hours)
• Debug the overall P&R EDA flow setup and techfile
• Arm Cortex-A9 core: ~120K logic only (turnaround in 8-12 hours)
• Quicker feedback on logic PPA and study PDN, NDR, Via-pillar choices
• Arm Cortex-A75 D-engine: ~500K logic only (turnaround 1-2 days)
• Early PPA sweeps with the comparison to previous technologies
• Resolve major decisions like PDN choice, library content
• Arm Cortex-X core: logic + SRAM (turnaround >1 week)
• Feedback if logic, SRAM, wire RC, metal stack need to be improved

12
Logic Cell Architecture vs PPA
Max performance @ min area point CPU Area vs Performance
• Cells are all min area, so pin density is higher Max Perf Point
• Timing matters, so P&R must balance both congestion
Min Area Max Perf @
and performance Point Min Area Point

Area
• Area is limited by BEOL density and not cell area
• Area and power are dominated by process decisions

Max performance point Performance


• Cells are folded to achieve performance, so pin
Shape of curve is strongly
density is lower influenced by path distribution
• Primary challenges are placement optimization and
minimizing resistance in the BEOL to maximize
performance
13
Big Core DTCO/STCO
Third DTCO Feedback

• Provide the Physical IP, techfile, and PnR flow to the


advanced physical implementation team for benchmarking
• Advanced Via pillars and NDRs are critical to optimize for
long wire paths
• Co-optimize with micro-architecture team to identify new memory types
• Study next-gen Client and Infrastructure cores to see what changes need to be
made around the number of gates on pipeline stages and logic/SRAM bottlenecks
• Once we understand timeline of process adoption (2nm or 1.4nm) and multi-die
requirements, we can investigate potential changes in micro-architecture to align
with the foundry roadmap or request foundry to improve the process for next-gen
CPU microarchitecture requirements
14
Nanosheet Alignment

Outbound Inbound Midbound

• Better vdd/vss R • Higher vdd/vss R • Medium vdd/vss R


• Higher output RC • Better output RC • Medium output RC
• Lower power • Lower LLE impact
LLE = Local Layout Effect
15
Logic Cell Architecture: Flexible Width vs Fixed Width

Nanosheet Widths (flexible) Fin Width (fixed)

• Need to model LLE impact • No LLE impact


• Support different Beta ratios, • Less device variation
better for performance • Ease of device spice modeling

POLY CPODE

2/2 3/3 3/2 2/3 6P/4N 4P/6N

16
FinFET vs GAA Widths Trade-off
Source:https://newsroom.lamresearch.com/
FinFETs-Give-Way-to-Gate-All-Around

2-finger of 2-fin 2-finger of Size2 1 finger of Size4


• Fins are discrete • More input cap • Lesser input cap
• 2-finger of 2-fin is equal • Slightly less performance • Better performance
to 1 finger of 4-fin • Higher area • Lower area
• Higher dynamic power • Lower dynamic power

17 2-finger of Size2 is less efficient than 1 finger of Size4


Via Selection
As Via resistance continues to increase in sub-3nm nodes, Via-pillar (VP) is required.
There are many vias (self-align Via, non-self-align Via, Viabar, ViaLrg) available.
Proper selection of Vias is critical to improve Fmax.

V0 V1 V2 V3 V4 V5 to M4 to M6
Via resistance 50 40 50 30 30 30 Equivalent Res
INV_X10 1 1 1 1 1 1 170 230
INV_X10 bar bar bar 1 1 1 123 183 Source:https://www.techdesignforums
INV_X10 bar bar bar 2 2 2 108 138 .com/blog/2017/06/21/synopsys-arm-
tsmc-10nm-dac-panels/
INV_X10 4 2 2 2 2 2 73 103
INV_X10 8 4 4 4 4 2 36 59 Via Pillar

Note: These are arbitrary Via resistances for discussion purposes


18
Improper Via Selection Impact
• If we have improper Via and extra jogs in routing, the effect is reflected in
waveforms
• Signal not reaching full vdd/vss near the output driver
• By the time this signal reaches its destination, it slows down 12ps (50%), 43ps (80%)

Improve

12ps
original
43ps

19
M1 Pitch
2M1 for 3 M1 for
2 CPP 2 CPP

1 CPP 2/3 CPP


• Smaller M1 resistance due to larger width • Larger M1 resistance due to narrower width
=> thickness increase (same aspect ratio) => thickness decrease (same aspect ratio)
• Lower SRAM WL resistance => perf ↑ • Higher SRAM WL resistance
• Lower M1 capacitance => better logic • Larger M1 capacitance
performance and lower power • Longer run length and smaller spacing
• Lower mask/processing cost: less double • Higher mask/processing cost: double
pattern layers pattern M1, M1-cut, and Via masks
• Reduce stdcell layout/characterization • More local routing resource, slightly better
complexity for 0-offset, M1shift, M1flip. for low performance/area design
20
M1 Overlay — More Pessimism for 2/3 CPP Pitch
M1 Overlay M1 wires Skip 1 M1 Skip 2 M1

Max Overlay (setup PVT) Real implementation

Fully Skip M1 on Skip M1 on


Normalized
Populated one side both sides
Resistance
M1c1 (ohms/um) 1 -12% -24%
M1c2 (ohms/um) 1 -30% -50%

Capacitance M1c1 (fF/um) 1 -12% -24%


M1c2 (fF/um) 1 -9% -18%
Due to M1 Color and M1 Overlay used in logic characterization,
21
significant inaccuracy and pessimism are introduced to logic library
SRAM DTCO 4 bitcells
GWL0 GWL1 GWL2 GWL3

• GWL width/space/colors
• Different widths for green and red GWL
to match the RC for different colors
• M1/M3 WL width/transition from SRAM to
periphery need to be logic rule compliant
• Avoid jump from WL driver to bitcell Core-array
• BL width/space trade off resistance vs
capacitance
• Support legal BL jog or allow Via-pillar
Continue WL WL jump
from core-array to periphery to mitigate
V0/V1 resistance

M3 M1 M3 M1
22
WL-driver
SRAM DTCO

• At sub-5nm, SRAM area scaling


primarily comes from periphery scaling
• Logic cell height continues to shrink,
but SRAM cell height reduction has
slowed BL
nBL
• Additional logic cells in SRAM
periphery (4 bitcells pitch match with
4/5/6/8/9 logic cells)
• Compared to 16/14nm, SRAM
periphery area has reduced by 30-50%

23
4C/4L 4C/5L 4C/6L 4C/8L
IR-Drop in Advanced Technologies
IR Drop
Blue Low
Green OK
Yellow Marginal
Orange High
Red Very High

5nm 3nm Source: Ron Preston+ IMEC VLSI 2023

7nm
• Meeting IR drop requirements is increasingly difficult in advanced processes
• IR drop for the same design run at each technology’s Vnom and Fmax
• 5% higher IR-drop leads to either ~10% higher power or ~5% loss in performance
24
Power Rails Scaling
• Most of the PDN resistance and IR drop come from the M0 rails
• Resistance ⍺ Length / (Width * Height)
• Cell height reductions — narrower power rails (width ↓)
• Min-M0 pitch reduction — width sets the thickness of the M0/M1 layer (height ↓)
• Length reduction — CPP scaling is ~6%/generation (can’t keep up with the area reduction)
• By 2nm, M0 power rails will contribute 50-100 Ω of resistance
• It will be challenging to scale cell area and maintain acceptable IR drop
7nm
7nm
7nm 5nm 3nm 5nm
5nm
~20 Ω/µm ~50 Ω/µm ~150 Ω/µm
3nm
3nm
Std-cell Power Rail resistance Std-cell M0 cross-sections
25
IR-Drop with BPR vs Conventional
BPR – Buried power rail

48cpp 32cpp 24cpp


3nm process w/conventional stitch pitch stitch pitch stitch pitch
PDN 8cpp stitch pitch
• Lower resistance power rail with BPR allows for fewer stitches/taps
• Designs can adjust the stitch pitch based upon the power requirements
• Expectation is that BPR will allow IR drop solutions to extend beyond 2nm nodes
• Continuing increases in current density will require smaller power strapping pitch
26
Backside PDN (BSPDN)
• All major foundries have announced plans to implement BSPDN at
around 2nm node
• Improve SoC PDN IR drop and EM capability
• Reduce the IR drop overhead by 5%. This translates into ~5% higher frequency
or ~10% reduction in power
• Remove critical EM fails on lower layer vias

• Free up routing resources for PDN


• Increase utilization => reduce block area
• Reduced area can offset structural wafer cost increases associated with BSPDN
• Simplified PnR closure – fewer DRC violations and congestion issues

• Is BSPDN needed for mobile?

27
BSPDN — IP Compatibility VDD signal

• DTCO with foundry partners can have significant benefits


• Simplify the migration path from conventional PDN to BSPDN to avoid full IP
redesign
• Convert wide VDD/VSS M0 tracks to extra signal track
• Increase M0 pitch to reduce M0 RC and # of EUV layer
• BSPDN can potentially improve SRAM bitcell PPA
• Possibly enabling smaller cell area or wider BL/WL for lower resistance
• Compatibility with 3D designs
VSS signal
• Die thinning and wafer handling need to be understood
• Power delivery across multiple dies is an open question ConvPDN BSPDN
• How to deal with backside power for front to back 3D?
• Thermal issues need to be understood both in single die and 3D
configurations
• How is heat removed and what is the Si thickness in the BSPDN solution?
28
BSPDN Scheme and Metal Layers
• Which BSPDN scheme is better: VPR (Via Power Rail) or DBC (Direct Backside
Connect)?
• Increasing integration complexity vs PPA scaling
• Metal layer count and pitch choices are another area for DTCO collaboration
• For BSPDN, the minimum number of backside layers may be set by:
• Pitch transition requirements to transition from the lowest layer (BM0) pitch
to the pitch for bonding
• Thermal/mechanical issues can also set the minimum number of layers
• Number of frontside layers will be reduced to offset the cost of BSPDN but DTCO is
required to understand how to deal with critical wires like clock and long signals
• If wider pitch layers (160, 320 pitch layers) are moved to the backside, we need
to support wider NDRs in the remaining FS layers or migrate these signals to the
backside
29
CPU Critical Paths BEOL vs FEOL
Plot takes 100K CPU critical paths
• Plots the FEOL delay (orange dots)
and BEOL delay (blue dots) for each
path vs slack
• Values are cumulative for the entire
critical path (from source to sink on
a specific path)
Analyze the critical paths with negative
slacks
• If a process has high % of BEOL delay,
find the root cause of the problem
• Check if FEOL delays on critical paths
are reasonable
30
DTCO Performance Gains

Memory
• SRAM bitcell modification
>200MHz • A must-have RTL feature
>1% CPU IPC 6-8% Boost
in CPU
Logic & Performance
• New Logic cells
Metal Stack • Metal stack optimization
>200MHz

IPC = Instructions Per Cycle


Optimize EDA flow for complex CPU and process
• Hierarchical partitioning, H-Tree and clock gating, pre-route vs post-route correlation
• Effective Via Pillar and NDR to counter high resistance of process
31
STCO: System-Technology Co-Optimization
• Partitioning SoC-like systems into
sub-systems or “chiplets”
• Each of these chiplets can be
fabricated using optimal process for
that function
• All the chiplets are then
reassembled using 2.5/3D packaging
• Cover system architecture,
partitioning, and workload

Source: Ann Keheller’s keynote in IEDM 2022

32
System Technology Co-optimization (STCO)
• Decades of process pitch scaling progress
with little interaction between system
architecture, design, and technology
• The motivations for STCO are to enable
higher levels of integrated functionality, at STCO +
lower cost
• STCO requires cross-disciplinary collaboration
• 3D stacking may be required for system
integration. Options like partitioning will
introduce dependencies between tiers
• STCO starts with workload analysis to assess
and optimize the technology to enable higher Source: www.newelectronics.co.uk/electronics-technology/
a-question-of-scale/233822/
levels of performance and functionality
33
STCO
• STCO means different things to different teams
• Some decisions require change to RTL architecture
• 3D vs 2D decision could re-define the RTL hierarchy
• Should we keep SLC in top die or move to lower die?
• Which interface can tolerate the extra cross dies sign-off overhead without
impacting system performance?
• How many crossing signals can meet the micro-bump or TSV pitch?
• Start with redefining the RTL or get nothing
• Collaboration with the RTL team needs to start much earlier. It could be a
multi-year development schedule

34
What Knobs Do We Have?

Raw SoC Rethink system Create workload Advanced


Technology architectures & optimized custom packaging &
Improvements workloads silicon EDA

Squeeze more out OEM system-level Advanced


Custom CPU silicon
of process scaling design packaging

CPU micro-arch & Domain specific EDA/System


Optimize workloads
SOC architecture accelerators Co-design

35
Compute & Physical IP Co-optimization

• Closely coupled development of RTL and physical IPs improves PPA


• Knowing where to invest for best ROI on system performance LAC — Limited ACcess
• Co-optimization starts 2-3 years before the final RTL release EAC — Early ACcess
36
Custom IP Features from Co-Optimization

Custom Features Cores Summary


• Improve IPC
High Bandwidth Nextgen • Improve area by ~50%
Instance (HBI) CPU compared to single-port
RAM
• Reduce wire/buffer delay
by 10-15%
Custom D-Data Nextgen • Improve routing
(4 copies of HBI) CPU congestion
• Reduce D-Data area by
~15%
37
Custom D-Data
Customer D-Data

• Timing: 10-15% faster


• Area: ~15% smaller
• Reduce routing
congestion and reduce
DRC counts

HBI Custom D-Data

38
Motivation for 3D System Integration

• Costs for yielding large dies


continues to increase as we 4-chiplet design
move to smaller process Yield=37%
nodes
• 360mm² monolithic die will
have a yield of 15% while a
4-chiplet design (each 99 Monolithic die
mm²) more than doubles Yield=15%
the yield to 37%.

Source: wikichip.org/wiki/chiplet

39
3D Block-Level Partitioning: Optimize PPAC

High performance
Efficiency over raw (N node)
performance (N node)

High efficiency
Many analog functions (N node)
don’t want large
scaling (N-1 nodes)
Yield and leakage
(N-1 node)

Different blocks favor different technologies. Challenges are less in 3DIC and
more in costs, thermal management, design tools, and supply chain issues.
40 Source: Greg Yeric Arm TechCon 2016
Multi-die Everywhere!

• Increase in compute resources drive larger die sizes


• Lack of scaling with SRAM bitcell, IO, and analog have huge impact
• Monolithic silicon economics does not work anymore
• Shift to multi-chip has already started in HPC/AI segment
Source: nextplatform.com/2022/01/04/inside-amazons-graviton3-arm-server-processor/

41
The Era of Multi-die Designs

Hyperscale Data Center Premium Smart Phone

LPDDR
Chiplet2 Chiplet1
SoC
SoC2
1 Chiplet3
SoC3

Source: nextplatform.com/2022/01/04/inside-amazons-graviton3-arm-server-processor/

42
TSV Columns for 3D
TSV array for power
(from die1 to die2)

Source: 2022 IEEE ISSCC 2.7: “Zen3”: The AMD


Die2 2nd-Generation 7nm x86-64 Microprocessor Core

Standard SRAM macro cannot


have TSV inside => Degrade
Die1 IR-drop or split the macro

C4 Bumps for
power (to die1)
43
TSV Integrated RAM for 3D RAM redesigned to have KOZ
(Keep Out Zones) for TSV
Original SRAM on die1

colmux8_b
colmux8_b

Colmux8
Colmux8
Cell Array Cell Array Cell Array Cell Array

WL WL WL WL
Row Dec CK Row Dec Row Dec CK Row Dec
WL WL WL WL

colmux8_b
colmux8_b

Colmux8
Colmux8

Cell Array Cell Array Cell Array Cell Array

TSV
44
C4 bump
Traditional Monolithic Premium Smart Phone SOC
CPU Cluster GPU Cluster
Big Mid Little GPU Shader
CPUs CPUs CPUs Core LSC
L1$ L1$ L1$ Modem,Other
Other Sensors,
Other Devices
Devices
Devices
L2$ L2$ L2$ ISPs, DSPs, I/O, etc.
GPU Top L2$

DynamIQ Non-coherent Interconnect (NCI)


L3$
Shared Unit (DSU)

System Level Cache (SLC)

DRAM DRAM DRAM DRAM


Memory system
45
Potential Future Design 1 (2.5D)
CPU Cluster GPU Cluster
Big Mid Little GPU Shader
CPUs CPUs CPUs Neural
Core LSC
L1$ L1$ engine
L1$
L2$ L2$ L2$ GPU Top L2$

DynamIQ Non-coherent Non-coherent

UCIe
UCIe
L3$
Shared Unit (DSU) Interconnect (NCI) Interconnect (NCI)
Die to Die
Compute Die
System Level Cache (SLC)
(N process node)
DRAM DRAM DRAM DRAM
UCIe = Universal Chiplet Companion Die
Interconnect Express Memory system (N-1 process node)
46
Potential Future Design 2 (3D)
CPU Cluster GPU Cluster Companion Die
Big Mid Little GPU Shader (N-1 process node)
CPUs CPUs CPUs Neural
Core LSC
L1$ L1$ L1$ engine Modem,Other
Other Sensors,
Other Devices
Devices
Devices
L2$ L2$ L2$ ISPs, DSPs, I/O, etc.
GPU Top L2$

DynamIQ L3$ Non-coherent Non-coherent


TSVs
Shared Unit (DSU) Interconnect (NCI) Interconnect (NCI)
Die to Die
System Level Cache (SLC)

Compute Die TSV = Through-Silicon Via DRAM


(N process node) Memory system
47
Potential Future Design 3 (3D)
CPU Cluster GPU Cluster Companion Die
Big Mid Little GPU Shader
(N-1 process node)
CPUs CPUs CPUs Neural
Core LSC
L1$ L1$ L1$ engine Modem,
Other Sensors,
Other
Other Devices
Devices
Devices
L2$ L2$ L2$
ISPs, DSPs, I/O, etc.
GPU Top L2$

DynamIQ L3$ Non-coherent Non-coherent


Shared Unit (DSU) Interconnect (NCI) Interconnect (NCI)
TSVs
System Level Cache
Compute Die
(N process node) DRAM
48
We All Have a Role to Play
IR Drop
IP Providers Power Delivery
Network
Thermals
OEM Silicon Partners
System
Floorplanning
EDA Foundries Performance Per
Watt
OSAT Timing

49

You might also like