SoC Design
ICE of silicon
[Roza]
Computational efficiency
106 [MOPS/W]
105
3DTV
Intrinsic computational efficiency
Query
by
humming
104
103
7400
102
i386SX
101
100
601
microsparc
604
i486DX P5 Super
sparc
68040
Turbosparc
604e
604e
21364
21164a
Ultra
P6
sparc
0.13
0.25
Feature size [m]
0.5
0.07
http://bwrc.eecs.berkeley.edu/cic
Designing Embedded Systems on Silicon-1
J. van Meerbergen
2/7/13
Hardware Efficiency
efficiency
high
ASIC
ASIP
medium
DSP
GP proc
FPGA
low
low
Designing Embedded Systems on Silicon-1
J. van Meerbergen
medium
2/7/13
high
flexibility
ASIC Style
A Finite Impulse
Response (FIR) filter
! highly efficient for fixed algorithms
! Ok only for large market volumes (100Ms for 32 nm)
! No changes after processing at all (no field upgrades, tuning to
specific context, bug fixes, new standards)
! Irregular code leads to highly irregular floorplan with large wiring
impact (Edyn) and large leakage (Estat)
! Difficult to efficiently include time multiplexing for irregular code
ASIC + microcontroller style
CPU
MEM
ASIC
! highly efficient for fixed algorithms that use -controller very
seldom
! Ok only for large market volumes (100Ms for 32 nm)
! Limited changes after processing
! Changes only very locally in non-critical code (ok for some field
upgrades, tuning to specific context, bug fixes, new standards)
! Irregular code leads to highly irregular floorplan with large wiring
impact (Edyn) and large leakage (Estat)
! Difficult to efficiently include time multiplexing for irregular code
General-purpose microprocessors
No picture
! Highly flexible: easy field upgrades, tuning to specific context,
bug fixes, new standards
! Easy to use and compiler friendly
! Large market due to combination of smaller markets
! Large A+E overhead: data cache hierarchy, multi-port register file,
instr. hierarchy, very flexible data-path units (wide multiplier, ALU
with many instr.)
GP CPUs + custom accelerators
Accel
! Highly flexible: easy field upgrades, tuning to specific context,
bug fixes, new standards. But degraded when accelerators have
to be used too much
! Easy to use and compiler friendly
! Large market due to combination of smaller markets, but not
when accelerators used more
! Large A+E overhead: data cache hierarchy, multi-port register file,
instr hierarchy, very flexible data-path units (wide multiplier, ALU
with many instr). Partly mitigated when accelerators are used
sufficiently
! Large overhead in communication between microproc and
accelerators except when large code segments(not flexible!)
SoC Design
Synthesis
DFT Insertion
Floorplanning
Power Planning
Clock tree insertion
Place and Route
RC extraction
Timing check
8
Design Tools
System Architecture
C/C++
SystemC
Matlab
RTL
Verilog-XL
NC-Verilog
NC-VHDL
Debussy
Synthesis
RC Compiler
Design Compiler
Physical Design
SoC Encounter
Magma (Synopsys)
Mentor
9
Simplified Flow
.lib
LEF
RTL
Front End
Test
(ATPG)
Logic Synthesis
Logic
Simulation
Floor planning
Formal
Verification
Clock Tree
Synthesis
Timing
Constraints
Static Timing
Analysis
Back End
Place &Route
RC Extraction
DRC/LVS
Netlist
GDSII
Static Timing
Analysis
SPEF, SDF
10
TSMCs Design Flow
11
Flow with Multi-Vendor Tools
12
Design Abstraction Levels
SYSTEM
MODULE
+
GATE
CIRCUIT
DEVICE
G
S
n+
D
n+
13
impact of a
design decision
Conceptual level
high level
RT level
gate level
transistor level
complexity
Designing Embedded Systems on Silicon-1
J. van Meerbergen
2/7/13
Design Flow: Summary
Level
Concept
High level
RT level
Gate level
Transistor level
Time concept
comm. processes with
distinct rates
frame, signal rate
clock
set-up en hold times
Analog
Data type
Tokens
Code lines
1K
arrays, lists
scalars, int, float
bits
Volt, mA
10K
100K
1M
10M
At higher levels the impact of a design decision is
larger.
Vendors concentrate on lower levels (more general
solutions).
Designing Embedded Systems on Silicon-1
J. van Meerbergen
2/7/13
Logic Synthesis
Netlist Synthesis
Synthesis is the process by which an
Logic
DFT
Synthesis
Architecture
abstract description (known as RTL) of
the circuit behaviour (generally in VHDL)
is mapped to a set of primitive standard
cells in a library for a particular process
Translation of RTL description
technology.
into an intermediate format
Idea
Functional
Description
Behavioral
HDL
RTL code
Target ASIC cell library
User Constraints
RTL
Gate-Level
Netlist
Optimization of logic
Mapping of the optimized netlist to
the gates of target library.
Synthesis tool requires
Timing and Area
Environmental
Power, Load etc.
Output of the synthesis is a gate
level netlist in the target
technology
16
RTL Coding
RTL stands for Register Transfer Level
RTL description of a design describes the
design in terms registers and logic that
resides between them
This captures the timing constraints of the
design efficiently
Verilog and VHDL are two most popular
hardware description languages that are
commonly used to write RTL description
Sample RTL code
if IR(3) = 0'then'
PC
:= PC + 1;
RTL description captures the change in
data at each clock cycle
DBUF
:= MEM(PC);
All the registers are updated at the same
time in a clock cycle
SP
:= SP - 1;
RTL captures the data flow
PC
:= DBUF;
Logic synthesis tools translate an RTL
model more efficiently compared to
behavioral model
else
MEM(SP) := PC + 1;
end if;
17
Logic Synthesis
RTL
Process (CLK, RST)
if (RST = 1) then
Q <= 0;
else
if rising_edge (CLK) then
Q <=A and B and !(C and D);
ASIC cell
library
User
constraints
Logic Synthesis
Tool
Gate level netlist
18
Logic Synthesis: Technology Mapping
A
S
Z = (not S and A) or (S and B)
Generic Gates
Z
B
A
Standard Cells
I-002
S
B
Z
ANDOR-001
19
DfT Insertion
Testable Flip-Flops
Scan chain generation
Chain propagation
from core to output pin
DfT Insertion
DfT Insertion and Synthesis
DfT Analysis
Test generation
ATPG / Expansion
test validation
Handoff deliverables
20
Backend Design
Technology Information and
Physical Libraries
Corelib.lef
IOlib.lef
Rams.vclef
Timing libraries
Corelib_slow,lib
Corelib_fast.lib
Corelib_typ.lib
IOlib_slow.lib
RAM timing libraries
Timing constraints (user
defined)
Design Netlist
Add IO pads, power pads
Verilog design netlist
Chip Physical Architecture
I/O
& Hierarchical
Planning
Power Grid
Design
Analysis
Chip
Assembly
Hierarchical
STA
Floorplan
Implementation
Physical Synthesis
Placement
DFT
Clock Tree
Synthesis
Post Placement
Optimisation
Routing and Final Optimisation
Signal Routing
Antennas
Decap, Fillers
Crosstalk Fixing
Post Route Fix
Editing
IO pad location file
21
Floorplanning
Floor planning is the task of deciding
how the chip area is to be utilized by
the leaf modules taking care of wiring
considerations
Two methods of floorplanning:
Top Down: Here the chip is
partitioned up during the
development of the RTL level
modelling. Area is assigned on the
basis of estimated block areas and
shapes, and blocks are placed
relative to each other depending on
connectivity.
Bottom up: Here the design is first
synthesised and then the resultant
gates are clustered together into
blocks on the basis of connectivity.
Std. Cells
IP Block
Most designs use a combination of
both of the above techniques, but the
emphasis is increasingly on the first.
Pads 22
Floorplanning
Calculating core size, width and height
When calculating core size of standard cells, the core utilization must be
decided first. Usually the core utilization is higher than 85%
The core size is calculated as follows
Core Size of Standard Cell =
standard cell area
core utilization
The recommended core shape is a square, i.e. Core Aspect Ratio = 1.
Width = Height = (Core Size of Standard Cells)0.5
Example
Standard cell area = 2,000,000um2
Core utilization demanded = 85%
No macros
Core Size of Standard Cells = 2,000,000 / 0.85 =
2,352,941um2
Width = Height = (2,352,941)0.5 =1534um
23
Floorplanning
Core Margins
Space for power and ground
routing
Core limited / Pad limited designs
When pad width > (core width +
core margin),die size is decided
by pads. And it is called pad
limited design
When pad width < (core width +
core margin), die size is decided
by core. And it is called core
limited design
24
Power Planning
Metal migration (also known as electromigration)
Under high currents, electron collisions with
metal grains cause the metal to move. The
metal wire may be open circuit or short circuit.
Prevention: sizing power supply lines to
ensure that the chip does not fail
Experience: make current density of power
ring < 1mA/m
IR drop
IR drop is the problem of voltage drop of the
power and ground due to high current flowing
through the power-ground resistive network
When there are excessive voltage drops in the
power network or voltage rises in the ground
network, the device will run at slower speed
IR drop can cause the chip to fail due to
Performance (circuit running slower than
specification)
Functionality problem (setup or hold violations)
Unreliable operation (less noise margin)
Power consumption (leakage power)
Latch up
Prevention: adding stripes to avoid IR drop on
cells power line
25
Power Planning: IR Drop
Number of counts inversely proportional
to DSP clock frequency
FC = 10, 20 and 25 MHz
Ringo frequency 115 MHz @ VDD = 1.8V
DSP induced PSN is clearly detected
Counter
enable
Average PSN = 6 counts 2.4 mV/count = 14.4 mV
v(t)
C2 Counts vs. DSP activity (Fc = 20 MHz)
(Tambient = 27C)
699
TC =
1
FC
C2 counts
698
697
696
counts = 6
695
694
693
692
691
0
Source: J. Rius, UPC
50
100
150
200
250
Tester ck-cycles
26
Voltage Drop Verification
VoltageStorm (Cadence)
Block-level Analysis
SoC Encounter
Encounter Power Analysis
Block Power
Consumption
Block
Powergrid
View
Voltage Storm
Partition 1
Virtual
Prototype
IP Block
(flat implementation)
Top-level Analysis
Encounter Power Analysis
Partition 2
Power Grid
View Library
Instance Power
Consumption
Voltage Storm
Top-level
Block-level
Chip
PG
PG
Analysis
SignCreate
Hierarchy
Results displayed
in
off
SoC Encounter Interface
27
Power Grid Design
Power Grid Design &
Analysis
Power Grid Design
Power
Grid
Creation
Power
Grid
Connect
Multiple
Power
Ground
Extraction & Analysis
Parasitics
Extraction
Power
Grid
Analysis
Power
Propagation
Power
Plan
Refinement
Power
Routing
Power
Propagation
Extraction & Hierarchical
Analysis Power
Parasitics
Grid
Extraction
Analysis
28
Power Ring Width
Experience
Gate count = 70 k
4000 Flip-Flops
80% FF with dynamic gated clock
Current needed = 0.2mA/MHz
Note: the value should multiply with 1.8~2 for no
gated design
Example:
Gate count = 200 k
No gated clock
Clock frequency = 20 MHz
Current needed = (200/70) * 0.2 * 20 * 2 = 22.86 mA
Current density < 1mA/m
The Width of P/G Ring > 22.86 um
In order to avoid the slot rule of wide metal, the
largest width is 20 um (process dependent)
Use two sets of P/G ring for this case
29
Power Stripe Calculation
Experience
Add one strap set per 100 um
Example
Core width = height = 1600
Stripe set added = 15
Core/IO power pad selection
Core power pad
One set core power pad
(PVDDC along with PVSSC)
can provide 40~50mA current
IO power pad
Core power
connection
Stripes
Power ring
One set IO power pad
(PVDDR along with PVSSR)
can provide the power for
3~4 output pads, or
6~8 input pads
30
Placement
Placement decides the positions of components within allocated blocks
One cannot route until the components have been placed.
The quality of placement is decided solely on the basis of the quality of routing it allows.
Placement is performed using simple estimates of final routing.
Timing driven P&R is the state of the art
Gates, flip-flops/latches are the common placement objects.
Smaller elements like logic gates are placed in single row.
Larger blocks are placed in multiple-rows.
Std cells
Low utilization
core
31
Placement
Source: Magma
32
Clock Tree Synthesis
Clock signal is used as a timing reference
in a synchronous digital system for the
movement of data within that system.
The Clock Tree or clock distribution
network distributes the clock signal(s) from
a common point to all the elements that
need it
Properties of clock signals
They are loaded with the greatest fanout,
travel over the greatest distances
The goal of clock tree synthesis
includes
Creating clock tree spec file
Building a buffer distribution network
In automatic CTS mode, Encounter will
do the following things
Build the clock buffer tree according to
the clock tree specification file
Balance the clock phase delay with
appropriately sized, inserted clock
buffers
operate at the highest speeds
33
Clock Tree Synthesis
34
Routing
Routing is the process of building the
physical connections between blocks
as defined by the logical connections.
Routing takes place in more than one
layer, the exact number available
depending on the process and design
conventions.
Layers are connected together using
vias
Global Routing
Assigns wires to channels
defined during the floor
planning phase
Detailed Routing
Assigns nets to individual
tracks in the channel
Routing and Final Optimisation
Signal Routing
Antennas
Decap, Fillers
Crosstalk Fixing
Post Route Fix
Editing
35
Routing: Signal Integrity Cross-talk
Peak Noise 20mm wire
Parallel repeater insertion does not reduce
the cross-talk peak noise
For a 10mm communication bus, the delay
noise is lowered by about 77%
Staggered repeaters reduce delay noise by
about 88%
shield wire
pico pad
T1IN
driver
aggressor
receiver
bfx4
T2IN
bfx3
driver
victim
driver
bfx4
aggressor
Power supply 2
T1OUT
Propagation Delay 20mm wire
bfx50ohm
receiver
bfx4
bfx3
bfx4
bfx50ohm
receiver
bfx3
bfx50ohm
bfx4
T3IN
bfx4
T2OUT
T3OUT
shield wire
wire length
Source: M. Meijer and A. Katoch, Philips
36
Routing: SI Prevention
Verification Signoff
Timing & Crosstalk
Analysis
Power
Distribution
Analysis
Parasitic
Extraction
37
Static Timing Analysis
Path 1
This involves three main steps:
Path 2
Design is broken down into sets of
timing paths
The delay of each path is
calculated
CLK
Path 3
All path delays are checked to see
if timing constraints have been met
Path delay calculations
0.54
D1
1.0
0.32
0.66
0.43
0.23
0.25 U33
path_delay = (1.0 + 0.54 + 0.32 + 0.66 + 0.23 + 0.43 + 0.25) = 3.43 ns
38
Physical Verification
DRC
Design Rule
Checking
LVS
Layout vs.
Schematic
verifications
39
Chip Finishing
tiles
Seal-ring & Artefact Generation
helps to make the circuit moisture
resistant and prevents the
generation of cracks in the die
during sawing the wafer
Sometimes this step is simply
called Design Chip Finishing
critical dimensions structures, mask
ids, fuse markers, etc
Tiling - dummy fill/pattern fill
Seal ring
Fabs stringent min and rules on
layer densities on active, poly and
metal must be met by all designs
Currently back-end operation
Each step is followed by
Physical Verification step
40
Package Fitting
Package options
Selection of appropriate
package
Route pads to pins
Wire length is important
Rule checking
GDS2 minimum required
information is the nitride or
pad opening layer or the
pad boundary layer
41
Packaging