Toward Dark Silicon in Servers
Kashif Rabbani
This report summarizes the technological trends that give rise to
the phenomenon of dark silicon, its impact on the servers, and
an effort to curb them based on the research paper [6] published
in 2011 by Hardavellas et al. Server chips do not scale beyond
a certain limit. As a result, an increasing portion of the chip
remains powered off known as dark silicon that we can not
afford to power. Specialized multicore processors can make use
of abundant, underutilized and power constrained die area by
providing diverse application specific heterogeneous cores to
improve server performance and power efficiency. Figure 2: Die size trend
1 DARK SILICON Now a question arises, should we waste this large unutilized
dark area of the chip? Hardavellas et al. [6] repurposed dark sili-
Data is growing at an exponential rate. It requires computational
con for chip multiprocessors (CMPs) by building a sea of special-
energy to process and perform computations. It has been ob-
ized heterogeneous application specific cores. These specialized
served that data is growing faster than Moore’s Law [1]. Moore’s
cores dynamically power up only a few selected cores designed
Law states that computer performance, CPU clock speed and
explicitly for the given workload. Most of these application cores
the number of transistors per chip will double every two years.
remain disable/dark when not in use.
An unprecedented amount of computational energy is required
Benefits of Specialized Cores: Specialized cores are better
to cope up with this challenge. It suffices to get an idea of the
than the conventional cores because they eliminate overheads.
energy demands by an example that 1000m 2 datacenter is 1.5MW.
For example, to access a piece of data from the local memory,
Nowadays, multicore processors are used to process this data. It
L2 cache, and the main memory requires 50 pJ, 256-1000 pJ, and
is believed that the performance of a system is directly propor-
nearly 16000 pJ of energy respectively. These overheads belong
tional to the number of available cores. However, this belief is
to general-purpose computing, while a carefully designed spe-
not true because performance does not follow Moore’s Law. In
cialized core can eliminate most of these overheads. Specialized
reality, the performance is much slower than the expected results
cores improve aggregate performance and energy efficiency of
due to some physical constraints such as bandwidth, power, and
server workloads by mitigating the effect of physical constraints.
thermal limits, as shown in the figure 1.
1.1 Methodology
To assess the extent of dark silicon, it is crucial to jointly optimize
a large number of design parameters to compose CMPs that are
capable of attaining peak performance while staying within the
physical constraints. Therefore, we develop first-order analytical
models by optimizing the principal components of the proces-
sor such as supply & threshold voltage, clock frequency, cache
size, memory hierarchy, and core count. The goal of the analyti-
cal models is to derive peak performance designs and describe
the physical constraints of the processor. Detailed parameter-
ized models are constructed according to ITRS1 standards. These
Figure 1: Physical Constraints models help in exploring the design space of multicores. Note
It is observed that off-chip bandwidth grows slowly. As a result, that these models do not propose the absolute number of cores
cores cannot be fed with data fast enough. An increase in the or cache size required to achieve the peak performance in the
number of transistors does not decrease the voltage fast enough. processors. Instead, they are analytical models proposed to cap-
A 10x increase in transistors resulted in only 30% voltage drop in ture the first-order effects of technology scaling to uncover the
the last decade. Similarly, power is constrained by cooling limits, trends leading to dark silicon. Performance of these models is
as cooling does not scale at all. In order to fuel the multicore measured in terms of aggregate server throughput and model is
revolution, the number of transistors on the chip are growing examined autonomously in the heterogeneous computing.
exponentially. However, operating all transistors simultaneously In order to construct such models, we have made some de-
requires exponentially more power per chip which is just not sign configuration choices for hardware, bandwidth, technology,
possible due to the physical constraints explained earlier. As a power, and area models as described in the next section in details.
result, an exponentially large area of the chip is left unutilized,
known as dark silicon. 2 DESIGN CHOICES
Dark silicon area is growing exponentially, as shown by the 2.1 Hardware Model
trend line in the figure 2. In this graph, the die size of the peak
CMPs are built over three types of cores, i.e. general purpose
performance for the different workloads is plotted with time.
(GPP), embedded (EMB) and specialized (SP). GPPs are scalar in-
In simple words, we can only use a fraction of the transistors
order four-way multithreaded cores and provide high throughput
available on a large chip, and the rest of the transistors remain
powered off. 1 https://en.wikipedia.org/wiki/International_Technology_Roadmap_for_Semiconductors
in a server environment by achieving 1.7x more speedup over a Nevertheless, we account on power dissipation to counter these
single-threaded core [7]. EMB cores represent a power-conscious effects. We estimate that 3D stacking will improve memory access
design paradigm, and they are similar to GPP cores in perfor- time by 32.5% because it makes communication between the cores
mance. Specialized cores are CMPs with specialized hardware, and 3D memory very efficient.
e.g. GPU, digital signal processors and field programmable gate
arrays. Only those hardware components will power up which 2.7 Power Model
are best suitable for the given workload at any time instance. SP Total chip power is calculated by adding the static and dynamic
cores outperform GPP cores 20x with 10x less power. power of each component such as core, cache, I/O, interconnect,
etc. We use ITRS data to manage the maximum available power
2.2 Technology Model for air-cooled chips with heat sinks. Our model will take maxi-
CMPs are modeled across 65nm, 45nm, 32nm, and 20nm fabrica- mum power limits as input and will discard all the CMPs design
tion technologies following ITRS projections. Transistors having exceeding the defined power limits. Liquid cooling technologies
a high threshold voltage (Vt h ) are best to evaluate the lowering can increase the maximum power however we are not yet suc-
of leakage current. Therefore high Vt h transistors are used to mit- ceeded in applying thermal cooling methods in cores. Dynamic
igate the effect of power wall [3]. CMPs with high-performance power of N cores and L2 cache is computed using the formulas
transistors for the entire chip, LOP (low operating power) for the mentioned in the paper with details.
cache, and LOP transistors for the entire chip are used to explore
the characteristics and behavior of the model. 3 ANALYSIS
After designing, we need to demonstrate the use of our analytical
2.3 Area Model models. We will explore the peak performance designs of general
Model restricts die area to 310mm 2 . Interconnect and system- purpose and specialized multicore processors in the next two
on-chip components occupy 28% of the area, and the rest of the subsections. Furthermore, we will also evaluate the core counts
72% is for cores and cache. We can estimate core areas by scaling for these designs and conclude by comparative analysis.
existing design for each type of cores according to ITRS standards.
UltraSPARC T1 core is scaled for GPP Cores and ARM11 for EMB 3.1 General purpose multicore processors
and SP cores. We begin by explaining the progression of our peak performance
design-space exploration algorithm by the results shown in figure
2.4 Performance Model 3. Figure 3a represents the performance of a 20nm GPP CMPs
Amdahl’s law [9] is the basis of the performance model. It as- running Apache using high performance (HP) transistors for
sumes 99% application parallelism. Performance of a single core both cores and cache. The graph represents the aggregate chip
is computed by aggregating UIPC (user instructions committed performance as a function of the L2 cache size. It means that a
per cycle). UIPCis computed in terms of memory access time fraction of the die area is dedicated to the L2 cache (represented
given by the following formula: in MB on the x-axis).
AveraдeMemoryAccessTime = HitTime+MissRate×MissPenalty Area curve shows the performance of the design with unlimited
UIPC is proportional to the overall system throughput. De- power and off-chip bandwidth but having constrained on-chip
tailed formulas, derivations, and calculations of the performance die area. Larger the cache fewer the cores. Even though a few
model are available at [4][5]. numbers of cores fit on the remaining die area, each core per-
forms the best due to the high hit rate of the bigger cache. The
2.5 L2 cache miss rate and data-set evolution performance benefit is achieved by increasing the L2 cache until
models 64MB. After this, it is outweighed by the cost of further reducing
the number of cores.
Estimating the cache miss rate for the given workload is impor- Power curve shows the performance of the design running at
tant as it plays a governing role in the performance. L2 cache the maximum frequency with limited power due to air cooling
of size between 256KB and 64MB is curve-fitted using empirical constraint but having unlimited off-chip bandwidth and area.
measurements to estimate the cache miss rate. X-shifted power The power constraint restricts aggregate chip performance be-
law y = α(x + β)γ provides the best fit for our data with only cause running the cores at the maximum frequency requires an
1.3% average error rate. Miss-rate scaling formulas are listed with unprecedented amount of energy which limits the design to a
details in this work [4]. very few cores only.
Bandwidth curve represents the performance of the design run-
2.6 Off-chip bandwidth Model ning at an unlimited power and die area having limited off-chip
Chip bandwidth requirements are modeled by estimation of off- bandwidth. Such design reduces the off-chip bandwidth pressure
chip activity rate i.e. clock frequency and core performance. Off- due to the larger available cache size and improves the perfor-
chip bandwidth is proportional to L2 miss rate, core count, and mance.
core activity. Maximum available bandwidth is given by the sum Area+Power curve represents the performance of the design lim-
of the number of pads and maximum off-chip clocks. In our ited in power and area but unlimited off-chip bandwidth. Such
model, we treat 3D-Stacked memory as a large L3 cache due to design jointly optimizes the frequency and voltage of the cores
its high capacity and high-bandwidth. Each layer of 3D stacked by selecting the peak performance design for each L2 cache size.
memory is 8 Gbits at 45nm technology. Energy consumption Peak performance curve represents the multicore design that
of each layer is 3.7 Watt in the worst case. We model 8 layers adapts to all the physical constraints. Performance is limited
with a total capacity of 8 GBytes and one extra layer for control by off-chip bandwidth at the start but after 24 MB power be-
logic. Addition of 9 layers raises the chip temperature to 10°C. comes the main performance limiter. Peak performance design is
2
Figure 3: Performance of general-purpose (GPP) chip multiprocessors
achieved at the intersection of power and bandwidth curves. A Core Counts Analysis: Figure 4b shows the comparative anal-
large gap between the peak performance and area curve indicates ysis of core counts for the peak performing designs across the
that a vast area of the silicon in GPP cannot be used for more mentioned core types. It shows that peak performance SP designs
cores because of power constraint. employ only 16-32 cores and cache occupies a large portion of
Figure 3b represents the performance of the designs that use the die chip area. Low-core-count SP designs outperform other
high performance (HP) transistors for cores and low operational designs with 99.9% parallelism. High-performance characteristics
power (LOP) for the cache. Similarly, figure 3c represents the of SP cores boost the power envelope further than is possible
performance of the designs with low operating power for both with other core designs. SP multicores attain 2x to 12x speedup
cores and the cache. Designs using HP transistors can power up over EMB and GPP multicore designs and are ultimately con-
only 20% of the cores that fit in the die area of 20 nm. On the strained by the limited off-chip bandwidth. A 3D-stacked memory
other hand, designs using LOP transistors for the cache (figure is used to mitigate the effect of bandwidth constraint beyond the
3c) yield higher performance than designs using HP transistors power limits. Use of 3D-stacked memory pushes the bandwidth
because they enable larger caches which support approximately constraint and leads to a high-performance power-constrained
double the number of cores, i.e. 35-40% cores in our case. LOP design (figure 4c). Elimination of off-chip bandwidth bottleneck
devices yield higher power efficiency because they are suitable takes us back to the power-limited regime having underutilized
to implement both the cores and the cache. die area (figure 4b). Reduction of off-chip bandwidth by combin-
Hence we can conclude that peak performance design offered ing 3D memory with specialized cores improves the speedup by
by general purpose multicore processors results into a large 3x for 20nm die size and reduces the pressure on the on-chip
area of dark silicon when cores and caches are build with HP cache size. On the other hand, GPP and EMP chip multiprocessors
transistors. However, making use of LOP transistors reduces the can only attain less than 35 percent of performance improvement.
dark area up to some extent as explained earlier and shown in
the figure 3.
4 CURRENT STATE-OF-THE-ART
Core Counts Analysis: To analyze the utilized number of cores, The phenomenon of dark silicon started in 2005. It was the time
figure 4a plots the theoretical number of cores that can fit on a when processor designers started increasing the core count to
specified die area of the corresponding technology along with exploit Moore’s law scaling rather than improving a single-core
core counts of the peak performance designs. Due to chip power performance. As a result, it was found out that Moore’s Law and
limits, HP-based designs became impossible after 2013. Although Dennard scaling behave conversely in reality. Dennard scaling
LOP-based designs provided a way forward, high gap shown state that the density of transistors per unit area remains constant
between the die area limit and LOP designs indicates that an with a decrease in its size [2]. Initially, the tasks of the processors
increasing fraction of the die area will remain dark because of were divided into different areas to achieve efficient processing
underutilized cores. and minimize the impact of dark silicon. This division led to the
concepts of floating-point units and later on it was realized that
division and distribution of the processor’s tasks using special-
3.2 Specialized multicore processors ized modules could also help to alleviate the problem of dark
Now we demonstrate the peak performance designs using GPP, silicon. These specialized modules resulted in a smaller processor
embedded (EMB), and specialized (SP) cores using LOP transistors area with efficient tasks execution which enabled us to turn off a
having die area of 20 nm. specific group of transistors before starting another group. Use
An extreme application of SP cores is evaluated by considering of few transistors in an efficient way in one task allows us to
specialized computing environment where a multicore chip con- keep having working transistors in another part of the processor.
tains hundreds of diverse application-specific cores. Only those These concepts advanced to System on Chip (SoC) and System in
cores are activated which are most useful for the running appli- Chip (SiC) processors. Transistors in Intel processors also turns
cation. Rest of the on-chip cores remain powered off. SP cores ON/OFF according to the workload. However, specialized mul-
design delivers high performance with fewer but more powerful ticores design discussed in this report requires further research
cores. It is observed that SP cores are highly power efficient and to realize its impact on other SoC and SiC multicore processors
they significantly outperform the GPP and EMB cores. having different requirements for bandwidth and temperature.
3
Figure 4: Core Counts Analysis
5 RELATED WORK 6 CONCLUSION
In this section, we will discuss other strategies, techniques or Continuous scaling of multicore processors is constrained by
trends proposed in the literature about the phenomenon of dark power, temperature, and bandwidth constraints. These constraints
silicon. limit the conventional multicore design to scale beyond a few
Jorg Henkel et al. introduced new trends in dark silicon in tens to low hundreds of cores only. As a result, a large portion
2015. The presented paper focuses on thermal aspects of dark of a processor chip sacrifices to enable the rest of the chip to
silicon. It is proven by extensive experiments that chip’s total keep working. We have discussed a technique to repurpose the
power budget is not the only reason behind dark silicon, power unused die area (dark silicon) by constructing specialized multi-
density and related thermal effects are also playing a major role cores. Specialized (SP) multicores implement a large number of
in this phenomenon. Therefore they propose a Thermal Safe workload-specific cores and power up only those specific cores
Power (TSP) for more efficient power budget. A new proposed having a close match with the requirements of the executing
trend states that consideration of peak temperature constraint workload. A detailed first-order model is proposed to analyze
provides a reduction in the dark area of the silicon. Moreover, the design of SP multicores by considering all the physical con-
it is also proposed that the use of Dynamic Voltage Frequency straints. Extensive workload experiments in comparison with
Scaling increases the overall system performance and decreases other general purpose multicores are performed to analyze the
the dark silicon [8]. performance of the model. SP multicores outperform other de-
Anil et al. presented a run-time resource management sys- signs by 2x to 12x. Although SP multicores are an appealing
tem in 2018 known as adBoost. It employs dark silicon aware design, modern workloads must be characterized to identify the
run-time application mapping strategy to achieve thermal-aware computational segments serving as candidates for off-loading to
performance boosting in multicore processors. It benefits from specialized cores. Moreover, software infrastructure and runtime
patterning (PAT) of dark silicon. PAT is a mapping strategy which environment are also required to facilitate the code migration at
evenly distributes the temperature across the chip to enhance uti- the appropriate granularity.
lizable power budget. It offers lower temperature, higher power
budget and sustains the more extended periods of boosting. Ex- REFERENCES
periments show that it yields 37 percent better throughput in [1] 1965. Moore’s Law. https://en.wikipedia.org/wiki/Moore%27s_law
[2] 1974. Dennard Scaling. https://en.wikipedia.org/wiki/Dennard_scaling
comparison with other state-of-the-art performance boosters [3] Pradip Bose. 2011. Power Wall. Springer US, Boston, MA, 1593–1608. https:
[11]. //doi.org/10.1007/978-0-387-09766-4_499
Lei Yang et al. proposed a thermal model in 2017 to solve the [4] Nikolaos Hardavellas. 2009. Chip multiprocessors for server workloads.
supervisors-Babak Falsafi and Anastasia Ailamaki (2009).
fundamental problem of determining the capability of the on- [5] Nikolaos Hardavellas, Michael Ferdman, Anastasia Ailamaki, and Babak Fal-
chip multiprocessor system to run the desired job by maintaining safi. 2010. Power scaling: the ultimate obstacle to 1k-core chips. (2010).
its reliability and keeping every core within a safe temperature [6] Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki.
2011. Toward dark silicon in servers. IEEE Micro 31, 4 (2011), 6–15.
range. The proposed thermal model is used for quick chip temper- [7] Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anas-
ature prediction. It finds the optimal task-to-core assignment by tassia Ailamaki, and Babak Falsafi. 2007. Database Servers on Chip Multipro-
cessors: Limitations and Opportunities.. In CIDR, Vol. 7. Citeseer, 79–87.
predicting the minimum chip peak temperature. If the minimum [8] Jörg Henkel, Heba Khdr, Santiago Pagani, and Muhammad Shafique. 2015.
chip peak temperature somehow exceeds the safe temperature New trends in dark silicon. In 2015 52nd ACM/EDAC/IEEE Design Automation
limit, a newly proposed heuristic algorithm known as temper- Conference (DAC). IEEE, 1–6.
[9] Mark D Hill and Michael R Marty. 2008. Amdahl’s law in the multicore era.
ature constrained task selection (TCTS) reacts to optimize the Computer 41, 7 (2008), 33–38.
system performance within a chip safe temperature limit. Op- [10] Mengquan Li, Weichen Liu, Lei Yang, Peng Chen, and Chao Chen. 2018. Chip
timality of TCTS algorithm is formally proved, and extensive temperature optimization for dark silicon many-core systems. IEEE Transac-
tions on Computer-Aided Design of Integrated Circuits and Systems 37, 5 (2018),
performance evaluations show that this model reduces the chip 941–953.
peak temperature by 10°C as compared to other traditional tech- [11] Amir M Rahmani, Muhammad Shafique, Axel Jantsch, Pasi Liljeberg, et al.
2018. adBoost: Thermal Aware Performance Boosting through Dark Silicon
niques. Overall system performance is improved by 19.8% under Patterning. IEEE Trans. Comput. 67, 8 (2018), 1062–1077.
safe temperature limitation. Finally, a real case study is conducted
to prove the feasibility of this systematical technique [10].