2024_Evolution_challenges_optimization
2024_Evolution_challenges_optimization
3
Madonna University, Nigeria
4
University of Bradford, United Kingdom
5
Ball State University, USA
6
Austin Peay State University, USA
Abstract—The evolution of computer architecture has led to a ’memory wall’. Additionally, it looks at various computing
paradigm shift from traditional single-core processors to multi- paradigms, such as heterogeneous computing and domain-
core and domain-specific architectures that address the increas- specific architectures, which aim to maximize performance for
ing demands of modern computational workloads. This paper
provides a comprehensive study of this evolution, highlighting specific tasks while reducing energy consumption.
the challenges and key advancements in the transition from Section III focuses on parallelism in computer architec-
single-core to multi-core processors. It also examines state-of- ture. It discusses instruction-level parallelism (ILP), data-level
the-art hardware accelerators, including Tensor Processing Units parallelism (DLP), and thread-level parallelism (TLP), each
(TPUs) and their derivatives, RipTide and the Catapult fabric, with their unique techniques and challenges. This section
and evaluates their strategies for optimizing critical performance
metrics such as energy consumption, latency, and flexibility. also compares various computing models, including the Von
Ultimately, this study emphasizes the role of reconfigurable Neumann model, dataflow models, and hybrid models.
systems in overcoming current architectural challenges and Sections IV through X explore state-of-the-art accelerators,
driving future advancements in computational efficiency. including TPU, STPU, FlexTPU, and RipTide. These sections
discuss how these accelerators are optimized for different
performance metrics, such as energy consumption, flexibility,
I. I NTRODUCTION
and latency.
This paper explores the evolving landscape of computer
architecture in the post-Moore’s Law era, highlighting chal- II. E VOLUTION OF COMPUTER ARCHITECTURE
lenges posed by the power wall and the end of Dennard
scaling. It describes the transition from single-core to mul- A. Moore’s Law, Dennard Scaling and the Bottleneck
ticore architectures, which provided a temporary solution but The history of computer architecture is deeply tied to
introduced complexities in parallel computing. Moore’s law, formulated by Gordon Moore in 1965. Moore
With the advent of domain-specific architectures (DSAs), observed that the number of transistors on a microchip roughly
specialized accelerators such as the TPU have emerged, op- doubles every two years. This exponential growth fueled
timizing for specific workloads such as machine learning. miniaturization and advancements in integrated circuits, mak-
However, as the demand for sparse matrix computations in- ing them faster and more complex. Initially, chip designers
creases, variants such as Sparse-TPU and FlexTPU offer more focused on cramming more transistors into a single processor
efficient solutions. The paper also discusses RipTide, a flexible core to boost its clock speed, the rate at which it executes
architecture designed to balance programmability and energy instructions. However, this approach soon hit physical limita-
efficiency, and Catapult, which leverages FPGAs for datacenter tions [1].
flexibility. The study concludes by emphasizing the importance As transistors shrunk, they became more susceptible to
of innovative and flexible architectures that cater to diverse overheating and power consumption skyrocketed. This phe-
computational needs while optimizing energy consumption. nomenon is known as the power wall [2]. Compounding
Section II provides an in-depth analysis of the evolution this issue was the end of Dennard scaling, another prin-
of computer architecture, highlighting the challenges posed ciple observed in the early days of microchips. Dennard
by Moore’s law, Dennard scaling, and the ’power wall’. It scaling predicted that even as transistors miniaturized, their
examines the transition from single-core to multi-core pro- power consumption would remain relatively constant due to
cessors, and subsequent issues such as ’dark silicon’ and the reduced voltage requirements. Unfortunately, this beneficial
2
effect diminished as the transistors reached the atomic scales, take advantage of the inherent characteristics of these domains
making it impractical to further reduce the voltage without to achieve optimal performance and energy efficiency. For
compromising performance [1][2]. example, the Eyeriss accelerator is specifically designed for
Convolutional Neural Networks (CNNs), a type of artificial
B. The Multicore Solution and Its Challenges neural network commonly used in deep learning applications.
Eyeriss optimizes data movement and minimizes energy
The combined pressure from the power wall and the end of
consumption, which are critical aspects of deep learning
Dennard scaling forced a paradigm shift in processor design.
algorithms [5].
Around the early 2000s, chipmakers began transitioning from
single-core to multicore architectures. Multicore processors
When designing a DSA, computer architects must make two
essentially pack multiple independent processing cores onto a
crucial decisions:
single chip. This approach allowed parallel processing, where
multiple tasks could be executed simultaneously, effectively 1) Type of Parallelism: Parallelism is a fundamental con-
improving overall performance without significantly increas- cept in modern computer architecture aiming to im-
ing power consumption [3]. prove performance by running multiple instructions or
However, the multicore revolution was not without its operations concurrently. Different types of parallelism
challenges. Software development had to adapt to efficiently exist, such as instruction-level parallelism, data-level
utilize these multiple cores. Traditional software designed for parallelism, and thread-level parallelism. The architect
single-core processors often could not leverage the parallelism needs to choose the type that best suits the target
offered by multi-core architecture. Furthermore, the increased application to maximize efficiency.
number of cores introduced the concept of ”dark silicon”. 2) Computing Model: The computing model defines how
Not all cores on a chip could be powered on simultaneously, data are processed and flow within the DSA. Com-
due to limitations in heat dissipation and power delivery. This mon models include the von Neumann model and the
essentially rendered some transistors inactive, reducing overall dataflow model. The architect must select the model that
efficiency [4][7]. aligns best with the use case and the chosen type of
In addition, there was the memory wall problem, which parallelism.
refers to the speed limitations of data transfer between the The next sections will briefly discuss these concepts.
processor and the memory. This problem intensifies as the
number of cores and the demand for memory bandwidth III. T YPES OF PARALLELISM IN C OMPUTER
increase. A RCHITECTURE
Figure 1 illustrates the dramatic increase in transistor count
There are three popular kinds of parallelism, which will be
on microprocessors over time, which is in accordance with
discussed below.
Moore’s law. This exponential growth fueled the advancements
discussed previously, but ultimately led to the need for multi-
core architectures. A. Instruction Level Parallelism (ILP)
ILP focuses on extracting parallelism within a single stream
C. Beyond Multicore: Heterogeneous Computing and of instructions. It leverages the fact that some instructions
Domain-Specific Architectures in a program may be independent of each other, meaning
The pursuit of improved performance and energy efficiency they do not rely on the results of previous instructions and
continues to drive innovation in computer architecture. One do not produce results that affect subsequent instructions.
such innovative approach is heterogeneous computing. In Identifying these independent instructions allows processors
contrast to homogeneous computing, which exclusively relies to execute them concurrently, improving overall program ex-
on Central Processing Units (CPUs) for all tasks, hetero- ecution speed.
geneous computing combines different types of processors Several techniques are used to exploit ILP:
such as CPUs which handles general-purpose tasks while • Instruction pipelining: This technique overlaps the ex-
controlling the overall system operations, Graphics Processing ecution of different stages of an instruction (fetching,
Units (GPUs) which perform parallel processing tasks such as decoding, execution, memory access, and writing results)
rendering and Machine Learning, and specialized accelerators with those of subsequent instructions [6].
on a single chip. Each type of processor in a heterogeneous • Multiple instruction issue: This technique is a feature
computing architecture is optimized for specific tasks, leading of modern processors that allows them to decode and
to improved performance and reduced power consumption for issue multiple instructions simultaneously, given that they
workloads that can be parallelized across these diverse cores are independent. The primary advantage of this approach
while also allowing efficient communication and resource is that it enables the processor to keep its execution
sharing between all components[7]. units active and reduce idle time, thus improving overall
These specialized accelerators are typically referred to as efficiency [6].
Domain Specific Architectures (DSAs). DSAs are custom- • Out-of-order execution: It involves the analysis of in-
designed processors that are tailored for specific computational structions, their reordering for efficient execution, and
tasks, such as machine learning or video processing. They their execution outside of their original program order.
3
Fig. 1: Trends in Microprocessor Technology from 1970 to 2020. Original data up to the year 2010 collected and plotted by
M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. New plot and data collected for 2010-2019
by K. Rupp
enhance performance, even in relatively simple code. Furthermore, the increased complexity in software design
demands more intricate programming techniques to achieve
for(i = 0; i < SIZE; i++) optimal performance without introducing errors.
{
a[i] = i * i + 5; IV. C OMPUTING M ODELS
}
Computer architects must carefully select the computing
model that best aligns with their use case when designing
(a) DSAs. There are two primary computing models: the Von
.L8: Neumann model and the Dataflow model.
movdqa %xmm2, %xmm1
.L3:
movdqa %xmm1, %xmm5 A. Von Neumann Model
movdqa %xmm1, %xmm2 The Von Neumann execution model, a classic architecture,
addq $16, %rax fetches program instructions from memory, decodes them,
pmuludq %xmm1, %xmm5 executes them, and stores the results. The program counter
psrlq $32, %xmm1 (PC) determines the next instruction to be executed, facilitating
pshufd $8, %xmm5, %xmm0 sequential execution. Operands are retrieved from a centralized
pmuludq %xmm1, %xmm1 memory or registers, forming the foundation of traditional
pshufd $8, %xmm1, %xmm1 computer architectures. This process drives a linear instruction
paddd %xmm4, %xmm2 flow in sequential programming [6][8]. The Von Neumann
punpckldq %xmm1, %xmm0 architecture, offering high flexibility, is essential in most
paddd %xmm3, %xmm0 general-purpose processors.
movdqa %xmm0, -16(%rax) Due to the inherent sequential execution of the model, com-
cmpq %rdx, %rax putations tend to be slow. To enhance performance, various
jne .L8 types of parallelism can be utilized. For instance, Instruction
Level Parallelism (ILP) can be exploited through techniques
such as pipelining and superscalar architectures. Data-Level
(b) Parallelism (DLP), on the other hand, can be implemented
Fig. 3: Demonstration of DLP. (a) Sample ’for loop’ written using Single Instruction, Multiple Data (SIMD) extensions.
in C. (b) Assembly code generated by the GCC compiler, Additionally, multi-threading can be used to improve through-
highlighting the use of SIMD instructions via Intel’s Streaming put. [8].
SIMD Extensions (SSE), which efficiently parallelizes data In Figure 4, we present an example of the Von Neumann
processing. computing model, showing its structure and the sequential
instruction execution characteristic of this model.
(a + b − 7b) × (a + b + 7b)
C. Hybrid
The hybrid model combines the strengths of both the Von
Neumann and dataflow models to take advantage of sequential
execution and parallelism. By integrating the programmability
and sequential execution of the Von Neumann model with
the data-driven parallel execution of the dataflow model, the
hybrid model seeks to balance ease of programming and
resource utilization. This approach is particularly beneficial
for applications that require sequential and parallel processing
[8].
Fig. 5: A basic dataflow graph using an adder tree configura-
tion [12] V. C HOOSING THE R IGHT A RCHITECTURE FOR
D OMAIN -S PECIFIC ACCELERATORS
Systolic arrays are often favored over adder trees for matrix The choice of computing model and type of parallelism
multiplication because of their efficiency and their ability to when designing a domain specific architecture (DSA) depends
handle high data reuse rates, a key requirement in matrix mul- on the specific requirements of the accelerator. For example,
tiplication. A systolic array is composed of a grid of processing if the accelerator is intended for a device that is used in
elements (PEs), typically Multiply-Accumulate (MAC) units, remote locations and operates on battery power, such as
which are closely interconnected [9]. As shown in Figure 6, a energy harvesting devices, energy efficiency is paramount.
systolic array performing matrix-matrix multiplication allows This is because it is not feasible to frequently replace the
both input data and results to flow through the grid. This is battery in these devices. In such cases, an Application-Specific
in contrast to adder trees, where only the results flow. This Integrated Circuit (ASIC) might be the optimal choice as it
characteristic of systolic arrays enables more efficient data offers superior energy efficiency, being optimized for a specific
processing and promotes optimal data reuse. use case. Additionally, if the goal is to optimize latency and/or
The primary advantage of the dataflow model lies in its throughput for a specific workload, an ASIC would also be the
data reuse efficiency. By prioritizing data movements, dataflow ideal choice.
architectures can significantly minimize unnecessary data Currently, most ASICs used for neural network applications
transfers and memory accesses. These are often the leading typically use the dataflow model because matrix operations are
causes of energy consumption and latency in traditional Von well suited for this model. However, there may be instances
Neumann systems. In addition, the inherent flexibility of the where the accelerator needs to perform multiple operations on
dataflow model enables the architectures to be highly recon- the workload for which the dataflow model is not optimized.
figurable, allowing them to efficiently adapt to a variety of In such cases, using only the dataflow model would not be
computational tasks and data types. This adaptability proves to energy efficient. The hybrid model, which integrates the Von
be crucial when dealing with a diverse range of deep learning Neumann model for computations requiring sequential com-
models and algorithms, each requiring different computational putation and the dataflow model for the rest of the accelerator,
resources [9]. would be more suitable in this scenario.
6
It is common to want an accelerator to be able to perform Machine learning and artificial intelligence applications
computations on different types of workload. Ideally, an ar- present several unique challenges that traditional hardware
chitecture that offers the best flexibility and energy efficiency architectures struggle to handle efficiently. These challenges
would be the best hardware choice. However, achieving this include:
performance requirement on any hardware is challenging. In 1) Computational demands: Machine learning models, es-
this case, an ASIC would be the least suitable choice because pecially neural networks, require substantial computa-
of its lack of flexibility. The CPU, which would provide the tional resources. The TPU is designed to efficiently
best flexibility, is very energy inefficient. A compromise could handle these high computational demands.
be to use a reconfigurable architecture, which offers flexibility 2) Power Efficiency: As the scale of data and complexity
while also being energy efficient. of the models increase, so does the power consumption.
The TPU provides a high throughput per watt, making it
a more energy-efficient solution compared to traditional
CPUs and GPUs.
3) Latency Requirements: Many real-world applications,
such as autonomous driving or real-time translation,
require low latency. The TPU deterministic execution
model is better suited to the 99th percentile response
time requirement of these applications.
4) Memory Management: Machine learning models often
require large amounts of memory to store intermediate
results, weights, and biases. The TPU comes with a
large, software-managed on-chip memory to efficiently
handle these requirements.
5) Scaling: As machine learning models become more
complex and data sets grow larger, the need for hardware
that can scale with these increases becomes apparent.
TPUs are designed to work both independently and
as part of a larger system, allowing easy scaling as
Fig. 7: Trade-off Between Flexibility and Energy Efficiency computational needs grow.
in Various Hardware Choices
A. TPU Architecture
Figure 7 illustrates the trade-off between flexibility and The TPU is an hardware accelerator designed specifically
energy efficiency for various hardware choices. for machine learning workloads. At the heart of the TPU
Two popular reconfigurable architectures that are currently is a large matrix multiplication unit, which is capable of
in use are the Coarse-Grained Reconfigurable Array (CGRA) performing 65,536 8-bit multiply-and-add operations in a
and the Field-Programmable Gate Array (FPGA). The choice single cycle.
between these two largely depends on the specific goals It uses a systolic array architecture for its matrix multipli-
of the accelerator. If the accelerator requires maximum cation unit. This design choice is based on the observation
flexibility, FPGAs are generally the best choice. However, in that a significant portion of computation in many machine
applications where there is a need to optimize latency and learning workloads consists of matrix operations. The systolic
energy efficiency, CGRA tends to be the better option. This array architecture allows for high computational efficiency and
is primarily because CGRAs operate at a coarser granularity throughput by enabling multiple operations to be performed
than FPGAs, leading to shorter reconfiguration times and simultaneously. In addition, it minimizes the need for data
higher energy efficiency. movement, a major source of energy consumption in tradi-
tional architectures.
In the following sections, we will discuss various popular One of the standout features of the TPU is its large,
DSA that have been designed with a focus on latency and/or software-managed on-chip memory. This design allows for ef-
throughput, energy efficiency, and flexibility. These DSAs ficient data reuse, a common requirement in machine learning
include the Tensor Processing Unit (TPU) and its variants workloads. It also reduces the need for costly data transfers
SparseTPU and FlexTPU. We will also look at RipTide, an between the processor and off-chip memory, further enhancing
accelerator designed for both energy efficiency and flexibility. the efficiency of the TPU architecture.
Lastly, we will explore the use cases of reconfigurable hard-
ware in data centers.
B. Deterministic Execution
VI. T ENSOR P ROCESSING U NIT In a TPU, operations are executed in a predictable and
This section discusses the Tensor Processing Unit (TPU) as consistent manner. This is in contrast to CPUs and GPUs,
presented in the paper ”In-Datacenter Performance Analysis which often employ dynamic scheduling and out-of-order
of a Tensor Processing Unit” by Norman P. Jouppi et al. [19]. execution to optimize performance. Although these techniques
7
they are still more energy efficient than CPUs and GPUs.
The reason for this lies in the structure of the TPU itself.
Specifically, the presence of sparsity in the matrices can lead
to unutilized PEs within the TPU, thereby reducing its overall
computational efficiency.
Figure 8 provides an illustration of how a 6x6 dense matrix
is mapped to a 3x3 TPU systolic array. As shown in the
figure, the TPU partitions the input matrix directly, based on
the shape of the systolic array. This process requires four
iterations for execution. It is important to note that even if
the matrix were sparse, the TPU would still require four
(a) iterations for execution due to its method of partitioning the
input matrix. This is because the TPU partitioning approach
does not account for the sparsity of the matrix.
can improve average-case performance, they can also lead to VII. S PARSE -TPU
variability in execution times, making it difficult to guarantee This section discusses the Sparse-TPU (STPU) as presented
a specific response time. in the paper ”Sparse-TPU: Adapting Systolic Arrays for Sparse
On the other hand, TPUs are designed to perform a large Matrices” by Xin He et al. [20]
number of operations simultaneously in a highly structured The Sparse Tensor Processing Unit (STPU) addresses the
way, specifically matrix multiplications, which are at the heart challenge of efficiently handling sparse matrices. It recognizes
of many machine learning workloads. Data flow through the that the traditional systolic array architecture, while highly
TPU systolic array in a predictable pattern and the same efficient for dense matrices, can be adapted to handle sparse
operation is performed at each step. This means that for a matrices as well.
given input size, the execution time is constant, regardless of
In a traditional systolic array, such as that of TPUs, the
the specific values of the input data.
PEs containing zeros in a sparse matrix are unutilized, leading
This deterministic execution model allows TPUs to consis-
to wasted computational resources and energy. The STPU
tently meet the strict latency requirements of many machine
addresses this inefficiency by repurposing these PEs.
learning applications, making them particularly well suited for
The STPU introduces a comprehensive framework that maps
real-time or near-real-time applications where predictability is
sparse data structures to a 2D systolic-based processor in
key.
a scalable manner. This allows the STPU to handle both
dense and sparse matrices efficiently, thereby improving the
C. Limitations of TPU utilization of the PEs and the overall energy efficiency. A key
TPUs are highly energy efficient when performing com- feature of the STPU is its ability to perform column merging
putations on dense matrices. Dense matrices are those with when mapping the input matrix onto its systolic array.
a high number of nonzero elements. An example of where Take, for example, Figure 9, which shows how a 6x6 sparse
a dense matrix is commonly found is in image processing, matrix is mapped onto a 3x3 STPU systolic array. The STPU
where the intensity of each pixel is represented by a numeric merges columns before mapping them onto the systolic array,
value between 0 and 255 [11]. Operations on dense matrices enabling computations on the sparse matrix to be completed
are both memory and computationally intensive. As a result, in just three iterations. Despite this design offering superior
it is common practice to prune a dense matrix to reduce energy efficiency compared to TPU, there are instances where
the number of computations performed on it. This process some PEs remain unutilized during computations. This occurs
introduces sparsity to the matrix. when there is a disproportionate number of non-zero columns
Sparse matrices are matrices with a high number of zero across the rows, leading to an increase in unutilized PEs.
elements. Sparsity in dense matrices can also be introduced For example, the value 12 of the sparse matrix required
through quantization. Furthermore, some applications, such as an additional iteration to map to the systolic array, despite
graph and recommendation systems, produce sparse matrices the availability of slots in the systolic array in the previous
by default. iteration.
While TPUs are energy efficient for dense matrix computa- Additionally, the STPU handles both sparse matrix-vector
tions, they are less so when used for sparse matrices, although (SpMV) and sparse matrix-matrix (SpMM) operations.
8
(a)
(a)
(b)
Fig. 9: Mapping of (a) a sparse matrix of size 6x6 to (b) 3x3
STPU systolic array [16]