0% found this document useful (0 votes)

15 views11 pages

2024_Evolution_challenges_optimization

This paper examines the evolution of computer architecture, focusing on the shift from single-core to multi-core and domain-specific architectures to address modern computational demands. It discusses the challenges posed by the power wall and the end of Dennard scaling, as well as the role of reconfigurable systems and various types of parallelism in optimizing performance and energy efficiency. The study highlights advancements in hardware accelerators, including Tensor Processing Units and their derivatives, emphasizing their importance in enhancing computational efficiency.

Uploaded by

aaaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views11 pages

2024_Evolution_challenges_optimization

Uploaded by

aaaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

1

Evolution, Challenges, and Optimization in

Computer Architecture: The Role of Reconfigurable
Systems
Jefferson Ederhion1 , Festus Zindozin2 , Hillary Owusu1 , Chukwurimazu Ozoemezim2 , Mmeri Okere3 , Opeyemi
Owolabi4 , Olalekan Fagbo5 , and Oyetubo Oluwatosin6
1
University of Maryland, College Park, USA
2
University of Florida, USA
arXiv:2412.19234v1 [cs.AR] 26 Dec 2024

3
Madonna University, Nigeria
4
University of Bradford, United Kingdom
5
Ball State University, USA
6
Austin Peay State University, USA

Abstract—The evolution of computer architecture has led to a ’memory wall’. Additionally, it looks at various computing
paradigm shift from traditional single-core processors to multi- paradigms, such as heterogeneous computing and domain-
core and domain-specific architectures that address the increas- specific architectures, which aim to maximize performance for
ing demands of modern computational workloads. This paper
provides a comprehensive study of this evolution, highlighting specific tasks while reducing energy consumption.
the challenges and key advancements in the transition from Section III focuses on parallelism in computer architec-
single-core to multi-core processors. It also examines state-of- ture. It discusses instruction-level parallelism (ILP), data-level
the-art hardware accelerators, including Tensor Processing Units parallelism (DLP), and thread-level parallelism (TLP), each
(TPUs) and their derivatives, RipTide and the Catapult fabric, with their unique techniques and challenges. This section
and evaluates their strategies for optimizing critical performance
metrics such as energy consumption, latency, and flexibility. also compares various computing models, including the Von
Ultimately, this study emphasizes the role of reconfigurable Neumann model, dataflow models, and hybrid models.
systems in overcoming current architectural challenges and Sections IV through X explore state-of-the-art accelerators,
driving future advancements in computational efficiency. including TPU, STPU, FlexTPU, and RipTide. These sections
discuss how these accelerators are optimized for different
performance metrics, such as energy consumption, flexibility,
I. I NTRODUCTION
and latency.
This paper explores the evolving landscape of computer
architecture in the post-Moore’s Law era, highlighting chal- II. E VOLUTION OF COMPUTER ARCHITECTURE
lenges posed by the power wall and the end of Dennard
scaling. It describes the transition from single-core to mul- A. Moore’s Law, Dennard Scaling and the Bottleneck
ticore architectures, which provided a temporary solution but The history of computer architecture is deeply tied to
introduced complexities in parallel computing. Moore’s law, formulated by Gordon Moore in 1965. Moore
With the advent of domain-specific architectures (DSAs), observed that the number of transistors on a microchip roughly
specialized accelerators such as the TPU have emerged, op- doubles every two years. This exponential growth fueled
timizing for specific workloads such as machine learning. miniaturization and advancements in integrated circuits, mak-
However, as the demand for sparse matrix computations in- ing them faster and more complex. Initially, chip designers
creases, variants such as Sparse-TPU and FlexTPU offer more focused on cramming more transistors into a single processor
efficient solutions. The paper also discusses RipTide, a flexible core to boost its clock speed, the rate at which it executes
architecture designed to balance programmability and energy instructions. However, this approach soon hit physical limita-
efficiency, and Catapult, which leverages FPGAs for datacenter tions [1].
flexibility. The study concludes by emphasizing the importance As transistors shrunk, they became more susceptible to
of innovative and flexible architectures that cater to diverse overheating and power consumption skyrocketed. This phe-
computational needs while optimizing energy consumption. nomenon is known as the power wall [2]. Compounding
Section II provides an in-depth analysis of the evolution this issue was the end of Dennard scaling, another prin-
of computer architecture, highlighting the challenges posed ciple observed in the early days of microchips. Dennard
by Moore’s law, Dennard scaling, and the ’power wall’. It scaling predicted that even as transistors miniaturized, their
examines the transition from single-core to multi-core pro- power consumption would remain relatively constant due to
cessors, and subsequent issues such as ’dark silicon’ and the reduced voltage requirements. Unfortunately, this beneficial
2

effect diminished as the transistors reached the atomic scales, take advantage of the inherent characteristics of these domains
making it impractical to further reduce the voltage without to achieve optimal performance and energy efficiency. For
compromising performance [1][2]. example, the Eyeriss accelerator is specifically designed for
Convolutional Neural Networks (CNNs), a type of artificial
B. The Multicore Solution and Its Challenges neural network commonly used in deep learning applications.
Eyeriss optimizes data movement and minimizes energy
The combined pressure from the power wall and the end of
consumption, which are critical aspects of deep learning
Dennard scaling forced a paradigm shift in processor design.
algorithms [5].
Around the early 2000s, chipmakers began transitioning from
single-core to multicore architectures. Multicore processors
When designing a DSA, computer architects must make two
essentially pack multiple independent processing cores onto a
crucial decisions:
single chip. This approach allowed parallel processing, where
multiple tasks could be executed simultaneously, effectively 1) Type of Parallelism: Parallelism is a fundamental con-
improving overall performance without significantly increas- cept in modern computer architecture aiming to im-
ing power consumption [3]. prove performance by running multiple instructions or
However, the multicore revolution was not without its operations concurrently. Different types of parallelism
challenges. Software development had to adapt to efficiently exist, such as instruction-level parallelism, data-level
utilize these multiple cores. Traditional software designed for parallelism, and thread-level parallelism. The architect
single-core processors often could not leverage the parallelism needs to choose the type that best suits the target
offered by multi-core architecture. Furthermore, the increased application to maximize efficiency.
number of cores introduced the concept of ”dark silicon”. 2) Computing Model: The computing model defines how
Not all cores on a chip could be powered on simultaneously, data are processed and flow within the DSA. Com-
due to limitations in heat dissipation and power delivery. This mon models include the von Neumann model and the
essentially rendered some transistors inactive, reducing overall dataflow model. The architect must select the model that
efficiency [4][7]. aligns best with the use case and the chosen type of
In addition, there was the memory wall problem, which parallelism.
refers to the speed limitations of data transfer between the The next sections will briefly discuss these concepts.
processor and the memory. This problem intensifies as the
number of cores and the demand for memory bandwidth III. T YPES OF PARALLELISM IN C OMPUTER
increase. A RCHITECTURE
Figure 1 illustrates the dramatic increase in transistor count
There are three popular kinds of parallelism, which will be
on microprocessors over time, which is in accordance with
discussed below.
Moore’s law. This exponential growth fueled the advancements
discussed previously, but ultimately led to the need for multi-
core architectures. A. Instruction Level Parallelism (ILP)
ILP focuses on extracting parallelism within a single stream
C. Beyond Multicore: Heterogeneous Computing and of instructions. It leverages the fact that some instructions
Domain-Specific Architectures in a program may be independent of each other, meaning
The pursuit of improved performance and energy efficiency they do not rely on the results of previous instructions and
continues to drive innovation in computer architecture. One do not produce results that affect subsequent instructions.
such innovative approach is heterogeneous computing. In Identifying these independent instructions allows processors
contrast to homogeneous computing, which exclusively relies to execute them concurrently, improving overall program ex-
on Central Processing Units (CPUs) for all tasks, hetero- ecution speed.
geneous computing combines different types of processors Several techniques are used to exploit ILP:
such as CPUs which handles general-purpose tasks while • Instruction pipelining: This technique overlaps the ex-
controlling the overall system operations, Graphics Processing ecution of different stages of an instruction (fetching,
Units (GPUs) which perform parallel processing tasks such as decoding, execution, memory access, and writing results)
rendering and Machine Learning, and specialized accelerators with those of subsequent instructions [6].
on a single chip. Each type of processor in a heterogeneous • Multiple instruction issue: This technique is a feature
computing architecture is optimized for specific tasks, leading of modern processors that allows them to decode and
to improved performance and reduced power consumption for issue multiple instructions simultaneously, given that they
workloads that can be parallelized across these diverse cores are independent. The primary advantage of this approach
while also allowing efficient communication and resource is that it enables the processor to keep its execution
sharing between all components[7]. units active and reduce idle time, thus improving overall
These specialized accelerators are typically referred to as efficiency [6].
Domain Specific Architectures (DSAs). DSAs are custom- • Out-of-order execution: It involves the analysis of in-
designed processors that are tailored for specific computational structions, their reordering for efficient execution, and
tasks, such as machine learning or video processing. They their execution outside of their original program order.
3

Fig. 1: Trends in Microprocessor Technology from 1970 to 2020. Original data up to the year 2010 collected and plotted by
M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. New plot and data collected for 2010-2019
by K. Rupp

This technique helps to hide memory access latencies. A

popular algorithm used for out-of-order execution is the
Tomasulo Algorithm [6]
However, ILP can be limited by dependencies which in-
clude data dependencies, control dependencies, and structural
dependencies.
Figure 2 illustrates an example of ILP using instruction
pipelining

B. Data Level Parallelism (DLP)

DLP involves the simultaneous execution of identical op-
erations across multiple data elements, thereby significantly
improving computational efficiency. This form of parallelism
is leveraged through vector processing, Single Instruction
Multiple Data (SIMD) architectures, and Graphics Processing
Units (GPUs). DLP is particularly advantageous in applica-
tions such as scientific computing, image processing, and
artificial intelligence, where large datasets require repeated
application of similar operations. Fig. 2: Instruction Pipelining [6]
By concurrently processing multiple data elements, DLP
considerably improves throughput for data-intensive tasks,
making it an indispensable technique in high-performance shown in Figure 3a. When compiled with the GCC compiler
computing environments. This is particularly evident in appli- using the highest optimization level, the generated assembly
cations that involve repetitive and computationally intensive code (Figure 3b) reveals how the processor automatically
operations, such as matrix multiplications or convolution op- translates the loop into SIMD instructions, leveraging Intel’s
erations in neural networks. Streaming SIMD Extensions (SSE). This process, often ab-
A common instance of DLP can be observed in everyday stracted by the compiler, demonstrates how modern compilers
programming. Consider a simple ”for loop” written in C, as and hardware architectures automatically harness DLP to
4

enhance performance, even in relatively simple code. Furthermore, the increased complexity in software design
demands more intricate programming techniques to achieve
for(i = 0; i < SIZE; i++) optimal performance without introducing errors.
{
a[i] = i * i + 5; IV. C OMPUTING M ODELS
}
Computer architects must carefully select the computing
model that best aligns with their use case when designing
(a) DSAs. There are two primary computing models: the Von
.L8: Neumann model and the Dataflow model.
movdqa %xmm2, %xmm1
.L3:
movdqa %xmm1, %xmm5 A. Von Neumann Model
movdqa %xmm1, %xmm2 The Von Neumann execution model, a classic architecture,
addq $16, %rax fetches program instructions from memory, decodes them,
pmuludq %xmm1, %xmm5 executes them, and stores the results. The program counter
psrlq $32, %xmm1 (PC) determines the next instruction to be executed, facilitating
pshufd $8, %xmm5, %xmm0 sequential execution. Operands are retrieved from a centralized
pmuludq %xmm1, %xmm1 memory or registers, forming the foundation of traditional
pshufd $8, %xmm1, %xmm1 computer architectures. This process drives a linear instruction
paddd %xmm4, %xmm2 flow in sequential programming [6][8]. The Von Neumann
punpckldq %xmm1, %xmm0 architecture, offering high flexibility, is essential in most
paddd %xmm3, %xmm0 general-purpose processors.
movdqa %xmm0, -16(%rax) Due to the inherent sequential execution of the model, com-
cmpq %rdx, %rax putations tend to be slow. To enhance performance, various
jne .L8 types of parallelism can be utilized. For instance, Instruction
Level Parallelism (ILP) can be exploited through techniques
such as pipelining and superscalar architectures. Data-Level
(b) Parallelism (DLP), on the other hand, can be implemented
Fig. 3: Demonstration of DLP. (a) Sample ’for loop’ written using Single Instruction, Multiple Data (SIMD) extensions.
in C. (b) Assembly code generated by the GCC compiler, Additionally, multi-threading can be used to improve through-
highlighting the use of SIMD instructions via Intel’s Streaming put. [8].
SIMD Extensions (SSE), which efficiently parallelizes data In Figure 4, we present an example of the Von Neumann
processing. computing model, showing its structure and the sequential
instruction execution characteristic of this model.

C. Thread Level Parallelism (TLP)

TLP enhances computational throughput by concurrently
executing multiple threads. Threads may represent indepen-
dent tasks or subtasks of a larger application, allowing multi-
threaded programs to distribute workloads effectively across
multiple cores or processors. This concurrent execution en-
ables TLP to fully exploit the available computing resources,
leading to significant performance gains, especially in multi-
core processors and distributed systems [6][8]. Fig. 4: Von Neumann Computing Model
TLP is crucial for maximizing the throughput of multi-
threaded applications by leveraging the power of modern
hardware architectures. It enables applications to scale hori-
zontally as processing power increases, making it a key factor B. DataFlow Model
in improving performance and efficiency. This parallelism In the dataflow model, the execution of instructions is not
is essential for workloads that benefit from simultaneous based on a predefined sequence as in the Von Neumann
execution of multiple threads, such as web servers, databases, model. Instead, instructions are executed as soon as their input
and scientific simulations [8]. operands become available. This approach eliminates the need
However, TLP also presents challenges. Synchronization for a program counter. The model uses a data dependency
overhead can arise from managing data shared across threads, graph, where each instruction directly passes its output to the
potentially leading to bottlenecks. Load balancing is another subsequent instructions that require its results. This structure
critical aspect, requiring careful design to ensure that no creates an efficient flow of data through the program, allowing
single thread is overloaded while others remain underutilized. a more effective execution of independent instructions. In this
5

way, the dataflow model bypasses the need for a centralized

register file, enhancing the efficiency of instruction execution
[8].
Figure 5 illustrates a basic dataflow graph for the mathe-
matical expression

(a + b − 7b) × (a + b + 7b)

. The model uses an adder tree configuration to efficiently

execute the mathematical operation with two processing el-
ements (PE). Assuming that each operation takes one cycle,
the adder tree completes the entire operation in three cycles,
compared to five cycles for sequential execution. To optimize
performance, the mathematical expression can be rearranged
so that operations with the same latency execute in the same
cycle.

Fig. 6: Systolic array [10]

C. Hybrid
The hybrid model combines the strengths of both the Von
Neumann and dataflow models to take advantage of sequential
execution and parallelism. By integrating the programmability
and sequential execution of the Von Neumann model with
the data-driven parallel execution of the dataflow model, the
hybrid model seeks to balance ease of programming and
resource utilization. This approach is particularly beneficial
for applications that require sequential and parallel processing
[8].
Fig. 5: A basic dataflow graph using an adder tree configura-
tion [12] V. C HOOSING THE R IGHT A RCHITECTURE FOR
D OMAIN -S PECIFIC ACCELERATORS
Systolic arrays are often favored over adder trees for matrix The choice of computing model and type of parallelism
multiplication because of their efficiency and their ability to when designing a domain specific architecture (DSA) depends
handle high data reuse rates, a key requirement in matrix mul- on the specific requirements of the accelerator. For example,
tiplication. A systolic array is composed of a grid of processing if the accelerator is intended for a device that is used in
elements (PEs), typically Multiply-Accumulate (MAC) units, remote locations and operates on battery power, such as
which are closely interconnected [9]. As shown in Figure 6, a energy harvesting devices, energy efficiency is paramount.
systolic array performing matrix-matrix multiplication allows This is because it is not feasible to frequently replace the
both input data and results to flow through the grid. This is battery in these devices. In such cases, an Application-Specific
in contrast to adder trees, where only the results flow. This Integrated Circuit (ASIC) might be the optimal choice as it
characteristic of systolic arrays enables more efficient data offers superior energy efficiency, being optimized for a specific
processing and promotes optimal data reuse. use case. Additionally, if the goal is to optimize latency and/or
The primary advantage of the dataflow model lies in its throughput for a specific workload, an ASIC would also be the
data reuse efficiency. By prioritizing data movements, dataflow ideal choice.
architectures can significantly minimize unnecessary data Currently, most ASICs used for neural network applications
transfers and memory accesses. These are often the leading typically use the dataflow model because matrix operations are
causes of energy consumption and latency in traditional Von well suited for this model. However, there may be instances
Neumann systems. In addition, the inherent flexibility of the where the accelerator needs to perform multiple operations on
dataflow model enables the architectures to be highly recon- the workload for which the dataflow model is not optimized.
figurable, allowing them to efficiently adapt to a variety of In such cases, using only the dataflow model would not be
computational tasks and data types. This adaptability proves to energy efficient. The hybrid model, which integrates the Von
be crucial when dealing with a diverse range of deep learning Neumann model for computations requiring sequential com-
models and algorithms, each requiring different computational putation and the dataflow model for the rest of the accelerator,
resources [9]. would be more suitable in this scenario.
6

It is common to want an accelerator to be able to perform Machine learning and artificial intelligence applications
computations on different types of workload. Ideally, an ar- present several unique challenges that traditional hardware
chitecture that offers the best flexibility and energy efficiency architectures struggle to handle efficiently. These challenges
would be the best hardware choice. However, achieving this include:
performance requirement on any hardware is challenging. In 1) Computational demands: Machine learning models, es-
this case, an ASIC would be the least suitable choice because pecially neural networks, require substantial computa-
of its lack of flexibility. The CPU, which would provide the tional resources. The TPU is designed to efficiently
best flexibility, is very energy inefficient. A compromise could handle these high computational demands.
be to use a reconfigurable architecture, which offers flexibility 2) Power Efficiency: As the scale of data and complexity
while also being energy efficient. of the models increase, so does the power consumption.
The TPU provides a high throughput per watt, making it
a more energy-efficient solution compared to traditional
CPUs and GPUs.
3) Latency Requirements: Many real-world applications,
such as autonomous driving or real-time translation,
require low latency. The TPU deterministic execution
model is better suited to the 99th percentile response
time requirement of these applications.
4) Memory Management: Machine learning models often
require large amounts of memory to store intermediate
results, weights, and biases. The TPU comes with a
large, software-managed on-chip memory to efficiently
handle these requirements.
5) Scaling: As machine learning models become more
complex and data sets grow larger, the need for hardware
that can scale with these increases becomes apparent.
TPUs are designed to work both independently and
as part of a larger system, allowing easy scaling as
Fig. 7: Trade-off Between Flexibility and Energy Efficiency computational needs grow.
in Various Hardware Choices
A. TPU Architecture
Figure 7 illustrates the trade-off between flexibility and The TPU is an hardware accelerator designed specifically
energy efficiency for various hardware choices. for machine learning workloads. At the heart of the TPU
Two popular reconfigurable architectures that are currently is a large matrix multiplication unit, which is capable of
in use are the Coarse-Grained Reconfigurable Array (CGRA) performing 65,536 8-bit multiply-and-add operations in a
and the Field-Programmable Gate Array (FPGA). The choice single cycle.
between these two largely depends on the specific goals It uses a systolic array architecture for its matrix multipli-
of the accelerator. If the accelerator requires maximum cation unit. This design choice is based on the observation
flexibility, FPGAs are generally the best choice. However, in that a significant portion of computation in many machine
applications where there is a need to optimize latency and learning workloads consists of matrix operations. The systolic
energy efficiency, CGRA tends to be the better option. This array architecture allows for high computational efficiency and
is primarily because CGRAs operate at a coarser granularity throughput by enabling multiple operations to be performed
than FPGAs, leading to shorter reconfiguration times and simultaneously. In addition, it minimizes the need for data
higher energy efficiency. movement, a major source of energy consumption in tradi-
tional architectures.
In the following sections, we will discuss various popular One of the standout features of the TPU is its large,
DSA that have been designed with a focus on latency and/or software-managed on-chip memory. This design allows for ef-
throughput, energy efficiency, and flexibility. These DSAs ficient data reuse, a common requirement in machine learning
include the Tensor Processing Unit (TPU) and its variants workloads. It also reduces the need for costly data transfers
SparseTPU and FlexTPU. We will also look at RipTide, an between the processor and off-chip memory, further enhancing
accelerator designed for both energy efficiency and flexibility. the efficiency of the TPU architecture.
Lastly, we will explore the use cases of reconfigurable hard-
ware in data centers.
B. Deterministic Execution
VI. T ENSOR P ROCESSING U NIT In a TPU, operations are executed in a predictable and
This section discusses the Tensor Processing Unit (TPU) as consistent manner. This is in contrast to CPUs and GPUs,
presented in the paper ”In-Datacenter Performance Analysis which often employ dynamic scheduling and out-of-order
of a Tensor Processing Unit” by Norman P. Jouppi et al. [19]. execution to optimize performance. Although these techniques
7

they are still more energy efficient than CPUs and GPUs.
The reason for this lies in the structure of the TPU itself.
Specifically, the presence of sparsity in the matrices can lead
to unutilized PEs within the TPU, thereby reducing its overall
computational efficiency.
Figure 8 provides an illustration of how a 6x6 dense matrix
is mapped to a 3x3 TPU systolic array. As shown in the
figure, the TPU partitions the input matrix directly, based on
the shape of the systolic array. This process requires four
iterations for execution. It is important to note that even if
the matrix were sparse, the TPU would still require four
(a) iterations for execution due to its method of partitioning the
input matrix. This is because the TPU partitioning approach
does not account for the sparsity of the matrix.

Several accelerators have been designed to handle sparse

computations. This paper will discuss FlexTPU and STPU, ac-
(b) celerators that were designed by repurposing the TPU, thereby
improving its energy efficiency. The idea behind repurposing
Fig. 8: Mapping of (a) a dense matrix of size 6x6 to (b) 3x3 the TPU is to minimize the number of unutilized PEs during
TPU systolic array [16] computations.

can improve average-case performance, they can also lead to VII. S PARSE -TPU
variability in execution times, making it difficult to guarantee This section discusses the Sparse-TPU (STPU) as presented
a specific response time. in the paper ”Sparse-TPU: Adapting Systolic Arrays for Sparse
On the other hand, TPUs are designed to perform a large Matrices” by Xin He et al. [20]
number of operations simultaneously in a highly structured The Sparse Tensor Processing Unit (STPU) addresses the
way, specifically matrix multiplications, which are at the heart challenge of efficiently handling sparse matrices. It recognizes
of many machine learning workloads. Data flow through the that the traditional systolic array architecture, while highly
TPU systolic array in a predictable pattern and the same efficient for dense matrices, can be adapted to handle sparse
operation is performed at each step. This means that for a matrices as well.
given input size, the execution time is constant, regardless of
In a traditional systolic array, such as that of TPUs, the
the specific values of the input data.
PEs containing zeros in a sparse matrix are unutilized, leading
This deterministic execution model allows TPUs to consis-
to wasted computational resources and energy. The STPU
tently meet the strict latency requirements of many machine
addresses this inefficiency by repurposing these PEs.
learning applications, making them particularly well suited for
The STPU introduces a comprehensive framework that maps
real-time or near-real-time applications where predictability is
sparse data structures to a 2D systolic-based processor in
key.
a scalable manner. This allows the STPU to handle both
dense and sparse matrices efficiently, thereby improving the
C. Limitations of TPU utilization of the PEs and the overall energy efficiency. A key
TPUs are highly energy efficient when performing com- feature of the STPU is its ability to perform column merging
putations on dense matrices. Dense matrices are those with when mapping the input matrix onto its systolic array.
a high number of nonzero elements. An example of where Take, for example, Figure 9, which shows how a 6x6 sparse
a dense matrix is commonly found is in image processing, matrix is mapped onto a 3x3 STPU systolic array. The STPU
where the intensity of each pixel is represented by a numeric merges columns before mapping them onto the systolic array,
value between 0 and 255 [11]. Operations on dense matrices enabling computations on the sparse matrix to be completed
are both memory and computationally intensive. As a result, in just three iterations. Despite this design offering superior
it is common practice to prune a dense matrix to reduce energy efficiency compared to TPU, there are instances where
the number of computations performed on it. This process some PEs remain unutilized during computations. This occurs
introduces sparsity to the matrix. when there is a disproportionate number of non-zero columns
Sparse matrices are matrices with a high number of zero across the rows, leading to an increase in unutilized PEs.
elements. Sparsity in dense matrices can also be introduced For example, the value 12 of the sparse matrix required
through quantization. Furthermore, some applications, such as an additional iteration to map to the systolic array, despite
graph and recommendation systems, produce sparse matrices the availability of slots in the systolic array in the previous
by default. iteration.
While TPUs are energy efficient for dense matrix computa- Additionally, the STPU handles both sparse matrix-vector
tions, they are less so when used for sparse matrices, although (SpMV) and sparse matrix-matrix (SpMM) operations.
8

data loader that includes an on-chip row decoder and

parallel nonzero loaders.
This combination of Z-shape mapping and SpMV dataflow
execution, along with on-the-fly matrix condensing, enables
FlexTPU to handle SpMV computations efficiently.

(a)

(a)
(b)
Fig. 9: Mapping of (a) a sparse matrix of size 6x6 to (b) 3x3
STPU systolic array [16]

VIII. F LEX TPU

This section discusses FlexTPU, as presented in the paper (b)
“Squaring the Circle: Executing Sparse Matrix Computations
on FlexTPU - A TPU-like Processor” by Xin He et al. [16]. Fig. 10: Mapping of (a) a sparse matrix of size 6x6 to (b) 3x3
The FlexTPU was designed by the same authors who FlexTPU systolic array [16]
developed the STPU. Although both TPU and STPU can
execute SpMV and SpMM, the TPU is notably inefficient Figure 10 illustrates the mapping of a previously used 6x6
for these operations. The FlexTPU, on the other hand, was sparse matrix onto a 3x3 FlexTPU systolic array. As shown
specifically designed to execute SpMV operations efficiently. in the figure, the Z-shape mapping enables full utilization of
To handle these SpMV computations effectively, the the systolic array. This is achieved by consecutively mapping
FlexTPU employs a new mapping technique that minimizes the non-zero values onto the systolic array, allowing the
the processing of zeros and maximizes the utilization of the computation to be completed in just two iterations. This is
TPU’s systolic array. This approach enhances the flexibility of more efficient compared to the three iterations required by the
TPUs, enabling them to handle a wider range of computational STPU. The difference in the number of iterations becomes
tasks. even more significant when dealing with a matrix with a high
The FlexTPU achieves efficient mapping of sparse matrices degree of sparsity.
onto the systolic array through a two-step process: Unlike the STPU, the FlexTPU does not have unutilized
1) Z-Shape Mapping: The first step involves a lightweight PEs when used for SpMV. It achieves a 3.55× speedup and
Z-shape mapping of sparse matrices onto the systolic 3.27× energy saving when compared to the Sparse-TPU. This
array. This mapping aims to eliminate the processing of makes FlexTPU a more efficient and effective solution for
zeros as much as possible, regardless of the sparsity and handling SpMV computations.
non-zero distribution. The Z-shape mapping arranges
the nonzero elements of the sparse matrix in a zigzag In the preceding sections, we explored the TPU and its
pattern across the systolic array, thereby maximizing the variants, the STPU and the FlexTPU. A comparison of
utilization of the PEs in the array. these three architectures is provided in Table I. From this
2) SpMV Dataflow Execution: Building on the mapping, an comparison, it becomes clear that the choice of architecture
SpMV dataflow is executed by an array of PEs, which depends on the nature of the workload. For dense matrix
are slightly modified versions of the conventional TPU computations, the TPU is the preferred choice because of
PE. This dataflow design ensures efficient computations its optimization for such tasks. For sparse computations, the
following the Z-shaped mapping. Additionally, Z-shape choice between STPU and FlexTPU depends on the specific
mapping facilitates on-the-fly matrix condensing from operation: for SpMM, STPU is more suitable, while for
the widely used compressed sparse matrix (e.g., CSR) SpMV, FlexTPU is the optimal choice. This understanding
representation. This is achieved by a proposed sparse allows us to select the most efficient architecture for a given
9

TABLE I: Comparison of TPU, STPU, and FlexTPU Architectures

Features TPU STPU FlexTPU
Processing Elements MAC units MAC units MAC units
Computation Model DataFlow DataFlow DataFlow
Optimized for Dense Matrix SpMM SpMV

computational task. 1) CGRA: The CGRA is an array of PEs connected by an

on-chip network (NoC). The CGRA is programmed by
The architectures we have discussed so far, including TPU, mapping the dataflow of a computation to the array, i.e.
STPU, and FlexTPU, have primarily focused on optimizing assigning operations to PEs and configuring the NoC to
the latency of matrix computations. However, because of the route values between dependent operations.
efficient utilization of their PEs, these architectures also offer 2) Control-Flow Operators: RipTide introduces control-
energy savings. flow primitives that support common programming id-
There are certain situations where energy savings become ioms, deeply nested loops, and irregular memory ac-
the primary performance metric in our design. This is par- cesses while minimizing energy overhead. These opera-
ticularly true for technologies that rely on batteries or are tors allow RipTide to support arbitrary control flow and
used in remote locations, such as in space exploration or at memory access on the CGRA fabric.
the bottom of the ocean. In these scenarios, design choices 3) Memory Ordering: RipTide enforces memory ordering
must prioritize energy efficiency while still delivering good through careful analysis at compile time, eliminating the
performance metrics. need for expensive tag-token matching on fabric. This
In the following section, we will discuss an accelerator that means that it can handle complex memory access pat-
has been specifically designed to optimize energy consump- terns without the need for costly hardware mechanisms.
tion, demonstrating how performance and efficiency can be 4) Network-on-Chip (NoC): RipTide implements control
balanced in hardware design. flow in the NoC to increase utilization and facilitate
compilation. The NoC is responsible for routing values
IX. R IP T IDE between dependent operations in the CGRA.
This section discusses RipTide, as presented in the paper 5) Compiler: The RipTide compiler is implemented in
”RipTide: A programmable, energy-minimal dataflow com- LLVM and is designed to compile applications written
piler and architecture” by Graham Gobieski et al. [13]. in a high-level language with minimal energy and high
RipTide addresses the need for a system that is both highly performance.
programmable and extremely energy efficient.
Traditional computing architectures often struggle to bal- A key feature of RipTide’s architecture is its use of a
ance these two aspects. On one hand, general-purpose proces- hybrid computing model, which strikes a balance between
sors, such as CPUs, offer high programmability, meaning they energy efficiency and flexibility. The architecture is primarily
can handle a wide variety of tasks. However, they are not very designed to execute most operations on the CGRA, which is
energy efficient, especially for specific tasks like deep learning known for its high energy efficiency and reasonable flexibility.
or signal processing.
However, there are certain operations, especially those that
On the other hand, specialized hardware accelerators, like
involve complex control flow and irregular memory access
GPUs or TPUs, can be very energy efficient for specific
patterns, that are not ideally suited for the CGRA. In these
tasks but lack the broad programmability of CPUs. They
situations, RipTide utilizes a more traditional von Neumann
are designed for specific types of computation and may not
core. This core excels at managing these complex operations,
perform well outside of those tasks.
offering a higher degree of flexibility.
RipTide aims to bridge this gap by offering a solution that
is both highly programmable, capable of handling a wide It is important to note that, while the von Neumann core
range of computations, and extremely energy efficient. This offers this increased flexibility, it does not achieve the energy
is particularly important for applications that require ultralow- efficiency of the CGRA. Therefore, RipTide’s architecture
power processing, such as emerging sensing applications. represents a strategic balance between flexibility and energy
RipTide achieves this through a unique combination of a efficiency.
co-designed compiler and a CGRA architecture.
Our discussion thus far has centered on the design of
A. RipTide’s Architecture accelerators using ASICs and reconfigurable architectures.
RipTide’s architecture is designed to achieve both high pro- However, it is worth noting that reconfigurable architectures
grammability and extreme energy efficiency by co-designing can also be used in other applications, such as enhancing
the compiler and CGRA, introducing control flow primitives, the performance of a datacenter while providing flexibility to
enforcing memory ordering and implementing control flow in adapt to evolving workloads. Microsoft currently has an active
NoC. The components of the architecture are described as project in this area titled “Catapult” [15]. We will discuss this
follows. in the next section.
10

X. C ATAPULT providing scalable solutions adaptable to evolving workloads

The paper titled “A Reconfigurable Fabric for Accelerating in datacenters.
Large-Scale Datacenter Services” by Microsoft explores the Future research will continue to push the boundaries of
challenges of enhancing computational capability, flexibility, architectural design, seeking an optimal balance between flexi-
power efficiency, and cost in datacenter workloads [14][15]. As bility, performance, and energy efficiency to meet the growing
datacenter workloads evolve rapidly, continuous improvement demands of modern computing applications.
in performance is required. As discussed previously, general-
purpose architectures have limited efficiency in handling com- R EFERENCES
putational demands, while homogeneous accelerators lack the
flexibility needed for rapidly evolving services in the datacen- [1] G. E. Moore, ”Cramming More Components Onto Inte-
ter. grated Circuits,” in Proceedings of the IEEE, vol. 86, no.
Microsoft’s solution to this challenge is to use an FPGA to 1, pp. 82-85, Jan. 1998, doi: 10.1109
strike a balance between efficiency and flexibility. This imple- [2] P. Bose, “Power Wall,” in Encyclopedia of Parallel
mentation, known as the Catapult fabric, is a reconfigurable Computing, D. Padua, Ed. Boston, MA: Springer, 2011.
computing infrastructure designed for large-scale services. [Online]. Available: https://doi.org/10.1007
The Catapult architecture integrates FPGAs directly into the [3] C. Märtin, “Post-Dennard Scaling and the final Years
datacenter server infrastructure. Each server hosts a Stratix V of Moore’s Law: Consequences for the Evolution of
FPGA on a small daughtercard with a mezzanine connector Multicore-Architectures,” Technical Report, Hochschule
for connectivity. The FPGA communicates with the FPGAs Augsburg University of Applied Sciences, Faculty of
of other servers over a low-latency, high-bandwidth torus Computer Science, Sep. 2014
network. This architecture ensures that services requiring more [4] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankar-
than one FPGA can be mapped across FPGAs on multiple alingam, and D. Burger, “Dark silicon and the end of
servers, thus improving elasticity and making efficient use of multicore scaling,” SIGARCH Comput. Archit. News,
the reconfigurable fabric. vol. 39, no. 3, pp. 365–376, Jun. 2011
The architecture is divided into two partitions: a shell and a [5] Y. -H. Chen, T. Krishna, J. S. Emer and V. Sze, ”Eye-
role. The shell provides reusable infrastructure such as PCIe, riss: An Energy-Efficient Reconfigurable Accelerator for
DRAM controllers, and interconnects, while the role is the Deep Convolutional Neural Networks,” in IEEE Journal
application logic that can be reconfigured as needed. This of Solid-State Circuits, vol. 52, no. 1, pp. 127-138, Jan.
design allows developers to focus on application logic without 2017, doi: 10.1109
worrying about the low-level details of the infrastructure. [6] J. L. Hennessy and D. A. Patterson, Computer Architec-
In addition, the Catapult Fabric has been deployed on a ture: A Quantitative Approach, 5th ed.
medium scale across 1,632 servers, demonstrating its ability [7] C. E. Leiserson, N. C. Thompson, J. S. Emer, B. C.
to accelerate large-scale software services. It has shown signif- Kuszmaul, B. W. Lampson, D. Sanchez, and T. B.
icant improvements in performance and efficiency, achieving Schardl, “There’s plenty of room at the Top: What will
a 95% improvement in throughput at each ranking server with drive computer performance after Moore’s law?” 2020
an equivalent latency distribution. Alternatively, at the same [8] F. Yazdanpanah, C. Alvarez-Martinez, D. Jimenez-
throughput, it reduces tail latency by 29%[15]. Gonzalez and Y. Etsion, ”Hybrid Dataflow/von-Neumann
This work has pioneered the use of FPGAs in cloud Architectures,” in IEEE Transactions on Parallel and
computing, proving that FPGAs can deliver efficiency and per- Distributed Systems, vol. 25, no. 6, pp. 1489-1509, June
formance without the cost, complexity, and risk of developing 2014, doi: 10.1109
custom ASICs. [9] B. Asgari, R. Hadidi and H. Kim, ”MEISSA: Multiplying
Matrices Efficiently in a Scalable Systolic Architecture,”
2020 IEEE 38th International Conference on Computer
XI. C ONCLUSION Design (ICCD), Hartford, CT, USA, 2020, pp. 130-137,
This paper emphasizes significant advances and challenges doi: 10.1109/ICCD50377.2020.00036
in computer architecture, focusing on the transition beyond [10] H. Waris, C. Wang, W. Liu et al., ”AxSA: On the Design
multi-core processors toward domain-specific accelerators. of High-Performance and Power-Efficient Approximate
As traditional scaling strategies face physical and power Systolic Arrays for Matrix Multiplication,” in J Sign
limitations, innovative solutions such as TPUs and their vari- Process Syst, vol. 93, pp. 605-615
ants, including SparseTPU and FlexTPU, showcase how spe- [11] A. Verma, “Understanding Sparse vs. Dense Data in
cialized architectures can efficiently handle complex computa- Machine Learning: Pros, Cons, and Use Cases,” Artificial
tional tasks. These architectures offer customized performance Intelligence in Plain English, Oct. 2, 2023. [Online].
improvements by optimizing for different types of parallelism Available: Link. [Accessed: May. 1, 2024]
and computational models. [12] O. Mutlu, “Dataflow Architectures,” 2013. [Online].
Meanwhile, innovations like RipTide demonstrate the po- Available: Link. [Accessed: May 1, 2024]
tential of energy-efficient, flexible designs that bridge the [13] G. Gobieski et al., ”RipTide: A Programmable, Energy-
gap between programmability and specialization. The Cata- Minimal Dataflow Compiler and Architecture,” 2022
pult fabric exemplifies the future of reconfigurable hardware, 55th IEEE/ACM International Symposium on Microar-
11

chitecture (MICRO), Chicago, IL, USA, 2022, pp. 546-

564, doi: 10.1109/MICRO56248.2022.00046
[14] A. Putnam et al., ”A Reconfigurable Fabric for Ac-
celerating Large-Scale Datacenter Services,” in IEEE
Micro, vol. 35, no. 3, pp. 10-22, May-June 2015, doi:
10.1109/MM.2015.42
[15] Microsoft, ”Project Catapult,” Microsoft Research, 2024.
[Online]. Available: Link. [Accessed: May 1, 2024].
[16] X. He, K.-Y. Chen, S. Feng, H.-S. Kim, D. Blaauw,
R. Dreslinski, and T. Mudge, “Squaring the circle: Ex-
ecuting Sparse Matrix Computations on FlexTPU—a
TPU-like processor,” in the 31st International Conference
on Parallel Architectures and Compilation Techniques
(PACT), Chicago, IL, October 2022, pp.148-159
[17] T. A. Davis and Y. Hu, “The University of Florida sparse
matrix collection,” ACM Transactions on Mathematical
Software (TOMS), vol. 38, no. 1, pp. 1-25, 2011.
[18] D. A. Bader and K. Madduri, “Snap: Small-World Net-
work Analysis and Partitioning.” [Online]. Available:
http://snap-graph.sourceforge.net.
[19] N. P. Jouppi et al., ”In-datacenter performance analysis
of a tensor processing unit,” 2017 ACM/IEEE 44th An-
nual International Symposium on Computer Architecture
(ISCA), Toronto, ON, Canada, 2017, pp. 1-12, doi:
10.1145/3079856.3080246.
[20] X. He, S. Pal et al., “Sparse-TPU: Adapting Systolic
Arrays for Sparse Matrices,” in Proceedings of the
34th ACM International Conference on Supercomputing
(ICS ’20), New York, NY, USA, 2020, pp. 1-12. doi:
10.1145/3392717.3392751.
[21] A. Fuchs and D. Wentzlaff, “The Accelerator Wall: Lim-
its of Chip Specialization,” in 2019 IEEE International
Symposium on High Performance Computer Architec-
ture (HPCA), Washington, DC, USA, Feb. 2019, pp.
1–14. doi: 10.1109/HPCA.2019.000231.

Efficient Processing of Deep Neural Networks
No ratings yet
Efficient Processing of Deep Neural Networks
341 pages
Chattopadhyay a. Handbook of Computer Architecture 2025
No ratings yet
Chattopadhyay a. Handbook of Computer Architecture 2025
1,465 pages
1-s2.0-S1383762122001138-main
No ratings yet
1-s2.0-S1383762122001138-main
51 pages
Computerarchitecture
No ratings yet
Computerarchitecture
2 pages
(eBook PDF) Computer Architecture: A Quantitative Approach 6th Edition pdf download
100% (1)
(eBook PDF) Computer Architecture: A Quantitative Approach 6th Edition pdf download
50 pages
lec1
No ratings yet
lec1
15 pages
01 - Introduction: 1 Why Parallel Programming Is Important in Research
No ratings yet
01 - Introduction: 1 Why Parallel Programming Is Important in Research
50 pages
Computer Architecture: Challenges and Opportunities For The Next Decade
No ratings yet
Computer Architecture: Challenges and Opportunities For The Next Decade
13 pages
Design and Applications of Emerging Computer Systems 1st Edition Weiqiang Liu - The latest ebook is available, download it today
No ratings yet
Design and Applications of Emerging Computer Systems 1st Edition Weiqiang Liu - The latest ebook is available, download it today
68 pages
Gr5 the Evolution of Computer Architecture
No ratings yet
Gr5 the Evolution of Computer Architecture
12 pages
Lecture Slides-Week1
No ratings yet
Lecture Slides-Week1
59 pages
Dynamic Reconfigurable Architectures and Transparent Optimization Techniques: Automatic Acceleration of Software Execution 2010th Edition by Antonio Carlos Schneider Beck Fl Luigi Carro ISBN 9048139120 978-9048139125 - Discover the ebook with all chapters in just a few seconds
100% (6)
Dynamic Reconfigurable Architectures and Transparent Optimization Techniques: Automatic Acceleration of Software Execution 2010th Edition by Antonio Carlos Schneider Beck Fl Luigi Carro ISBN 9048139120 978-9048139125 - Discover the ebook with all chapters in just a few seconds
76 pages
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
No ratings yet
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
52 pages
3357375
No ratings yet
3357375
39 pages
Customizable Computing
No ratings yet
Customizable Computing
120 pages
Approximate Computing Part1
No ratings yet
Approximate Computing Part1
34 pages
Robotic Computing On FPGAs (Synthesis Lectures On Distributed Computing Theory) by Shaoshan Liu, Zishen Wan, Bo Yu
100% (1)
Robotic Computing On FPGAs (Synthesis Lectures On Distributed Computing Theory) by Shaoshan Liu, Zishen Wan, Bo Yu
220 pages
primer parrallel processing 1980 to 2020
No ratings yet
primer parrallel processing 1980 to 2020
192 pages
Approximate Computing Concepts Architectures Challenges Applications and Future Directions
No ratings yet
Approximate Computing Concepts Architectures Challenges Applications and Future Directions
67 pages
Organizational and Architectural Trends in Computer Systems
No ratings yet
Organizational and Architectural Trends in Computer Systems
3 pages
Advanced Computer Architecture: Azvjvhd
No ratings yet
Advanced Computer Architecture: Azvjvhd
61 pages
икт статья бвну
No ratings yet
икт статья бвну
6 pages
Full download Compiling Algorithms for Heterogeneous Systems Steven Bell pdf docx
No ratings yet
Full download Compiling Algorithms for Heterogeneous Systems Steven Bell pdf docx
55 pages
Introduction to ACA 2021
No ratings yet
Introduction to ACA 2021
73 pages
CCC ARCH 2030 Report v3 1 1
No ratings yet
CCC ARCH 2030 Report v3 1 1
12 pages
Dynamic Reconfigurable Architectures and Transparent Optimization Techniques: Automatic Acceleration of Software Execution 2010th Edition by Antonio Carlos Schneider Beck Fl Luigi Carro ISBN 9048139120 978-9048139125 instant download
100% (3)
Dynamic Reconfigurable Architectures and Transparent Optimization Techniques: Automatic Acceleration of Software Execution 2010th Edition by Antonio Carlos Schneider Beck Fl Luigi Carro ISBN 9048139120 978-9048139125 instant download
44 pages
Onur Stanford SystemXSeminar FutureComputingPlatforms 8 February 2024
No ratings yet
Onur Stanford SystemXSeminar FutureComputingPlatforms 8 February 2024
423 pages
LECTURE 36
No ratings yet
LECTURE 36
15 pages
(eBook PDF) Computer Architecture: A Quantitative Approach 5th Edition download
100% (2)
(eBook PDF) Computer Architecture: A Quantitative Approach 5th Edition download
49 pages
AI Computer Architecture: Principles, Practice, and Prospects
No ratings yet
AI Computer Architecture: Principles, Practice, and Prospects
39 pages
PDF Heterogeneous Computing Architectures Challenges and Vision 1st Edition Olivier Terzo download
100% (2)
PDF Heterogeneous Computing Architectures Challenges and Vision 1st Edition Olivier Terzo download
55 pages
MODULE- 01 CC(BCS601)
No ratings yet
MODULE- 01 CC(BCS601)
47 pages
Instant Download The Datacenter As A Computer Designing Warehouse Scale Machines Luiz André Barroso PDF All Chapters
100% (3)
Instant Download The Datacenter As A Computer Designing Warehouse Scale Machines Luiz André Barroso PDF All Chapters
55 pages
ACA Notes UNIT-1
No ratings yet
ACA Notes UNIT-1
20 pages
90095
No ratings yet
90095
50 pages
Introducción_2024
No ratings yet
Introducción_2024
41 pages
CS5204/EE5364 - Advanced Computer Architecture - Introduction
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Introduction
28 pages
UNIT1
No ratings yet
UNIT1
11 pages
The Datacenter As A Computer Designing Warehouse Scale Machines Luiz André Barroso All Chapters Instant Download
100% (4)
The Datacenter As A Computer Designing Warehouse Scale Machines Luiz André Barroso All Chapters Instant Download
62 pages
Modern Hardware - Algorithmica
No ratings yet
Modern Hardware - Algorithmica
6 pages
Chapter 1 Measuring Understanding Performance
No ratings yet
Chapter 1 Measuring Understanding Performance
63 pages
Download Compiling Algorithms for Heterogeneous Systems Steven Bell ebook All Chapters PDF
100% (2)
Download Compiling Algorithms for Heterogeneous Systems Steven Bell ebook All Chapters PDF
55 pages
Future Trends in Computer Architecture
No ratings yet
Future Trends in Computer Architecture
4 pages
Kogge 2013 Yearly Update Exascale Projections
No ratings yet
Kogge 2013 Yearly Update Exascale Projections
130 pages
A 3-D CPU-FPGA-DRAM Hybrid Architecture For Low-Power Computation
No ratings yet
A 3-D CPU-FPGA-DRAM Hybrid Architecture For Low-Power Computation
14 pages
Heterogeneous Computing Architectures Challenges and Vision 1st Edition Olivier Terzo - The ebook in PDF format is ready for immediate access
100% (3)
Heterogeneous Computing Architectures Challenges and Vision 1st Edition Olivier Terzo - The ebook in PDF format is ready for immediate access
67 pages
Systolic Array Architecture For Educational Use
No ratings yet
Systolic Array Architecture For Educational Use
6 pages
Report On 'Evolution in Micro-Processor Architecture Technology'
No ratings yet
Report On 'Evolution in Micro-Processor Architecture Technology'
9 pages
Week 1 Csc447
No ratings yet
Week 1 Csc447
36 pages
Kien Truc May Tinh David Brooks Cs146 Lecture1 Introduction To Computer Architecture (Cuuduongthancong - Com)
No ratings yet
Kien Truc May Tinh David Brooks Cs146 Lecture1 Introduction To Computer Architecture (Cuuduongthancong - Com)
14 pages
Robotic Computing On Fpgas Synthesis Lectures On Distributed Computing Theory Shaoshan Liu download
No ratings yet
Robotic Computing On Fpgas Synthesis Lectures On Distributed Computing Theory Shaoshan Liu download
78 pages
Transformers: Principles and Applications
From Everand
Transformers: Principles and Applications
Richard Johnson
No ratings yet
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Union-Find Data Structures and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Union-Find Data Structures and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Grid Computing: A Revolutionary Approach to Scientific Research and Data Management
From Everand
Grid Computing: A Revolutionary Approach to Scientific Research and Data Management
Pasquale De Marco
No ratings yet
Energy Management Systems: Design and Implementation: Definitive Reference for Developers and Engineers
From Everand
Energy Management Systems: Design and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
StarPU: Parallel Computing and Task Scheduling Techniques
From Everand
StarPU: Parallel Computing and Task Scheduling Techniques
Richard Johnson
No ratings yet
Structural Design and Applications of Bulkheads: Definitive Reference for Developers and Engineers
From Everand
Structural Design and Applications of Bulkheads: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Rapid Spanning Tree Protocol for Modern Networks: Definitive Reference for Developers and Engineers
From Everand
Rapid Spanning Tree Protocol for Modern Networks: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
ELT Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
ELT Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
A Brief History of Microsoft Windows
No ratings yet
A Brief History of Microsoft Windows
20 pages
When Hearts Collide A Forbidden Professor Student Billionaire Romance The Orchid Victoria Lum instant download
100% (2)
When Hearts Collide A Forbidden Professor Student Billionaire Romance The Orchid Victoria Lum instant download
17 pages
DeltaGrid EM - Leaflet - EN - 20231027
No ratings yet
DeltaGrid EM - Leaflet - EN - 20231027
4 pages
KBA 1663549 - SOAMANAGER Bug PDF
No ratings yet
KBA 1663549 - SOAMANAGER Bug PDF
3 pages
Jayaswal Neco Balance Sheet 2021
No ratings yet
Jayaswal Neco Balance Sheet 2021
145 pages
USB Serial 232 Schematic RevD
No ratings yet
USB Serial 232 Schematic RevD
3 pages
Company Profile
No ratings yet
Company Profile
13 pages
SWIFT Customer Security Program
No ratings yet
SWIFT Customer Security Program
12 pages
Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching
No ratings yet
Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching
15 pages
Customer Behavior Analysis in E-Commerce Using Decision Tree Machine Learning Approach
No ratings yet
Customer Behavior Analysis in E-Commerce Using Decision Tree Machine Learning Approach
9 pages
IEEE802 Standards
No ratings yet
IEEE802 Standards
4 pages
Compaq Presario SR1530IL Desktop PC Product Specifications Compaq Presario SR1530IL Desktop PC - HP Customer Care (United States - English)
No ratings yet
Compaq Presario SR1530IL Desktop PC Product Specifications Compaq Presario SR1530IL Desktop PC - HP Customer Care (United States - English)
3 pages
Saura Et Al
No ratings yet
Saura Et Al
13 pages
Transmission Modes: Stream Mode
No ratings yet
Transmission Modes: Stream Mode
7 pages
Xray 500ma With DR
No ratings yet
Xray 500ma With DR
3 pages
SF 1808 Release Overview BizX
No ratings yet
SF 1808 Release Overview BizX
74 pages
TBI GDF Report Builder
No ratings yet
TBI GDF Report Builder
3 pages
TCS NQT 2024-3
No ratings yet
TCS NQT 2024-3
4 pages
Lin 2015
No ratings yet
Lin 2015
11 pages
CV ABDUL QADIR SAP ABAP HANA 9 YoE
No ratings yet
CV ABDUL QADIR SAP ABAP HANA 9 YoE
5 pages
Unit 3 - MAD
No ratings yet
Unit 3 - MAD
134 pages
The Morphology of English
No ratings yet
The Morphology of English
22 pages
Ibm FW Imm2 1aoo58t-4.31 Anyos Noarch
No ratings yet
Ibm FW Imm2 1aoo58t-4.31 Anyos Noarch
4 pages
Python Programming PRACTICAL NO.12 ANSWERS
No ratings yet
Python Programming PRACTICAL NO.12 ANSWERS
8 pages
NAOMI Service Manual EN
No ratings yet
NAOMI Service Manual EN
34 pages
Basic Computer Shortcut Keys With Explanation
No ratings yet
Basic Computer Shortcut Keys With Explanation
30 pages
3.14 FICHA TECNICA - Módulo de Descarga GSA-REL Kidde
No ratings yet
3.14 FICHA TECNICA - Módulo de Descarga GSA-REL Kidde
8 pages
3.0 Java Programming Tutorial OOP Exercises 3.4 Understanding Objects PDF
No ratings yet
3.0 Java Programming Tutorial OOP Exercises 3.4 Understanding Objects PDF
61 pages
Marian Jeanel Pascual - Complete Essay
No ratings yet
Marian Jeanel Pascual - Complete Essay
11 pages
Loops in C++
No ratings yet
Loops in C++
16 pages

2024_Evolution_challenges_optimization

Uploaded by

2024_Evolution_challenges_optimization

Uploaded by

1

Evolution, Challenges, and Optimization in

This technique helps to hide memory access latencies. A

B. Data Level Parallelism (DLP)

C. Thread Level Parallelism (TLP)

way, the dataflow model bypasses the need for a centralized

. The model uses an adder tree configuration to efficiently

Fig. 6: Systolic array [10]

Several accelerators have been designed to handle sparse

data loader that includes an on-chip row decoder and

VIII. F LEX TPU

TABLE I: Comparison of TPU, STPU, and FlexTPU Architectures

computational task. 1) CGRA: The CGRA is an array of PEs connected by an

X. C ATAPULT providing scalable solutions adaptable to evolving workloads

chitecture (MICRO), Chicago, IL, USA, 2022, pp. 546-

You might also like