0% found this document useful (0 votes)
12 views12 pages

2219-Article Text-15412-2-10-20230802

Uploaded by

Allan Gabriel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views12 pages

2219-Article Text-15412-2-10-20230802

Uploaded by

Allan Gabriel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Machine Translated by Google

Journal of the Brazilian Computer Society, 2023, 29:1, doi: 10.5753/jbcs.2023.2219


ÿ This work is licensed under a Creative Commons Attribution 4.0 International License.

Challenges in High-Performance Computing


Philippe Olivier Alexandre Navaux ÿ [Federal University of Rio Grande do Sul | [email protected]]
Arthur Francisco Lorenzon ÿ ÿ [ Federal University of Rio Grande do Sul | [email protected] ]
Matheus da Silva Serpa ÿ [ Federal University of Rio Grande do Sul | [email protected] ]

ÿ Institute of Informatics, Federal University of Rio Grande do Sul - PO Box 15064, Av. Bento Gonçalves, 9500,
91501-970, Porto Alegre, RS, Brazil

Received: 15 September 2021 • Accepted: 30 March 2023 • Published: 01 August 2023

Abstract High-Performance Computing, HPC, has become one of the most active computer science fields. Driven
mainly by the need for high processing capabilities required by algorithms from many areas, such as Big Data, Arti-
ficial Intelligence, Data Science, and subjects related to chemistry, physics, and biology, the state-of-art algorithms
from these fields are notoriously demanding computer resources. Therefore, choosing the right computer system
to optimize their performance is paramount. This article presents the main challenges of future supercomputer sys-
tems, highlighting the areas that demand the most of HPC serversÿ the new architectures, including heterogeneous
processors composed of artificial intelligence chips, quantum processors, the adoption of HPC on cloud serversÿ and
the challenges of software developers when facing parallelizing applications. We also discuss challenges regarding
non-functional requirements, such as energy consumption and resilience.

Keywords: High-Performance Computing, Supercomputers, Exascale, Computer Architecture, Parallel Programming

1 Introduction gate arrays, FPGAs, processing in memory, PIM, and re-cently,


the adoption of Quantum processors. Hence, sev-eral parallel
High-Performance Computing, HPC, has been settled as the programming libraries and patterns are proposed
area of specialists concerned with machines with the great-est yearly to get the most performance from such architectures.
processing power of a given time, represented by super-computers. Hence, we present in this article a comprehensive discus-sion
Traditionally, two selections occur annually to regarding: (i) the demands for HPC servers from differ-ent fields of
indicate which machines in the world have reached the top science and societyÿ (ii) the main challenges in
of this processing power disclosed in the TOP500 [Dongarra the HPC area that researchers will face shortly in hardware
and Strohmaier, 2020]. and software to further optimize the execution of applications
The HPC scenario has been changing in recent years, pres- in terms of performance, energy consumption, and resilienceÿ
sured by the processing power requirements from Artificial (iii) the trends related to the increasing availability of hetero-
Intelligence, AI, Machine Learning, ML, and Deep Learning, geneous architectures in HPC systems and how to get the
DL, algorithms, and the need for this power to matter learn-ing most out of each device.
data in Big Data environments [Verbraeken et al., 2020].
In summary, HPC is not only an area that meets a niche of
needs of researchers in the fields of physics, chemistry, and 2 The Need for HPC
biology, among other specific ones, but an area that needs to
meet the demands of society as a whole. In this section, we discuss how HPC systems can be em-ployed to
Another deeper change has occurred. Cloud vendors are optimize the execution of applications widely
known in scientific areas.
investing in global networks of massive-scale systems for
HPC systems [Reed et al., 2022]. Simultaneously, one can
observe a massive migration from parallel application execu-tions on
dedicated HPC servers to cloud environments due to
2.1 Demands from Big Data area
many factors, such as the facility to access them over the in-ternet The term Big Data was used for the first time in 1997, refer-ring to
and for provisioning high-performance architectures the growing number of data generated every second in
on demand while keeping operating costs low. In this sce-nario, the world in a structured or unstructured way. Michael Cox
hardware and software technologies have been pro-posed over and David Ellsworth, both working at NASA, wrote the arti-cle
the years to allow a sustainable execution of HPC “Managing Big Data for Scientific Visualization” for the
applications in cloud environments. 1997 Conference on Visualization, introducing the concept
On top of that, HPC environments are becoming increas-ingly of Big Data to the academic community [Cox and Ellsworth,
heterogeneous to support applications’ performance 1997].
and energy demands in scientific areas. In this scenario, most The amount of data generated per year doubles every two
supercomputers are equipped with hardware accelerators and years and grows exponentially [Desjardins, 2019]. With this
techniques with distinct computing power capabilities, such evolution, we are reaching the level of Yotta data, which cor-
as graphical processing units, GPUs, field-programmable responds to 1024 bytes or 10008 zettabytes. This amount of
Machine Translated by Google

Challenges in High-Performance Computing Navals et al., 2023

data, also known as Big Data, comes from different sources platform for Machine Learning that provides tools and li-braries
(e.g., sensors on the Internet of Things) and needs treat-ment to allow easy deployment of codes across heteroge-neous
by non-traditional systems for manipulating, analyzing, devicesÿ PyTorch, a GPU-accelerated tensor computa-tional
and extracting information from the dataset. However, only frameworkÿ and TorchANI, an implementation of Ac-curate
around 20% of this extracted information would be helpful. Neural Network Engine for Molecular Energies. An-other
Therefore, Big Data is a collection of large and complex example is the Intel Neural Compressor tool that helps
datasets that becomes difficult to process using database man- software developers to easily and quickly deploy inference
agement tools. It is often a collection of legacy data. In ad- solutions on popular deep learning frameworks, including
dition to dealing with new ways of managing data, the Big TensorFlow and PyTorch.
Data area needs great computing power to process this data. Therefore, one of the challenges software developers will
Furthermore, we must remember that data are numbers and face in the convergence of HPC with AI is how to efficiently
codes without treatment. Information, on the other hand, is use the available hardware devices to get the most out of
processed data. It is the processing of data that will create all the provided frameworks. Simultaneously, hardware ven-
meaning. Finally, knowledge is knowing about a specific dors have the challenging task of designing optimized pro-
subjectÿ it is having an application for information. cessors for AI computing, as we discuss in Section 3.3.
It is also important to mention that knowledge about in-
formation is power nowadays. It always has been, but with
the advent of large amounts of data and the ability to process
them, it became possible to conduct analysis and decision- 2.3 Demands from Data Science and Related
making. Information has become the most critical asset. Areas
Companies that hold data and have the power to process and
analyze it hold today a capacity many times greater than the
In conjunction with advancements in supercomputing, the
governments. Furthermore, there are no frontiers for infor-
rise of data generation technologies has resulted in the
mation, and these companies are capturing data from almost
convergence of HPC and data science applications. The
all over the world.
widespread adoption of high-throughput data generation
In summary, machines with high processing power, such
technologies, scalable algorithms and analytics, and efficient
as supercomputers, are needed to assist in data storage and
methods for managing large-scale data has established scal-
extraction in the Big Data era.
able Data Science as a central pillar for accelerating scientific
discovery and innovation [Xenopoulos et al., 2016].
2.2 Demands from the Artificial Intelligence Data science can be defined as a field that combines dif-ferent areas
Area (e.g., math, statistics, artificial intelligence, ma-chine learning, and
specialized programming) with a subject
According to the US Department of Energy’s ”AI for Sci-ence”
to uncover insights that are hidden in an organization’s data.
report [Stevens et al., 2020], new Artificial Intelli-gence
The lifecycle of data science involves tools, roles, and pro-
techniques will be indispensable to support the contin-ued
cesses that usually undergo the following steps: data collec-
growth and expansion of Science infrastructure through
tion, data storage and processing, data analysis, and presenta-
Exascale systems. The experience of the scientific commu-
tion of reports and data visualizations that make the insights
nity, the use of Machine Learning, the simulation with HPC,
[Kelleher and Tierney, 2018].
and the data analysis methods allowed a unique and new
growth of opportunities for Science, discoveries, and more All steps are accelerated with the employment of HPC sys-
robust approaches to accelerated Science and its applications tems and can be used to provide insights beneficial to soci-
for the benefit of humanity. ety: fraud and risk detection, search engines, advanced im-
The convergence of HPC with AI allows simulation envi- age recognition, speech recognition, and airline route plan-
ronments to employ deep reinforcement learning in various ning. In summary, even with the use of HPC to optimize the
problems, such as simulating robots, aircraft, autonomous execution of data science algorithms, one of the main chal-
vehicles, etc. DL techniques accelerate simulations by re- lenges faced by the area is to use hardware resources better
placing models in climate prediction, geoscience, pharma- to reduce the training cost of learning algorithms.
ceuticals, etc. New frontiers in physics are being reached by
increasing the application of Partial Differential Equations
(PDE) with DL for simulations.
On the other hand, one major bottleneck in using ML and 3 HPC Architectures and Processors
DL algorithms is the learning phase before their use. De-
pending on the available computing infrastructure, this ac-tivity Challanges
can take weeks and even months. That is where high-
performance processing accelerates this step. The last section (section 2) showed how advances in scien-
In the end, due to the HPC demands from the AI area, com- tific areas will demand more processing power to run the
panies are investing in parallel frameworks to optimize the models, manage the files, and extract the data, among other
execution of AI software. As an example, NVIDIA provides demands. Hence, the main challenges HPC needs to over-
different solutions so that users can take advantage of GPUs come to get results on time in terms of hardware for its em-
optimized for AI algorithms: TensorFlow, an open-source ployment in such areas will be presented in this section.
Machine Translated by Google

Challenges in High-Performance Computing Navals et al., 2023

Figure 1. Fugaku [Fujitsu, 2021], Frontier [2021],Frontier [2021], and El Capitan supercomputers[LLNL, 2021].

NERSC ALCF
System attributes ALCF Now NERSC Now OLCF Now OLCF Exascale ALCF Exascale
Pre-Exascale Pre-Exascale

Name Theta hearts Summit Perlmutter Polaris Frontier Aurora


(Planned) Installation 2016 2016 2017-2018 (2020-2021) (2021) (2021-2022) (2022-2023)

System peak > 15.6 PF > 30 PF 200 PF > 120PF 35 – 45PF >1.5 EF ÿ 1 EF DP sustained

Peak Power (MW) < 2.1 < 3.7 10 6 <2 29 ÿ 60

~1 PB DDR4 + High
847 TB DDR4 + 70 TB HBM 2.4 PB DDR4 + 0.4 PB 4.6 PB DDR4 +4.6 PB
Bandwidth Memory 1.92 PB DDR4 +
Total system memory + 7.5 TB GPU memory HBM + 7.4 PB > 250 TB HBM2e + 36 PB > 10 PB
(HBM) + 1.5PB 240TB HBM
persistent memory persistent memory
persistent memory

Node performance 2.7 TF (KNL node) and 166.4 > 70 (GPU)


>3 43 > 70 TF TBD > 130
(TF) TF (GPU node) > 4 (CPU)

Intel Xeon Phi 7320 64-


CPU only nodes: AMD
core CPUs (KNL) and Intel Knights Landing many 2 IBM 1 HPC and AI optimized
EPYC Milan CPUS; 2 Intel Xeon Sapphire
GPU nodes with 8 core CPUs Power9 CPUs + AMD EPYC CPU and 4
Node processors CPU-GPU nodes: 1 CPU; 4 GPUs Rapids and 6 Xe Bridge
NVIDIA A100 GPUs Intel Haswell CPU in data 6 Nvidia AMD Radeon Instinct
AMD EPYC Milan with Old GPUs
coupled with 2 AMD partition Volta GPUs GPUs
NVIDIA A100 GPUs
EPYC 64-core CPUs

9,300 nodes
4,392 KNL nodes and 24 > 1,500(GPU)
System size (nodes) 1,900 nodes in data 4608 nodes > 500 > 9,000 nodes > 9,000 nodes
DGX-A100 nodes > 3,000 (CPU)
partition

NVLINK AMD Infinity Fabric


CPU-GPU Unified memory
NVLINK on GPU nodes N/A Coherent memory PCIe Coherent memory
Interconnect architecture, RAMBO
across node across the node

Node-to-node Aries (KNL nodes) and Aries Dual Rail EDR-IB HPE Slingshot NIC HPE Slingshot NIC HPE Slingshot HPE Slingshot
interconnect HDR200 (GPU nodes)

200 PB, 1.3 TB/s Lustre 28 PB, 744 GB/s 250 PB, 2.5 TB/s 695 PB + 10 PB Flash ÿ 230 PB, ÿ 25 TB/s
File System 35 PB All Flash, Lustre N/A
10 PB, 210 GB/s Lustre Lustre GPFS performance tier, Lustre DAMAGES

ASCR Computing Upgrades At-a-Glance


November 24, 2020

Figure 2. Evolution of supercomputers

3.1 New Generation of Processors and Accel- Also, the El Capitan supercomputer (Figure 1) from the
erators Lawrence Livermore National Laboratory, LLNL, designed
with AMD EPYC processors, code-named Genoa and Zen 4
Several proposals for parallel architectures have emerged re- processor core, and AMD Radeon Instinct GPUs, is expected
cently as depicted in Figures 1 and 2, which should shake to exceed 2 Exaflops [LLNL, 2021].
and change the processor market. Next we highlight the four
supercomputers with the highest computing power currently.
The ARM company designed the A64FX processor that In addition to these two machines, other Exascale Systems
allowed Fujitsu’s FUGAKU computer with NVIDIA GPUs are planned to be delivered in the next few years. The DoE,
[Fujitsu, 2021], Japan, to remain for two years (2020 and Department of Energy from the USA, financed national com-
2021) as the fastest computer in the TOP500 list, reaching a puting facilities centers to receive new systems with distinct
performance of 442 Petaflops (Figure 1). architectures from different companies. Figure 2 summa-rizes
The first supercomputer to deliver performance greater the evolution of the subsequent machine installations,
than 1 Exaflop was the Frontier (Figure 1) from the Oak starting with Perlmutter in NERSC - Berkeley, Polaris in
Ridge laboratory, USA. It was designed by Cray-HPE with ALCF - Argonne, Frontier in OLCF - Oak Ridge, and Au-rora
AMD Epyc processors and Radeon accelerators and reached in ALCF - Argonne. In other countries, there are also
the performance of 1.1 Exaflops [Frontier, 2021], according Exascale Machines to be delivered like China, with the evo-
to the TOP500 list. lution of the Sunway and the Tianhe, and in Japan, with the
After breaking the exaflop performance barrier, it is natu-ral new Fugaku.
that the number of supercomputers achieving such a per-
formance level increases. In this scenario, Aurora (Figure 1)
is a supercomputer from Argonne’s laboratory that is being It is worth mentioning that the architectures of these pro-
developed with Intel processors adopting the Ponte Vecchio cessors and machines are increasingly heterogeneous, allow-
architecture that integrates the Xeon processor and the Xe ing the execution of applications to be processed in the device
accelerator on the same chip [Aurora, 2021]. with the best performance.
Machine Translated by Google

Challenges in High-Performance Computing Navals et al., 2023

3.2 Heterogeneous Architectures 3.3 AI chips


In the upcoming years, AI is set to play a crucial part in
The new processors, which are already appearing in the mar- many fields, specifically in national and international secu-rity
ket, have more and more heterogeneous architectures. In [Khan and Mann, 2020]. However, generatl-purpose AI
HPC systems, processing units with different computing ca- software, datasets, and algorithms are not effective when run-
pabilities are combined to provide better performance and ning on traditional HPC systems, the focus has shifted to-wards
energy efficiency when compared to homogeneous systems a computer hardware specialized to execute modern
(e.g., CPU-Only). AI applications.

Figure 4. Comparison of Cerebras WSE-2 with the largest GPU at the mo-ment.
Figure 3. NVIDIA Tegra K1 chip integrating processors with GPUs in an
SoC

This hardware is called as “AI chip”, which may include


In this scenario, besides traditional CPU-GPU and FPGA- accelerators such as, GPUs, FPGAs, and application-specific
based systems, the following heterogeneous architectures integrated circuits, ASICs, that are specialized for AI. Even
will become popular in the near future in HPC (Figure 5). though general-purpose processors (e.g., CPUs) can be em-
Heterogeneous memory systems employ a combination of ployed to execute simple AI tasks, they are becoming less
memories, such as DRAM and SRAM, to improve the trade-off useful as AI advances. Hence, AI chips have optimized de-sign
between performance and energy consumption in data-intensive features that accelerate the calculations required by AI
workloads. Neural Processing Units, NPUs, are algorithms, e.g., large number of parallel calculationsÿ the
processors designed to speed up AI workloads. Quantum use of mixed precision (discussed in Section 5.3)ÿ speeding
processors, which are in the early stages of development, but up memory accessesÿ and providing programming languages
with the potential to perform calculations that are exponen-tially to efficiently translate AI code for execution on AI chips.
faster than classical computers [Fu et al., 2016]. In this scenario, one of the main challenges is to rightly
choose the AI chip that will execute a given AI algorithm.
In the market, one can find systems composed of CPUs
GPUs are often used for the training step. FPGAs efficiently
and GPUs on the same chip [Dávila et al., 2019], such as
perform the inference phase. On the other hand, ASICs can
those from Intel (Ponte Vecchio) and AMD (APUs). Another
be used to execute both phases.
possibility is the SoC, System on Chip, where on the same
At the 2020 SuperComputing conference, two new AI
chip (Figure 3), there are processors, accelerator, memory,
chips were announced, Cerebras [Rocki et al., 2020] and
and I/O system. These chips are generally used in sensors,
SambaNova [2021]. Cerebras chips employ the second gen-
Internet of Things, IoT, and edge computing environments.
eration of Wafer Scale Engine, WSE-2, a central processor
In addition to this on-chip option, heterogeneity is possible
for deep learning and sparse tensor operations that contains
at the board level, where the processor coexists with the GPU
2.6 trillion transistors, 850,000 AI-optimized cores, and 40
or the FPGA (Intel A10).
gigabytes of high-performance on-wafer memory. Figure
In this increasingly heterogeneous environment with dif-ferent 4, extracted from [Cerebras, 2021], compares the Cerebras
accelerators, the programming model must change WSE-2 chip with the largest GPU at the moment.
from serving a specific type of accelerator to being prepared The SambaNova chip has the Reconfigurable Dataflow Ar-
for a programming environment to suit different accelerators chitecture, RDA [SambaNova, 2021], facilitating machine
[Vetter et al., 2022]. For example, with the emergence of learning and HPC convergence. The RDA provides a flexi-ble
GPUs from AMD and Intel, new parallel programming inter- dataflow execution model to all types of dataflow com-putation
faces should be used instead of the CUDA language, which problems. Furthermore, it does not have a fixed In-struction Set
is proprietary to NVIDIA. Architecture (ISA) but is programmed for each
Machine Translated by Google

Challenges in High-Performance Computing Navals et al., 2023

model. 3.5 Processing In Memory


In summary, AI chips with specialized hardware devices
Processing In Memory, PIM, is a way to execute instructions
will popularize in the next few years, which will likely
in memory without traditionally moving data to the proces-sor
change how AI algorithms are executed. Therefore, one can
[Lee et al., 2021]. With that, the time lost in data trans-fer is
envision challenges regarding adopting such chips on tradi-tional HPC
eliminated, being the main benefit of this technique.
systems and the emergence of new AI frame-works.
PIM has been growing in its study, and some chips are ap-
pearing on the market. However, even though it can allow
data-intensive applications to avoid moving data from mem-ory
to CPU, new challenges are introduced for HPC system
3.4 Aware Computing architects and software developers.
Even though many challenges to designing PIM architec-tures
In recent years, there has been a growth of different types of have been overcome during the last years, some points
Aware Computing in HPC systems. When it is applied to the are still open [Ghose et al., 2019]. One of them relies on how
execution of applications, the hardware and software knobs HPC programmers can extract the benefits of PIM without re-
(e.g., number of threads, cores, processor frequency, and the sorting to complex programming models. Another challenge
amount of memory used) are adapted according to a given is understanding the constraints of distinct substrates when
optimization heuristic to the best situation so that the out-come designing PIM logic. Therefore, developing a PIM program-ming
in non-functional metrics (i.e., performance, energy, model, data mapping, and runtime scheduling for PIM
and power consumption) is improved. Consequently, this sit- are challenging topics that should be addressed before most
uation is changing the complexity of hardware and software HPC developers can employ PIM.
behavior. High-tech companies (e.g., Samsung, Micron, and Synop-sys)
With the need for reaching more performance with every started developing memories with computing capabili-ties. In the
new generation of supercomputers, techniques like power-and end, they believe that memory and AI computing
energy-aware computing have become widely used. The will converge into the same architecture, making AI-based
reason is that such HPC systems must improve performance memory chips.
without increasing power and energy consumption, so the
costs of cooling and electricity infrastructure do not grow. 3.6 Quantum Computing
This type of aware computing employs strategies to adapt
the hardware and software to keep power and energy de-mands With quantum processors becoming a reality, it is possible
below a safety threshold. A popular technique widely to imagine heterogeneous machines with quantum process-ing
used in hardware is dynamic voltage and frequency scal-ing, units. The prospect of getting atom-sized operators capa-ble of
which automatically manages hardware components’ operating using germanium transistor techniques [Hen-drickx et
frequency and voltage levels based on their workload usage. al., 2021] is enabling the arrival of Qubits proces-sors on the
Another technique is thread-throttling, where the number of market. With this, future architectures would
running threads is adjusted according to the thread-level par- have x86 processor units with qubit units (Figure 5).
allelism degree of a given application. In this perspective, quantum processors will be used as
Similarly, since new HPC applications from scientific accelerators at first. They will solve problems, especially
fields increasingly require more memory space to store data, in security, cryptography, meteorology, pharmaceuticals,
memory-aware computing is becoming popular. In this type, biotechnology, and economic models.
the design of a computer system is optimized for memory In the subsequent years, quantum computing will not sub-
performance, taking into account the increasing gap between stitute classical computing [Matsuoka et al., 2023]. Instead,
processor speed and memory latency. Memory-aware com-puting Quantum computing will help to solve complex problems
involves data compression, prefetching, memory hi-erarchy that take a lot of time with classical computers. Examples of
optimization, and efficient data placement. This problems that will benefit from it can be found in modeling
type of computing is particularly challenging for applications protein or molecular simulations, developing new robust en-
that rely on memory, such as big data processing, machine cryption, processing data from accelerators, like CERN, and
learning, and scientific simulations. One example is the other complex problems.
AMD High Bandwidth Memory, HBM , designed to provide
high-performance memory for graphics and other memory- Host CPU

intensive applications.
Other types of aware computing are also becoming impor-tant
in HPC servers. In network-aware computing, the net-work
performance is optimized with load balancing, network
topology optimization, and congestion control. Security-aware
computing focuses on ensuring the security of systems. Figure 5. Future of Heterogeneous Architectures with Quantum Processors
Furthermore, one can also find data-aware and user-aware - Based on Fu et al. [2016]
computing, where the former focuses on efficient manage-ment
of data, and the second is concerned with providing In recent years, many Quantum machines have been an-
personalized user experiences. nounced, like the Zuchongzhi with 66 qubits, Google with
Machine Translated by Google

Challenges in High-Performance Computing Navals et al., 2023

its 54-qubit Sycamore processor, and IBM’s 14th quantum They understand that the evolution will increase from ma-chines
computer model with 53 qubits. However, as scientists pre-view with 2 to 3 Eflops in 2025 to 2030, with performance
practical computing, machines need nearly a thousand scaling to 50–80 Eflops, and the last step will end in 2035
qubits. In this scenario, IBM has an ambitious goal of build-ing one with the Zflops machines.
quantum computer containing 1000 qubits by 2025. To reach this level of computing, the authors expected a
In the end, one of the big challenges of quantum machines list of technical metrics in a table presented in Table 1. The
is correcting the myriad errors that usually come with quan-tum computing system will have a power efficiency of 10 Tflop-s/W for
operations. So the researchers focus on having a lower a total power consumption near 100MW for all the
error rate on the entire system. Quantum Computing will be systems. Interesting to observe that the performance estima-tions
the next frontier in the changes in processing capacity, which per node will be 10Pflops and the bandwidth between
will allow solving problems in Science that are currently un-feasible to nodes 1.6 Tb/s. The data for all the system storage capacity
solve on time. will be around 1ZB (Zeta byte).

Table 1. Metris for a Zflops Machine [Liao et al., 2018]


3.7 Cloud Computing
Metric Value
With the increasing demand for HPC for several areas, Peak Performance 1 Zflops
as mentioned in Section 2, the big cloud computing
Power consumption 100 MW
providers started to be interested in providing these facili-ties. The
Power efficiency 10 Tflops/ W
emergence of instances with more powerful proces-sors, GPUs,
Peak performance per node 10 Pflops/ node
FPGAs, and improved interconnection systems
Bandwidth between nodes 1.6 Tb/ s
within the cloud was observed to allow users to instantiate a
I/O bandwidth 10-100 PB/ s
set of machines able to meet a demand for greater processing
Storage capacity 1 ZB
power.
Floor space 1000 m²
In this scenario, HPC as a Service (HPCaaS) is becoming
employed. In Figure 6, extracted from [Paillard et al., 2015]
example of infrastructure for executing parallel workloads
on the Cloud, where the end-users submit their job through
4 Programming Challenges in HPC
internet, and the cloud environment (HPCaaS) is responsible
for allocating machines and deploying the job for execution This section discusses the impact of design decisions regard-ing
on HPC systems. implementing parallel applications that run on top of
On the websites of the leading cloud providers, there HPC servers. For that, we start by describing parallel al-gorithms
are advertisements such as ”High-Performance Computing design patterns and the parallel programming in-terfaces that can
on AWS Redefines What is Possible,” ”Cray in Azure - a be used to implement to extract the most
dedicated supercomputer on your virtual network,” ”Build from the HPC systems. Furthermore, given the importance
your high-performance computing solution on IBM Cloud, of memory on the application execution behavior, we discuss
”Google Cloud - HPC in the cloud becomes a reality” clearly techniques to optimize data and thread locality.
showing the importance and interest that these companies are
giving to offer HPC in the cloud essential and growing part
of HPC. 4.1 Parallel Algorithms Design Patterns
The initial cloud performance issues are gradually being Software developers can employ different communication
overcome, allowing satisfactory results when running HPC models when parallelizing applications to ensure the coop-eration
applications in this environment. The increased use of bet-ter and between the cores that execute concurrently, such as
faster connections, the availability of new genera-tion processors, shared memory and message-passing. The former is based
and a better storage administration allow on the existence of an address space in the memory that all
enough processing speed to execute complex applications in processors can access. It is widely used when parallelism is
the cloud. In the end, cloud servers must have a scalable and exploited at the thread level, as they share the same mem-ory
cost-effective infrastructure for running HPC applications, address space. On the other hand, message-passing is
given their capability of provisioning resources on-demand employed in environments where the memory space is dis-tributed
over the Internet. and/or processes do not share the same memory ad-dress space.
On top of that, we also need to take care to Severless Com- In summary, the challenge is to use the program-ming model that
puting, a new paradigm for developing cloud applications. It best benefits from the target architecture:
is a new level of virtualization, containers, and Function as while shared memory delivers better outcomes for multicore
a Service, FaaS, reducing complexity for the user [Baldini and many-core processors, message-passing is most suitable
et al., 2017]. for large computers that communicate via an interconnection
link.
In addition to the communication model, another chal-lenge is
3.8 Moving to Zettascale Computing
choosing the parallel programming style to extract
In a paper from Liao et al. [2018], the authors suggested parallelism from sequential codes. For many years, the fork-join
that, by 2035, we will have HPC systems with one Zettas-cale model has been the style most used by software devel-opers
computing (1021 floating-point operations per second). because of its ease in exploiting parallelism. When
Machine Translated by Google

Challenges in High-Performance Computing Navals et al., 2023

Figure 6. HPC as a Service on Cloud Computing - Paillard et al. [2015]

Sequential Parallel Sequential Parallel Sequential level parallelism is defining the data dependency between
different parallel regions, which may limit the gains with par-allelization.

Regardless of the programming model, when exploiting


Time parallelism, the software developers can employ design pat-
Fork Join Fork Join
terns to help in this task, such as Map, Stencil, and Reduction,
which are the most used nowadays. The Map pattern, shown
in Figure 8, divides the workload (e.g., array, list, or other
Figure 7. The fork-join shared-memory programming model.
related collection) into independent parts that can run in par-
employing the fork-join model, as shown in Figure 7, the allel with no data dependency, representing a parallelization
master thread (represented by the orange rectangle) starts the referred to as embarrassing parallelism Voss et al. [2019]. A
execution of the sequential phase. When it reaches a paral-lel function is applied to all collection elements, usually produc-
region, a team of threads is created to execute the parallel ing a new collection with the same shape as the input. More-
region concurrently (fork operation). Then, at the end of the over, the number of iterations and the inputs may be the same
region, all threads perform a join operation to synchronize. and known in advance. For the Map pattern, the challenge
From this moment on, only the master thread executes the ap- with new computer architectures will be the need for a heavy
plication binary until reaching another parallel region. Con- function to be applied to each element of a collection and a
sidering that this model is widely employed when paralleliz-ing high number of iterations to fulfill the processor’s cores.
applications that run on top of multicore systems and that
processor companies (e.g., AMD, ARM, Intel, and NVIDIA)
have been releasing processors with more and more cores,
the challenge is the development of algorithms that can get
the most out from the architectures. In summary, the main
point is to scale the number of threads without losing perfor-
mance and energy efficiency.

However, the rigid execution model and lack of malleabil-ity


of parallel applications implemented with the fork-join
style may not deal with some hardware and software aspects
Figure 8. Map pattern example, where a function is applied to all elements
(e.g., data synchronization and cache contention) when there of a collection, producing a new collection with the same shape as the input
is variability in the application behavior or execution environ- [Voss et al., 2019]
ment, preventing linear performance improvements. When a
scenario like this arises, the rigid fork-join-based implemen- A generalization of the Map is the Stencil pattern, which
tations can increase power consumption and jeopardize the focuses on combining and applying functions in a set of
performance of parallel applications. With that, one can note neighbors from each matrix point [Voss et al., 2019]. Fig-ure 9
a popularization of task-based programming models, as dis- exemplifies a stencil operation over one element of the
cussed in the next section (Section 5), which can overcome matrix, resulting from the computation of the values on its
fork-join limitations providing more malleability and better neighbors. Software developers face a critical challenge in
load-balancing on multicore systems. However, the chal-lenge handling boundary conditions in this scenario due to the com-
software developers face when exploiting task-based munication overhead between threads/processes. The Sten-
Machine Translated by Google

Challenges in High-Performance Computing Navals et al., 2023

cil pattern has good performance scalability on GPUs, as they 4.2 Parallel Programming Libraries
are organized so thousands of threads can be executed simul-
taneously. In the end, the main challenge will be to balance With the emergence of modern architectures that present dif-
the memory operations time with the computation time in the ferent computing capabilities, many programming languages
way that the loading of neighbors should be faster than the and libraries to help software developers exploit parallelism
computation applied to them. were introduced in the literature. We illustrate in Figure
11 the evolution of the number of citations on the Scopus
There are also many applications where many threads/pro-
database of the most used parallel programming languages
cesses simultaneously apply the same operation over differ-ent
over the years and discuss each one next.
data. In this case, the Reduction pattern combines ev-ery
As observed, Message-Passing Interface, MPI, has estab-
element calculated by each thread/process using an as-
lished itself over the years as one of the main alternatives for
sociative function called the combined function [Voss et al.,
exploiting parallelism. This popularity happens due to the
2019]. Figure 10 shows an example of a reduction operation
need for a library to create processes and manage commu-
executing in parallel, where pairs of elements are calculated
nication between them in distributed memory environments,
simultaneously until the final result is reached. Even though
which is found in every HPC system. Given that, along with
it achieves satisfactory performance results when combin-ing
the evolution of MPI [Gabriel et al., 2004] (e.g., MPI-1, MPI-2,
values computed by each thread, its implementation gets
and MPI-3) and in the technology available in HPC sys-tems,
more complex as the number of reductions after each iter-ation
the challenges are related to (i) balance communica-tion and
changes at every step until the final result is reached.
computation between computing nodes to better use
Therefore, software developers face the challenge of coordi-
hardware resourcesÿ (ii) efficiently use asynchronous com-
nating the computation during each reduction operation to
munication to overlap communication and computationÿ and
maximize the usage of hardware resources.
(iii) ensure fault tolerance mechanisms.
Simultaneously, with the increase in the number of cores in
multicore architectures, there was a popularization in the use
of libraries that exploit parallelism in shared memory (for ex-
ample, OpenMP and POSIX Threads). The POSIX Threads
library, a lightweight thread implementation, was the first to
create parallel applications [Barney, 2009]. However, due to
the complexity and the need for the programmer to decide
different aspects of the operating system, it is no longer used
for this purpose. It continues to be used only for developing
applications aimed at the operating system.
In this scenario, the OpenMP library replaced the POSIX
Threads library. OpenMP consists of compiler directives, li-
brary functions, and environment variables that ease the bur-
den of managing threads in the code [Chandra et al., 2001].
Therefore, extracting parallelism using OpenMP usually re-
quires less effort when compared to POSIX Threads. Par-
allelism is exploited by inserting directives in the code that
inform the compiler how and which parts of the application
should be executed in parallel. With the increasing number
Figure 9. Stencil Computation Example [Voss et al., 2019] of cores and complexity of the memory hierarchy in shared-
memory architectures, the challenge when using OpenMP
will be related to (i) finding the optimal number of threads
to execute each parallel regionÿ (ii) defining thread and data
placement strategies that reduce cache contention and last-
level cache missesÿ and (iii) efficiently employ directives
for heterogeneous computing. Furthermore, other libraries,
such as Cilk Plus [Schardl et al., 2018], developed by MIT
and later maintained by Intel, emerged to facilitate vector
and task programmingÿ however, due to the evolution of the
OpenMP library, they also ceased to be used. In the last years,
OmpSs-2 has emerged as an alternative to OpenMP when
exploiting parallelism through tasks by providing more mal-
leability and better load-balancing on multi-core systems.
The scenario dominated by OpenMP and MPI drasti-cally
changes with the popularization of GPU architectures.
Figure 10. Parallel reduction, where threads produce subresults that are
From then on, CUDA became one of the most used paral-lel
combined to produce a final single answer [Voss et al., 2019]
programming interfaces to accelerate multi-domain ap-plications
on NVIDIA GPUs [Sanders and Kandrot, 2010].
Machine Translated by Google

Challenges in High-Performance Computing Navals et al., 2023

3500 OpenMP MPI CUDA OpenCL Intel OneAPI


5.1 Energy Demand
3000

2500 HPC machines’ growing demand for processing capabilities has led
2000 manufacturers to associate thousands of processors in clusters, which
Number

1500 generates high power consumption. In this scenario, some machines


1000
from the TOP500 list consume around 30 MW, corresponding to a city
500
of approximately 300,000 inhabitants. Consequently, machine builders
0
and processor manufacturers aim to optimize architectures so that
consumption decreases. These optimizations include chang-ing the
processor architecture, managing non-active units, and reducing the
Figure 11. Number of citations from the foremost parallel programming processor operating frequency, among other techniques [Padoin et al.,
libraries over the years in the Scopus database.
2019]. Today, new machines are made to increase instruction speed

The CUDA library consists of a set of extensions for C, C++ and and energy consumption, which sometimes occurs the opposite.

FORTRAN, which allows the programmer to cre-ate kernels, which are


functions that can run on NVIDIA GPU-type graphics cards [Cook, 2012].
In addition to this library, OpenCL also emerged as a framework for Figure 12, extracted from a modern multicore processor with
heteroge-neous computing, allowing the software developer to imple- WattWatcher tool when executing well-known parallel workloads
ment applications that run on both multicore processors and graphics [LeBeane et al., 2015], clearly demonstrates one of the challenges for
cards from any vendor [Munshi et al., 2011]. Fi-nally, the OpenACC the evolution of processor architectures.
library has become popular as it looks similar to the OpenMP library It is verified that the percentage of energy dedicated to the effective
when exploiting parallelism [Farber, 2016]. It also uses pragmas and processing and execution of the instruction is about 17% of all energy
makes program-ming for GPUs fast and straightforward. (OoO and ALU). In comparison, about 34% is spent on static energy,
leakage, the energy the circuit consumes, even with nothing running.
Furthermore, about 35% is spent on accesses to different levels of cache
memory access and 11% on registers. In the end, even if these per-
Considering the imminent increase in the use of hetero-geneous
centages can improve during the next few years, it appears that the
architectures by end users, Intel released OneAPI at the end of 2018. It
energy invested in the final goal, which is the execu-tion of the instruction,
simplifies software development by providing the same programming
is small compared to the point spent in other parts of the processor’s
languages and models across all accelerator architectures. oneAPI
operation. Therefore, this is a significant challenge to be improved in
seeks to provide software developers with source-level compatibility,
future processor architectures.
perfor-mance transparency, and software stack portability. In this
scenario, the challenge will be defining the ideal architecture to execute
each piece of parallel code without jeopardizing the performance of the
entire HPC system.

Current and efficient parallel applications employ more than one


parallel programming interface. The MPI library is generally used for
message exchange between different computational nodes while the
OpenMP and CUDA / Ope-nACC libraries exploit the use of multiple
processor cores and GPU-type graphics cards. The trend is the
emergence and evolution of compilers, both in self-parallelization and
data location optimization.

Figure 12. Distribution of energy consumption in a core.


5 Other Challenges

Given the increasing heterogeneity of hardware resources and availability


of frameworks and libraries to exploit paral-lelism in HPC servers shown 5.2 Resilience
in the last sections, other chal-lenges become important to be addressed.
Hence, we discuss in this section, the trends related to energy Resilience deals with the ability of a system to continue oper-ating in the
consumption, that is, what should one care about in HPC systems to presence of failures or performance fluctuations.
improve their energy efficiency. Similarly, new HPC environments will In supercomputers, which process high-performance appli-cations and
require more resilience to reduce the amount of software and hardware have thousands of cores, memories, and circuits connecting them, the
failures, as discussed in Section 5.2. We also discuss two alternatives probability of a failure in any of its ele-ments becomes more prominent.
to optimize the energy efficiency of HPC servers, such as mixed- Therefore, resilience is one of the challenges to be faced in order to
precision and data locality in Section 5.3 and 5.4, respectively. continue delivering high-performance execution of scientific applications
while keeping infrastructure-related costs low.
Machine Translated by Google

Challenges in High-Performance Computing Navals et al., 2023

Analysis over the last years shows that newer HPC sys-tems 5.4 Data Locality
will be much less reliable as a consequence of three
In future microprocessors, where the memory hierarchy will
main factors: the increasing complexity of the system de-sign and
become more complex, the energy spent only to move data
number of hardware resources (e.g., cores and mem-ories)ÿ the
will critically affect the application’s performance and en-ergy
decreasing device dependabilityÿ and the shrink-ing process
consumption. Hence, any nano-joule employed to move
technology. Figure 13 shows the mean time be-tween failures
data up and down the memory hierarchy will decrease the
(MTBF) of distinct HPC servers with different
energy available for computation. In this scenario, task map-ping
number of nodes and cores per node, extracted from [Gupta
and scheduling need to be optimized in the interconnec-tion
et al., 2017]. As can be observed, the greater the number
network, prioritizing location to restrict data movement
of nodes in the HPC system, the shorter the mean time (in
as much as possible [Cruz et al., 2021]. Therefore, the ten-dency
hours) between failures. This scenario means that newer su-
is to prioritize data locality over processor speed, al-though local
percomputers with thousands of processors and nodes will
data tends to aid in faster processing. Energy
have servers crashing every day, hence the importance of re-
conservation becomes the priority.
silience to keep the machine running.
Figure 14, extracted from the DoE report [Vetter et al.,
Analysis over time shows the growth of failures with
2022], shows that despite the evolution of chip technology,
the evolution of supercomputers, with more and more cores
with increasingly thinner technologies reaching 7nm, the de-
reaching thousands of them. Figure 13 shows this evolution,
crease in energy consumption of the entire chip does not keep
which means that a supercomputer with thousands of proces-sors
the same pace with the reduction of energy due to computing.
will have servers crashing every day, hence the impor-tance of
This evolution shows that energy consumption for moving
resilience to keep the machine running. Resilience
data spends more power than performing computing opera-tions
is achieved through hardware and software that will dynam-ically
on the chip, which implies an important revolution, not
detect the failure, diagnose, reconfigure, and repair
only in the architecture of processors but also in the way in-
it, allowing processing to continue without user awareness.
structions are executed. Compilers should be concerned with
Therefore, this area becomes essential in future machines to
bringing data closer to the processors.
meet the needs of high-performance processing and will en-able
this processing to occur until obtaining results without
interruption.

10000

2,918.0
1000 2,187.0

100 189.0

36.9
MTBF 10
8.9
5.2

IBM AC922 IBM AC922 EoS Cray XC30 Jaguar XT4 Jaguar XK6 Titan XK7
4608 Nodes 4662 Nodes 9840 Nodes 7832 Nodes 18662 Nodes 18688 Nodes
(44 cores/node) (44 cores/node) (12 cores/node) (12 cores/node) (16 cores/node) (16 cores/node)

Figure 14. Energy consumption at Pico Joules versus technology evolution


Figure 13. Mean time between failures on different HPC servers.
Vetter et al. [2022]

5.3 HPC and Mixed Precision 6 Conclusion


As discussed in the text above, High-Performance Process-ing
The amount of energy and time spent with operations with
has gone from a specific area, meeting certain processing
high precision induce researchers to optimize operations con-
needs, to a central theme in the evolution of computing, con-
sidering the correct need of precision in each calculus. Low-ering
sidering the growing needs of processing power in science
the precision can save cost, time and energy. The
fields, such as Big Data, Artificial Intelligence, Data Science,
proper reduction of accuracy is the new challenge in appli-cation
among others. This evolution goes through important trans-
execution.
formations of machines and processors, including the grow-ing
Mixed-precision architectures usually support two or more use of the cloud and cloud to meet these demands for
floating-point precision arithmetic operations and allow the processing power. The search for greater performance is not
reduction of both storage, energy, and computational require- always the main priorityÿ today, there is often an attempt to
ments. By reducing the precision of some data and arithmetic optimize energy consumption. Heterogeneity is an integral
operations of the problems, it is possible to trade-off the qual-ity part of processors and machines, and the arrival of quantum
of the result by the performance and energy efficiency of processors should increase this diversity. Resilience is cru-cial
the execution [Freytag et al., 2022]. for new HPC systems since system failures, errors, or
Other than the precision reduction, there is Approximate interruptions can result in data loss, delays, or system down-time,
Arithmetic that works with simplified arithmetic’s units, that leading to significant financial losses, productivity de-clines, or
are less expensive on area and so on energy, whose results even safety risks in critical applications. Further-more, new forms
are less precise. of programming and storage are an essen-
Machine Translated by Google

Challenges in High-Performance Computing Navals et al., 2023

tial part of the HPC development. It is concluded that the Cruz, E. H., Diener, M., Pilla, L. L., and Navaux, P. O.
evolution of computing involves the continuous growth of (2021). Online thread and data mapping using a sharing-aware
processing power and new ways of doing it. memory management unit. ACM Transactions on
Modeling and Performance Evaluation of Computing Sys-tems
(TOMPECS), 5(4):1–28. DOI: 10.1145/3433687.
Declarations Dávila, GP, Oliveira, D., Navaux, P., and Rech, P.
(2019). Identifying the most reliable collaborative
Authors’ Contributions workload distribution in heterogeneous devices. In
2019 Design, Automation & Test in Europe Conference
All authors contributed to the writing of this article, read and ap- & Exhibition (DATE), pages 1325–1330. IEEE. DOI:
proved the final manuscript. 10.23919/DATE.2019.8715107.
Desjardins, J. (2019). How much data is generated each day?
Available at:https://www.visualcapitalist.com/
Competing interests how-much-data-is-generated-each-day. cessed: Mar. 12, Ac-
The authors declare that they have no competing interests. 2021.
Dongarra, J. H. M. and Strohmaier, E. (2020). Top500 su-
percomputer:. Available at:https://www.top500.org/
Funding lists/top500/2020/11/.. Accessed: Mar. 10, 2021.
This study was partially supported by the Coordination for the Farber, R. (2016). Parallel programming with OpenACC.
Newnes.
Improvement of Higher Education Personnel – Brazil (CAPES) – Fi-
nance Code 001, by Petro bras grant n.º 2: 020/00182-5, by CN-Pq/ Freytag, G., Lima, J. V., Rech, P., and Navaux, P. O. (2022).
MCTI /FNDCT nº 308877/2022-5 and grant Universal 18/2021 Impact of Reduced and Mixed-Precision on the Efficiency
nº 406182/2021-3. of a Multi-GPU Platform on CFD Applications. In Compu-tational
Science and Its Applications–ICCSA 2022 Work-shops: Malaga,
Spain, July 4–7, 2022, Proceedings, Part
Availability of data and materials IV, pages 570–587. Springer. DOI: 10.1007/978-3-031-
10542-539.
Data can be made available upon request.
Frontier (2021). ORNL Exascale Supercomputer. Available
at:https:// www.olcf.ornl.gov/ frontier/. Accessed:
Apr. 10, 2021.
References
Fu, X., Riesebos, L., Lao, L., Almudever, CG, Se-bastiano, F.,
Aurora (2021). Argonne leadership computing facility. Versluis, R., Charbon, E., and Bertels, K. .
Available at:https://www.alcf.anl.gov/aurora. Ac-cessed: Apr. 11, (2016). A heterogeneous quantum computer architec-ture. In
2021. Proceedings of the ACM International Con-ference on Computing
Baldini, I., Castro, P., Chang, K., Cheng, P., Fink, S., Frontiers, pages 323–330. DOI:
10.1145/2903150.2906827.
Ishakian, V., Mitchell, N., Muthusamy, V., Rabbah, R.,
Slominski, A., et al. (2017). Serverless computing: Cur-rent trends Fujitsu (2021). Supercomputer fugaku. Available at:https:
and open problems. Research advances in //www.fujitsu.com/. Accessed: Apr. 10, 2021.
cloud computing, pages 1–20. DOI: 10.1007/978-981-10- Gabriel, E., Fagg, GE, Bosilca, G., Angskun, T., Don-garra, JJ,
5026-81. Squyres, JM, Sahay, V., Kambadur, P., Bar-rett, B., Lumsdaine,
Barney, B. (2009). POSIX threads programming. A. , et al. (2004). Open also: Goals,
National Laboratory. Available at: https: concept, and design of a next generation mpi implemen-tation. In
Recent Advances in Parallel Virtual Machine
//computing.llnl.gov/tutorials/pthreads.
Accessed: Mai. 4, 2022. and Message Passing Interface: 11th European PVM/ MPI

Cerebras (2021). The future of ai is here. Available at:https: Users’ Group Meeting Budapest, Hungary, September 19-

// cerebras.net/ chip/. Accessed: Sep. 10, 2021. 22, 2004. Proceedings 11, pages 97–104. Springer. DOI:
10.1007/978-3-540-30218-619.
Chandra, R., Dagum, L., Menon, R., Kohr, D., Maydan,
Ghose, S., Boroumand, A., Kim, J. S., Gómez-Luna, J., and
D., and McDonald, J. (2001). Parallel programming in
OpenMP. Morgan Kaufmann. Mutlu, O. (2019). Processing-in-memory: A workload-driven
perspective. IBM Journal of Research and Devel-opment, 63(6):3–
Cook, S. (2012). CUDA programming: a developer’s guide
1. DOI: 10.1147/JRD.2019.2934048.
to parallel computing with GPUs. Newnes.
Gupta, S., Patel, T., Engelmann, C., and Tiwari, D. (2017).
Cox, M. and Ellsworth, D. (1997). Managing big data for sci-entific
Failures in large scale systems: long-term measurement,
visualization. In ACM siggraph, volume 97, pages
21–38. MRJ/NASA Ames Research Center. Available analysis, and implications. In Proceedings of the In-ternational
Conference for High Performance Comput-ing, Networking,
at:https://www.researchgate.net/profile/David-Ellsworth-2/
Storage and Analysis, pages 1–12. DOI:
publication/238704525_Managing_
10.1145/3126908.3126937.
big_data_for_scientific_visualization/links/
54ad79d20cf2213c5fe4081a/Managing-big-data-for-scientific- Hendrickx, NW, Lawrie, WI, Russ, M., van Riggelen,

visualization.pdf. F., de Snoo, SL, Schouten, RN, Sammak, A., Scap-


Machine Translated by Google

Challenges in High-Performance Computing Navals et al., 2023

pucci, G., and Veldhorst, M. (2021). A four-qubit German opportunities. arXiv preprint arXiv:2203.02544. DOI:
quantum processor. Nature, 591(7851):580–585. 10.48550/arXiv.2203.02544.
DOI: 10.1038/s41586-021-03332-6. Rocki, K., Van Essendelft, D., Sharapov, I., Schreiber, R.,
Kelleher, J. D. and Tierney, B. (2018). Data science. MIT Morrison, M., Kibardin, V., Portnoy, A., Dietiker, J.F.
Press. Syamlal, M., and James, M. (2020). Fast stencil-code
Khan, S. M. and Mann, A. (2020). Ai chips: what they are computation on a wafer-scale processor. In SC20: Interna-
and why they matter. Center for Security and Emerging tional Conference for High Performance Computing, Net-
Technology. DOI: 10.51593/20190014. working, Storage and Analysis, pages 1–14. IEEE. DOI:
LeBeane, M., Ryoo, J. H., Panda, R., and John, L. K. (2015). 10.1109/SC41405.2020.00062.
Watt watcher: fine-grained power estimation for emerg-ing SambaNova (2021). Accelerated computing with a recon-
workloads. In 2015 27th International Symposium on figurable dataflow architecture. white paper. Available
Computer Architecture and High Performance Computing at:https://sambanova.ai/. Accessed: Sep. 10, 2021.
(SBAC-PAD), pages 106–113. IEEE. DOI: 10.1109/SBAC-PAD.2015.26. Sanders, J. and Kandrot, E. (2010). CUDA by example:
an introduction to general-purpose GPU programming.
Lee, S., Kang, S.-h., Lee, J., Kim, H., Lee, E., Seo, S., Addison-Wesley Professional.
Yoon, H., Lee, S., Lim, K., Shin, H., et al. (2021). Hard-ware Schardl, TB, Lee, I.-TA, and Leiserson, CE
architecture and software stack for pim based on (2018). Brief announcement: Open cilk. In Pro-ceedings of
commercial dram technology: Industrial product. In the 30th on Symposium on Parallelism
2021 ACM/ IEEE 48th Annual International Symposium in Algorithms and Architectures, pages 351–353. DOI:
on Computer Architecture (ISCA), pages 43–56. IEEE. 10.1145/3210377.3210658.
DOI: 10.1109/ISCA52012.2021.00013. Stevens, R., Taylor, V., Nichols, J., Maccabe, A. B., Yelick,
Liao, X.-k., Lu, K., Yang, C.-q., Li, J.-w., Yuan, Y., Lai, K., and Brown, D. (2020). AI for science: Report on the
M.-c., Huang, L.-b., Lu, P.-j., Fang, J.-b., Ren, J., et al. department of energy (doe) town halls on artificial intelli-
(2018). Moving from exascale to zettascale computing: gence (ai) for science. Technical report, Argonne National
challenges and techniques. Frontiers of Information Tech- Lab.(ANL), Argonne, IL (United States).
nology & Electronic Engineering, 19:1236–1244. DOI: Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Ver-
10.1631/FITEE.1800494. belen, T., and Rellermeyer, J. S. (2020). A survey on dis-
LLNL (2021). DOE/NNSA Lab announces a partner-ship with tributed machine learning. Acm computing surveys (csur),
Cray to develop NNSA’s first exascale su-percomputer. 53(2):1–33. DOI: 10.1145/3377454.
Jeremy Thomas. Available at:https:// Vetter, J. S., Brightwell, R., Gokhale, M., McCormick, P.,
www.llnl.gov/news/. Accessed: Sep. 10, 2021. Ross, R., Shalf, J., Antypas, K., Donofrio, D., Humble, T.,
Matsuoka, S., Domke, J., Wahib, M., Drozd, A., and Hoe-fler, Schuman, C., et al. (2022). Extreme heterogeneity 2018-
T. (2023). Myths and legends in high-performance productive computational science in the era of extreme het-
computing. arXiv preprint arXiv:2301.02432. DOI: erogeneity: Report for DOE ASCR workshop on extreme
10.48550/arXiv.2301.02432. heterogeneity. DOI: 10.2172/1473756.
Munshi, A., Gaster, B., Mattson, T. G., and Ginsburg, D. Voss, M., Asenjo, R., Reinders, J., Voss, M., Asenjo, R.,
(2011). OpenCL programming guide. Pearson Education. and Reinders, J. (2019). Mapping parallel patterns to
Padoin, EL, Diener, M., Navaux, PO, and Méhaut, TBB. Pro TBB: C++ Parallel Programming with Thread-ing
J.-F. (2019). Managing power demand and load im-balance Building Blocks, pages 233–248. DOI: 10.1007/978-1-
to save energy on systems with heterogeneous 4842-4398-58.
CPU speeds. In 2019 31st International Symposium on Xenopoulos, P., Daniel, J., Matheson, M., and Sukumar, S.
Computer Architecture and High Performance Computing (2016). Big data analytics on HPC architectures: Perfor-
(SBAC-PAD), pages 72–79. IEEE. DOI: 10.1109/SBAC- mance and cost. In 2016 IEEE International Conference
PAD.2019.00024. on Big Data (Big Data), pages 2286–2295. IEEE. DOI:
Paillard, GAL, Coutinho, EF, de Lima, ET, and 10.1109/BigData.2016.7840861.
Moreira, L. O. (2015). An architecture proposal for high
performance computing in cloud computing environments.
In 4th International Workshop on Advances in ICT Infras-
tructures and Services (ADVANCE 2015), Recife. Avail-able
at:https://www.researchgate.net/profile/
Emanuel-Coutinho/publication/293481549_
An_Architecture_Proposal_for_High_
Performance_Computing_in_Cloud_Computing_
Environments/links/56b897a708ae3c1b79b2dff5/
An-Architecture-Proposal-for-High-Performance-
Computing-in-Cloud-Computing-Environments.pdf.

Reed, D., Gannon, D., and Dongarra, J. (2022). Rein-venting


high performance computing: Challenges and

You might also like