2219-Article Text-15412-2-10-20230802
2219-Article Text-15412-2-10-20230802
ÿ Institute of Informatics, Federal University of Rio Grande do Sul - PO Box 15064, Av. Bento Gonçalves, 9500,
91501-970, Porto Alegre, RS, Brazil
Abstract High-Performance Computing, HPC, has become one of the most active computer science fields. Driven
mainly by the need for high processing capabilities required by algorithms from many areas, such as Big Data, Arti-
ficial Intelligence, Data Science, and subjects related to chemistry, physics, and biology, the state-of-art algorithms
from these fields are notoriously demanding computer resources. Therefore, choosing the right computer system
to optimize their performance is paramount. This article presents the main challenges of future supercomputer sys-
tems, highlighting the areas that demand the most of HPC serversÿ the new architectures, including heterogeneous
processors composed of artificial intelligence chips, quantum processors, the adoption of HPC on cloud serversÿ and
the challenges of software developers when facing parallelizing applications. We also discuss challenges regarding
non-functional requirements, such as energy consumption and resilience.
data, also known as Big Data, comes from different sources platform for Machine Learning that provides tools and li-braries
(e.g., sensors on the Internet of Things) and needs treat-ment to allow easy deployment of codes across heteroge-neous
by non-traditional systems for manipulating, analyzing, devicesÿ PyTorch, a GPU-accelerated tensor computa-tional
and extracting information from the dataset. However, only frameworkÿ and TorchANI, an implementation of Ac-curate
around 20% of this extracted information would be helpful. Neural Network Engine for Molecular Energies. An-other
Therefore, Big Data is a collection of large and complex example is the Intel Neural Compressor tool that helps
datasets that becomes difficult to process using database man- software developers to easily and quickly deploy inference
agement tools. It is often a collection of legacy data. In ad- solutions on popular deep learning frameworks, including
dition to dealing with new ways of managing data, the Big TensorFlow and PyTorch.
Data area needs great computing power to process this data. Therefore, one of the challenges software developers will
Furthermore, we must remember that data are numbers and face in the convergence of HPC with AI is how to efficiently
codes without treatment. Information, on the other hand, is use the available hardware devices to get the most out of
processed data. It is the processing of data that will create all the provided frameworks. Simultaneously, hardware ven-
meaning. Finally, knowledge is knowing about a specific dors have the challenging task of designing optimized pro-
subjectÿ it is having an application for information. cessors for AI computing, as we discuss in Section 3.3.
It is also important to mention that knowledge about in-
formation is power nowadays. It always has been, but with
the advent of large amounts of data and the ability to process
them, it became possible to conduct analysis and decision- 2.3 Demands from Data Science and Related
making. Information has become the most critical asset. Areas
Companies that hold data and have the power to process and
analyze it hold today a capacity many times greater than the
In conjunction with advancements in supercomputing, the
governments. Furthermore, there are no frontiers for infor-
rise of data generation technologies has resulted in the
mation, and these companies are capturing data from almost
convergence of HPC and data science applications. The
all over the world.
widespread adoption of high-throughput data generation
In summary, machines with high processing power, such
technologies, scalable algorithms and analytics, and efficient
as supercomputers, are needed to assist in data storage and
methods for managing large-scale data has established scal-
extraction in the Big Data era.
able Data Science as a central pillar for accelerating scientific
discovery and innovation [Xenopoulos et al., 2016].
2.2 Demands from the Artificial Intelligence Data science can be defined as a field that combines dif-ferent areas
Area (e.g., math, statistics, artificial intelligence, ma-chine learning, and
specialized programming) with a subject
According to the US Department of Energy’s ”AI for Sci-ence”
to uncover insights that are hidden in an organization’s data.
report [Stevens et al., 2020], new Artificial Intelli-gence
The lifecycle of data science involves tools, roles, and pro-
techniques will be indispensable to support the contin-ued
cesses that usually undergo the following steps: data collec-
growth and expansion of Science infrastructure through
tion, data storage and processing, data analysis, and presenta-
Exascale systems. The experience of the scientific commu-
tion of reports and data visualizations that make the insights
nity, the use of Machine Learning, the simulation with HPC,
[Kelleher and Tierney, 2018].
and the data analysis methods allowed a unique and new
growth of opportunities for Science, discoveries, and more All steps are accelerated with the employment of HPC sys-
robust approaches to accelerated Science and its applications tems and can be used to provide insights beneficial to soci-
for the benefit of humanity. ety: fraud and risk detection, search engines, advanced im-
The convergence of HPC with AI allows simulation envi- age recognition, speech recognition, and airline route plan-
ronments to employ deep reinforcement learning in various ning. In summary, even with the use of HPC to optimize the
problems, such as simulating robots, aircraft, autonomous execution of data science algorithms, one of the main chal-
vehicles, etc. DL techniques accelerate simulations by re- lenges faced by the area is to use hardware resources better
placing models in climate prediction, geoscience, pharma- to reduce the training cost of learning algorithms.
ceuticals, etc. New frontiers in physics are being reached by
increasing the application of Partial Differential Equations
(PDE) with DL for simulations.
On the other hand, one major bottleneck in using ML and 3 HPC Architectures and Processors
DL algorithms is the learning phase before their use. De-
pending on the available computing infrastructure, this ac-tivity Challanges
can take weeks and even months. That is where high-
performance processing accelerates this step. The last section (section 2) showed how advances in scien-
In the end, due to the HPC demands from the AI area, com- tific areas will demand more processing power to run the
panies are investing in parallel frameworks to optimize the models, manage the files, and extract the data, among other
execution of AI software. As an example, NVIDIA provides demands. Hence, the main challenges HPC needs to over-
different solutions so that users can take advantage of GPUs come to get results on time in terms of hardware for its em-
optimized for AI algorithms: TensorFlow, an open-source ployment in such areas will be presented in this section.
Machine Translated by Google
Figure 1. Fugaku [Fujitsu, 2021], Frontier [2021],Frontier [2021], and El Capitan supercomputers[LLNL, 2021].
NERSC ALCF
System attributes ALCF Now NERSC Now OLCF Now OLCF Exascale ALCF Exascale
Pre-Exascale Pre-Exascale
System peak > 15.6 PF > 30 PF 200 PF > 120PF 35 – 45PF >1.5 EF ÿ 1 EF DP sustained
~1 PB DDR4 + High
847 TB DDR4 + 70 TB HBM 2.4 PB DDR4 + 0.4 PB 4.6 PB DDR4 +4.6 PB
Bandwidth Memory 1.92 PB DDR4 +
Total system memory + 7.5 TB GPU memory HBM + 7.4 PB > 250 TB HBM2e + 36 PB > 10 PB
(HBM) + 1.5PB 240TB HBM
persistent memory persistent memory
persistent memory
9,300 nodes
4,392 KNL nodes and 24 > 1,500(GPU)
System size (nodes) 1,900 nodes in data 4608 nodes > 500 > 9,000 nodes > 9,000 nodes
DGX-A100 nodes > 3,000 (CPU)
partition
Node-to-node Aries (KNL nodes) and Aries Dual Rail EDR-IB HPE Slingshot NIC HPE Slingshot NIC HPE Slingshot HPE Slingshot
interconnect HDR200 (GPU nodes)
200 PB, 1.3 TB/s Lustre 28 PB, 744 GB/s 250 PB, 2.5 TB/s 695 PB + 10 PB Flash ÿ 230 PB, ÿ 25 TB/s
File System 35 PB All Flash, Lustre N/A
10 PB, 210 GB/s Lustre Lustre GPFS performance tier, Lustre DAMAGES
3.1 New Generation of Processors and Accel- Also, the El Capitan supercomputer (Figure 1) from the
erators Lawrence Livermore National Laboratory, LLNL, designed
with AMD EPYC processors, code-named Genoa and Zen 4
Several proposals for parallel architectures have emerged re- processor core, and AMD Radeon Instinct GPUs, is expected
cently as depicted in Figures 1 and 2, which should shake to exceed 2 Exaflops [LLNL, 2021].
and change the processor market. Next we highlight the four
supercomputers with the highest computing power currently.
The ARM company designed the A64FX processor that In addition to these two machines, other Exascale Systems
allowed Fujitsu’s FUGAKU computer with NVIDIA GPUs are planned to be delivered in the next few years. The DoE,
[Fujitsu, 2021], Japan, to remain for two years (2020 and Department of Energy from the USA, financed national com-
2021) as the fastest computer in the TOP500 list, reaching a puting facilities centers to receive new systems with distinct
performance of 442 Petaflops (Figure 1). architectures from different companies. Figure 2 summa-rizes
The first supercomputer to deliver performance greater the evolution of the subsequent machine installations,
than 1 Exaflop was the Frontier (Figure 1) from the Oak starting with Perlmutter in NERSC - Berkeley, Polaris in
Ridge laboratory, USA. It was designed by Cray-HPE with ALCF - Argonne, Frontier in OLCF - Oak Ridge, and Au-rora
AMD Epyc processors and Radeon accelerators and reached in ALCF - Argonne. In other countries, there are also
the performance of 1.1 Exaflops [Frontier, 2021], according Exascale Machines to be delivered like China, with the evo-
to the TOP500 list. lution of the Sunway and the Tianhe, and in Japan, with the
After breaking the exaflop performance barrier, it is natu-ral new Fugaku.
that the number of supercomputers achieving such a per-
formance level increases. In this scenario, Aurora (Figure 1)
is a supercomputer from Argonne’s laboratory that is being It is worth mentioning that the architectures of these pro-
developed with Intel processors adopting the Ponte Vecchio cessors and machines are increasingly heterogeneous, allow-
architecture that integrates the Xeon processor and the Xe ing the execution of applications to be processed in the device
accelerator on the same chip [Aurora, 2021]. with the best performance.
Machine Translated by Google
Figure 4. Comparison of Cerebras WSE-2 with the largest GPU at the mo-ment.
Figure 3. NVIDIA Tegra K1 chip integrating processors with GPUs in an
SoC
intensive applications.
Other types of aware computing are also becoming impor-tant
in HPC servers. In network-aware computing, the net-work
performance is optimized with load balancing, network
topology optimization, and congestion control. Security-aware
computing focuses on ensuring the security of systems. Figure 5. Future of Heterogeneous Architectures with Quantum Processors
Furthermore, one can also find data-aware and user-aware - Based on Fu et al. [2016]
computing, where the former focuses on efficient manage-ment
of data, and the second is concerned with providing In recent years, many Quantum machines have been an-
personalized user experiences. nounced, like the Zuchongzhi with 66 qubits, Google with
Machine Translated by Google
its 54-qubit Sycamore processor, and IBM’s 14th quantum They understand that the evolution will increase from ma-chines
computer model with 53 qubits. However, as scientists pre-view with 2 to 3 Eflops in 2025 to 2030, with performance
practical computing, machines need nearly a thousand scaling to 50–80 Eflops, and the last step will end in 2035
qubits. In this scenario, IBM has an ambitious goal of build-ing one with the Zflops machines.
quantum computer containing 1000 qubits by 2025. To reach this level of computing, the authors expected a
In the end, one of the big challenges of quantum machines list of technical metrics in a table presented in Table 1. The
is correcting the myriad errors that usually come with quan-tum computing system will have a power efficiency of 10 Tflop-s/W for
operations. So the researchers focus on having a lower a total power consumption near 100MW for all the
error rate on the entire system. Quantum Computing will be systems. Interesting to observe that the performance estima-tions
the next frontier in the changes in processing capacity, which per node will be 10Pflops and the bandwidth between
will allow solving problems in Science that are currently un-feasible to nodes 1.6 Tb/s. The data for all the system storage capacity
solve on time. will be around 1ZB (Zeta byte).
Sequential Parallel Sequential Parallel Sequential level parallelism is defining the data dependency between
different parallel regions, which may limit the gains with par-allelization.
cil pattern has good performance scalability on GPUs, as they 4.2 Parallel Programming Libraries
are organized so thousands of threads can be executed simul-
taneously. In the end, the main challenge will be to balance With the emergence of modern architectures that present dif-
the memory operations time with the computation time in the ferent computing capabilities, many programming languages
way that the loading of neighbors should be faster than the and libraries to help software developers exploit parallelism
computation applied to them. were introduced in the literature. We illustrate in Figure
11 the evolution of the number of citations on the Scopus
There are also many applications where many threads/pro-
database of the most used parallel programming languages
cesses simultaneously apply the same operation over differ-ent
over the years and discuss each one next.
data. In this case, the Reduction pattern combines ev-ery
As observed, Message-Passing Interface, MPI, has estab-
element calculated by each thread/process using an as-
lished itself over the years as one of the main alternatives for
sociative function called the combined function [Voss et al.,
exploiting parallelism. This popularity happens due to the
2019]. Figure 10 shows an example of a reduction operation
need for a library to create processes and manage commu-
executing in parallel, where pairs of elements are calculated
nication between them in distributed memory environments,
simultaneously until the final result is reached. Even though
which is found in every HPC system. Given that, along with
it achieves satisfactory performance results when combin-ing
the evolution of MPI [Gabriel et al., 2004] (e.g., MPI-1, MPI-2,
values computed by each thread, its implementation gets
and MPI-3) and in the technology available in HPC sys-tems,
more complex as the number of reductions after each iter-ation
the challenges are related to (i) balance communica-tion and
changes at every step until the final result is reached.
computation between computing nodes to better use
Therefore, software developers face the challenge of coordi-
hardware resourcesÿ (ii) efficiently use asynchronous com-
nating the computation during each reduction operation to
munication to overlap communication and computationÿ and
maximize the usage of hardware resources.
(iii) ensure fault tolerance mechanisms.
Simultaneously, with the increase in the number of cores in
multicore architectures, there was a popularization in the use
of libraries that exploit parallelism in shared memory (for ex-
ample, OpenMP and POSIX Threads). The POSIX Threads
library, a lightweight thread implementation, was the first to
create parallel applications [Barney, 2009]. However, due to
the complexity and the need for the programmer to decide
different aspects of the operating system, it is no longer used
for this purpose. It continues to be used only for developing
applications aimed at the operating system.
In this scenario, the OpenMP library replaced the POSIX
Threads library. OpenMP consists of compiler directives, li-
brary functions, and environment variables that ease the bur-
den of managing threads in the code [Chandra et al., 2001].
Therefore, extracting parallelism using OpenMP usually re-
quires less effort when compared to POSIX Threads. Par-
allelism is exploited by inserting directives in the code that
inform the compiler how and which parts of the application
should be executed in parallel. With the increasing number
Figure 9. Stencil Computation Example [Voss et al., 2019] of cores and complexity of the memory hierarchy in shared-
memory architectures, the challenge when using OpenMP
will be related to (i) finding the optimal number of threads
to execute each parallel regionÿ (ii) defining thread and data
placement strategies that reduce cache contention and last-
level cache missesÿ and (iii) efficiently employ directives
for heterogeneous computing. Furthermore, other libraries,
such as Cilk Plus [Schardl et al., 2018], developed by MIT
and later maintained by Intel, emerged to facilitate vector
and task programmingÿ however, due to the evolution of the
OpenMP library, they also ceased to be used. In the last years,
OmpSs-2 has emerged as an alternative to OpenMP when
exploiting parallelism through tasks by providing more mal-
leability and better load-balancing on multi-core systems.
The scenario dominated by OpenMP and MPI drasti-cally
changes with the popularization of GPU architectures.
Figure 10. Parallel reduction, where threads produce subresults that are
From then on, CUDA became one of the most used paral-lel
combined to produce a final single answer [Voss et al., 2019]
programming interfaces to accelerate multi-domain ap-plications
on NVIDIA GPUs [Sanders and Kandrot, 2010].
Machine Translated by Google
2500 HPC machines’ growing demand for processing capabilities has led
2000 manufacturers to associate thousands of processors in clusters, which
Number
The CUDA library consists of a set of extensions for C, C++ and and energy consumption, which sometimes occurs the opposite.
Analysis over the last years shows that newer HPC sys-tems 5.4 Data Locality
will be much less reliable as a consequence of three
In future microprocessors, where the memory hierarchy will
main factors: the increasing complexity of the system de-sign and
become more complex, the energy spent only to move data
number of hardware resources (e.g., cores and mem-ories)ÿ the
will critically affect the application’s performance and en-ergy
decreasing device dependabilityÿ and the shrink-ing process
consumption. Hence, any nano-joule employed to move
technology. Figure 13 shows the mean time be-tween failures
data up and down the memory hierarchy will decrease the
(MTBF) of distinct HPC servers with different
energy available for computation. In this scenario, task map-ping
number of nodes and cores per node, extracted from [Gupta
and scheduling need to be optimized in the interconnec-tion
et al., 2017]. As can be observed, the greater the number
network, prioritizing location to restrict data movement
of nodes in the HPC system, the shorter the mean time (in
as much as possible [Cruz et al., 2021]. Therefore, the ten-dency
hours) between failures. This scenario means that newer su-
is to prioritize data locality over processor speed, al-though local
percomputers with thousands of processors and nodes will
data tends to aid in faster processing. Energy
have servers crashing every day, hence the importance of re-
conservation becomes the priority.
silience to keep the machine running.
Figure 14, extracted from the DoE report [Vetter et al.,
Analysis over time shows the growth of failures with
2022], shows that despite the evolution of chip technology,
the evolution of supercomputers, with more and more cores
with increasingly thinner technologies reaching 7nm, the de-
reaching thousands of them. Figure 13 shows this evolution,
crease in energy consumption of the entire chip does not keep
which means that a supercomputer with thousands of proces-sors
the same pace with the reduction of energy due to computing.
will have servers crashing every day, hence the impor-tance of
This evolution shows that energy consumption for moving
resilience to keep the machine running. Resilience
data spends more power than performing computing opera-tions
is achieved through hardware and software that will dynam-ically
on the chip, which implies an important revolution, not
detect the failure, diagnose, reconfigure, and repair
only in the architecture of processors but also in the way in-
it, allowing processing to continue without user awareness.
structions are executed. Compilers should be concerned with
Therefore, this area becomes essential in future machines to
bringing data closer to the processors.
meet the needs of high-performance processing and will en-able
this processing to occur until obtaining results without
interruption.
10000
2,918.0
1000 2,187.0
100 189.0
36.9
MTBF 10
8.9
5.2
IBM AC922 IBM AC922 EoS Cray XC30 Jaguar XT4 Jaguar XK6 Titan XK7
4608 Nodes 4662 Nodes 9840 Nodes 7832 Nodes 18662 Nodes 18688 Nodes
(44 cores/node) (44 cores/node) (12 cores/node) (12 cores/node) (16 cores/node) (16 cores/node)
tial part of the HPC development. It is concluded that the Cruz, E. H., Diener, M., Pilla, L. L., and Navaux, P. O.
evolution of computing involves the continuous growth of (2021). Online thread and data mapping using a sharing-aware
processing power and new ways of doing it. memory management unit. ACM Transactions on
Modeling and Performance Evaluation of Computing Sys-tems
(TOMPECS), 5(4):1–28. DOI: 10.1145/3433687.
Declarations Dávila, GP, Oliveira, D., Navaux, P., and Rech, P.
(2019). Identifying the most reliable collaborative
Authors’ Contributions workload distribution in heterogeneous devices. In
2019 Design, Automation & Test in Europe Conference
All authors contributed to the writing of this article, read and ap- & Exhibition (DATE), pages 1325–1330. IEEE. DOI:
proved the final manuscript. 10.23919/DATE.2019.8715107.
Desjardins, J. (2019). How much data is generated each day?
Available at:https://www.visualcapitalist.com/
Competing interests how-much-data-is-generated-each-day. cessed: Mar. 12, Ac-
The authors declare that they have no competing interests. 2021.
Dongarra, J. H. M. and Strohmaier, E. (2020). Top500 su-
percomputer:. Available at:https://www.top500.org/
Funding lists/top500/2020/11/.. Accessed: Mar. 10, 2021.
This study was partially supported by the Coordination for the Farber, R. (2016). Parallel programming with OpenACC.
Newnes.
Improvement of Higher Education Personnel – Brazil (CAPES) – Fi-
nance Code 001, by Petro bras grant n.º 2: 020/00182-5, by CN-Pq/ Freytag, G., Lima, J. V., Rech, P., and Navaux, P. O. (2022).
MCTI /FNDCT nº 308877/2022-5 and grant Universal 18/2021 Impact of Reduced and Mixed-Precision on the Efficiency
nº 406182/2021-3. of a Multi-GPU Platform on CFD Applications. In Compu-tational
Science and Its Applications–ICCSA 2022 Work-shops: Malaga,
Spain, July 4–7, 2022, Proceedings, Part
Availability of data and materials IV, pages 570–587. Springer. DOI: 10.1007/978-3-031-
10542-539.
Data can be made available upon request.
Frontier (2021). ORNL Exascale Supercomputer. Available
at:https:// www.olcf.ornl.gov/ frontier/. Accessed:
Apr. 10, 2021.
References
Fu, X., Riesebos, L., Lao, L., Almudever, CG, Se-bastiano, F.,
Aurora (2021). Argonne leadership computing facility. Versluis, R., Charbon, E., and Bertels, K. .
Available at:https://www.alcf.anl.gov/aurora. Ac-cessed: Apr. 11, (2016). A heterogeneous quantum computer architec-ture. In
2021. Proceedings of the ACM International Con-ference on Computing
Baldini, I., Castro, P., Chang, K., Cheng, P., Fink, S., Frontiers, pages 323–330. DOI:
10.1145/2903150.2906827.
Ishakian, V., Mitchell, N., Muthusamy, V., Rabbah, R.,
Slominski, A., et al. (2017). Serverless computing: Cur-rent trends Fujitsu (2021). Supercomputer fugaku. Available at:https:
and open problems. Research advances in //www.fujitsu.com/. Accessed: Apr. 10, 2021.
cloud computing, pages 1–20. DOI: 10.1007/978-981-10- Gabriel, E., Fagg, GE, Bosilca, G., Angskun, T., Don-garra, JJ,
5026-81. Squyres, JM, Sahay, V., Kambadur, P., Bar-rett, B., Lumsdaine,
Barney, B. (2009). POSIX threads programming. A. , et al. (2004). Open also: Goals,
National Laboratory. Available at: https: concept, and design of a next generation mpi implemen-tation. In
Recent Advances in Parallel Virtual Machine
//computing.llnl.gov/tutorials/pthreads.
Accessed: Mai. 4, 2022. and Message Passing Interface: 11th European PVM/ MPI
Cerebras (2021). The future of ai is here. Available at:https: Users’ Group Meeting Budapest, Hungary, September 19-
// cerebras.net/ chip/. Accessed: Sep. 10, 2021. 22, 2004. Proceedings 11, pages 97–104. Springer. DOI:
10.1007/978-3-540-30218-619.
Chandra, R., Dagum, L., Menon, R., Kohr, D., Maydan,
Ghose, S., Boroumand, A., Kim, J. S., Gómez-Luna, J., and
D., and McDonald, J. (2001). Parallel programming in
OpenMP. Morgan Kaufmann. Mutlu, O. (2019). Processing-in-memory: A workload-driven
perspective. IBM Journal of Research and Devel-opment, 63(6):3–
Cook, S. (2012). CUDA programming: a developer’s guide
1. DOI: 10.1147/JRD.2019.2934048.
to parallel computing with GPUs. Newnes.
Gupta, S., Patel, T., Engelmann, C., and Tiwari, D. (2017).
Cox, M. and Ellsworth, D. (1997). Managing big data for sci-entific
Failures in large scale systems: long-term measurement,
visualization. In ACM siggraph, volume 97, pages
21–38. MRJ/NASA Ames Research Center. Available analysis, and implications. In Proceedings of the In-ternational
Conference for High Performance Comput-ing, Networking,
at:https://www.researchgate.net/profile/David-Ellsworth-2/
Storage and Analysis, pages 1–12. DOI:
publication/238704525_Managing_
10.1145/3126908.3126937.
big_data_for_scientific_visualization/links/
54ad79d20cf2213c5fe4081a/Managing-big-data-for-scientific- Hendrickx, NW, Lawrie, WI, Russ, M., van Riggelen,
pucci, G., and Veldhorst, M. (2021). A four-qubit German opportunities. arXiv preprint arXiv:2203.02544. DOI:
quantum processor. Nature, 591(7851):580–585. 10.48550/arXiv.2203.02544.
DOI: 10.1038/s41586-021-03332-6. Rocki, K., Van Essendelft, D., Sharapov, I., Schreiber, R.,
Kelleher, J. D. and Tierney, B. (2018). Data science. MIT Morrison, M., Kibardin, V., Portnoy, A., Dietiker, J.F.
Press. Syamlal, M., and James, M. (2020). Fast stencil-code
Khan, S. M. and Mann, A. (2020). Ai chips: what they are computation on a wafer-scale processor. In SC20: Interna-
and why they matter. Center for Security and Emerging tional Conference for High Performance Computing, Net-
Technology. DOI: 10.51593/20190014. working, Storage and Analysis, pages 1–14. IEEE. DOI:
LeBeane, M., Ryoo, J. H., Panda, R., and John, L. K. (2015). 10.1109/SC41405.2020.00062.
Watt watcher: fine-grained power estimation for emerg-ing SambaNova (2021). Accelerated computing with a recon-
workloads. In 2015 27th International Symposium on figurable dataflow architecture. white paper. Available
Computer Architecture and High Performance Computing at:https://sambanova.ai/. Accessed: Sep. 10, 2021.
(SBAC-PAD), pages 106–113. IEEE. DOI: 10.1109/SBAC-PAD.2015.26. Sanders, J. and Kandrot, E. (2010). CUDA by example:
an introduction to general-purpose GPU programming.
Lee, S., Kang, S.-h., Lee, J., Kim, H., Lee, E., Seo, S., Addison-Wesley Professional.
Yoon, H., Lee, S., Lim, K., Shin, H., et al. (2021). Hard-ware Schardl, TB, Lee, I.-TA, and Leiserson, CE
architecture and software stack for pim based on (2018). Brief announcement: Open cilk. In Pro-ceedings of
commercial dram technology: Industrial product. In the 30th on Symposium on Parallelism
2021 ACM/ IEEE 48th Annual International Symposium in Algorithms and Architectures, pages 351–353. DOI:
on Computer Architecture (ISCA), pages 43–56. IEEE. 10.1145/3210377.3210658.
DOI: 10.1109/ISCA52012.2021.00013. Stevens, R., Taylor, V., Nichols, J., Maccabe, A. B., Yelick,
Liao, X.-k., Lu, K., Yang, C.-q., Li, J.-w., Yuan, Y., Lai, K., and Brown, D. (2020). AI for science: Report on the
M.-c., Huang, L.-b., Lu, P.-j., Fang, J.-b., Ren, J., et al. department of energy (doe) town halls on artificial intelli-
(2018). Moving from exascale to zettascale computing: gence (ai) for science. Technical report, Argonne National
challenges and techniques. Frontiers of Information Tech- Lab.(ANL), Argonne, IL (United States).
nology & Electronic Engineering, 19:1236–1244. DOI: Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Ver-
10.1631/FITEE.1800494. belen, T., and Rellermeyer, J. S. (2020). A survey on dis-
LLNL (2021). DOE/NNSA Lab announces a partner-ship with tributed machine learning. Acm computing surveys (csur),
Cray to develop NNSA’s first exascale su-percomputer. 53(2):1–33. DOI: 10.1145/3377454.
Jeremy Thomas. Available at:https:// Vetter, J. S., Brightwell, R., Gokhale, M., McCormick, P.,
www.llnl.gov/news/. Accessed: Sep. 10, 2021. Ross, R., Shalf, J., Antypas, K., Donofrio, D., Humble, T.,
Matsuoka, S., Domke, J., Wahib, M., Drozd, A., and Hoe-fler, Schuman, C., et al. (2022). Extreme heterogeneity 2018-
T. (2023). Myths and legends in high-performance productive computational science in the era of extreme het-
computing. arXiv preprint arXiv:2301.02432. DOI: erogeneity: Report for DOE ASCR workshop on extreme
10.48550/arXiv.2301.02432. heterogeneity. DOI: 10.2172/1473756.
Munshi, A., Gaster, B., Mattson, T. G., and Ginsburg, D. Voss, M., Asenjo, R., Reinders, J., Voss, M., Asenjo, R.,
(2011). OpenCL programming guide. Pearson Education. and Reinders, J. (2019). Mapping parallel patterns to
Padoin, EL, Diener, M., Navaux, PO, and Méhaut, TBB. Pro TBB: C++ Parallel Programming with Thread-ing
J.-F. (2019). Managing power demand and load im-balance Building Blocks, pages 233–248. DOI: 10.1007/978-1-
to save energy on systems with heterogeneous 4842-4398-58.
CPU speeds. In 2019 31st International Symposium on Xenopoulos, P., Daniel, J., Matheson, M., and Sukumar, S.
Computer Architecture and High Performance Computing (2016). Big data analytics on HPC architectures: Perfor-
(SBAC-PAD), pages 72–79. IEEE. DOI: 10.1109/SBAC- mance and cost. In 2016 IEEE International Conference
PAD.2019.00024. on Big Data (Big Data), pages 2286–2295. IEEE. DOI:
Paillard, GAL, Coutinho, EF, de Lima, ET, and 10.1109/BigData.2016.7840861.
Moreira, L. O. (2015). An architecture proposal for high
performance computing in cloud computing environments.
In 4th International Workshop on Advances in ICT Infras-
tructures and Services (ADVANCE 2015), Recife. Avail-able
at:https://www.researchgate.net/profile/
Emanuel-Coutinho/publication/293481549_
An_Architecture_Proposal_for_High_
Performance_Computing_in_Cloud_Computing_
Environments/links/56b897a708ae3c1b79b2dff5/
An-Architecture-Proposal-for-High-Performance-
Computing-in-Cloud-Computing-Environments.pdf.