0% found this document useful (0 votes)

46 views16 pages

Accelerating Fortran Codes: A Method For Integrating Coarray Fortran With Cuda Fortran and Openmp

Uploaded by

Roberto Pinheiro Domingos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views16 pages

Accelerating Fortran Codes: A Method For Integrating Coarray Fortran With Cuda Fortran and Openmp

Uploaded by

Roberto Pinheiro Domingos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Accelerating Fortran Codes: A Method for Integrating Coarray Fortran with CUDA

Fortran and OpenMP

James McKevitta,b , Eduard I. Vorobyova , Igor Kulikovc

a University of Vienna, Department of Astrophysics, Türkenschanzstrasse 17, Vienna, A-1180, Austria
b University College London, Mullard Space Science Laboratory, Holmbury St Mary, Dorking, RH5 6NT, Surrey, United Kingdom
c Institute of Computational Mathematics and Mathematical Geophysics SB RAS, Lavrentieva ave. 6, Novosibirsk, 630090, Russia

Abstract
arXiv:2409.02294v1 [astro-ph.IM] 3 Sep 2024

Fortran’s prominence in scientific computing requires strategies to ensure both that legacy codes are efficient on high-performance
computing systems, and that the language remains attractive for the development of new high-performance codes. Coarray Fortran
(CAF), part of the Fortran 2008 standard introduced for parallel programming, facilitates distributed memory parallelism with a
syntax familiar to Fortran programmers, simplifying the transition from single-processor to multi-processor coding. This research
focuses on innovating and refining a parallel programming methodology that fuses the strengths of Intel Coarray Fortran, Nvidia
CUDA Fortran, and OpenMP for distributed memory parallelism, high-speed GPU acceleration and shared memory parallelism
respectively. We consider the management of pageable and pinned memory, CPU-GPU affinity in NUMA multiprocessors, and
robust compiler interfacing with speed optimisation. We demonstrate our method through its application to a parallelised Poisson
solver and compare the methodology, implementation, and scaling performance to that of the Message Passing Interface (MPI),
finding CAF offers similar speeds with easier implementation. For new codes, this approach offers a faster route to optimised
parallel computing. For legacy codes, it eases the transition to parallel computing, allowing their transformation into scalable,
high-performance computing applications without the need for extensive re-design or additional syntax.
Keywords: Coarray Fortran (CAF), CUDA Fortran, OpenMP, MPI
PACS: 0000, 1111
2000 MSC: 0000, 1111

1. Introduction operations [e.g., 11], often a common and performance-critical

component in astrophysics simulations and image or data pro-
Across the many fields which make use of scientific comput- cessing [e.g., 12, 13].
ing, the enduring importance of Fortran-written codes is unde-
For distributed memory parallelism the Message Passing In-
niable. Most notably, Intel’s compiler remains a popular choice
terface (MPI) is commonly used [14]. However, its imple-
for these Fortran codes, primarily due to its robust performance
mentation can be resource-intensive and often requires a full
and reliable support. However, with the exponential growth in
re-write of the original serialised code. We turn, however, to
computational demands, there is an imperative need to enhance
Coarray Fortran, as a simpler yet powerful alternative [15, 16].
the speed and efficiency of these codes. Shared memory par-
Coarray Fortran, introduced in the Fortran 2008 standard, is de-
allelism techniques, like OpenMP, though useful and easy to
signed specifically for parallel programming, both with shared
implement, often fall short in meeting these demands. Hence,
and distributed memory [17]. It not only offers a simple syntax
turning to distributed memory parallelism and graphics process-
but also ensures efficient performance, especially in the Intel
ing units (GPUs) becomes essential, given their capacity to ex-
implementation, easing the transition from single-to-multiple-
ploit modern and computationally efficient hardware [1, 2].
node programming.
To optimise the use of GPUs in general computing tasks
(general purpose computing on GPUs; GPGPU; [3, 4, 5]), The fusion of these two paradigms offered by separate
Nvidia’s CUDA, a parallel computing platform and program- providers, while non-trivial, offers a powerful combination to
ming model, is often employed [6, 7]. Fortran users can lever- accelerate Fortran codes on the most modern hardware using
age CUDA Fortran, an adapted language also provided by intuitive Fortran syntax. This paper provides a comprehen-
Nvidia, which offers all the speed advantages of CUDA, but sive guide on how to perform such an acceleration of Fortran
with the familiar Fortran syntax [8, 9, 10]. The true potential codes using CUDA Fortran and Coarray Fortran. Regardless of
of CUDA Fortran is unlocked when applied to tasks that in- the specific problem at hand, our methodology can be adapted
volve heavy parallelisation like Fast Fourier Transform (FFT) and implemented by scientists in their computational work. We
demonstrate the advantages and drawbacks of various available
configurations, using a well-established and highly parallelis-
Email address: [email protected] (James McKevitt) able potential field solver as a case study [13]. We also explore
Preprint submitted to Journal of Parallel and Distributed Computing September 5, 2024
various techniques and strategies, such as the usage of pointers the grid and block dimensions. This is how CUDA Fortran har-
to optimise communication speed and implement direct mem- nesses the power of GPUs - by organising threads into a hier-
ory access. Additionally, we delve into the definition of vari- archy of grids and blocks around which the hardware is con-
ables required by both the central processing unit (CPU) and structed.
GPU memory, highlighting the treatment of variables describ- An essential aspect of CUDA Fortran programming is under-
ing a potential field in our case study as an example. standing how to manage memory effectively, which involves
Our proposed approach has a broad focus, providing a strategically allocating and deallocating memory on the device
roadmap that can be adapted to any Fortran code. Through and copying data between the host and the device. Special at-
this detailed guide, we aim to enable researchers to streamline tention needs to be given to optimising memory usage to ensure
their computational workflows, augment their codes’ speed, that the full computational capability of the GPU is used.
and thereby accelerate their scientific work. CUDA Fortran can be integrated seamlessly with existing
Fortran codes, offering a less labour-intensive path to GPU pro-
gramming than rewriting codes in another language [9].
1.1. Coarray Fortran

Coarray Fortran (CAF), introduced in the Fortran 2008 stan- 1.3. Heterogeneous Computing with CPUs and GPUs
dard, has gained recognition for its ability to simplify parallel
When developing a code capable of running on a heteroge-
programming. CAF offers an intuitive method for data par-
neous system — that being a system containing more than one
titioning and communication, enabling scientists to focus on
type of processor, in this case a CPU-GPU system — it is im-
problem-solving rather than the intricacies of parallel comput-
portant to understand the differences in the architecture of par-
ing.
allel execution between the hardware to best distribute tasks and
Coarray Fortran integrates parallel processing constructs di- optimize code performance.
rectly into the Fortran language, eliminating the need for exter-
Single Instruction, Multiple Data (SIMD) is a parallel com-
nal libraries like MPI. It adopts the Partitioned Global Address
puting model where one instruction is executed simultaneously
Space (PGAS) model, which views memory as a global entity
across multiple data elements within the same processor [19].
but partitions it among different processors [18]. This allows
This process is known as vectorisation, and it allows a single
straightforward programming of distributed data structures, fa-
processor to perform the same operation on multiple data points
cilitating simpler, more readable codes.
at once.
In CAF, the global computational domain is decomposed into In GPU architectures, SIMD is combined with multi-
a series of images, which represent self-contained Fortran exe- threading and implemented in a way where warps (groups of
cution environments. Each image holds its local copy of data threads) receive the same instruction. This means that while
and can directly access data on other images via coarray vari- multiple threads perform identical operations in parallel with
ables. These variables are declared in a similar fashion to tradi- each other, each thread also performs vectorised (SIMD) oper-
tional Fortran variables but with the addition of codimensions. ations on its assigned data. This broader GPU parallelism is
The codimension has a size equal to the number of images. The known as Single Instruction, Multiple Threads (SIMT) [20].
variables defined without codimensions are local to each image.
While each GPU thread is capable of performing hundreds or
Coarray Fortran introduces synchronisation primitives, such thousands of simultaneous identical calculations through vec-
as critical, lock, unlock, and sync statements, to ensure torisation, individual CPU cores also support SIMD but on a
proper sequencing of events across different images. This pro- smaller scale, typically performing tens of operations simul-
vides programmers with the tools to manage and prevent race taneously each. However, unlike in GPUs, each core within
conditions and to coordinate communication between images. a CPU multiprocessor can receive its own set of instructions.
This means that while individual cores perform SIMD oper-
1.2. CUDA Fortran ations, the overall CPU executes multiple instructions across
its multiple cores, following the Multiple Instruction, Multiple
CUDA Fortran is a programming model that extends Fortran Data (MIMD) model [19].
to allow direct programming of Nvidia GPUs. It is essentially This means that besides the difference in scale of simultane-
an amalgamation of CUDA and Fortran, providing a pathway to ous operations between CPUs and GPUs, there are also archi-
leverage the computational power of Nvidia GPUs while stay- tectural differences in how CPU MIMD and GPU SIMT handle
ing within the Fortran environment. parallel tasks. In MIMD architectures like those in CPUs, each
The CUDA programming model works on the basis that core operates independently, allowing more flexibility when en-
the host (CPU) and the device (GPU) have separate memory countering conditional branching such as an if statement. This
spaces. Consequently, data must be explicitly transferred be- independence helps minimise performance losses during thread
tween these two entities. This introduces new types of routines divergence because each core can process different instructions
to manage device memory and execute kernels on the device. simultaneously without waiting for others.
In CUDA Fortran, routines executed on the device are known Conversely, in GPU architectures using SIMT, all threads in
as kernels. Kernels can be written in a syntax very similar to a warp execute the same instruction. This synchronization can
standard Fortran but with the addition of qualifiers to specify lead to performance bottlenecks during thread divergence —–
2
for instance, when an if statement causes only some threads 2.2. Memory Space Configuration
within a warp to be active. In such cases, GPUs typically han-
dle divergence by executing all conditional paths and then se- When creating a single code which requires the use of two
lecting the relevant outcomes for each thread, a process that can compilers, a few key considerations are required. For the fol-
be less efficient than the CPU’s approach. This synchronisation lowing text, ‘Intel code’ refers to code compiled with ifort
requirement makes GPUs highly effective for large data sets and ‘Nvidia code’ refers to code compiled with the nvfortran.
where operations are uniform but potentially less efficient for The various execution streams mentioned below refer to an ex-
tasks with varied execution paths. Thus, while GPUs excel in ecuted command being made within either the Intel code or the
handling massive, uniform operations due to their SIMD capa- Nvidia code.
bilities within each thread, CPUs offer advantages in scenarios
where operations diverge significantly.
It is, therefore, important to understand which problems are 2.2.1. Pageable and Pinned Memory
best suited to which type of processor, and design the code to Before continuing, a short background on pageable and
distribute different tasks to the appropriate hardware. pinned memory is required. Pageable memory is the default
The layout of this work is as follows. In Sect. 2 we outline memory type in most systems. It is so-called because the op-
our methodology and consider a number of options which are erating system can ‘page’ it out to the disk, freeing up physical
available to solve this problem, depending on the use case. We memory for other uses. This paging process involves writing
also present a detailed guide on how compiler linking can be the contents of the memory to a slower form of physical mem-
achieved. In Sect. 3 we present the results of our tests on differ- ory, which can then be read back into high-speed physical mem-
ent hardware. In Sect. 4 we summarise our main conclusions. ory when needed.
The main advantage of pageable memory is that it allows
for efficient use of limited physical memory. By paging out
2. Methodology memory that is not currently needed, the operating system can
free up high-speed physical memory for other uses. This can be
The methodology we propose hinges on the robust combi- particularly useful in systems with limited high-speed physical
nation of CUDA Fortran and Coarray Fortran, leveraging their memory.
unique strengths to develop efficient high-performance comput- However, the paging process can be slow, particularly when
ing applications. The primary challenge is the complex inter- data is transferred between the host and a device such as a GPU.
facing required to integrate the GPU-accelerated capabilities of When data is transferred from pageable host memory to de-
Nvidia CUDA Fortran with the distributed memory parallelism vice memory, the CUDA driver must first allocate a temporary
of Intel Coarray Fortran. CUDA Fortran is chosen as it allows pinned memory buffer, copy the host memory to this buffer, and
the user high levels of control over GPU operations - some of then transfer the data from the buffer to the device. This dou-
which are highlighted below - while remaining close to the tra- ble buffering incurs overhead and can significantly slow down
ditional Fortran syntax. Likewise, Coarray Fortran, particularly memory transfers, especially for large datasets.
when implemented by Intel, allows for high distributed mem- Pinned memory, also known as page-locked memory, is a
ory performance with no augmentations to the standard Fortran type of memory that cannot be paged out to the disk. This
syntax. Below, we detail the steps involved in our approach. means that the data is constantly resident in the high-speed
physical memory of the system, which can result in consider-
2.1. Selection of Compilers ably faster memory transfers between the host and device.
The main advantage of pinned memory is its speed. Because
CUDA Fortran, being proprietary to Nvidia, requires the use it can be accessed directly by the device, it eliminates the need
of Nvidia’s nvfortran compiler. However, nvfortran does for the double-buffering process required by pageable memory
not support Coarray Fortran. In contrast, Intel’s ifort com- on GPUs. This can result in significantly faster memory trans-
piler supports Coarray Fortran - with performance levels that fers, particularly for large datasets.
rival MPI and without the complexity of its syntax - but does not However, pinned memory has its drawbacks. The alloca-
support CUDA Fortran. One other alternative compiler support- tion of pinned memory is more time-consuming than pageable
ing Coarray Fortran is OpenCoarrays. According to our expe- memory - something of concern if not all memory allocation is
rience, its implementation falls short in terms of speed. How- done at the beginning - and it consumes physical memory that
ever, we note that experiences with OpenCoarrays may vary de- cannot be used for other purposes. This can be a disadvantage
pending on a variety of factors such as hardware configurations in systems with limited physical memory. Additionally, exces-
and compiler versions. sive use of pinned memory can cause the system to run out of
When considering most Fortran programmes in general, In- physical memory, leading to a significant slowdown as the sys-
tel’s ifort compiler is a common choice and offers a high- tem starts to swap other data to disk. Pinned memory is not part
performance, robust and portable compiler. Consequently, of the Fortran standard and is, therefore, not supported by the
here we demonstrate a hybrid ifort for Coarray Fortran and Intel compiler. This leads to additional intricacies during the
nvfortran for CUDA Fortran solution. combination of compilers, as we discuss below.
3
2.2.2. Host memory: Coarray Fortran
Returning to our problem, when using Intel code and Nvidia
code together, the division of parameters and variables between
the two is one of the key areas to which attention should be
paid. The execution stream within the Nvidia code can only
access and operate on variables and parameters which have
been declared in the Nvidia code. Likewise, the Intel execu-
tion stream can only access and operate on variables and pa-
rameters which have been declared in the Intel code. For this
reason, we consider a virtual partition within the host physical
memory, which can be crossed through the use of subroutines
and interfaces, which we discuss in more detail below. This
clear division of the physical memory does not happen in real-
ity, as Intel and Nvidia declared variables and parameters will
be mixed together when placed in the physical memory, but it is
a helpful tool to consider the possible configurations we detail
below.
Depending on which side of the partition in the host mem-
ory the variables are placed defines where and at what speed
they can be transferred. Figure 1 shows a selection of these op-
tions. In the leftmost case, a variable has two identical copies
in the host memory, one defined by the Intel compiler to allow
CAF parallelisation (simply using the [*] attribute) and one
by the Nvidia compiler to allow it to be pinned. Proceeding to Figure 1: The relative transfer speeds of moving variables between various
the right, in the next case, the variable is defined by the Intel parts of the host memory and the device memory, where green represents a fast
compiler for CAF communication, but a pointer to this variable to negligible speed, and yellow represents a slower speed. As CUDA Fortran
is used for GPU operations, only Nvidia code can be used to define device
is defined by the Nvidia compiler, meaning it cannot be pinned variables. MPI is added for comparison.
and so suffers from the slower pageable memory transfers to the
GPU. In the next cases, we consider the options available with
MPI. The variable is defined by the Nvidia compiler which al- and GPUs. While Coarray Fortran offers a simpler syntax, it re-
lows it to be pinned, but a pointer to this variable is defined quires a slightly more complex compilation process and setup
by the Intel compiler. This means that CAF transfers are not when using GPU parallelisation. We, therefore, lay out this pro-
possible as they are not supported for pointers, meaning MPI is cess as clearly as possible in this article, to make the simplified
required for distributed memory parallelism. In the final case, speed-up of Coarray Fortran as accessible as possible.
the Intel code is not required at all, as the Nvidia compiler sup- The pointer solution, while allowing a faster cross-partition
ports these MPI transfers natively. To understand any overhead solution, does not allow for this pinned attribute, as the Nvidia
associated with the pointing procedure which allows the combi- code simply points to the array defined by the Intel compiler,
nation of CAF and CUDA, we implemented the two MPI solu- which resides in pageable memory. This means a pageable
tions to our potential solver and found no appreciable difference transfer rather than a pinned transfer takes place when moving
in performance. Any small speedup or slowdown between the the values onto the device, which is slower.
two options — one using pointers to combine both compilers It is important to think about the problem which is being ad-
and one only using the Nvidia compiler — is likely due to the dressed as to which solution is more optimal. On our hardware
different MPI versions and implementations used in the com- and testing with a 1283 array, pinned memory transfers took
pilers, rather than by the pointing procedure. around 1 ms, as opposed to 3 ms taken by pageable memory
The partition-crossing use of such a variable is much slower transfers. However, during cross-partition operations, using a
when copying the array than when using a pointer, which has pointer resulted in approximately ∼0 ms of delay, as opposed to
practically no speed overhead at all. In the case that the vari- 3 ms when using a copy operation. Given the choice between
able is copied across the partition, the version which is defined 1) 1 ms GPU transfers but 3 ms cross-partition transfers, or 2)
in the Nvidia code can be defined with attributes allowed by 3 ms GPU transfers but ∼0 ms cross-partition transfers, it is
the Nvidia compiler, one of these being that it is pinned. This important to consider the specific use case.
would not be possible in the Intel version of the variable be- In our case, and therefore also in most cases, it is more op-
cause, as previously discussed, pinned memory is not part of the timal to use a pointer in the Nvidia code. This does not mean
Fortran standard and so is not supported by the Intel compiler. pinned memory transfers to the device cannot be used at all.
It is important to note that, as mentioned above, the Nvidia They simply cannot be used for any coarrayed variables, which
compiler natively supports MPI, and can therefore run Fortran reside in the Intel code partition. During implementation, it
codes parallelised across shared memory, distributed memory is also clearly easier to implement the pointer configuration,
4
which we describe now. example, the Intel execution stream requires a large array from
First, we show the implementation of two copies of the same the Nvidia code, no slowdown is caused during the transfer of
array stored in the physical memory of the host, either side of the array to the other partition and no unnecessary overhead in
our partition. We note that this is inefficient and makes poor use the physical memory is present.
of the memory available given that parameters and variables re- This pointing can be accomplished using an intermediate se-
quiring transfer are stored twice, although this is not typically ries of subroutines, calls, and pointers, all of which are writ-
a problem for modern high-performance computing systems. ten in Fortran but declared with the c-binding to ensure a ro-
Furthermore, every time a variable is updated by the execution bust and common connection between the two compilers, as
stream of one side of the code, a transfer is required across the mentioned above. In the prior duplication case, these subrou-
host physical memory causing some considerable slowdown — tines and calls are required before the use of any parameters or
particularly in the case of large arrays which are frequently up- variables which have been changed by the alternate execution
dated. Technically, a transfer could only be made if it is known stream since their last use, to ensure an up-to-date version is
that the data will be changed on the other side of the partition used. However, with this new pointer case, the subroutines and
before being used again, but this introduces a large and difficult- calls are only required once at the beginning of the running to
to-detect source for coding errors, especially in complex codes, setup and initialise the pointing, making it harder to mistakenly
and so is not advisable. This transfer could also be performed forget to update an array before using it in the source code. This
asynchronously to mask the transfer time, but the inefficient use configuration can be seen in Figure 3. The steps of this config-
of available physical memory is intrinsic to this solution. An il- uration, to set up a coarrayed variable capable of distributed
lustration of how this setup works can be seen in Figure 2, the memory transfer and a counterpart pointer which allows trans-
stages of which are as follows: fer to the device for GPU accelerated operations, are as follows:

1. The Nvidia execution stream calls a subroutine, which has 1. The Nvidia execution stream again calls a subroutine, de-
been defined and compiled within the Intel code, to get fined and compiled within the Intel code, to get the address
the value of the array (in this case a coarray), which is to in the memory of the array (in this case a coarray), which
be operated on. To do this, an interface is defined within is to be operated on. This must be declared with the tar-
the Nvidia code with the bind(C) attribute, meaning that get attribute in the Intel code, so it can be pointed to. As
while the source code is Fortran, the compiled code is before, an interface is defined within the Nvidia code with
C, ensuring a robust connection between the two compil- the bind(C) attribute, to allow this connection. The des-
ers. The destination subroutine is also written with the tination subroutine is also written with the bind(C) at-
bind(C) attribute for the same reason. An example sub- tribute, and the pointer is made a C pointer for the same
routine showing how to use C-binding is provided in the reason.
appendix.
2. The subroutine returns the address of the array as a C-
2–3. The subroutine returns the value of the array and sets the pointer.
value of the counterpart array in the Nvidia code to the
3. The C-pointer is converted to a Fortran pointer for use by
same value.
the Nvidia code. The original coarray, as defined by the
4. The array is now updated on the Nvidia partition and can Intel code and residing in the Intel partition of the host
be operated on by the Nvidia execution stream. memory, can now be operated on and transferred to the
device for GPU-accelerated operations. We note that in
5. A pinned memory transfer can now take place, moving the this case the data are transferred from the pageable host
array to the device memory. memory to the device, thus incurring a certain overhead,
as described above.
6. The array is operated on by the device.
This handling of sensitive cross-compiler operations with
7. A pinned memory transfer allows the array to be moved
C-bindings was found to be essential, as relying on Fortran-
back to the Nvidia partition of the host memory.
Fortran interfacing between the compilers led to non-descript
8–11. An inverse of the first four steps takes place, allowing the segmentation errors. The use of C-binding ensures a robust so-
updated array to be used by the Intel execution stream, and lution to this issue, with no overhead in the execution speed and
accessible for coarray transfer to the rest of the distributed no deviation from the Fortran syntax in the source code.
memory programme.
2.2.3. Host memory: MPI
What was found to be more efficient in our use case, and This technique is almost identically applicable when using
in general something which is true for the majority of cases in MPI. However, it should be noted that in that case, an addi-
terms of execution speed, memory usage and coding simplicity, tional benefit is present in that arrays which are communicated
was to declare a parameter or variable once in either the Intel or using the MPI protocol do not require a coarray attribute in their
Nvidia code, and then create a pointer to this parameters or vari- definition. This means that, opposite to the case in Figure 3, the
able in the partition of the code. This means that whenever, for array can be defined in the Nvidia code, and the pointer in the
5
Figure 2: Configuration of physical memory and the procedure required for retrieval across the partition allowing pinned memory for fast device transfers but
slow partition crossing by copying memory. In this case, a variable has been changed by the Intel execution stream before this procedure starts, and so the Nvidia
execution stream is required to retrieve an updated copy of this array before operating on it, transferring it to the device and performing further GPU-accelerated
operations. The Intel execution stream then takes over, retrieving the value(s) of the updated variable, performing its operations, and then allowing for transfer to
other host memory spaces in the distributed memory. This host-host transfer must occur every time a device operation is performed with a coarrayed variable.

Figure 3: Configuration of physical memory and the procedure required for retrieval across the partition using pointers allowing for no transfer overhead in host-host
transfers, but requiring the use of pageable memory for device transfers and incurring an overhead. In this case, the linking is performed only once during the first
stages of the running of the code, meaning both host execution streams are always accessing the same version of host variables.

6
Intel code. In this case, the full speed-up of data transfer for the therefore, attempts to communicate with the GPU of the other,
pinned memory can be utilized. An illustration of this can be there is a big overhead in the communication. Where our GPUs
seen in Figure 4. are connected is shown in the figure.
For practical application within Intel Coarray Fortran, the In-
2.3. CPU Affinity with GPUs tel MPI Library offers environment variables to facilitate CPU
In any high-performance computing application using ei- affinity settings:
ther PGAS, or more widely the single program multiple data
(SPMD) technique, the ability to control the allocation of pro- export I_MPI_PIN=on
gramme processes to specified processing units—commonly export I_MPI_PIN_DOMAIN=[HEX_CONFIG]
referred to as CPU affinity or process pinning—is pivotal for
where HEX CONFIG is a hexadecimal (hex) code correspond-
performance optimisation. This is especially relevant in sys-
ing to the IDs of the compute cores within the multiprocessors
tems with a Non-Uniform Memory Access (NUMA) architec-
which are to be connected to the GPUs. This is described in
ture (common in HPC), and/or in systems with multiple CPUs
more detail below.
and GPUs.
To effectively set CPU affinity through Intel MPI’s environ-
In a NUMA architecture, the term ‘NUMA node’ refers to
ment variables, one must provide the correct hexadecimal (hex)
a subsystem that groups sets of processors within one mul-
codes corresponding to the CPU or core IDs. These hex codes
tiprocessor together and connects them with a memory pool,
serve as unique identifiers for the CPUs and are critical for pin-
forming a closely-knit computational unit. Unlike a Uniform
ning computational tasks accurately. In the case of our code,
Memory Access (UMA) configuration where each processor
each coarray image is designed to require one GPU. It is, there-
within the multiprocessor has equal latency for memory access
fore, important to ensure that each coarray image is running on
across the system, NUMA nodes have differing latencies de-
the socket containing the NUMA node that is connected to the
pending on whether the accessed memory resides within the
correct GPU, namely, to the GPU the coarray image is using.
same NUMA node or in a different one. Taking the example of
Other situations where multiple coarray images use the same
one of our systems, described in detail later, this can be seen
GPU are also possible, in which case these images should be
in Figures 5 and 6. When using a hybrid programming model,
pinned within the same socket the GPU is connected to. How-
such as a PGAS language like Coarray Fortran in combination
ever, if more images are added to a socket, they each naturally
with OpenMP in our case, it becomes specifically important to
must reside on fewer cores, meaning that any CPU parallelisa-
remember these.
tion is limited. Determining the balance between CPU and GPU
Therefore, a well-configured system strives to keep data and
parallelisation is something that will vary between use cases.
tasks localised within the same NUMA node whenever possi-
As aforementioned, in our example, we are interested in CPU
ble, mitigating the impact of varying memory access latencies
affinity because we want to ensure that each coarray image runs
across NUMA nodes. This concept is critical for understanding
on the socket containing the NUMA node that is connected to
the intricacies of CPU affinity and process pinning in multi-
the GPU it is using.
socket, multi-GPU systems.
Once the desired destination CPU IDs have been identified,
The Linux command numactl --hardware can be used to
they can be converted to hexadecimal format in the following
determine relative distances between NUMA nodes, the mem-
way: Given two coarray images are running on one node (in
ory available to each group, and the ID of each processor within
this case, one image on each socket), the CPU code of the first
each NUMA node. If more exact latency numbers are required
image should be pinned to NUMA node 3 and CPU code of the
between NUMA nodes, tools like the Intel Memory Latency
second image to NUMA node 7, with the corresponding GPU
Checker can be used. If hyperthreading is enabled on a system,
code pinned to GPUs 0 and 1 respectively. The GPU number
then additional processor IDs will be shown in addition to those
used by the coarray image can simply be set using the Nvidia
referring to physical processors. In our case, our first NUMA
CUDA Fortran line
node, NUMA node 0, encompasses the processors 0–15 and
128–143, with the former referring to the physical processors istat = cudaSetDevice(mod(irank-1,gpusPerNode))
and the latter referring to their hyperthreaded counterparts. We
where irank is the Coarray Fortran image number, starting
do not concern ourselves with hyperthreading in this paper, as
from 1 and gpusPerNode is the number of GPUs attached to
we observed no speed improvements when using it.
each computational node, this being 2 in the case shown in Fig-
A second important aspect of the NUMA architec-
ures 5 and 6. This will effectively alternate between 0 and 1 as
ture when considering GPUs is the connection of GPUs
the image number (irank) increases from 1 to the last image.
to the processors, found using the Linux command
Taking our test case, it is known that NUMA node 3 cor-
nvidia-smi topo --matrix. In our architecture, a GPU is
responds with cores 48–63. This first needs to be encoded in
connected to one NUMA node, with one device connected per
binary, considering all the 127 cores we have available:
socket. As seen in Figure 6, each NUMA node within a socket
has a slight speed penalty associated with communication 00000000000000000000000000000000
with NUMA nodes in the same socket (20% increase) and a ,→ 00000000000000001111111111111111
high overhead associated with communicating with the other ,→ 00000000000000000000000000000000
socket (220% increase). When a coarray image on one socket, ,→ 00000000000000000000000000000000
7
Figure 4: Configuration of physical memory and the procedure required for retrieval across the partition using pointers allowing for no transfer overhead in host-host
transfers, and allowing for the use of pinned memory for device transfers, incurring the minimal overhead. As in the previous case, the linking is performed only
once during the first stages of the running of the code, meaning both host execution streams are always accessing the same version of host variables.

This is then converted to a hex code:

FFFF0000000000000000. The same is done for NUMA
node 7 (cores 112-127) and the appropriate hex code (FFFF), is
generated. Once obtained, these hex codes can be placed in the
I MPI PIN DOMAIN environment variable to set CPU affinity
precisely as:

export I_MPI_PIN=on
export I_MPI_PIN_DOMAIN=[FFFF0000000000000000,
,→ FFFF]

I MPI PIN DOMAIN here receives two arguments for posi-

tions within the node ([position 1, position 2]), given
that in our case we want to pin two coarray images to each
Figure 5: Architectural schematic detailing the interconnect between GPUs and node. Each coarray image is then allocated, starting with the
NUMA nodes within a dual-socket configuration. first image (1) and proceeding in order to the last image, to po-
sition 1 on node 1, position 2 on node 1, position 1 on node 2,
position 2 on node 2, and so on.
Activating these environment variables and running the Coar-
ray Fortran executable adheres the computation to the desig-
nated CPU affinity. The fidelity of this configuration can be
corroborated via the I MPI DEBUG environment variable. The
performance difference between pinning coarray images on the
correct, and incorrect, sockets for their associated GPUs, can
be seen in our scaling tests below.

2.4. Compiling and Linking

Finally, we compile the CUDA Fortran code with

nvfortran into constituent object files and then into one
shared object library file. The Coarray Fortran code is com-
piled with ifort into constituent object files. To link the two
Figure 6: Schematic representation of NUMA node distances within a multi- together, we use ifort, ensuring that the linking is performed
socket, multi-GPU environment. NUMA nodes closely connected to the GPUs in an environment where both CUDA Fortran and Coarray For-
are emphasised.
tran libraries are included in the library path. We outline a sim-
plified procedure here:
8
1. Compile the CUDA Fortran device code into a position-
independent machine code object file:
nvfortran -cuda -c -fpic device.cuf
2. Create a shared object library file using the object file:
nvfortran -cuda -shared -o libdevice.so
device.o
3. Compile the Fortran host code into a machine code object
file:
ifort -c host.f90
4. Create a CAF configuration file:
echo '-n 2 ./a.out' > config.caf
(a) Entire data range from 0 to 107 bytes shown.
5. Link the host machine code object file and the device
machine code shared object library file, also using CAF
distributed memory flags:
ifort -coarray=distributed
-coarray-config-file=config.caf -L. host.o
-ldevice -o a.out
Before execution of the programme, the relative path to the
source directory should also be appended to the dynamic shared
libraries path list environment variable LD LIBRARY PATH.
Our methodology involves a meticulous combination of
CUDA Fortran’s GPU acceleration and Coarray Fortran’s ef-
ficient data distribution across multiple compute nodes. It
requires careful attention to compiler selection, memory al- (b) Data range cropped to that between 104 to 106 bytes shown.
location, and interfacing between two different programming
paradigms. The result is a seamless integration of these distinct Figure 7: Results of one-sided memory transfers between two images on sepa-
rate nodes both with and without the adapted RMA procedure, showing a slight
models into a single high-performance computing application speed overhead.
with high-performance GPU and distributed memory accelera-
tion.
RMA operations to take place but through a slightly different,
2.5. Intel Coarray Fortran with AMD processors AMD processor-compatible, method.
Intel’s ifort does not officially support the running of code This can be done by setting the environment variable
on AMD processors. Therefore, there are inevitably some com- MPIR CVAR CH4 OFI ENABLE RMA=0. This is an environment
plications when attempting to do this. Many high-performance variable setting for the Message Passing Interface (MPI) im-
computing centres use AMD processors, including the ones plementation, specifically for the MPICH library, which is a
used by us for this study. We faced errors when running widely used open-source MPI implementation. The Intel For-
our code on AMD processors from the remote memory access tran compiler supports MPI for parallel programming, and
(RMA) part of Coarray Fortran (CAF). Unlike pure-MPI, CAF the MPICH library can be used in conjunction with the In-
uses a one-side communications protocol which means that intel Fortran compiler for this purpose. The environment vari-
stead of each image being told to send and receive data, one able MPIR CVAR CH4 OFI ENABLE RMA is related to the CH4
image is allowed to access the data from another without any device layer in MPICH, which is responsible for commu-
action from the target image. CAF still uses MPI commands, nication between processes. CH4 is a generic communica-
and therefore requires an MPI version, but only uses the one- tion layer that can work with different network interfaces,
sided protocols. and OFI (Open Fabric Interface) is one such interface. The
We used the ifort compiler in intel/2019 as it was the MPIR CVAR CH4 OFI ENABLE RMA variable controls the usage
most recent version of the intel compiler available to us, and of RMA operations in the CH4 device layer when using the
found that it cannot perform such RMA operations when AMD OFI network interface. RMA operations enable processes to di-
processors are being used, and therefore CAF programmes can- rectly access the memory of other processes without involving
not run. There are three solutions we identified to this prob- the target process, thus providing low-latency communication
lem, the first of which is to change the version of MPI which and better performance.
is used by ifort to one released by OpenMPI. The second is To understand the overhead associated with changing the in-
to change the Open Fabric Interface (OFI) version. The third, ternal default transport-specific implementation, we ran speed
and in our experience the fastest, solution is to change the inter- tests using the Intel Benchmarking tool on some of the MPI op-
nal default transport-specific implementation. This still allows erations which are used by CAF both with and without OFI
9
RMA disabled. We used the unidir put and unidir get GPU acceleration, and then compared this with the fully paral-
commands for this comparison, the results of which can be seen lelised and optimised Coarray Fortran-CUDA-OpenMP hybrid
in Figure 7. As is seen in this Figure, there is little effect above approach. The results of these tests and how they scale, for both
a few ×106 bytes, saturating to no speed overhead incurred by N = 64 and N = 128 can be seen in Figure 9 for different num-
changing the RMA implementation as we describe. Therefore, bers of nested levels. It should be noted that this figure uses
our comparisons of speeds between MPI and our slightly aug- VSC5 for the single-node results and Narval for the multi-node
mented CAF can be seen as representative of a direct compari- results. The performance of the two clusters using distributed
son between MPI and CAF. memory is shown later.
It can be seen that GPU acceleration is highly desirable as
it enables acceleration of a factor of approximately 10. Dis-
3. Performance Analysis and Results tributed memory parallelism further accelerates our solver by a
factor of at least 2.5. In cases such as these, where a potential
To test our integration of CAF with CUDA Fortran, and to solver forms part of a larger code which is all Coarray Fortran
compare it with a hybrid MPI-CUDA Fortran approach, we em- parallelised, the acceleration is further enhanced by requiring
ployed the convolution method for solving potential fields on less transfers all to one node to perform the GPU operations.
nested grids in three dimensions [13], referred to from hereon However, such additional complexities are not considered here.
in as CM4NG. This method provides an appropriate bench-
mark due to its reliance on all forms of shared memory, dis- 3.2. CPU-GPU affinity
tributed memory, and GPU parallelism. We performed testing To present the importance of CPU-GPU affinity, we confined
on two different clusters with different hardware configurations. each nested mesh level to a NUMA node and then performed
These are the Vienna Scientific Cluster 5 (VSC5)1 and Compute speed testing by running the code with perfect affinity and per-
Canada’s Narval23 . The specifications of the hardware config- fectly inverse affinity. These optimal and pessimal configura-
urations are shown in Table 1. tions are illustrated in Figure 10. Case a) presents the per-
The domain distribution performed on each of the clusters fect affinity configuration when the coarray image runs its GPU
for the convolution theorem on nested grids can be seen in Fig- tasks on the device directly connected to its CPU socket, while
ure 8. In our case, each nested mesh was allocated one GPU, case b) shows the worst affinity configuration when the coarray
with the number of images per socket varying based on the clus- image has to communicate through the other socket to perform
ter configuration. GPU utilisation was over 50% during the test its GPU tasks.
cases, meaning that a computational deceleration would be ob- The results of this testing can be seen in Figure 11. The de-
served if hosting multiple nested grids on one GPU. If hardware gree to which CPU-GPU affinity impacts the performance of
availability is an issue, this may be outweighed by HPC queue CM4NG is dependent on the size of the computational task,
times, and a technique called ‘CUDA Streams’ could be used given that the CPU-GPU transfer time takes up a smaller por-
to ‘split’ the GPUs. However, these were not considered in this tion of the solving time as the number of computations in-
study. creases with grid size. In our case, when it has the most impact,
In the following text, N is used to refer to the length of the the optimal configuration is approximately 40% faster than the
three-dimensional array used by each mesh level in the convo- pessimal configuration. This is seen when using two nested
lution method. Ndepth is used to denote the number of mesh meshes, each with 643 resolution. For the 643 resolution cases,
levels, each one corresponding to a Coarray Fortran (and MPI the difference between optimal and pessimal configurations sat-
when used for comparison purposes) image. For these tests, urates, to the point where for 12 nested mesh levels, the advan-
RMA operations in the CH4 device layer using the OFI inter- tage is negligible. For the 1283 case, the difference between
face were disabled, as explained above, for both the CAF and optimal and pessimal configurations is negligible for all mesh
MPI tests to ensure comparability. In the following text, CAF+ depths.
is used to refer to the fully parallelised hybrid Coarray For-
tran, CUDA, and OpenMP approach explained above. MPI+ 3.3. GPUs per node
is similarly used to refer to the aforementioned MPI, CUDA, To test the performance of the code when running on a dif-
and OpenMP approach. ferent number of GPUs per node, we performed scaling tests
on VSC5 and Narval. The results of this can be seen in Fig-
3.1. GPU and distributed memory acceleration ure 12. Notable here is the speed-up of the code when using 4
GPUs per node on Narval as opposed to 2 GPUs per node on
To assess the performance of our integration, we first com-
VSC5. Although Narval’s CPUs support a slightly higher clock
pared the execution times and scaling efficiency of our solution
speed and a larger L3 cache, these differences are mostly due
when only parallelised with OpenMP for shared memory paral-
to the fact that inter-node communication is more costly than
lelism, with shared memory parallelism combined with a single
intra-node communication. When 4 GPUs per node are used as
opposed to 2, only 3 nodes in total are needed as opposed to 6.
1 https://vsc.ac.at/ This reduces the transfer time introduced by distributed mem-
2 https://www.calculquebec.ca/ ory parallelism, as the distribution is across a smaller number
3 https://alliancecan.ca/ of nodes.
10
Cluster CPU/node GPU/node Interconnect
1 VSC-5 2x AMD EPYC 7713 2x NVidia A100 Infiniband Mellanox HDR
2.0 GHz 33 M 40 GB memory
cache L3
2 Narval 2x AMD Milan 7413 4x NVidia A100 Infiniband Mellanox HDR
2.65 GHz 128M 40 GB memory
cache L3

Table 1: Hardware configurations for performance testing.

Figure 8: a) Illustration of a two-dimensional slice of the nested grid structure employed in performance testing, showing four nested mesh levels, with each colour
corresponding to a new grid level. b) Schematic representation of the grid distribution across four GPUs on VSC5, where each socket has one NUMA node attached
to one GPU. c) Schematic representation of the grid distribution across four GPUs on Narval, where each socket has two GPUs attached, each to a different NUMA
node. More nested grids require more nodes.

3.4. Coarray Fortran vs MPI 3.5. Execution times

The execution times for the different hardware configura-

To test the efficiency of the Coarray Fortran implementation tions, as seen, show that Coarray Fortran performs comparably
explained above, we performed scaling tests for both the CAF to MPI, scaling in the same manner with no worsening per-
and MPI versions of the solver. The results of these tests can be formance as the computational domain increases, both for the
seen in Figure 13. We observe that the performance of the CAF nested mesh size and the number of nested meshes. All results
and MPI versions of the code are comparable, with the MPI ver- demonstrate a near-proportional scaling of the code with nested
sion performing approximately 10% faster for 12 nested mesh mesh level. In our method, no asynchronous transfers can be
levels for the 643 resolution case, and approximately 5% faster performed so all effects of transfers between the device and the
for the 1283 resolution case. This difference becomes more host are reflected in the results. However, as is seen, when more
pronounced as the size of the computational domain decreases, GPUs are clustered on a computational node, the code performs
with CAF being 30% slower for 643 with 2 nested meshes and better, given less inter-node transfers are required, which are
50% for 1283 . This is partly due to the ability of the MPI code more costly than intra-node transfers.
to use pinned memory for device transfers, as described above. We found that for a CM4NG computational domain of a size
However, the main reason for this difference is the faster imple- producing adequate accuracy, the CAF-CUDA-OMP integra-
mentation of MPI than CAF by Intel in the ifort compiler. In tion showed a 5% reduction in speed when compared to the
the cases where the number of mesh levels is lower, the com- MPI-CUDA Fortran approach. This, for us, is considered in
putation forms less than half of the total time to complete the the context of the simplicity of implementing Coarray Fortran
solution, with the majority of time being spent on communica- when compared to MPI. According to our experience, many
tion. This means the result is extremely sensitive to the speed legacy Fortran codes were written to be run in a serial mode.
of the communication. In real application, for the production Such codes can be parallelised with modest efforts to be run
of useful results, CM4NG is used at 8 nested mesh levels and on a single node using OpenMP, unless common blocks were
above. In cases where data transfers form the majority of the utilised, in which case parallelisation is difficult and requires
time to complete the solution, and actual computation forms introducing modules. In any case, a distributed memory par-
the minority, MPI becomes more desirable. allelisation of legacy Fortran codes using MPI requires a sig-
11
Figure 9: Scaling performance of the convolution method on nested grids when using OpenMP (OMP) parallelisation, OMP combined with CUDA GPU accelera-
tion, and when OMP and CUDA are combined with Coarray Fortran (CAF) distributed memory parallelisation. For this final case of full parallelisation, each nested
mesh is assigned one GPU. Left panel: 643 cells per nested mesh. Right panel: 1283 cells per nested mesh.

Figure 10: Two configurations for a socket-GPU pair on nodes with two sockets
containing NUMA-architecture multiprocessors and two GPUs, where GPUs 0
and 1 are connected to NUMA nodes 3 (socket 0) and 7 (socket 1) respectively.
Two processes (light blue and dark blue) are running on the node, one on each
socket. a) Shows the optimal configuration of the CPU and GPU processes,
where the CPU image runs its GPU tasks on the device directly connected, in
this case, process 0 on NUMA node 3 (socket 0) runs on GPU 0, and process 1
on NUMA node 7 (socket 1) runs on GPU 1. b) Shows the worst-case configu- Figure 11: Scaling performance of CM4NG with full parallelisation via
ration where CPUs perform their GPU calculations on the device connected to OpenMP (OMP), CUDA and Coarray Fortran (CAF), using 2 GPUs per node
the opposite socket, in this case, process 0 on NUMA node 0 (socket 0) runs on on VSC5. CPU-GPU affinity is shown for the optimal and pessimal configura-
GPU 1, and process 1 on NUMA node 4 (socket 1) runs on GPU 0. tions.

12
nificant rewrite of the code which, in terms of the required ef-
forts, may be comparable to writing a new code from scratch.
With Coarray Fortran the distributed memory parallelisation of
legacy Fortran codes becomes feasible with relatively little ef-
fort but significant speed-up. This is because CAF requires
introducing only several additional lines and variables to the
code, keeping the existing code structure and syntax intact. The
fact that Coarray Fortran can be integrated with CUDA Fortran
makes the coarray language standard particularly attractive for
scientists.

4. Conclusions

In this study, we have successfully demonstrated a robust

methodology for integrating Intel Coarray Fortran with Nvidia
CUDA Fortran, supplemented by OpenMP, to optimise Fortran
codes for high-speed distributed memory, GPU, and multipro-
cessor acceleration. Our approach, focusing on the fusion of
these distinct yet powerful paradigms, has shown significant
promise in enhancing the computational performance of For-
Figure 12: Scaling performance of CM4NG on clusters VSC5 and Narval, us-
ing 2 GPUs/node and 4 GPUs/node respectively. CPU-GPU affinity is optimal
tran codes without the need for extensive rewrites or departure
for these results. CM4NG is configured to allocate each nested mesh level one from familiar syntax.
GPU. The code is fully parallelised with Coarray Fortran, CUDA and OpenMP
(CAF+). Both grids with 643 and 1283 cells per nested mesh level are shown. 1. Performance Improvements: Our results indicate that the
Coarray Fortran-CUDA Fortran integration leads to sub-
stantial performance gains. We observed, for our use case,
only a 5% reduction in execution time compared to a sim-
ilar MPI-CUDA Fortran approach. The performance of
CAF-CUDA Fortran is significant, considering the com-
parative ease of implementation of Coarray Fortran over
MPI. Our findings underscore the potential of Coarray
Fortran, especially for the scientific community that re-
lies heavily on Fortran codes. Its straightforward syntax
and implementation make it an accessible and powerful
tool for researchers who may not have extensive experi-
ence with more complex distributed memory parallelism
techniques.

2. Scalability: The near-linear scaling of our potential solver

code with the increase in nested mesh levels and the num-
ber of nested meshes highlights the efficiency of our ap-
proach, and this scaling is present in both the MPI and
Coarray Fortran implementations of the code. This scala-
bility allows the code to be run at a competitive speed on
a range of hardware depending on its performance.

3. Hardware Utilisation: Our methodology’s ability to lever-

age multiple GPUs effectively, as evidenced by improved
performance on systems with a higher concentration of
Figure 13: Scaling performance of CM4NG using 2 GPUs per node on VSC5, GPUs per computational node, points towards the impor-
when CPU-GPU affinity is optimal. Fully parallelised Coarray Fortran, CUDA tance of hardware-aware code optimisation and the min-
and OpenMP (CAF+) and MPI, CUDA and OpenMP (MPI+) are compared
for both 643 and 1283 cells per nested mesh level. CM4NG uses one GPU per imisation of distributed memory transfers. This means the
nested mesh level. speed of these communications should be as high as possi-
ble and not influenced heavily by additional transfers than
those to and from the GPU memory. Our methodology
avoids this by using pointers, ensuring the most optimal
Coarray Fortran and MPI performance.
13
4. CPU-GPU affinity: When using Coarray Fortran and [8] Greg Ruetsch, An Easy Introduction to CUDA Fortran (10 2012).
CUDA Fortran together, we observe an increase in per- URL https://developer.nvidia.com/blog/
easy-introduction-cuda-fortran/
formance when CPU-GPU affinity is optimised, and show [9] G. Ruetsch, M. Fatica, CUDA Fortran for Scientists and Engineers, Mor-
that this is particularly important when data transfer times gan Kaufmann, 2014. doi:10.1016/C2013-0-00006-0.
make up a significant proportion of the complete solution [10] NVIDIA, CUDA Fortran Programming Guide, Tech. rep. (11 2023).
time. We demonstrate the method which can be used to [11] B. Van De Wiele, A. Vansteenkiste, B. Van Waeyenberge, L. Dupré,
D. De Zutter, Fast Fourier transforms for the evaluation of convolution
optimise CPU-GPU affinity when using Coarray Fortran products: CPU versus GPU implementation, International Journal of
via the I MPI PIN DOMAIN environment variable. Numerical Modelling: Electronic Networks, Devices and Fields 27 (3)
(2014) 495–504. doi:10.1002/JNM.1960.
5. Portability: As our approach relies in part on the Intel For- [12] J. Binney, S. Tremaine, Galactic dynamics, Princeton University Press,
tran compiler, we outline effective solutions to enable the 1987.
running of Coarray Fortran on AMD processors, the fastest [13] E. I. Vorobyov, J. McKevitt, I. Kulikov, V. Elbakyan, Computing the grav-
itational potential on nested meshes using the convolution method, As-
of these being to change the remote memory access proto- tronomy & Astrophysics 671 (2023) A81. doi:10.1051/0004-6361/
col used by CAF. 202244701.
[14] W. Gropp, MPI (Message Passing Interface), in: D. Padua (Ed.), Ency-
In conclusion, our integration of Coarray Fortran with CUDA clopedia of Parallel Computing, Springer, Boston, MA, 2011, pp. 1184–
Fortran and OpenMP offers a significant step forward in mod- 1190. doi:10.1007/978-0-387-09766-4.
[15] A. Sharma, I. Moulitsas, MPI to Coarray Fortran: Experiences with a
ernizing Fortran-based scientific computing. Multiple imple- CFD Solver for Unstructured Meshes, Scientific Programming 2017 (9
mentations are available and are compared and contrasted de- 2017). doi:10.1155/2017/3409647.
pending on the use case. [16] S. Garain, D. S. Balsara, J. Reid, Comparing Coarray Fortran (CAF) with
MPI for several structured mesh PDE applications, Journal of Computa-
tional Physics 297 (2015) 237–253. doi:10.1016/J.JCP.2015.05.
Acknowledgements 020.
[17] R. W. Numrich, J. Reid, Co-array Fortran for parallel programming, ACM
We are thankful to the referee for the comments and sugges- SIGPLAN Fortran Forum 17 (2) (1998) 1–31. doi:10.1145/289918.
289920.
tions that helped to improve the manuscript. This work was sup- [18] G. Almasi, PGAS (Partitioned Global Address Space) Languages, in:
ported by the FWF project I4311-N27 (J.M., E.I.V.) and RFBR D. Padua (Ed.), Encyclopedia of Parallel Computing, Springer, Boston,
project 19-51-14002 (I.K.). Benchmarks were performed on MA, 2011, pp. 1539–1545. doi:10.1007/978-0-387-09766-4.
the Vienna Scientific Cluster (VSC)4 and on the Narval Cluster [19] M. J. Flynn, Some Computer Organizations and their Effectiveness, IEEE
Transactions on Computers C-21 (9) (1972) 948–960. doi:10.1109/
provided by Calcul Québec5 and the Digital Research Alliance TC.1972.5009071.
of Canada6 . [20] M. McCool, J. Reinders, A. Robison, Structured Parallel Programming:
Patterns for Efficient Computation, 1st Edition, Elsevier Science, San
Diego, 2012.
References
[1] A. R. Brodtkorb, T. R. Hagen, M. L. Sætra, Graphics processing unit
(GPU) programming strategies and trends in GPU computing, Journal of
Parallel and Distributed Computing 73 (1) (2013) 4–13. doi:10.1016/
J.JPDC.2012.04.003.
[2] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, J. C. Phillips,
GPU computing, Proceedings of the IEEE 96 (5) (2008) 879–899. doi:
10.1109/JPROC.2008.917757.
[3] S. Santander-Jiménez, M. A. Vega-Rodrı́guez, J. Vicente-Viola, L. Sousa,
Comparative assessment of GPGPU technologies to accelerate objective
functions: A case study on parsimony, Journal of Parallel and Distributed
Computing 126 (2019) 67–81. doi:10.1016/J.JPDC.2018.12.006.
[4] M. Khairy, A. G. Wassal, M. Zahran, A survey of architectural approaches
for improving GPGPU performance, programmability and heterogeneity,
Journal of Parallel and Distributed Computing 127 (2019) 65–88. doi:
10.1016/J.JPDC.2018.11.012.
[5] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E.
Lefohn, T. J. Purcell, A survey of general-purpose computation on graph-
ics hardware, Computer Graphics Forum 26 (1) (2007) 80–113. doi:
10.1111/J.1467-8659.2007.01012.X.
[6] M. Garland, NVIDIA GPU, in: D. Padua (Ed.), Encyclopedia of Par-
allel Computing, Springer, Boston, MA, 2011, pp. 1339–1345. doi:
10.1007/978-0-387-09766-4.
[7] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, K. Skadron, A
performance study of general-purpose applications on graphics proces-
sors using CUDA, Journal of Parallel and Distributed Computing 68 (10)
(2008) 1370–1380. doi:10.1016/J.JPDC.2008.05.014.

4 https://vsc.ac.at/
5 https://www.calculquebec.ca/
6 https://alliancecan.ca/

14
Appendix A. C-Binding in Fortran Appendix A.2.1. Code compiled with compiler B
! module containing the variable
module my_vars
C-binding in Fortran is a powerful feature that allows inter-
implicit none
operability with C, enabling the use of C data types, and the
integer, target :: a
calling of C functions and libraries. This feature is particu-
end module my_vars
larly useful in high-performance computing where leveraging
both Fortran’s computational efficiency and C’s extensive li-
! subroutine returning the C address
brary ecosystem can be advantageous.
subroutine get_a(value) bind(C, name="get_a")
In the context of our study, C-binding plays a critical role in
use iso_c_binding, only: c_ptr, c_loc
enabling robust interaction between the nvfortran and ifort
use my_vars, only: a
compilers, allowing them to communicate according to the C
implicit none
standard. When implementing pure Fortran interfaces between
type(c_ptr), intent(out) :: value
the compilers, we met numerous segmentation errors which
value = c_loc(a)
were avoided when communication was facilitated by binding
end subroutine get_a
the connection with C. This section provides an overview of
how C-binding is used in our methodology and a simple exam-
ple for illustration. Appendix A.2.2. Code compiled with compiler A
! interface to subroutine
module interface_compilers
Appendix A.1. Using C-binding in Fortran to combine Fortran use iso_c_binding, only: c_ptr
compilers and allow a robust common memory implicit none
space
interface
In the case below, code compiled with compiler A will re- subroutine get_a(value) bind(C, name="get_a")
ceive the location of a variable in memory compiled by com- type(c_ptr), intent(out) :: value
piler B. The code compiled by compiler A will then be able to end subroutine get_a
access the variable directly. end interface
To use C-binding in Fortran, specific steps must be followed: end module interface_compilers

! call the subroutine and convert to pointer

B1 Define the subroutine to return the memory location:
program main
In code compiled with compiler B, define a subroutine
use iso_c_binding, only: c_ptr
with the bind(C) attribute, which returns the C address
use interface_compilers
of a variable.
implicit none
A1 Define the interface to the subroutine: In the code com-
type(c_ptr) :: a_c_ptr
piled with compiler A, define an interface to the previous
integer, pointer :: a_ptr
subroutine, also using the bind(C) attribute. This must be
done inside a module.
call get_a(a_c_ptr)
call c_f_pointer(a_c_ptr, a_ptr)
A2 Call the subroutine: In the code compiled with compiler
A, call the subroutine defined in step B1. This will return ! use as normal Fortran variable
the C address of the variable to the code compiled with end program main
compiler A.
In this example, the Fortran program defines a module
A3 Convert the C address to a Fortran pointer: In the code interface compilers containing an interface to a C func-
compiled with compiler A, convert the C address to a For- tion, written in Fortran, named get a. This methodology can
tran pointer. This can be done using the c f pointer in- be used to not only pass C addresses but more generally call
trinsic function. functions and subroutines across Fortran code compiled with
different compilers.
Appendix A.2. Example of C-Binding for memory location
passing Appendix A.3. Integration in Our Methodology
In our methodology, C-binding is employed to create inter-
Below is a simple example demonstrating the use of C- faces between CUDA Fortran and Coarray Fortran components.
binding in Fortran to pass the memory location of a variable This is crucial for transferring data and control between differ-
from one compiler to another. ent parts of the application, which are compiled with different
15
compilers. By using C-binding, we ensure a robust and efficient
interaction between these components.

Cuda Examples
No ratings yet
Cuda Examples
28 pages
A Deep Dive Into The Latest HPC Software
No ratings yet
A Deep Dive Into The Latest HPC Software
38 pages
Parallel Comp Point Main
No ratings yet
Parallel Comp Point Main
18 pages
Solving Sparse Linear Systems of Equations Using Fortran Coarrays
No ratings yet
Solving Sparse Linear Systems of Equations Using Fortran Coarrays
10 pages
Applied Parallel Computing Deng 2011
No ratings yet
Applied Parallel Computing Deng 2011
232 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
Leveraging Large Language Models For Code Translation and Software Development in Scientific Computing
No ratings yet
Leveraging Large Language Models For Code Translation and Software Development in Scientific Computing
8 pages
Cost-Effective HPC for Computer Vision
No ratings yet
Cost-Effective HPC for Computer Vision
6 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
High Performance Computing For Computational Scien
No ratings yet
High Performance Computing For Computational Scien
8 pages
OpenACC Programming for Fortran
No ratings yet
OpenACC Programming for Fortran
53 pages
High Performance Computing Labs & Concepts
No ratings yet
High Performance Computing Labs & Concepts
5 pages
Writing Fast Fortran Routines For Python: © 2012 M. Scott Shell Last Modified 4/10/2012
No ratings yet
Writing Fast Fortran Routines For Python: © 2012 M. Scott Shell Last Modified 4/10/2012
24 pages
Computational Physics
No ratings yet
Computational Physics
143 pages
CUDA Fortran
No ratings yet
CUDA Fortran
88 pages
Python GPU Acceleration Webinar
No ratings yet
Python GPU Acceleration Webinar
33 pages
1.6 Final Thoughts: 1 Parallel Programming Models 49
No ratings yet
1.6 Final Thoughts: 1 Parallel Programming Models 49
5 pages
Parallel Programming For Modern High Performance Computing Systems (Czarnul, Pawel)
No ratings yet
Parallel Programming For Modern High Performance Computing Systems (Czarnul, Pawel)
330 pages
High Performance Parallel Computing: Fortran: View Full Version
No ratings yet
High Performance Parallel Computing: Fortran: View Full Version
2 pages
Cufft Performance Graphs
No ratings yet
Cufft Performance Graphs
10 pages
GPU Computing With Apache Spark and Python: April 5, 2016
No ratings yet
GPU Computing With Apache Spark and Python: April 5, 2016
55 pages
High Performance Computing Guide
0% (1)
High Performance Computing Guide
76 pages
Programming in Parallel With CUDA A Practical Guide (Richard Ansorge)
100% (1)
Programming in Parallel With CUDA A Practical Guide (Richard Ansorge)
477 pages
CUDA Libraries for Developers
No ratings yet
CUDA Libraries for Developers
86 pages
Introduction To CUDA Platform 1
No ratings yet
Introduction To CUDA Platform 1
18 pages
Applied Parallel Computing-Honest
100% (1)
Applied Parallel Computing-Honest
218 pages
High Performance Computing: 772 10 91 Thomas@chalmers - Se
No ratings yet
High Performance Computing: 772 10 91 Thomas@chalmers - Se
75 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
sc09 Fluid Sim Cohen
No ratings yet
sc09 Fluid Sim Cohen
33 pages
High Performance Computing For Computational Mechanics: ISCM-10
No ratings yet
High Performance Computing For Computational Mechanics: ISCM-10
63 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
No ratings yet
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
167 pages
Parallel Distributed Computing Using Pyt
No ratings yet
Parallel Distributed Computing Using Pyt
41 pages
Owens
No ratings yet
Owens
67 pages
Parallel Computing Techniques
No ratings yet
Parallel Computing Techniques
19 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Gpu Parallel Program Development Cuda
100% (2)
Gpu Parallel Program Development Cuda
477 pages
EAS 520 UmassD Syllabus Sheer
No ratings yet
EAS 520 UmassD Syllabus Sheer
2 pages
Luong Thesis
No ratings yet
Luong Thesis
81 pages
Parallel Algorithms for Image Processing
No ratings yet
Parallel Algorithms for Image Processing
5 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
CUDA 4 1 Webinar v11-11-22
100% (1)
CUDA 4 1 Webinar v11-11-22
41 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
GPU Computing Course Overview
No ratings yet
GPU Computing Course Overview
3 pages
PDSCUDA
No ratings yet
PDSCUDA
11 pages
Python GPU Performance Guide
No ratings yet
Python GPU Performance Guide
8 pages
ParallelR-Accelerating R Applications With CUDA
No ratings yet
ParallelR-Accelerating R Applications With CUDA
59 pages
High Performance Object-Oriented Scientific Programming in Fortran 90
No ratings yet
High Performance Object-Oriented Scientific Programming in Fortran 90
8 pages
High Performance Computing ChapterSampler
No ratings yet
High Performance Computing ChapterSampler
124 pages
CUDA C Programming Course Overview
No ratings yet
CUDA C Programming Course Overview
30 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
41 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
OTP and UV EPROM Products: Cross Reference Guide
No ratings yet
OTP and UV EPROM Products: Cross Reference Guide
8 pages
TP02G-AS1 Terminal Panel Guide
No ratings yet
TP02G-AS1 Terminal Panel Guide
2 pages
Ey-H3l Yh1576
No ratings yet
Ey-H3l Yh1576
44 pages
Palo Alto Networks Product Summary Specsheet
No ratings yet
Palo Alto Networks Product Summary Specsheet
2 pages
ENACh 13 Final
No ratings yet
ENACh 13 Final
34 pages
8051 LCD & Keyboard Interfacing Guide
No ratings yet
8051 LCD & Keyboard Interfacing Guide
12 pages
M800Software and Firmware Upgrade-Emily-20230107
No ratings yet
M800Software and Firmware Upgrade-Emily-20230107
37 pages
Essential Windows Run Commands
No ratings yet
Essential Windows Run Commands
3 pages
Hemix: Technical Specifications
No ratings yet
Hemix: Technical Specifications
3 pages
Replacing Nintendo Wii Bluetooth Board: Written By: Andrew Bookholt
No ratings yet
Replacing Nintendo Wii Bluetooth Board: Written By: Andrew Bookholt
19 pages
Data Hazards and Pipeline Timing in RISC
No ratings yet
Data Hazards and Pipeline Timing in RISC
8 pages
COMPAQ Professional Workstation XP1000
No ratings yet
COMPAQ Professional Workstation XP1000
14 pages
Computer Application Past Paper Short Question Answer
No ratings yet
Computer Application Past Paper Short Question Answer
4 pages
Real Time System Quiz 1
No ratings yet
Real Time System Quiz 1
2 pages
IT Infrastructure for Professionals
100% (1)
IT Infrastructure for Professionals
39 pages
FC - FV 922 924 Fire Voice Battery Calcs v3.1.0 A6V10744930 - en
No ratings yet
FC - FV 922 924 Fire Voice Battery Calcs v3.1.0 A6V10744930 - en
1 page
MB Memory Ga-Ax370-Gaming-K7 Pinnacle v4
No ratings yet
MB Memory Ga-Ax370-Gaming-K7 Pinnacle v4
8 pages
MM Computer Journal 153 2006 November
No ratings yet
MM Computer Journal 153 2006 November
121 pages
Kecs 101
No ratings yet
Kecs 101
26 pages
2010-11 Hawell 7000seriesDVRmanual SVR 7504
No ratings yet
2010-11 Hawell 7000seriesDVRmanual SVR 7504
72 pages
Family 19h Instruction Latencies Version 1-00
No ratings yet
Family 19h Instruction Latencies Version 1-00
78 pages
FIFO Depth Calculation
100% (1)
FIFO Depth Calculation
27 pages
B CANopen-10-ENG V2.20
No ratings yet
B CANopen-10-ENG V2.20
57 pages
ProView Software Tutorial Guide
No ratings yet
ProView Software Tutorial Guide
12 pages
DGT Transmitter/ Indicator: Data Sheet
No ratings yet
DGT Transmitter/ Indicator: Data Sheet
1 page
Juniper SRX550-645AP-M Datasheet
No ratings yet
Juniper SRX550-645AP-M Datasheet
4 pages
Lecture 18 Directives
No ratings yet
Lecture 18 Directives
4 pages
Rom and Ram
No ratings yet
Rom and Ram
5 pages
Microwave Oven 1
100% (1)
Microwave Oven 1
22 pages
Game Log
No ratings yet
Game Log
30 pages

Accelerating Fortran Codes: A Method For Integrating Coarray Fortran With Cuda Fortran and Openmp

Uploaded by

Accelerating Fortran Codes: A Method For Integrating Coarray Fortran With Cuda Fortran and Openmp

Uploaded by

Accelerating Fortran Codes: A Method for Integrating Coarray Fortran with CUDA

Fortran and OpenMP

James McKevitta,b , Eduard I. Vorobyova , Igor Kulikovc

1. Introduction operations [e.g., 11], often a common and performance-critical

This is then converted to a hex code:

I MPI PIN DOMAIN here receives two arguments for posi-

2.4. Compiling and Linking

Finally, we compile the CUDA Fortran code with

Table 1: Hardware configurations for performance testing.

3.4. Coarray Fortran vs MPI 3.5. Execution times

The execution times for the different hardware configura-

In this study, we have successfully demonstrated a robust

2. Scalability: The near-linear scaling of our potential solver

3. Hardware Utilisation: Our methodology’s ability to lever-

! call the subroutine and convert to pointer

You might also like