Accelerating Fortran Codes: A Method For Integrating Coarray Fortran With Cuda Fortran and Openmp
Accelerating Fortran Codes: A Method For Integrating Coarray Fortran With Cuda Fortran and Openmp
Abstract
arXiv:2409.02294v1 [astro-ph.IM] 3 Sep 2024
Fortran’s prominence in scientific computing requires strategies to ensure both that legacy codes are efficient on high-performance
computing systems, and that the language remains attractive for the development of new high-performance codes. Coarray Fortran
(CAF), part of the Fortran 2008 standard introduced for parallel programming, facilitates distributed memory parallelism with a
syntax familiar to Fortran programmers, simplifying the transition from single-processor to multi-processor coding. This research
focuses on innovating and refining a parallel programming methodology that fuses the strengths of Intel Coarray Fortran, Nvidia
CUDA Fortran, and OpenMP for distributed memory parallelism, high-speed GPU acceleration and shared memory parallelism
respectively. We consider the management of pageable and pinned memory, CPU-GPU affinity in NUMA multiprocessors, and
robust compiler interfacing with speed optimisation. We demonstrate our method through its application to a parallelised Poisson
solver and compare the methodology, implementation, and scaling performance to that of the Message Passing Interface (MPI),
finding CAF offers similar speeds with easier implementation. For new codes, this approach offers a faster route to optimised
parallel computing. For legacy codes, it eases the transition to parallel computing, allowing their transformation into scalable,
high-performance computing applications without the need for extensive re-design or additional syntax.
Keywords: Coarray Fortran (CAF), CUDA Fortran, OpenMP, MPI
PACS: 0000, 1111
2000 MSC: 0000, 1111
Coarray Fortran (CAF), introduced in the Fortran 2008 stan- 1.3. Heterogeneous Computing with CPUs and GPUs
dard, has gained recognition for its ability to simplify parallel
When developing a code capable of running on a heteroge-
programming. CAF offers an intuitive method for data par-
neous system — that being a system containing more than one
titioning and communication, enabling scientists to focus on
type of processor, in this case a CPU-GPU system — it is im-
problem-solving rather than the intricacies of parallel comput-
portant to understand the differences in the architecture of par-
ing.
allel execution between the hardware to best distribute tasks and
Coarray Fortran integrates parallel processing constructs di- optimize code performance.
rectly into the Fortran language, eliminating the need for exter-
Single Instruction, Multiple Data (SIMD) is a parallel com-
nal libraries like MPI. It adopts the Partitioned Global Address
puting model where one instruction is executed simultaneously
Space (PGAS) model, which views memory as a global entity
across multiple data elements within the same processor [19].
but partitions it among different processors [18]. This allows
This process is known as vectorisation, and it allows a single
straightforward programming of distributed data structures, fa-
processor to perform the same operation on multiple data points
cilitating simpler, more readable codes.
at once.
In CAF, the global computational domain is decomposed into In GPU architectures, SIMD is combined with multi-
a series of images, which represent self-contained Fortran exe- threading and implemented in a way where warps (groups of
cution environments. Each image holds its local copy of data threads) receive the same instruction. This means that while
and can directly access data on other images via coarray vari- multiple threads perform identical operations in parallel with
ables. These variables are declared in a similar fashion to tradi- each other, each thread also performs vectorised (SIMD) oper-
tional Fortran variables but with the addition of codimensions. ations on its assigned data. This broader GPU parallelism is
The codimension has a size equal to the number of images. The known as Single Instruction, Multiple Threads (SIMT) [20].
variables defined without codimensions are local to each image.
While each GPU thread is capable of performing hundreds or
Coarray Fortran introduces synchronisation primitives, such thousands of simultaneous identical calculations through vec-
as critical, lock, unlock, and sync statements, to ensure torisation, individual CPU cores also support SIMD but on a
proper sequencing of events across different images. This pro- smaller scale, typically performing tens of operations simul-
vides programmers with the tools to manage and prevent race taneously each. However, unlike in GPUs, each core within
conditions and to coordinate communication between images. a CPU multiprocessor can receive its own set of instructions.
This means that while individual cores perform SIMD oper-
1.2. CUDA Fortran ations, the overall CPU executes multiple instructions across
its multiple cores, following the Multiple Instruction, Multiple
CUDA Fortran is a programming model that extends Fortran Data (MIMD) model [19].
to allow direct programming of Nvidia GPUs. It is essentially This means that besides the difference in scale of simultane-
an amalgamation of CUDA and Fortran, providing a pathway to ous operations between CPUs and GPUs, there are also archi-
leverage the computational power of Nvidia GPUs while stay- tectural differences in how CPU MIMD and GPU SIMT handle
ing within the Fortran environment. parallel tasks. In MIMD architectures like those in CPUs, each
The CUDA programming model works on the basis that core operates independently, allowing more flexibility when en-
the host (CPU) and the device (GPU) have separate memory countering conditional branching such as an if statement. This
spaces. Consequently, data must be explicitly transferred be- independence helps minimise performance losses during thread
tween these two entities. This introduces new types of routines divergence because each core can process different instructions
to manage device memory and execute kernels on the device. simultaneously without waiting for others.
In CUDA Fortran, routines executed on the device are known Conversely, in GPU architectures using SIMT, all threads in
as kernels. Kernels can be written in a syntax very similar to a warp execute the same instruction. This synchronization can
standard Fortran but with the addition of qualifiers to specify lead to performance bottlenecks during thread divergence —–
2
for instance, when an if statement causes only some threads 2.2. Memory Space Configuration
within a warp to be active. In such cases, GPUs typically han-
dle divergence by executing all conditional paths and then se- When creating a single code which requires the use of two
lecting the relevant outcomes for each thread, a process that can compilers, a few key considerations are required. For the fol-
be less efficient than the CPU’s approach. This synchronisation lowing text, ‘Intel code’ refers to code compiled with ifort
requirement makes GPUs highly effective for large data sets and ‘Nvidia code’ refers to code compiled with the nvfortran.
where operations are uniform but potentially less efficient for The various execution streams mentioned below refer to an ex-
tasks with varied execution paths. Thus, while GPUs excel in ecuted command being made within either the Intel code or the
handling massive, uniform operations due to their SIMD capa- Nvidia code.
bilities within each thread, CPUs offer advantages in scenarios
where operations diverge significantly.
It is, therefore, important to understand which problems are 2.2.1. Pageable and Pinned Memory
best suited to which type of processor, and design the code to Before continuing, a short background on pageable and
distribute different tasks to the appropriate hardware. pinned memory is required. Pageable memory is the default
The layout of this work is as follows. In Sect. 2 we outline memory type in most systems. It is so-called because the op-
our methodology and consider a number of options which are erating system can ‘page’ it out to the disk, freeing up physical
available to solve this problem, depending on the use case. We memory for other uses. This paging process involves writing
also present a detailed guide on how compiler linking can be the contents of the memory to a slower form of physical mem-
achieved. In Sect. 3 we present the results of our tests on differ- ory, which can then be read back into high-speed physical mem-
ent hardware. In Sect. 4 we summarise our main conclusions. ory when needed.
The main advantage of pageable memory is that it allows
for efficient use of limited physical memory. By paging out
2. Methodology memory that is not currently needed, the operating system can
free up high-speed physical memory for other uses. This can be
The methodology we propose hinges on the robust combi- particularly useful in systems with limited high-speed physical
nation of CUDA Fortran and Coarray Fortran, leveraging their memory.
unique strengths to develop efficient high-performance comput- However, the paging process can be slow, particularly when
ing applications. The primary challenge is the complex inter- data is transferred between the host and a device such as a GPU.
facing required to integrate the GPU-accelerated capabilities of When data is transferred from pageable host memory to de-
Nvidia CUDA Fortran with the distributed memory parallelism vice memory, the CUDA driver must first allocate a temporary
of Intel Coarray Fortran. CUDA Fortran is chosen as it allows pinned memory buffer, copy the host memory to this buffer, and
the user high levels of control over GPU operations - some of then transfer the data from the buffer to the device. This dou-
which are highlighted below - while remaining close to the tra- ble buffering incurs overhead and can significantly slow down
ditional Fortran syntax. Likewise, Coarray Fortran, particularly memory transfers, especially for large datasets.
when implemented by Intel, allows for high distributed mem- Pinned memory, also known as page-locked memory, is a
ory performance with no augmentations to the standard Fortran type of memory that cannot be paged out to the disk. This
syntax. Below, we detail the steps involved in our approach. means that the data is constantly resident in the high-speed
physical memory of the system, which can result in consider-
2.1. Selection of Compilers ably faster memory transfers between the host and device.
The main advantage of pinned memory is its speed. Because
CUDA Fortran, being proprietary to Nvidia, requires the use it can be accessed directly by the device, it eliminates the need
of Nvidia’s nvfortran compiler. However, nvfortran does for the double-buffering process required by pageable memory
not support Coarray Fortran. In contrast, Intel’s ifort com- on GPUs. This can result in significantly faster memory trans-
piler supports Coarray Fortran - with performance levels that fers, particularly for large datasets.
rival MPI and without the complexity of its syntax - but does not However, pinned memory has its drawbacks. The alloca-
support CUDA Fortran. One other alternative compiler support- tion of pinned memory is more time-consuming than pageable
ing Coarray Fortran is OpenCoarrays. According to our expe- memory - something of concern if not all memory allocation is
rience, its implementation falls short in terms of speed. How- done at the beginning - and it consumes physical memory that
ever, we note that experiences with OpenCoarrays may vary de- cannot be used for other purposes. This can be a disadvantage
pending on a variety of factors such as hardware configurations in systems with limited physical memory. Additionally, exces-
and compiler versions. sive use of pinned memory can cause the system to run out of
When considering most Fortran programmes in general, In- physical memory, leading to a significant slowdown as the sys-
tel’s ifort compiler is a common choice and offers a high- tem starts to swap other data to disk. Pinned memory is not part
performance, robust and portable compiler. Consequently, of the Fortran standard and is, therefore, not supported by the
here we demonstrate a hybrid ifort for Coarray Fortran and Intel compiler. This leads to additional intricacies during the
nvfortran for CUDA Fortran solution. combination of compilers, as we discuss below.
3
2.2.2. Host memory: Coarray Fortran
Returning to our problem, when using Intel code and Nvidia
code together, the division of parameters and variables between
the two is one of the key areas to which attention should be
paid. The execution stream within the Nvidia code can only
access and operate on variables and parameters which have
been declared in the Nvidia code. Likewise, the Intel execu-
tion stream can only access and operate on variables and pa-
rameters which have been declared in the Intel code. For this
reason, we consider a virtual partition within the host physical
memory, which can be crossed through the use of subroutines
and interfaces, which we discuss in more detail below. This
clear division of the physical memory does not happen in real-
ity, as Intel and Nvidia declared variables and parameters will
be mixed together when placed in the physical memory, but it is
a helpful tool to consider the possible configurations we detail
below.
Depending on which side of the partition in the host mem-
ory the variables are placed defines where and at what speed
they can be transferred. Figure 1 shows a selection of these op-
tions. In the leftmost case, a variable has two identical copies
in the host memory, one defined by the Intel compiler to allow
CAF parallelisation (simply using the [*] attribute) and one
by the Nvidia compiler to allow it to be pinned. Proceeding to Figure 1: The relative transfer speeds of moving variables between various
the right, in the next case, the variable is defined by the Intel parts of the host memory and the device memory, where green represents a fast
compiler for CAF communication, but a pointer to this variable to negligible speed, and yellow represents a slower speed. As CUDA Fortran
is used for GPU operations, only Nvidia code can be used to define device
is defined by the Nvidia compiler, meaning it cannot be pinned variables. MPI is added for comparison.
and so suffers from the slower pageable memory transfers to the
GPU. In the next cases, we consider the options available with
MPI. The variable is defined by the Nvidia compiler which al- and GPUs. While Coarray Fortran offers a simpler syntax, it re-
lows it to be pinned, but a pointer to this variable is defined quires a slightly more complex compilation process and setup
by the Intel compiler. This means that CAF transfers are not when using GPU parallelisation. We, therefore, lay out this pro-
possible as they are not supported for pointers, meaning MPI is cess as clearly as possible in this article, to make the simplified
required for distributed memory parallelism. In the final case, speed-up of Coarray Fortran as accessible as possible.
the Intel code is not required at all, as the Nvidia compiler sup- The pointer solution, while allowing a faster cross-partition
ports these MPI transfers natively. To understand any overhead solution, does not allow for this pinned attribute, as the Nvidia
associated with the pointing procedure which allows the combi- code simply points to the array defined by the Intel compiler,
nation of CAF and CUDA, we implemented the two MPI solu- which resides in pageable memory. This means a pageable
tions to our potential solver and found no appreciable difference transfer rather than a pinned transfer takes place when moving
in performance. Any small speedup or slowdown between the the values onto the device, which is slower.
two options — one using pointers to combine both compilers It is important to think about the problem which is being ad-
and one only using the Nvidia compiler — is likely due to the dressed as to which solution is more optimal. On our hardware
different MPI versions and implementations used in the com- and testing with a 1283 array, pinned memory transfers took
pilers, rather than by the pointing procedure. around 1 ms, as opposed to 3 ms taken by pageable memory
The partition-crossing use of such a variable is much slower transfers. However, during cross-partition operations, using a
when copying the array than when using a pointer, which has pointer resulted in approximately ∼0 ms of delay, as opposed to
practically no speed overhead at all. In the case that the vari- 3 ms when using a copy operation. Given the choice between
able is copied across the partition, the version which is defined 1) 1 ms GPU transfers but 3 ms cross-partition transfers, or 2)
in the Nvidia code can be defined with attributes allowed by 3 ms GPU transfers but ∼0 ms cross-partition transfers, it is
the Nvidia compiler, one of these being that it is pinned. This important to consider the specific use case.
would not be possible in the Intel version of the variable be- In our case, and therefore also in most cases, it is more op-
cause, as previously discussed, pinned memory is not part of the timal to use a pointer in the Nvidia code. This does not mean
Fortran standard and so is not supported by the Intel compiler. pinned memory transfers to the device cannot be used at all.
It is important to note that, as mentioned above, the Nvidia They simply cannot be used for any coarrayed variables, which
compiler natively supports MPI, and can therefore run Fortran reside in the Intel code partition. During implementation, it
codes parallelised across shared memory, distributed memory is also clearly easier to implement the pointer configuration,
4
which we describe now. example, the Intel execution stream requires a large array from
First, we show the implementation of two copies of the same the Nvidia code, no slowdown is caused during the transfer of
array stored in the physical memory of the host, either side of the array to the other partition and no unnecessary overhead in
our partition. We note that this is inefficient and makes poor use the physical memory is present.
of the memory available given that parameters and variables re- This pointing can be accomplished using an intermediate se-
quiring transfer are stored twice, although this is not typically ries of subroutines, calls, and pointers, all of which are writ-
a problem for modern high-performance computing systems. ten in Fortran but declared with the c-binding to ensure a ro-
Furthermore, every time a variable is updated by the execution bust and common connection between the two compilers, as
stream of one side of the code, a transfer is required across the mentioned above. In the prior duplication case, these subrou-
host physical memory causing some considerable slowdown — tines and calls are required before the use of any parameters or
particularly in the case of large arrays which are frequently up- variables which have been changed by the alternate execution
dated. Technically, a transfer could only be made if it is known stream since their last use, to ensure an up-to-date version is
that the data will be changed on the other side of the partition used. However, with this new pointer case, the subroutines and
before being used again, but this introduces a large and difficult- calls are only required once at the beginning of the running to
to-detect source for coding errors, especially in complex codes, setup and initialise the pointing, making it harder to mistakenly
and so is not advisable. This transfer could also be performed forget to update an array before using it in the source code. This
asynchronously to mask the transfer time, but the inefficient use configuration can be seen in Figure 3. The steps of this config-
of available physical memory is intrinsic to this solution. An il- uration, to set up a coarrayed variable capable of distributed
lustration of how this setup works can be seen in Figure 2, the memory transfer and a counterpart pointer which allows trans-
stages of which are as follows: fer to the device for GPU accelerated operations, are as follows:
1. The Nvidia execution stream calls a subroutine, which has 1. The Nvidia execution stream again calls a subroutine, de-
been defined and compiled within the Intel code, to get fined and compiled within the Intel code, to get the address
the value of the array (in this case a coarray), which is to in the memory of the array (in this case a coarray), which
be operated on. To do this, an interface is defined within is to be operated on. This must be declared with the tar-
the Nvidia code with the bind(C) attribute, meaning that get attribute in the Intel code, so it can be pointed to. As
while the source code is Fortran, the compiled code is before, an interface is defined within the Nvidia code with
C, ensuring a robust connection between the two compil- the bind(C) attribute, to allow this connection. The des-
ers. The destination subroutine is also written with the tination subroutine is also written with the bind(C) at-
bind(C) attribute for the same reason. An example sub- tribute, and the pointer is made a C pointer for the same
routine showing how to use C-binding is provided in the reason.
appendix.
2. The subroutine returns the address of the array as a C-
2–3. The subroutine returns the value of the array and sets the pointer.
value of the counterpart array in the Nvidia code to the
3. The C-pointer is converted to a Fortran pointer for use by
same value.
the Nvidia code. The original coarray, as defined by the
4. The array is now updated on the Nvidia partition and can Intel code and residing in the Intel partition of the host
be operated on by the Nvidia execution stream. memory, can now be operated on and transferred to the
device for GPU-accelerated operations. We note that in
5. A pinned memory transfer can now take place, moving the this case the data are transferred from the pageable host
array to the device memory. memory to the device, thus incurring a certain overhead,
as described above.
6. The array is operated on by the device.
This handling of sensitive cross-compiler operations with
7. A pinned memory transfer allows the array to be moved
C-bindings was found to be essential, as relying on Fortran-
back to the Nvidia partition of the host memory.
Fortran interfacing between the compilers led to non-descript
8–11. An inverse of the first four steps takes place, allowing the segmentation errors. The use of C-binding ensures a robust so-
updated array to be used by the Intel execution stream, and lution to this issue, with no overhead in the execution speed and
accessible for coarray transfer to the rest of the distributed no deviation from the Fortran syntax in the source code.
memory programme.
2.2.3. Host memory: MPI
What was found to be more efficient in our use case, and This technique is almost identically applicable when using
in general something which is true for the majority of cases in MPI. However, it should be noted that in that case, an addi-
terms of execution speed, memory usage and coding simplicity, tional benefit is present in that arrays which are communicated
was to declare a parameter or variable once in either the Intel or using the MPI protocol do not require a coarray attribute in their
Nvidia code, and then create a pointer to this parameters or vari- definition. This means that, opposite to the case in Figure 3, the
able in the partition of the code. This means that whenever, for array can be defined in the Nvidia code, and the pointer in the
5
Figure 2: Configuration of physical memory and the procedure required for retrieval across the partition allowing pinned memory for fast device transfers but
slow partition crossing by copying memory. In this case, a variable has been changed by the Intel execution stream before this procedure starts, and so the Nvidia
execution stream is required to retrieve an updated copy of this array before operating on it, transferring it to the device and performing further GPU-accelerated
operations. The Intel execution stream then takes over, retrieving the value(s) of the updated variable, performing its operations, and then allowing for transfer to
other host memory spaces in the distributed memory. This host-host transfer must occur every time a device operation is performed with a coarrayed variable.
Figure 3: Configuration of physical memory and the procedure required for retrieval across the partition using pointers allowing for no transfer overhead in host-host
transfers, but requiring the use of pageable memory for device transfers and incurring an overhead. In this case, the linking is performed only once during the first
stages of the running of the code, meaning both host execution streams are always accessing the same version of host variables.
6
Intel code. In this case, the full speed-up of data transfer for the therefore, attempts to communicate with the GPU of the other,
pinned memory can be utilized. An illustration of this can be there is a big overhead in the communication. Where our GPUs
seen in Figure 4. are connected is shown in the figure.
For practical application within Intel Coarray Fortran, the In-
2.3. CPU Affinity with GPUs tel MPI Library offers environment variables to facilitate CPU
In any high-performance computing application using ei- affinity settings:
ther PGAS, or more widely the single program multiple data
(SPMD) technique, the ability to control the allocation of pro- export I_MPI_PIN=on
gramme processes to specified processing units—commonly export I_MPI_PIN_DOMAIN=[HEX_CONFIG]
referred to as CPU affinity or process pinning—is pivotal for
where HEX CONFIG is a hexadecimal (hex) code correspond-
performance optimisation. This is especially relevant in sys-
ing to the IDs of the compute cores within the multiprocessors
tems with a Non-Uniform Memory Access (NUMA) architec-
which are to be connected to the GPUs. This is described in
ture (common in HPC), and/or in systems with multiple CPUs
more detail below.
and GPUs.
To effectively set CPU affinity through Intel MPI’s environ-
In a NUMA architecture, the term ‘NUMA node’ refers to
ment variables, one must provide the correct hexadecimal (hex)
a subsystem that groups sets of processors within one mul-
codes corresponding to the CPU or core IDs. These hex codes
tiprocessor together and connects them with a memory pool,
serve as unique identifiers for the CPUs and are critical for pin-
forming a closely-knit computational unit. Unlike a Uniform
ning computational tasks accurately. In the case of our code,
Memory Access (UMA) configuration where each processor
each coarray image is designed to require one GPU. It is, there-
within the multiprocessor has equal latency for memory access
fore, important to ensure that each coarray image is running on
across the system, NUMA nodes have differing latencies de-
the socket containing the NUMA node that is connected to the
pending on whether the accessed memory resides within the
correct GPU, namely, to the GPU the coarray image is using.
same NUMA node or in a different one. Taking the example of
Other situations where multiple coarray images use the same
one of our systems, described in detail later, this can be seen
GPU are also possible, in which case these images should be
in Figures 5 and 6. When using a hybrid programming model,
pinned within the same socket the GPU is connected to. How-
such as a PGAS language like Coarray Fortran in combination
ever, if more images are added to a socket, they each naturally
with OpenMP in our case, it becomes specifically important to
must reside on fewer cores, meaning that any CPU parallelisa-
remember these.
tion is limited. Determining the balance between CPU and GPU
Therefore, a well-configured system strives to keep data and
parallelisation is something that will vary between use cases.
tasks localised within the same NUMA node whenever possi-
As aforementioned, in our example, we are interested in CPU
ble, mitigating the impact of varying memory access latencies
affinity because we want to ensure that each coarray image runs
across NUMA nodes. This concept is critical for understanding
on the socket containing the NUMA node that is connected to
the intricacies of CPU affinity and process pinning in multi-
the GPU it is using.
socket, multi-GPU systems.
Once the desired destination CPU IDs have been identified,
The Linux command numactl --hardware can be used to
they can be converted to hexadecimal format in the following
determine relative distances between NUMA nodes, the mem-
way: Given two coarray images are running on one node (in
ory available to each group, and the ID of each processor within
this case, one image on each socket), the CPU code of the first
each NUMA node. If more exact latency numbers are required
image should be pinned to NUMA node 3 and CPU code of the
between NUMA nodes, tools like the Intel Memory Latency
second image to NUMA node 7, with the corresponding GPU
Checker can be used. If hyperthreading is enabled on a system,
code pinned to GPUs 0 and 1 respectively. The GPU number
then additional processor IDs will be shown in addition to those
used by the coarray image can simply be set using the Nvidia
referring to physical processors. In our case, our first NUMA
CUDA Fortran line
node, NUMA node 0, encompasses the processors 0–15 and
128–143, with the former referring to the physical processors istat = cudaSetDevice(mod(irank-1,gpusPerNode))
and the latter referring to their hyperthreaded counterparts. We
where irank is the Coarray Fortran image number, starting
do not concern ourselves with hyperthreading in this paper, as
from 1 and gpusPerNode is the number of GPUs attached to
we observed no speed improvements when using it.
each computational node, this being 2 in the case shown in Fig-
A second important aspect of the NUMA architec-
ures 5 and 6. This will effectively alternate between 0 and 1 as
ture when considering GPUs is the connection of GPUs
the image number (irank) increases from 1 to the last image.
to the processors, found using the Linux command
Taking our test case, it is known that NUMA node 3 cor-
nvidia-smi topo --matrix. In our architecture, a GPU is
responds with cores 48–63. This first needs to be encoded in
connected to one NUMA node, with one device connected per
binary, considering all the 127 cores we have available:
socket. As seen in Figure 6, each NUMA node within a socket
has a slight speed penalty associated with communication 00000000000000000000000000000000
with NUMA nodes in the same socket (20% increase) and a ,→ 00000000000000001111111111111111
high overhead associated with communicating with the other ,→ 00000000000000000000000000000000
socket (220% increase). When a coarray image on one socket, ,→ 00000000000000000000000000000000
7
Figure 4: Configuration of physical memory and the procedure required for retrieval across the partition using pointers allowing for no transfer overhead in host-host
transfers, and allowing for the use of pinned memory for device transfers, incurring the minimal overhead. As in the previous case, the linking is performed only
once during the first stages of the running of the code, meaning both host execution streams are always accessing the same version of host variables.
export I_MPI_PIN=on
export I_MPI_PIN_DOMAIN=[FFFF0000000000000000,
,→ FFFF]
Figure 8: a) Illustration of a two-dimensional slice of the nested grid structure employed in performance testing, showing four nested mesh levels, with each colour
corresponding to a new grid level. b) Schematic representation of the grid distribution across four GPUs on VSC5, where each socket has one NUMA node attached
to one GPU. c) Schematic representation of the grid distribution across four GPUs on Narval, where each socket has two GPUs attached, each to a different NUMA
node. More nested grids require more nodes.
Figure 10: Two configurations for a socket-GPU pair on nodes with two sockets
containing NUMA-architecture multiprocessors and two GPUs, where GPUs 0
and 1 are connected to NUMA nodes 3 (socket 0) and 7 (socket 1) respectively.
Two processes (light blue and dark blue) are running on the node, one on each
socket. a) Shows the optimal configuration of the CPU and GPU processes,
where the CPU image runs its GPU tasks on the device directly connected, in
this case, process 0 on NUMA node 3 (socket 0) runs on GPU 0, and process 1
on NUMA node 7 (socket 1) runs on GPU 1. b) Shows the worst-case configu- Figure 11: Scaling performance of CM4NG with full parallelisation via
ration where CPUs perform their GPU calculations on the device connected to OpenMP (OMP), CUDA and Coarray Fortran (CAF), using 2 GPUs per node
the opposite socket, in this case, process 0 on NUMA node 0 (socket 0) runs on on VSC5. CPU-GPU affinity is shown for the optimal and pessimal configura-
GPU 1, and process 1 on NUMA node 4 (socket 1) runs on GPU 0. tions.
12
nificant rewrite of the code which, in terms of the required ef-
forts, may be comparable to writing a new code from scratch.
With Coarray Fortran the distributed memory parallelisation of
legacy Fortran codes becomes feasible with relatively little ef-
fort but significant speed-up. This is because CAF requires
introducing only several additional lines and variables to the
code, keeping the existing code structure and syntax intact. The
fact that Coarray Fortran can be integrated with CUDA Fortran
makes the coarray language standard particularly attractive for
scientists.
4. Conclusions
4 https://vsc.ac.at/
5 https://www.calculquebec.ca/
6 https://alliancecan.ca/
14
Appendix A. C-Binding in Fortran Appendix A.2.1. Code compiled with compiler B
! module containing the variable
module my_vars
C-binding in Fortran is a powerful feature that allows inter-
implicit none
operability with C, enabling the use of C data types, and the
integer, target :: a
calling of C functions and libraries. This feature is particu-
end module my_vars
larly useful in high-performance computing where leveraging
both Fortran’s computational efficiency and C’s extensive li-
! subroutine returning the C address
brary ecosystem can be advantageous.
subroutine get_a(value) bind(C, name="get_a")
In the context of our study, C-binding plays a critical role in
use iso_c_binding, only: c_ptr, c_loc
enabling robust interaction between the nvfortran and ifort
use my_vars, only: a
compilers, allowing them to communicate according to the C
implicit none
standard. When implementing pure Fortran interfaces between
type(c_ptr), intent(out) :: value
the compilers, we met numerous segmentation errors which
value = c_loc(a)
were avoided when communication was facilitated by binding
end subroutine get_a
the connection with C. This section provides an overview of
how C-binding is used in our methodology and a simple exam-
ple for illustration. Appendix A.2.2. Code compiled with compiler A
! interface to subroutine
module interface_compilers
Appendix A.1. Using C-binding in Fortran to combine Fortran use iso_c_binding, only: c_ptr
compilers and allow a robust common memory implicit none
space
interface
In the case below, code compiled with compiler A will re- subroutine get_a(value) bind(C, name="get_a")
ceive the location of a variable in memory compiled by com- type(c_ptr), intent(out) :: value
piler B. The code compiled by compiler A will then be able to end subroutine get_a
access the variable directly. end interface
To use C-binding in Fortran, specific steps must be followed: end module interface_compilers
16