PGI Compilers for Heterogeneous Supercomputing, March 2018
PGI COMPILERS & TOOLS UPDATE
2
PGI - THE NVIDIA HPC SDK
Fortran, C & C++ Compilers
Optimizing, SIMD Vectorizing, OpenMP
Accelerated Computing Features
OpenACC Directives
CUDA Fortran
Multi-Platform Solution
Multicore x86-64 and OpenPOWER CPUs,
NVIDIA Tesla GPUs
Supported on Linux, macOS, Windows
MPI/OpenMP/OpenACC Tools
Debugger
Performance Profiler
Interoperable with DDT, TotalView
3
OPENACC FOR EVERYONE
PGI Community Edition Now Available
PROGRAMMING MODELS
OpenACC, CUDA Fortran, OpenMP,
C/C++/Fortran Compilers and Tools
PLATFORMS
X86, OpenPOWER, NVIDIA GPU
UPDATES 1-2 times a year 6-9 times a year 6-9 times a year
SUPPORT User Forums PGI Support
PGI Premier
Services
LICENSE Annual Perpetual Volume/Site
FREE
4
Latest CPUs Support
Intel Skylake
AMD Zen
IBM POWER9
Full OpenACC 2.6
OpenMP 4.5 for multicore CPUs
AVX-512 code generation
Integrated CUDA 9.1 toolkit/libraries
New fastmath intrinsics library
Partial C++17 support
Optional LLVM-based x86 code
generator
pgicompilers.com/whats-new
5
Performance measured February, 2018. Skylake: Two 20 core Intel Xeon Gold 6148 CPUs @ 2.4GHz w/ 376GB memory, hyperthreading enabled. EPYC: Two 24 core AMD EPYC 7451 CPUs
@ 2.3GHz w/ 256GB memory. Broadwell: Two 20 core Intel Xeon E5-2698 v4 CPUs @ 3.6GHz w/ 256GB memory, hyperthreading enabled. Volta: NVIDIA DGX1 system with two 20 core
Intel Xeon E5-2698 v4 CPUs @ 2.20GHz, 256GB memory, one NVIDIA Tesla V100-SXM2-16GB GPU @ 1.53GHz. SPEC® is a registered trademark of the Standard Performance Evaluation
Corporation (www.spec.org).
SPEC ACCEL 1.2 BENCHMARKS
0
50
100
150
200
2-socket Skylake 2-socket EPYC 2-socket BroadwellGEOMEANSeconds
Intel 2018 PGI 18.1
OpenMP 4.5
40 cores / 80 threads 48 cores / 48 threads 40 cores / 80 threads
0
50
100
150
200
GEOMEANSeconds
PGI 18.1
OpenACC
2-socket
Broadwell
1x Volta
V100
4.4x
Speed-up
6
0
50
100
150
200
2-socket Skylake 2-socket EPYC 2-socket Broadwell
GEOMEANSeconds
Intel 2018 PGI 18.1
SPEC CPU 2017 FP SPEED BENCHMARKS
40 cores / 80 threads 48 cores / 48 threads 40 cores / 80 threads
Performance measured February, 2018. Skylake: Two 20 core Intel Xeon Gold 6148 CPUs @ 2.4GHz w/ 376GB memory, hyperthreading enabled. EPYC: Two 24 core AMD EPYC 7451 CPUs
@ 2.3GHz w/ 256GB memory. Broadwell: Two 20 core Intel Xeon E5-2698 v4 CPUs @ 3.6GHz w/ 256GB memory, hyperthreading enabled. SPEC® is a registered trademark of the Standard
Performance Evaluation Corporation (www.spec.org).
7
OPENACC UPDATE
8
OPENACC DIRECTIVES
Manage
Data
Movement
Initiate
Parallel
Execution
Optimize
Loop
Mappings
#pragma acc data copyin(a,b) copyout(c)
{
...
#pragma acc parallel
{
#pragma acc loop gang vector
for (i = 0; i < n; ++i) {
c[i] = a[i] + b[i];
...
}
}
...
}
• CPU, GPU, Manycore
• Performance portable
• Interoperable
• Single source base
• Incremental
9
CPU
OPENACC IS FOR MULTICORE CPUS & GPUS
% pgfortran -ta=multicore –fast –Minfo=acc -c 
update_tile_halo_kernel.f90
. . .
100, Loop is parallelizable
Generating Multicore code
100, !$acc loop gang
102, Loop is parallelizable
GPU
% pgfortran -ta=tesla –fast -Minfo=acc –c 
update_tile_halo_kernel.f90
. . .
100, Loop is parallelizable
102, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
100, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
102, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
98 !$ACC KERNELS
99 !$ACC LOOP INDEPENDENT
100 DO k=y_min-depth,y_max+depth
101 !$ACC LOOP INDEPENDENT
102 DO j=1,depth
103 density0(x_min-j,k)=left_density0(left_xmax+1-j,k)
104 ENDDO
105 ENDDO
106 !$ACC END KERNELS
10
0
20
40
60
80
100
120
140
160
Multicore Haswell Multicore Broadwell Multicore Skylake Kepler Pascal
PGI 18.1 OpenACC
Intel 2018 OpenMP
7.6x 7.9x 10x 10x 11x
40x
14.8x 15x
Volta V100
CLOVERLEAF
SpeedupvsSingleHaswellCore
Systems: Haswell: 2x16 core Haswell server, four K80s, CentOS 7.2 (perf-hsw10), Broadwell: 2x20 core Broadwell server, eight P100s (dgx1-prd-01), Broadwell server, eight V100s (dgx07), Skylake 2x20 core Xeon Gold server (sky-4).
Compilers: Intel 2018.0.128, PGI 18.1
Benchmark: CloverLeaf v1.3 downloaded from http://uk-mac.github.io/CloverLeaf the week of November 7 2016; CloverlLeaf_Serial; CloverLeaf_ref (MPI+OpenMP); CloverLeaf_OpenACC (MPI+OpenACC)
Data compiled by PGI February 2018.
AWE Hydrodynamics mini-App, bm32 data set
http://uk-mac.github.io/CloverLeaf
109x
67x
142x
1x 2x 4x
11
OPENACC UPTAKE IN HPC
Applications
3 of Top 5 HPC Apps
ANSYS Fluent &
Gaussian released;
VASP in development
5 ORNL CAAR Codes
GTC, XGC, ACME,
FLASH, LSDalton
109 Apps
Total being tracked
Hackathons Training Community
5 Events in 2017
All hackathons are
initiated by users
6–9 Events in 2018
94 Codes GPU
accelerated to date
Expertise
94 mentors registered
User Group
SC17 participation
up 27% vs SC16
Slack Channel
2x growth in
last 6 months
Downloads
PGI Community Edition
quarterly downloads
up 136% in 2017
OpenACC and DLI
10 new modules;
instructor certification
18+ 2018 workshops
ECMWF, KAUST,
PSC, CESGA
Online Courses
5K+ attended over
last 3 years
Online Labs
4.3K+ taken over
last 3 years
12
GAUSSIAN 16
Using OpenACC allowed us to continue
development of our fundamental
algorithms and software capabilities
simultaneously with the GPU-related
work. In the end, we could use the
same code base for SMP, cluster/
network and GPU parallelism. PGI's
compilers were essential to the success
of our efforts.
Mike Frisch, Ph.D.
President and CEO
Gaussian, Inc.
Gaussian, Inc.
340QuinnipiacSt. Bldg. 40
Wallingford, CT 06492USA
custserv@gaussian.com
Gaussian isa registered trademark of Gaussian, Inc. All other trademarksand
thepropertiesof their respectiveholders. Specif cationssubject tochangewitho
Copyright © 2017, Gaussian, Inc. All rightsreserved.
Roberto Gomperts
NVIDIA
Michael Frisch
Gaussian
Brent Leback
NVIDIA/PGI
Gio
Project Contributors
%GPUCPU=0 - 7 =0 - 7 UseGPUs0-7with CPUs0-7astheir controllers.
Detailed information isavailableon our website.
13
ANSYS FLUENT
We’ve effectively used
OpenACC for heterogeneous
computing in ANSYS Fluent
with impressive performance.
We’re now applying this work
to more of our models and
new platforms.
Sunil Sathe
Lead Software Developer
ANSYS Fluent
14
VASP
For VASP, OpenACC is the way
forward for GPU acceleration.
Performance is similar and in some
cases better than CUDA C, and
OpenACC dramatically decreases
GPU development and maintenance
efforts. We’re excited to collaborate
with NVIDIA and PGI as an early
adopter of CUDA Unified Memory.
Prof. Georg Kresse
Computational Materials Physics
University of Vienna
15
MPAS-A
Our team has been evaluating
OpenACC as a pathway to
performance portability for the Model
for Prediction (MPAS) atmospheric
model. Using this approach on the
MPAS dynamical core, we have
achieved performance on a single
P100 GPU equivalent to 2.7 dual
socketed Intel Xeon nodes on our new
Cheyenne supercomputer.
Richard Loft
Director, Technology Development
NCAR
Image courtesy: NCAR
16
David Gutzwiller
Lead Software Developer
NUMECA
NUMECA FINE/Open
Porting our unstructured C++ CFD
solver FINE/Open to GPUs using
OpenACC would have been
impossible two or three years ago,
but OpenACC has developed
enough that we’re now getting
some really good results.
17
OpenACC made it practical to
develop for GPU-based hardware
while retaining a single source for
almost all the COSMO physics
code.
Dr. Oliver Fuhrer
Senior Scientist
Meteoswiss
COSMO
18
GAMERA FOR GPU
With OpenACC and a compute
node based on NVIDIA's Tesla
P100 GPU, we achieved more
than a 14X speed up over a K
Computer node running our
earthquake disaster simulation
code
Takuma Yamaguchi, Kohei Fujita, Tsuyoshi Ichimura, Muneo
Hori, Lalith Wijerathne
The University of Tokyo
Map courtesy University of Tokyo
19
QUANTUM ESPRESSO
CUDA Fortran gives us the full
performance potential of the
CUDA programming model and
NVIDIA GPUs. !$CUF KERNELS
directives give us productivity and
source code maintainability. It’s
the best of both worlds.
Filippo Spiga
Head of Research Software Engineering
University of Cambridge
20
OPENACC AND CUDA UNIFIED MEMORY
21
Programming GPU-Accelerated Systems
CUDA Unified Memory for Dynamically Allocated Data
GPU Developer View With
CUDA Unified Memory
Unified Memory
GPU Developer View
System
Memory
GPU Memory
PCIe
22
#pragma acc data copyin(a,b) copyout(c)
{
...
#pragma acc parallel
{
#pragma acc loop gang vector
for (i = 0; i < n; ++i) {
c[i] = a[i] + b[i];
...
}
}
...
}
PGI OpenACC and CUDA Unified Memory
Compiling with the –ta=tesla:managed option
C malloc, C++ new, Fortran allocate all mapped to CUDA Unified Memory
GPU Developer View With
CUDA Unified Memory
Unified Memory
23
PGI OpenACC and CUDA Unified Memory
Compiling with the –ta=tesla:managed option
C malloc, C++ new, Fortran allocate all mapped to CUDA Unified Memory
...
#pragma acc parallel
{
#pragma acc loop gang vector
for (i = 0; i < n; ++i) {
c[i] = a[i] + b[i];
...
}
}
...
GPU Developer View With
CUDA Unified Memory
Unified Memory
24NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
GTC: An OpenACC Production Application
The gyrokinetic toroidal
code (GTC) is a massively
parallel, particle-in-cell
production code for
turbulence simulation in
support of the burning
plasma experiment ITER,
the crucial next step in the
quest for fusion energy.
Being ported for runs on the ORNL Summit supercomputer
http://phoenix.ps.uci.edu/gtc_group
25
GTC Performance using OpenACC
P8 : IBM POWER8NVL, 2 sockets, 20 cores, NVLINK
UM : No Data Directives in sources, compiled with –ta=tesla:managed
2x
4x
6x
8x
10x
12x
OpenPOWER | NVLink | Unified Memory | P100 | V100
14x
16x
Data Directives Data Directives Data Directives
6.1X 5.9X
12.1X 12X
16.5X
20-core P8 P8+2xP100
UM
P8+2xP100 P8+4xP100
UM
P8+4xP100 x64+4xV100
26
Resources
https://www.openacc.org/resources
Success Stories
https://www.openacc.org/success-stories
Events
https://www.openacc.org/events
OPENACC RESOURCES
Guides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow
Compilers and Tools
https://www.openacc.org/tools
www.openacc.org/community#slack
PGI User Forums
pgicompilers.com/userforum
stackoverflow.com/questions/tagged
/openacc' Questions
Support Options

PGI Compilers & Tools Update- March 2018

  • 1.
    PGI Compilers forHeterogeneous Supercomputing, March 2018 PGI COMPILERS & TOOLS UPDATE
  • 2.
    2 PGI - THENVIDIA HPC SDK Fortran, C & C++ Compilers Optimizing, SIMD Vectorizing, OpenMP Accelerated Computing Features OpenACC Directives CUDA Fortran Multi-Platform Solution Multicore x86-64 and OpenPOWER CPUs, NVIDIA Tesla GPUs Supported on Linux, macOS, Windows MPI/OpenMP/OpenACC Tools Debugger Performance Profiler Interoperable with DDT, TotalView
  • 3.
    3 OPENACC FOR EVERYONE PGICommunity Edition Now Available PROGRAMMING MODELS OpenACC, CUDA Fortran, OpenMP, C/C++/Fortran Compilers and Tools PLATFORMS X86, OpenPOWER, NVIDIA GPU UPDATES 1-2 times a year 6-9 times a year 6-9 times a year SUPPORT User Forums PGI Support PGI Premier Services LICENSE Annual Perpetual Volume/Site FREE
  • 4.
    4 Latest CPUs Support IntelSkylake AMD Zen IBM POWER9 Full OpenACC 2.6 OpenMP 4.5 for multicore CPUs AVX-512 code generation Integrated CUDA 9.1 toolkit/libraries New fastmath intrinsics library Partial C++17 support Optional LLVM-based x86 code generator pgicompilers.com/whats-new
  • 5.
    5 Performance measured February,2018. Skylake: Two 20 core Intel Xeon Gold 6148 CPUs @ 2.4GHz w/ 376GB memory, hyperthreading enabled. EPYC: Two 24 core AMD EPYC 7451 CPUs @ 2.3GHz w/ 256GB memory. Broadwell: Two 20 core Intel Xeon E5-2698 v4 CPUs @ 3.6GHz w/ 256GB memory, hyperthreading enabled. Volta: NVIDIA DGX1 system with two 20 core Intel Xeon E5-2698 v4 CPUs @ 2.20GHz, 256GB memory, one NVIDIA Tesla V100-SXM2-16GB GPU @ 1.53GHz. SPEC® is a registered trademark of the Standard Performance Evaluation Corporation (www.spec.org). SPEC ACCEL 1.2 BENCHMARKS 0 50 100 150 200 2-socket Skylake 2-socket EPYC 2-socket BroadwellGEOMEANSeconds Intel 2018 PGI 18.1 OpenMP 4.5 40 cores / 80 threads 48 cores / 48 threads 40 cores / 80 threads 0 50 100 150 200 GEOMEANSeconds PGI 18.1 OpenACC 2-socket Broadwell 1x Volta V100 4.4x Speed-up
  • 6.
    6 0 50 100 150 200 2-socket Skylake 2-socketEPYC 2-socket Broadwell GEOMEANSeconds Intel 2018 PGI 18.1 SPEC CPU 2017 FP SPEED BENCHMARKS 40 cores / 80 threads 48 cores / 48 threads 40 cores / 80 threads Performance measured February, 2018. Skylake: Two 20 core Intel Xeon Gold 6148 CPUs @ 2.4GHz w/ 376GB memory, hyperthreading enabled. EPYC: Two 24 core AMD EPYC 7451 CPUs @ 2.3GHz w/ 256GB memory. Broadwell: Two 20 core Intel Xeon E5-2698 v4 CPUs @ 3.6GHz w/ 256GB memory, hyperthreading enabled. SPEC® is a registered trademark of the Standard Performance Evaluation Corporation (www.spec.org).
  • 7.
  • 8.
    8 OPENACC DIRECTIVES Manage Data Movement Initiate Parallel Execution Optimize Loop Mappings #pragma accdata copyin(a,b) copyout(c) { ... #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { c[i] = a[i] + b[i]; ... } } ... } • CPU, GPU, Manycore • Performance portable • Interoperable • Single source base • Incremental
  • 9.
    9 CPU OPENACC IS FORMULTICORE CPUS & GPUS % pgfortran -ta=multicore –fast –Minfo=acc -c update_tile_halo_kernel.f90 . . . 100, Loop is parallelizable Generating Multicore code 100, !$acc loop gang 102, Loop is parallelizable GPU % pgfortran -ta=tesla –fast -Minfo=acc –c update_tile_halo_kernel.f90 . . . 100, Loop is parallelizable 102, Loop is parallelizable Accelerator kernel generated Generating Tesla code 100, !$acc loop gang, vector(4) ! blockidx%y threadidx%y 102, !$acc loop gang, vector(32) ! blockidx%x threadidx%x 98 !$ACC KERNELS 99 !$ACC LOOP INDEPENDENT 100 DO k=y_min-depth,y_max+depth 101 !$ACC LOOP INDEPENDENT 102 DO j=1,depth 103 density0(x_min-j,k)=left_density0(left_xmax+1-j,k) 104 ENDDO 105 ENDDO 106 !$ACC END KERNELS
  • 10.
    10 0 20 40 60 80 100 120 140 160 Multicore Haswell MulticoreBroadwell Multicore Skylake Kepler Pascal PGI 18.1 OpenACC Intel 2018 OpenMP 7.6x 7.9x 10x 10x 11x 40x 14.8x 15x Volta V100 CLOVERLEAF SpeedupvsSingleHaswellCore Systems: Haswell: 2x16 core Haswell server, four K80s, CentOS 7.2 (perf-hsw10), Broadwell: 2x20 core Broadwell server, eight P100s (dgx1-prd-01), Broadwell server, eight V100s (dgx07), Skylake 2x20 core Xeon Gold server (sky-4). Compilers: Intel 2018.0.128, PGI 18.1 Benchmark: CloverLeaf v1.3 downloaded from http://uk-mac.github.io/CloverLeaf the week of November 7 2016; CloverlLeaf_Serial; CloverLeaf_ref (MPI+OpenMP); CloverLeaf_OpenACC (MPI+OpenACC) Data compiled by PGI February 2018. AWE Hydrodynamics mini-App, bm32 data set http://uk-mac.github.io/CloverLeaf 109x 67x 142x 1x 2x 4x
  • 11.
    11 OPENACC UPTAKE INHPC Applications 3 of Top 5 HPC Apps ANSYS Fluent & Gaussian released; VASP in development 5 ORNL CAAR Codes GTC, XGC, ACME, FLASH, LSDalton 109 Apps Total being tracked Hackathons Training Community 5 Events in 2017 All hackathons are initiated by users 6–9 Events in 2018 94 Codes GPU accelerated to date Expertise 94 mentors registered User Group SC17 participation up 27% vs SC16 Slack Channel 2x growth in last 6 months Downloads PGI Community Edition quarterly downloads up 136% in 2017 OpenACC and DLI 10 new modules; instructor certification 18+ 2018 workshops ECMWF, KAUST, PSC, CESGA Online Courses 5K+ attended over last 3 years Online Labs 4.3K+ taken over last 3 years
  • 12.
    12 GAUSSIAN 16 Using OpenACCallowed us to continue development of our fundamental algorithms and software capabilities simultaneously with the GPU-related work. In the end, we could use the same code base for SMP, cluster/ network and GPU parallelism. PGI's compilers were essential to the success of our efforts. Mike Frisch, Ph.D. President and CEO Gaussian, Inc. Gaussian, Inc. 340QuinnipiacSt. Bldg. 40 Wallingford, CT 06492USA [email protected] Gaussian isa registered trademark of Gaussian, Inc. All other trademarksand thepropertiesof their respectiveholders. Specif cationssubject tochangewitho Copyright © 2017, Gaussian, Inc. All rightsreserved. Roberto Gomperts NVIDIA Michael Frisch Gaussian Brent Leback NVIDIA/PGI Gio Project Contributors %GPUCPU=0 - 7 =0 - 7 UseGPUs0-7with CPUs0-7astheir controllers. Detailed information isavailableon our website.
  • 13.
    13 ANSYS FLUENT We’ve effectivelyused OpenACC for heterogeneous computing in ANSYS Fluent with impressive performance. We’re now applying this work to more of our models and new platforms. Sunil Sathe Lead Software Developer ANSYS Fluent
  • 14.
    14 VASP For VASP, OpenACCis the way forward for GPU acceleration. Performance is similar and in some cases better than CUDA C, and OpenACC dramatically decreases GPU development and maintenance efforts. We’re excited to collaborate with NVIDIA and PGI as an early adopter of CUDA Unified Memory. Prof. Georg Kresse Computational Materials Physics University of Vienna
  • 15.
    15 MPAS-A Our team hasbeen evaluating OpenACC as a pathway to performance portability for the Model for Prediction (MPAS) atmospheric model. Using this approach on the MPAS dynamical core, we have achieved performance on a single P100 GPU equivalent to 2.7 dual socketed Intel Xeon nodes on our new Cheyenne supercomputer. Richard Loft Director, Technology Development NCAR Image courtesy: NCAR
  • 16.
    16 David Gutzwiller Lead SoftwareDeveloper NUMECA NUMECA FINE/Open Porting our unstructured C++ CFD solver FINE/Open to GPUs using OpenACC would have been impossible two or three years ago, but OpenACC has developed enough that we’re now getting some really good results.
  • 17.
    17 OpenACC made itpractical to develop for GPU-based hardware while retaining a single source for almost all the COSMO physics code. Dr. Oliver Fuhrer Senior Scientist Meteoswiss COSMO
  • 18.
    18 GAMERA FOR GPU WithOpenACC and a compute node based on NVIDIA's Tesla P100 GPU, we achieved more than a 14X speed up over a K Computer node running our earthquake disaster simulation code Takuma Yamaguchi, Kohei Fujita, Tsuyoshi Ichimura, Muneo Hori, Lalith Wijerathne The University of Tokyo Map courtesy University of Tokyo
  • 19.
    19 QUANTUM ESPRESSO CUDA Fortrangives us the full performance potential of the CUDA programming model and NVIDIA GPUs. !$CUF KERNELS directives give us productivity and source code maintainability. It’s the best of both worlds. Filippo Spiga Head of Research Software Engineering University of Cambridge
  • 20.
    20 OPENACC AND CUDAUNIFIED MEMORY
  • 21.
    21 Programming GPU-Accelerated Systems CUDAUnified Memory for Dynamically Allocated Data GPU Developer View With CUDA Unified Memory Unified Memory GPU Developer View System Memory GPU Memory PCIe
  • 22.
    22 #pragma acc datacopyin(a,b) copyout(c) { ... #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { c[i] = a[i] + b[i]; ... } } ... } PGI OpenACC and CUDA Unified Memory Compiling with the –ta=tesla:managed option C malloc, C++ new, Fortran allocate all mapped to CUDA Unified Memory GPU Developer View With CUDA Unified Memory Unified Memory
  • 23.
    23 PGI OpenACC andCUDA Unified Memory Compiling with the –ta=tesla:managed option C malloc, C++ new, Fortran allocate all mapped to CUDA Unified Memory ... #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { c[i] = a[i] + b[i]; ... } } ... GPU Developer View With CUDA Unified Memory Unified Memory
  • 24.
    24NVIDIA CONFIDENTIAL. DONOT DISTRIBUTE. GTC: An OpenACC Production Application The gyrokinetic toroidal code (GTC) is a massively parallel, particle-in-cell production code for turbulence simulation in support of the burning plasma experiment ITER, the crucial next step in the quest for fusion energy. Being ported for runs on the ORNL Summit supercomputer http://phoenix.ps.uci.edu/gtc_group
  • 25.
    25 GTC Performance usingOpenACC P8 : IBM POWER8NVL, 2 sockets, 20 cores, NVLINK UM : No Data Directives in sources, compiled with –ta=tesla:managed 2x 4x 6x 8x 10x 12x OpenPOWER | NVLink | Unified Memory | P100 | V100 14x 16x Data Directives Data Directives Data Directives 6.1X 5.9X 12.1X 12X 16.5X 20-core P8 P8+2xP100 UM P8+2xP100 P8+4xP100 UM P8+4xP100 x64+4xV100
  • 26.
    26 Resources https://www.openacc.org/resources Success Stories https://www.openacc.org/success-stories Events https://www.openacc.org/events OPENACC RESOURCES Guides● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow Compilers and Tools https://www.openacc.org/tools www.openacc.org/community#slack PGI User Forums pgicompilers.com/userforum stackoverflow.com/questions/tagged /openacc' Questions Support Options