0% found this document useful (0 votes)

27 views128 pages

OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014

Uploaded by

alan88w

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views128 pages

OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014

Uploaded by

alan88w

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 128

GPU/Accelerator programming with

OpenMP 4.0:
yet another Significant Paradigm Shift in
High-level Parallel Computing
Michael Wong, Senior Compiler Technical Lead/Architect
[email protected]
OpenMP CEO
Chair of WG21 SG5 Transactional Memory
ISOCPP.org, Director, VP
Vice Chair of Programming Languages, Standards Council of Canada
WG21 C++ Standard, Head of Delegation for Canada and IBM

CPPCON 2014
Acknowledgement and Disclaimer
• Numerous people internal and external to the
OpenMP WG, in industry and academia, have
made contributions, influenced ideas, written
part of this presentations, and offered
feedbacks to form part of this talk.
• I even lifted this acknowledgement and
disclaimer from some of them.
• But I claim all credit for errors, and stupid
mistakes. These are mine, all mine!
• Any opinions expressed in this presentation are
my opinions and do not necessarily reflect the
opinions of IBM or OpenMP or ISO C++.
Legal Disclaimer
• This work represents the view of the author and
does not necessarily represent the view of IBM.
• IBM, PowerPC and the IBM logo are trademarks
or registered trademarks of IBM or its subsidiaries
in the United States and other countries.
• The OpenMP_Timeline files here are licensed
under the three clause BSD license,
http://opensource.org/licenses/BSD-3-Clause
• Other company, product, and service names may
be trademarks or service marks of others..
What is OpenMP about?

And how does it fit with C++?

Common-vendor Specification
Parallel Programming model on
Multiple compilers
AMD, Convey, Cray, Fujitsu, HP, IBM,
Intel, NEC, NVIDIA, Oracle, RedHat
(GNU), ST Mircoelectronics, TI,
clang/llvm
A de-facto Standard: Across 3
Major General Purpose
Languages
C++, C, Fortran
A de-facto Standard: One High-
Level Accelerator Language
One High-Level Vector SIMD
language too!
Support Multiple Devices and let
the local compiler generate the
best code
Xeon Phi, NVIDIA, GPU, GPGPU, DSP,
MIC, ARM and FPGA
So how does it fit with other
GPU/Accelerator efforts?
ISO C++ WG21 SG1 Parallelism TS
C++AMP
OpenCL
Cuda?
WG21 SG1 Parallelism TS
std::vector<int> v = ... size_t threshold = ...
// standard sequential sort
std::sort(vec.begin(), vec.end()); execution_policy exec = seq;
using namespace if (v.size() > threshold) {
std::experimental::parallel;
// explicitly sequential sort
exec = par;
sort(seq, v.begin(), v.end()); }
// permitting parallel execution
sort(exec, v.begin(), v.end());
sort(par, v.begin(), v.end());
// permitting vectorization as well
sort(vecpar_vec, v.begin(), v.end());
// sort with dynamically-selected
execution
C++AMP
void AddArrays(int n, int m, int * pA, int * pB, int * pSum) {
concurrency::array_view<int,2> a(n, m, pA), b(n, m, pB),
sum(n, m, pSum);
concurrency::parallel_for_each(sum.extent,
[=](concurrency::index<2> i) restrict(amp)
{
sum[i] = a[i] + b[i];
});
}
CUDA
texture<float, 2, cudaReadModeElementType> tex;
void foo() {
cudaArray* cu_array;
// Allocate array
cudaChannelFormatDesc description = cudaCreateChannelDesc<float>();
cudaMallocArray(&cu_array, &description, width, height);
// Copy image data to array
…
// Set texture parameters (default)
…
// Bind the array to the texture
…
// Run kernel
…
// Unbind the array from the texture
}
Its like the difference between:

An Aircraft Carrier Battle Group (ISO)

And a Cruiser (Consortium: OpenMP)
And a Destroyer (Company Specific
language)
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015

1
What now?
• Nearly every C, C++ features makes for beautiful, elegant code for developers
(Disclaimer: I love C++)
– Please insert your beautiful code here:
– Elegance is efficiency, or is it? Or
– What we lack in beauty, we gain in efficiency; Or do we?
• The new C++11 Std is
– 1353 pages compared to 817 pages in C++03
• The new C++14 Std is
– 1373 pages (N3937), vs the free n3972
• The new C11 is
– 701 pages compared to 550 pages in C99
• OpenMP 3.1 is
– 354 pages and growing
• OpenMP 4.0 is
– 520 pages
Beautiful and elegant Lambdas

C++98 C++11
vector<int>::iterator i = auto i = find_if( begin(v), end(v),
v.begin(); [=](int i) {
for( ; i != v.end(); ++i ) { return i > x && i < y;
if( *i > x && *i < y ) } );
break;
}

• “Lambdas, Lambdas Everywhere”

http://vimeo.com/23975522
• Full Disclosure: I love C++ and have for many years
• But … What is wrong here?
The Truth

• Q: Does your language allow you to access all the GFLOPS of your
machine?
“Is there in Truth No Beauty?”
from Jordan by George Herbert

• Q: Does your language allow you to access all the GFLOPS of your
machine?
• A: What a quaint concept!
– I thought its natural to drop out into OpenCL, CUDA, OpenGL, DirectX,
C++AMP, Assembler …. to get at my GPU
– Why? I just use my language as a cool driver, it’s a great scripting
language too. But for real kernel computation, I just use Fortran
– I write vectorized code, so my vendor offers me intrinsics, they also tell
me they can auto-vectorize, though I am not sure how much they really
do, so I am looking into OpenCL
– Well, I used to use one thread, but now that I use multiple threads, I
can get at it with C++11, OpenMP, TBB, GCD, PPL, MS then
continuation, Cilk
– I know I may have a TM core somewhere, so my vendor offers me
intrinsics
– No I like using a single thread, so I just use C, or C++
The Question

• Q: Is it true that there is a language that allows you to access all the
GFLOPS of your machine?
Power of Computing
• 1998, when C++ 98 was released
– Intel Pentium II: 0.45 GFLOPS
– No SIMD: SSE came in Pentium III
– No GPUs: GPU came out a year later
• 2011: when C++11 was released
– Intel Core-i7: 80 GFLOPS
– AVX: 8 DP flops/HZ*4 cores *4.4 GHz= 140 GFlops
– GTX 670: 2500 GFLOPS
• Computers have gotten so much faster, how come
software have not?
– Data structures and algorithms
– latency
In 1998, a typical machine had the
following flops
• .45 GFLOP, 1 core

• Single threaded C++98/C99 dominated this picture

In 2011, a typical machine had the
following flops
• 2500 GFLOP GPU

• To program the GPU, you use CUDA, OpenCL, OpenGL,

DirectX, Intrinsics, C++AMP
In 2011, a typical machine had the
following flops
• 2500 GFLOP GPU+140GFLOP AVX

• To program the GPU, you use CUDA, OpenCL, OpenGL,

DirectX, Intrinsics, C++AMP
• To program the vector unit, you use Intrinsics, OpenCL, or
auto-vectorization
In 2011, a typical machine had the
following flops
• 2500 GFLOP GPU+140GFLOP AVX+80GFLOP 4
cores

• To program the GPU, you use CUDA, OpenCL, OpenGL,

DirectX, Intrinsics, C++AMP
• To program the vector unit, you use Intrinsics, OpenCL, or
auto-vectorization
• To program the CPU, you use C/C++11, OpenMP, TBB,
Cilk, MS Async/then continuation, Apple GCD, Google
executors
In 2011, a typical machine had the
following flops
• 2500 GFLOP GPU+140GFLOP AVX+80GFLOP 4
cores+HTM

• To program the GPU, you use CUDA, OpenCL, OpenGL,

DirectX, Intrinsics, C++AMP
• To program the vector unit, you use Intrinsics, OpenCL, or
auto-vectorization
• To program the CPU, you use C/C++11, OpenMP, TBB,
Cilk, MS Async/then continuation, Apple GCD, Google
executors
• To program HTM, you have?
In 2014, a typical machine had the
following flops
• 2500 GFLOP GPU+140GFLOP AVX+80GFLOP 4
cores+HTM

• To program the GPU, you use CUDA, OpenCL, OpenGL,

DirectX, Intrinsics, C++AMP, OpenMP
• To program the vector unit, you use Intrinsics, OpenCL, or
auto-vectorization, OpenMP
• To program the CPU, you might use C/C++11/14,
OpenMP, TBB, Cilk, MS Async/then continuation, Apple
GCD, Google executors
• To program HTM, you have the upcoming C++ TM TS
OpenMP 4.0OpenMP
released
4.0: A Significant Paradigm S
Parallelism

27
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015

2
A brief history of OpenMP API
by Kelvin Li
Fortran, C & C++
Fortran V1.0 V4.0

Fortran, C & C++

V3.1
Fortran V1.1
Fortran, C & C++
V3.0
Fortran, C & C++
Fortran V2.0
V2.5

1998 2002

1997 1999 2000 2001 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013

C & C++ V2.0

2014 onwards, more agile
Next OpenMP revision cycle:
C & C++ V1.0 faster, more predictable
Less monolithic: Delivering concurrent TRs &
language extensions
OpenMP is a living language
OpenMP Members growth26 members
• From Dieter An Mey, RWTH Aachen 2012, since 2012 added
and growing
– Red Hat/GCC
– Barcelona SuperComputing Centre
– University of Houston

30
Major Features by Jim Cownie
OpenMP internal Organization

Today

Future
The New Mission Statement of
OpenMP
• OpenMP’s new mission statement
–“Standardize directive-based multi-
language high-level parallelism that is
performant, productive and portable”
–Updated from
• "Standardize and unify shared memory,
thread-level parallelism for HPC”

33
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU programming
• VectorSIMD Programming
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015

3
Hello Concurrent World
#include <iostream>
#include <thread> //#1
void hello() //#2
{
std::cout<<"Hello Concurrent World"<<std::endl;
}
int main()
{
std::thread t(hello); //#3
t.join(); //#4
}

35
Is this valid C++ today? Are
these equivalent?
int x = 0;
atomic<int> y = 0; int x = 0;
Thread 1: atomic<int> y = 0;
x = 17; Thread 1:
y.store(1, x = 17;
memory_order_release); y = 1;
// or: y.store(1);
Thread 2:
Thread 2: while (y != 1)
while continue;
(y.load(memory_order_acq assert(x == 17);
uire) != 1)
// or: while
(y.load() != 1)
assert(x == 17);

36
Hello World again
• What will this program print?

#include <stdlib.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
printf("Hello ");
printf("World ");
printf("\n");
return(0);
}

3
2-threaded Hello World with
#include <stdlib.h>
OpenMP threads
#include <stdio.h>
int main(int argc, char *argv[]) {

#pragma omp parallel

{
printf("Hello ");
printf("World ");
} // End of parallel region
printf("\n");
return(0);
}
Hello World Hello World
Or
Hello Hello World World 3
More advanced 2-threaded
Hello World
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
printf("Hello ");
printf("World ");
}
} // End of parallel region
printf("\n");
return(0);
}
Hello World

3
Hello World with OpenMP
tasks now run 3 times
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{printf("Hello ");}
#pragma omp task
{printf("World ");}
}
} // End of parallel region
printf("\n");
return(0);
}
 Hello World
 Hello World
 World Hello

4
Tasks are executed at a task
execution point
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{printf("Hello ");}
#pragma omp task
{printf("World ");}
printf(“\nThank You “);
}
} // End of parallel region
printf("\n");
return(0);
}
Thank You Hello World
Thank You Hello World
Thank You World Hello

4
Execute Tasks First
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{printf("Hello ");}
#pragma omp task
{printf("World ");}
#pragma omp taskwait
printf(“Thank You “);
}
} // End of parallel region
printf("\n");return(0);
}
Hello World Thank You
Hello World Thank You
World Hello Thank You

4
Execute Tasks First with
• OpenMP 4.0 only Dependencies
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
int x = 1;
#pragma omp task shared (x) depend (out:x)
{printf("Hello ");}
#pragma omp task shared (x) depend (in:x)
{printf("World ");}
#pragma omp taskwait
printf(“Thank You “);
}
} // End of parallel region
printf("\n");return(0);
}
Hello World Thank You
Hello World Thank You
Hello World Thank You

4
Intro to OpenMP
• De-facto standard Application Programming
Interface (API) to write shared memory parallel
applications in C, C++, and Fortran
• Consists of:
– ● Compiler directives
– ● Run time routines
– ● Environment variables
• Specification maintained by the OpenMP
Architecture Review Board
(http://www.openmp.org)
– Version 4.0 was released 2013

4
When do you want to use OpenMP?
• If the compiler cannot parallelize the way you like
it even with auto-parallelization
– a loop is not parallelized
• Data dependency analyses are not able to
determine whether it is safe to parallelize or not
– Compiler finds a low level of parallelism
• But your know there is a high level, but compiler
lacks information to parallelize at the highest
possible level
• No Auto-parallelizing compiler, then you have to
do it yourself
– Need explicit parallelization using directives
4
Advantages of OpenMP
• Good performance and scalability
–If you do it right ....
• De-facto and mature standard
• An OpenMP program is portable
–Supported by a large number of
compilers
• Allows the program to be parallelized
incrementally
4
Can OpenMP work with
MultiCore, Heterogeneous
• OpenMP is ideally suited for
multicore architectures
–Memory and threading model
map naturally
–Lightweight
–Mature
–Widely available and used 4
The OpenMP Execution Model

4
Directive Format
• C/C++
– #pragma omp directive [clause [clause] …]
– Continuation: \
– Conditional compilation: _OPENMP macro is set
• Fortran:
– Fortran: directives are case insensitive
• Syntax: sentinel directive [clause [[,] clause]...]
• The sentinel is one of the following:
– ✔ !$OMP or C$OMP or *$OMP (fixed format)
– ✔ !$OMP (free format)
– Continuation: follows the language syntax
– Conditional compilation: !$ or C$ -> 2 spaces 4
Components of OpenMP
• Directives • Environment • Runtime Variables
– Tasking Variables – Number of threads
– Parallel region – Number of – Thread id
– Dynamic thread
– Work sharing threads adjustment
– Synchronization – Scheduling type – Nested Parallelism
– Data scope – Dynamic thread – Schedule
attributes adjustment – Active Levels
• Private – Nested – Thread limit
• Firstprivate parallelism – Nesting Level
• Lastprivate – Stacksize – Ancestor thread
• Shared – Team size
– Idle threads
• reduction – Wallclock Timer
– Active levels
– Orphaning – locking
– Thread limit

5
But why does OpenMP use
pragmas?
It is an intentional design …
Pragmas can support 3 General
Purpose Programming Languages
and maintain the same style
C++
C
Fortran
And National Labs, weather
research, nuclear simulations
Still have substantial kernels written
in mix of Fortran and C driven by C++
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015

5
Goals
• Thread-rich computing environments are becoming more
prevalent
– more computing power, more threads
– less memory relative to compute
• There is parallelism, it comes in many forms
– hybrid MPI - OpenMP parallelism
– mixed mode OpenMP / Pthread parallelism
– nested OpenMP parallelism
• Have to exploit parallelism efficiently
– providing ease of use for casual programmers
– providing full control for power programmers
– providing timing feedback
5
What did we accomplish in
OpenMP 4.0?
• Broad form of accelerator support
• SIMD
• Cancellation (start of a full error model)
• Task dependencies and task groups
• Thread Affinity
• User-defined reductions
• Initial Fortran 2003
• C/C++ array sections
• Sequentially Consistent Atomics
• Display initial OpenMP internal control variable state
5
Compilers are here!
• Intel 13.1 compiler supports
Accelerators/SIMD
• Oracle/Sun Studio 12.4 Beta just
announced full OpenMP 4.0
• GCC 4.9 shipped April 9, 2014 supports
4.0
• Clang support for OpenMP injecting
into trunk, first appears in 3.5 last week
• Cray, TI, IBM coming online 57
In 2014, a typical machine had the
following flops
• 2500 GFLOP GPU+140GFLOP AVX+80GFLOP 4
cores+HTM

• To program the GPU, you have to use CUDA, OpenCL,

OpenGL, DirectX, Intrinsics, C++AMP, OpenMP
• To program the vector unit, you have to use Intrinsics,
OpenCL, or auto-vectorization, OpenMP
• To program the CPU, you might use C/C++11/14,
OpenMP, TBB, Cilk, MS Async/then continuation, Apple
GCD, Google executors
• To program HTM, you have the upcoming C++ TM TS
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU Programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015

5
OpenMP Accelerator
• Co-chairs Technical leads Subcommittee
– Jame Beyers- Cray (courtesy for slides)
– Eric Stotzer – TI (courtesy for slides)
• Active subcommittee members
– Xinmin Tian – Intel (courtesy for slides)
– Ravi Narayanaswamy – Intel (courtesy for slides)
– Jeff Larkin – Nvidia
– Kent Milfeld – TACC
– Henry Jin – NASA
– Kevin O’Brien, Kelvin Li, Alexandre Eichenberger, IBM
– Christian Terboven– RWTH Aachen (courtesy for slides)
– Michael Klemm – Intel (courtesy for slides)
– Stephane Cheveau – CAPS
– Convey, AMD, ORNL, TU Dresden,

6
60
So, how do you program GPU?
Why is GPU important now?
• Or is it a flash in the pan?
• The race to exascale computing .. 10 18 flops
• Vertical scale is in GFLOPS
Top500 contenders
What is OpenMP Model’s aim?
• All forms of accelerators, DSP, GPU, APU, GPGPU
• Network heterogenous consumer devices
– Kitchen appliances, drones, signal processors, medical
imaging, auto, telecom, automation, not just graphics
engines
Heterogeneous Device model
• OpenMP 4.0 supports accelerators/coprocessors
• Device model:
– One host
– Multiple accelerators/coprocessors of the same kind

Heterogeneous SoC
Glossary

6
OpenMP 4.0 Device
Constructs
• Execute code on a target device
– omp target [clause[[,] clause],…]
structured-block
– omp declare target
[function-definitions-or-declarations]
• Map variables to a target device
– map ([map-type:] list) // map clause
map-type := alloc | tofrom | to | from
– omp target data [clause[[,] clause],…]
structured-block
– omp target update [clause[[,] clause],…]
– omp declare target
[variable-definitions-or-declarations]
• Workshare for acceleration
– omp teams [clause[[,] clause],…]
structured-block
– omp distribute [clause[[,] clause],…]
for-loops
6
target Construct

6
target data Construct

6
target update Construct

7
Execution Model

7
Execution Model and Data
Environment

7
map Clause
extern void init(float*, float*, int); • The target construct creates
extern void output(float*, int); a new device data
void vec_mult(float *p, float *v1, float *v2, int N)
environment and explicitly
{ maps the array sections
int i; v1[0:N], v2[:N] and p[0:N] to
init(v1, v2, N); the new device data
#pragma omp target map(to:v1[0:N],v2[:N]) \\ environment.
map(from:p[0:N]) • The variable N implicitly
#pragma omp parallel for
for (i=0; i<N; i++)
mapped into the new device
p[i] = v1[i] * v2[i]; data environment from the
encountering task's data
output(p, N); environment.
}

Map-types:
• alloc: allocate storage for corresponding variable
• to: alloc and assign value of original variable to corresponding variable on entry
• from: alloc and assign value of corresponding variable to original variable on exit
• tofrom: default, both to and form

7
target Construct Example
• Use target construct to
– Transfer control from the host to the device
– Establish a device data environment (if not yet done)
• Host thread waits until offloaded region completed
– Use other OpenMP constructs for asynchronicity

host
#pragma omp target map(to:b[0:count]) map(to:c,d) map(from:a[0:count])
{
#pragma omp parallel for

target
for (i=0; i<count; i++) {
a[i] = b[i] * c + d;
}

host
}
Data Environments

7
target data Construct Example
extern void init(float*, float*, int); • The target data
extern void init_again(float*, float*, int); construct creates a device
extern void output(float*, int); data environment and
encloses target regions,
void vec_mult(float *p, float *v1, float *v2, int N) which have their own device
{ data environments.
int i; • The device data environment
of the target data region
init(v1, v2, N); is inherited by the device
data environment of an
#pragma omp target data map(from: p[0:N]) enclosed target region.
{
• The target data
#pragma omp target map(to: v1[:N], v2[:N])
construct is used to create
#pragma omp parallel for
variables that will persist
for (i=0; i<N; i++)
throughout the target
p[i] = v1[i] * v2[i];
data region.
init_again(v1, v2, N); • v1 and v2 are mapped at
each target construct.
#pragma omp target map(to: v1[:N], v2[:N]) • Instead of mapping the
#pragma omp parallel for variable p twice, once at each
for (i=0; i<N; i++) target construct, p is
p[i] = p[i] + (v1[i] * v2[i]); mapped once by the
} target data construct.
output(p, N);
}

7
Data mapping: shared or distributed memory

Shared memory
Memory

Processor X Processor Y

Cache Cache
A
A A

Distributed memory
Accelertor
Memory X Y

• The corresponding variable in the device Processor

data environment may share storage X Memory Y

with the original variable. Cache

A
A
• Writes to the corresponding variable A
may alter the value of the original
variable.
if Clause Example
#define THRESHOLD1 1000000 • The if clause on the
#define THRESHOLD2 1000
target construct
extern void init(float*, float*, int); indicates that if the
extern void output(float*, int);
variable N is smaller than
void vec_mult(float *p, float *v1, float *v2, int N) a given threshold, then
{
int i;
the target region will
init(v1, v2, N); be executed by the host
#pragma omp target if(N>THRESHOLD1) \\
device.
map(to: v1[0:N], v2[:N]) map(from: p[0:N])
#pragma omp parallel for if(N>THRESHOLD2)
for (i=0; i<N; i++) • The if clause on the
p[i] = v1[i] * v2[i];
output(p, N); parallel construct
} indicates that if the
variable N is smaller than
a second threshold then
the parallel region is
inactive.

7
declare target Constrtuct

7
Host and device functions

8
Explicit Data
Transfers:Target
update Construct Example
#pragma omp target data device(0) map(alloc:tmp[:N]) map(to:input[:N)) map(from:res)

host
{
#pragma omp target device(0)
#pragma omp parallel for

target
for (i=0; i<N; i++)
tmp[i] = some_computation(input[i], i);

update_input_array_on_the_host(input);

host
#pragma omp target update device(0) to(input[:N])

#pragma omp target device(0)

#pragma omp parallel for reduction(+:res)

target
for (i=0; i<N; i++)
res += final_computation(input[i], tmp[i], i)

host
}
Asynchronous Offloading

8
Teams Constructs
C/C++
#pragma omp teams [clause[[,] clause],...] new-line
structured-block

Fortran
!$omp teams [clause[[,] clause],...]
structured-block
!$omp end teams

Clauses: num_teams( integer-expression )

num_threads( integer-expression )
default(shared | none)
private( list )
firstprivate( list )
shared( list )
reduction( operator : list )

83
Restrictions on teams
Construct

8
Teams Execution Model
Teams Constructs
#pragma omp teams num_teams(3), num_threads(3)
structured-block

Team 0 Team 1 Team 2

Thread0 Thread1 Thread2 Thread0 Thread1 Thread2 Thread0 Thread1 Thread2

Thread0 Thread0 Thread0

Structured-block Structured-block Structured-block

SAXPY: Serial (host)

8
SAXPY: Serial (host)

8
SAXPY:
Coprocessor/Accelerator

8
C/C++
Distribute Constructs
#pragma omp distribute [clause[[,] clause],...] new-line
for-loops

Fortran
!$omp distribute [clause[[,] clause],...]
do-loops
[ !$omp end distribute ]

Clauses: private( list )

firstprivate( list )
collapse( n )
dist_schedule( kind[, chunk_size] )

A distribute construct must be closely nested in a teams region.

l 89
distribute Construct

9
Teams + Distribute Execution Model
#pragma omp teams num_teams(3), num_threads(3)
#pragma omp distribute
for (int i=0; i<9; i++) {

Team 0 Team 1 Team 2

Thread0 Thread1 Thread2 Thread0 Thread1 Thread2 Thread0 Thread1 Thread2

Thread0 Thread0 Thread0

i = 0,1,2 i = 3,4,5 i = 6,7,8

Teams + Distribute Constructs
#pragma omp teams num_teams(3), num_threads(3)
#pragma omp distribute
for (int i=0; i<9; i++) {
# pragma omp parallel for
for (int j=0;j<6; j++) {

Team 0 Team 1 Team 2

Thread0 Thread1 Thread2 Thread0 Thread1 Thread2 Thread0 Thread1 Thread2

Thread0 Thread0 Thread0

i = 0,1,2 i = 3,4,5 i = 6,7,8

Thread0 Thread1 Thread2 Thread0 Thread1 Thread2 Thread0 Thread1 Thread2

j=0,1 j=2,3 j=4,5 j=0,1 j=2,3 j=4,5 j=0,1 j=2,3 j=4,5

SAXPY:
Coprocessor/Accelerator

9
Combined Constructs

9
SAXPY: Combined Constructs

9
Additional Runtime Support

9
Multi-device Example

9
OpenACC1 compared to OpenMP 4.0 (by
Dr. James Beyer)
OpenACC1 OpenMP 4.0
• Parallel (offload) • Target
– Parallel (multiple “threads”) • Team/Parallel
• Kernels •
• Data • Target Data
• Loop • Distribute/Do/for/Simd
• Host data •
• Cache •
• Update • Target Update
• Wait •
• Declare • Declare Target

Slide 99
Future OpenACC vs future OpenMP
(by Dr. James Beyer)
OpenACC2 OpenMP future
• enter data • Unstructured data environment
• exit data
• data api • declare target
• routine •
• async wait • Parallel in parallel or team
• parallel in parallel • tile
• tile • Linkable or Deferred_map
• Linkable • Device_type
• Device_type
Slide 100
Preliminary results: AXPY (Y=a*X)
AXPY Execution Time (s)
3
Sequential

2.5
OpenMP FOR (16 threads)

2
HOMP ACC

1.5
PGI OpenACC

1
HMPP OpenACC

0.5

0
5000 50000 500000 5000000 50000000 100000000 500000000
Vector size (float)

Hardware configuration: Software configuration:

• 4 quad-core Intel Xeon processors (16 cores) 2.27GHz • PGI OpenACC compiler version 13.4
with 32GB DRAM. • HMPP OpenACC compiler version 3.3.3
• NVIDIA Tesla K20c GPU (Kepler architecture) • GCC 4.4.7 and the CUDA 5.0 compiler
Jacobi
Jacobi Execution Time (s)
100
Sequential
90
HMPP

80 PGI

HOMP
70
OpenMP
60 HMPP Collapse

50 HOMP Collpase

0
128x128 256x256 512x512 1024x1024 2048x2048
Matrix size (float)
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU Programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015

1
Compilers are here!
• Oracle/Sun Studio 12.4 Beta just
announced full OpenMP 4.0
• GCC 4.9 shipped April 9, 2014 supports
OpenMP 4.0
• Clang support for OpenMP injecting
into trunk, first appears in 3.5
• Intel 13.1 compiler supports
Accelerators/SIMD
• Cray, TI, IBM coming 104
OpenMP in Clang update
• I Chaired Weekly OpenMP Clang review WG (Intel, IBM, AMD, TI, Micron) to
help speedup OpenMP upstream into clang: April-on going
– Joint code reviews, code refactoring
– Delivered Most of OpenMP 3.1 constructs (except atomic and ordered) into Clang 3.5
stream for AST/Semantic Analysis support.
– have OpenMP –fsyntax-only, Runtime, and basic parallel for loop region to
demonstrate code capability
– Added U of Houston OpenMP tests into clang
– IBM team Delivered changes for OpenMP RT for PPC, other teams added their
platform/architecture
– Released Joint design on Multi-device target interface for LLVM to llvm-dev for
comment
• Future:
– Clang 3.5 (Sept 2, 2014): Initial support for AST/SEMA for OpenMP 3.1 (except
atomic and ordered) + OpenMP library for AMD, ARM, TI, IBM, Intel
– Clang 3.6 (~Feb 2015): aim for functional codegen of all OpenMP 3.1 + accelerator
support(from 4.0)
– Clang 3.7 (~Sept 2015): aim for full OpenMP 4.0 functional support
Release note commited by me
to clang/llvm 3.5
• Clang 3.5 now has parsing and semantic-analysis support for all
OpenMP 3.1 pragmas (except atomics and ordered). LLVM's OpenMP
runtime library, originally developed by Intel, has been modified to
work on ARM, PowerPC, as well as X86. Code generation support is
minimal at this point and will continue to be developed for 3.6,
along with the rest of OpenMP 3.1. Support for OpenMP 4.0 features,
such as SIMD and target accelerator directives, is also in progress.
Contributors to this work include AMD, Argonne National Lab., IBM,
Intel, Texas Instruments, University of Houston and many others.

Slide 106
Many Participants/companies
• Ajay Jayaraj, TI • Kelvin Li, IBM
• Alexander Musman, Intel • Kevin O’Brien, IBM
• Alex Eichenberger, IBM • Samuel Antao, IBM
• Alexey Bataev, Intel • Sergey Ostanevich, Intel
• Andrey Bokhanko, Intel • Sunita Chandrasekaran,
• Carlo Bertolli, IBM UH
• Eric Stotzer, TI • Michael Wong, IBM
• Guansong Zhang, AMD • Priya Unikhrishnan, IBM
• Hal Finkel, ANL • Robert Ho, IBM
• Ilia Verbyn, Intel • Wael Yehia, IBM
• James Cownie, Intel • Yan Liu, IBM
Summary of upstream
progress of OpenMP clan
• Upstream progress to clang 3.5
– https://github.com/clang-omp/clang/wiki/Status-of-
supported-OpenMP-constructs
• Benchmark OpenMP clang vs OpenMP GCC
– http://www.phoronix.com/scan.php?page=article&ite
m=llvm_clang_openmp&num=1
• Unfairly Used –O3 for GCC and noopt for clang
• Link to OpenMP offload infrastructure in LLVM
– http://lists.cs.uiuc.edu/pipermail/llvmdev/attachment
s/20140809/cd6c7f7a/attachment-0001.pdf
Slide 108
OpenMP offload/target in
• Samuel Antao (IBM) LLVM
• Carlo Bertolli (IBM)
• Andrey Bokhanko (Intel)
• Alexandre Eichenberger (IBM)
• Hal Finkel ( Argonne National Laboratory )
• Sergey Ostanevich (Intel)
• Eric Stotzer (Texas Instruments)
• Guansong Zhang (AMD)
Goal of Design
1. support multiple target platforms at runtime and
be extensible in the future with minimal or no
changes
2. determine the availability of the target platform
at runtime and able to make a decision to
offload depending on the availability and load of
the target platform
Clang/llvm offload design
Example code
1. #pragma omp declare target
2. int foo(int[1000]);
3. #pragma omp end declare target
4. ...
5. int device_count = omp_get_num_devices();
6. int device_no;
7. int *red = malloc(device_count * sizeof(int));
8. #pragma omp parallel
9. for (i = 0; i < 1000; i++) {
10. device_no = i % device_count;
11. #pragma omp target device(device_no) map(to:c) map(red[i])
12. {
13. red[i] += foo(c);
14. }
15. }
16.
17. for (I = 0; i< device_count; i++)
18. total red = red[i];
Generation of fat binary
1. The driver called on a source code should spawn a
number of front-end executions for each available
target. This should generate a set of object files for each
target
2. Target linkers combine dedicated target objects into
target shared libraries – one for each target
3. The host linker combines host object files into an
executable/shared library and incorporates shared
libraries for each target into a separate section within
host binary. This process and format is target-
dependent and will be thereafter handled by the target
RTL at runtime
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015

1
What did we accomplish in
OpenMP 4.0?
• Much broader form of accelerator support
• SIMD
• Cancellation (start of a full error model)
• Task dependencies and task groups
• Thread Affinity
• User-defined reductions
• Initial Fortran 2003
• C/C++ array sections
• Sequentially Consistent Atomics
• Display initial OpenMP internal control variable state
1
OpenMP future features
• OpenMP Tools: Profilers and Debuggers
– Just released as TR2
• Consumer style parallelism: event/async/futures
• Enhance Accelerator support/FPGA
– Multiple device type, linkable to match OpenACC2
• Additional Looping constructs
• Transactional Memory, Speculative Execution
• Task Model refinements
• CPU Affinity
• Common Array Shaping
• Full Error Model
• Interoperability
• Rebase to new C/C++/Fortran Standards, C/C++11 memory model
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators
– OpenMP and OpenACC
• Affinity
• VectorSIMD
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015 1
IWOMP, SC14 and
OpenMPCon
• International Workshop on OpenMP
– 2014 to be held in Brazil
– A strongly academic conference, with refereed papers, and a Springer-
Verlag published proceeding
• SC14
– Chairing OpenMP Bof, Steering commitete for LLVM in HPC,
giving keynote at OpenMP Exhibitor’s Forum
• What is missing is a user conference similar to ACCU, pyCON, CPPCON (next
week presenting 2 talks), C++Now
• OpenMPCON
– A user conference paired with IWOMP
– Non-refereed, user abstracts
– 1st one will be held in Europe in 2015 to pair with the 2015 IWOMP
1
IWOMP 2014
September 28-30, 2014
SENAI CIMATEC – Salvador, Brazil
Salvador

• Salvador is the largest city on the northeast coast of Brazil

– The capital of the Northeastern Brazilian state of Bahia
– It is also known as Brazil's capital of happiness
• Salvador was the first colonial capital of Brazil
– The city is one of the oldest in the Americas

• Getting There (SSA):

– Direct flights from US (Miami) and Europe (Lisbon, Madrid, &
Frankfurt)
– Alternatively, fly to Rio (GIG) or Sao Paulo (GRU) and connect to
Salvador (SSA)

• Average Temperatures in September:

– Average high: 27 °C / 81 °F
– Daily mean: 25 °C / 77 °F
– Average low 22 °C / 72 °F
Common-vendor Specification
Parallel Programming model on
Multiple compilers
AMD, Convey, Cray, Fujitsu, HP, IBM,
Intel, NEC, NVIDIA, Oracle, RedHat
(GNU), ST Mircoelectronics, TI,
clang/llvm
A de-facto Standard: Across 3
Major General Purpose
Languages
C++, C, Fortran
A de-facto Standard: One High-
Level Accelerator Language
One High-Level Vector SIMD
language too!
Support Multiple Devices and let
the local compiler generate the
best code
Xeon Phi, NVIDIA, GPU, GPGPU, DSP,
MIC, ARM and FPGA
My blogs and email address
• ISOCPP.org Director, VP
http://isocpp.org/wiki/faq/wg21#michael-wong
OpenMP CEO: http://openmp.org/wp/about-openmp/
My Blogs: http://ibm.co/pCvPHR
C++11 status: http://tinyurl.com/43y8xgf
Boost test results
http://www.ibm.com/support/docview.wss?rs=2239&context=SS
JT9L&uid=swg27006911
C/C++ Compilers Feature Request Page
http://www.ibm.com/developerworks/rfe/?PROD_ID=700
Chair of WG21 SG5 Transactional MemoryM:
https://groups.google.com/a/isocpp.org/forum/?hl=en&fromgro
ups#!forum/tm

1
FRAGEN?

Partner:
Ich freue mich auf Ihr Feedback!

Vielen Dank!
Michael Wong

Partner:

01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Ruud Van Der Pas - Eric Stotzer - Christian Terboven - Using Openmp - The Next Step - Affinity, Accelerators, Tasking, and Simd (2017, Mit Press) PDF
No ratings yet
Ruud Van Der Pas - Eric Stotzer - Christian Terboven - Using Openmp - The Next Step - Affinity, Accelerators, Tasking, and Simd (2017, Mit Press) PDF
381 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
High Performance Computing
100% (1)
High Performance Computing
294 pages
Gpu Programming
100% (2)
Gpu Programming
96 pages
OpenMP Tutorial - Lawrence Livermore National Laboratory
No ratings yet
OpenMP Tutorial - Lawrence Livermore National Laboratory
75 pages
CUDA 4 1 Webinar v11-11-22
100% (1)
CUDA 4 1 Webinar v11-11-22
41 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Omp Hands On
No ratings yet
Omp Hands On
200 pages
Introduction To OpenCL With Examples
No ratings yet
Introduction To OpenCL With Examples
128 pages
Parallel Programming
No ratings yet
Parallel Programming
108 pages
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
No ratings yet
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
43 pages
Using OpenCL Programming Massively Parallel Computers
No ratings yet
Using OpenCL Programming Massively Parallel Computers
309 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Using-Modern-Cpp-Techniques-To-Enhance-Multicore-Optimizations - Das's Edution
No ratings yet
Using-Modern-Cpp-Techniques-To-Enhance-Multicore-Optimizations - Das's Edution
17 pages
Owens
No ratings yet
Owens
67 pages
High Performance Computing 5.2
No ratings yet
High Performance Computing 5.2
294 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
2 ParallelArchExec
No ratings yet
2 ParallelArchExec
46 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Programming For Graphics Processing Units (Gpus) : Parallel
No ratings yet
Programming For Graphics Processing Units (Gpus) : Parallel
35 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Col 11136
No ratings yet
Col 11136
294 pages
ParallelProgramming Start2016
No ratings yet
ParallelProgramming Start2016
41 pages
Intro To OpenMP Mattson Customized
No ratings yet
Intro To OpenMP Mattson Customized
94 pages
Using C++ To Connect To Web Services - Steve Gates - CppCon 2014
No ratings yet
Using C++ To Connect To Web Services - Steve Gates - CppCon 2014
40 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Paralelismo 2024
No ratings yet
Paralelismo 2024
30 pages
GPGPU
No ratings yet
GPGPU
139 pages
Shared Memory Parallel Programming: Introduction To Openmp
No ratings yet
Shared Memory Parallel Programming: Introduction To Openmp
39 pages
CRGC Mcore PDF
No ratings yet
CRGC Mcore PDF
124 pages
A Gênese (Allan Kardec)
No ratings yet
A Gênese (Allan Kardec)
80 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Note2 4
No ratings yet
Note2 4
11 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Code Generation Compiler For The Openmp 4.0 Accelerator Model Onto Ompss
No ratings yet
Code Generation Compiler For The Openmp 4.0 Accelerator Model Onto Ompss
74 pages
Multicore Code Entwicklung
No ratings yet
Multicore Code Entwicklung
33 pages
Cs6801 Mcap MGM
No ratings yet
Cs6801 Mcap MGM
7 pages
Demystifying Multicore Germany 14 PDF
No ratings yet
Demystifying Multicore Germany 14 PDF
82 pages
Lec 1
No ratings yet
Lec 1
27 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
CAQA5e ch1
No ratings yet
CAQA5e ch1
42 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
90209-1025DEB - F Controller As Language Reference Manual
No ratings yet
90209-1025DEB - F Controller As Language Reference Manual
572 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
LLVM Clang - Advancing Compiler Technology
No ratings yet
LLVM Clang - Advancing Compiler Technology
28 pages
CSE5006 Multicore-Architectures ETH 1 AC41
No ratings yet
CSE5006 Multicore-Architectures ETH 1 AC41
9 pages
Algorithmic Differentiation - C++ and Extremum Estimation - Matt P. Dziubinski - CppCon 2015
No ratings yet
Algorithmic Differentiation - C++ and Extremum Estimation - Matt P. Dziubinski - CppCon 2015
283 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
C++ Metaprogramming - Fedor Pikus - CppCon 2015
100% (1)
C++ Metaprogramming - Fedor Pikus - CppCon 2015
76 pages
Ile RPG Full Training Material
No ratings yet
Ile RPG Full Training Material
5 pages
Functional Programming - Functors and Monads - Michał Dominiak - CppCon 2015
100% (1)
Functional Programming - Functors and Monads - Michał Dominiak - CppCon 2015
19 pages
Module 3 - IT 113 - Introduction To Computing (Assignment)
No ratings yet
Module 3 - IT 113 - Introduction To Computing (Assignment)
5 pages
Viewing The World Through Array-Shaped Glasses - Łukasz Mendakiewicz - CppCon 2014
No ratings yet
Viewing The World Through Array-Shaped Glasses - Łukasz Mendakiewicz - CppCon 2014
131 pages
Teradata Architecture
100% (1)
Teradata Architecture
27 pages
The Canonical Class - Michael Caisse - CppCon 2014
No ratings yet
The Canonical Class - Michael Caisse - CppCon 2014
138 pages
Compile-Time Tools For Generic Programming in C++ - Abel Sinkovics - CppCon 2015
No ratings yet
Compile-Time Tools For Generic Programming in C++ - Abel Sinkovics - CppCon 2015
241 pages
LVS Best - Practice - and - Debugging
No ratings yet
LVS Best - Practice - and - Debugging
73 pages
RCPP - Seamless R and C++ Integration - Matt P. Dziubinski - CppCon 2015
No ratings yet
RCPP - Seamless R and C++ Integration - Matt P. Dziubinski - CppCon 2015
137 pages
Users Guide 2007
No ratings yet
Users Guide 2007
198 pages
Being Smart About Pointers - Michael VanLoon - CppCon 2015
No ratings yet
Being Smart About Pointers - Michael VanLoon - CppCon 2015
47 pages
Simple Extensible Pattern Matching With C++14 - John Bandela - CppCon 2015
No ratings yet
Simple Extensible Pattern Matching With C++14 - John Bandela - CppCon 2015
118 pages
From Functional To Parallel - Stochastic Modelling in C++ - Kevin Carpenter - CppCon 2015
No ratings yet
From Functional To Parallel - Stochastic Modelling in C++ - Kevin Carpenter - CppCon 2015
64 pages
Modernizing Legacy C++ Code - Gregory and McNellis - CppCon 2014
No ratings yet
Modernizing Legacy C++ Code - Gregory and McNellis - CppCon 2014
81 pages
STL Algorithms in Action - Michael VanLoon - CppCon 2015
No ratings yet
STL Algorithms in Action - Michael VanLoon - CppCon 2015
99 pages
Types Don't Know # - Howard Hinnant - CppCon 2014
No ratings yet
Types Don't Know # - Howard Hinnant - CppCon 2014
95 pages
Where Did My Performance Go - Fedor Pikus - CppCon 2014
No ratings yet
Where Did My Performance Go - Fedor Pikus - CppCon 2014
66 pages
STL Features and Implementation Techniques - Stephan T. Lavavej - CppCon 2014
No ratings yet
STL Features and Implementation Techniques - Stephan T. Lavavej - CppCon 2014
47 pages
Visual Programming Practical Sheet For Class 12
No ratings yet
Visual Programming Practical Sheet For Class 12
7 pages
Configuring The SAP System For ALE or BAPI Inbound Processing
100% (2)
Configuring The SAP System For ALE or BAPI Inbound Processing
3 pages
Benchmarking C++ Code - Bryce Adelstein Lelbach - CppCon 2015
No ratings yet
Benchmarking C++ Code - Bryce Adelstein Lelbach - CppCon 2015
79 pages
The Implementation of Value Types - Lawrence Crowl - CppCon 2014
No ratings yet
The Implementation of Value Types - Lawrence Crowl - CppCon 2014
71 pages
C++11, 14, 17 Atomics - The Deep Dive - Michael Wong - CppCon 2015
No ratings yet
C++11, 14, 17 Atomics - The Deep Dive - Michael Wong - CppCon 2015
69 pages
The Birth of Study Group 14 - Nicolas Guillemot, Sean Middleditch, Michael Wong - CppCon 2015
No ratings yet
The Birth of Study Group 14 - Nicolas Guillemot, Sean Middleditch, Michael Wong - CppCon 2015
44 pages
C++ Multi-Dimensional Arrays For Computational Physics and Applied Mathematics - Pramod Gupta - CppCon 2015
No ratings yet
C++ Multi-Dimensional Arrays For Computational Physics and Applied Mathematics - Pramod Gupta - CppCon 2015
43 pages
Rebuilding Boost Date-Time For C++11 - Jeff Garland - CppCon 2014
No ratings yet
Rebuilding Boost Date-Time For C++11 - Jeff Garland - CppCon 2014
56 pages
Reactive Stream Processing Rx4DDS - Sumant Tambe - CppCon 2015
No ratings yet
Reactive Stream Processing Rx4DDS - Sumant Tambe - CppCon 2015
51 pages
C++ On The Web - JF Bastien - CppCon 2015
No ratings yet
C++ On The Web - JF Bastien - CppCon 2015
24 pages
QT - Modern User Interfaces For C++ - Milian Wolff - CppCon 2015
No ratings yet
QT - Modern User Interfaces For C++ - Milian Wolff - CppCon 2015
43 pages
Functional Design Explained - David Sankel - CppCon 2015
No ratings yet
Functional Design Explained - David Sankel - CppCon 2015
43 pages
Contracts For Dependable C++ - Gabriel Dos Reis - CppCon 2015
No ratings yet
Contracts For Dependable C++ - Gabriel Dos Reis - CppCon 2015
35 pages
Easy Compilation From TouchDevelop To ARM Cortex-M0 Using C++11 - Jonathan Protzenko - CppCon 2015
No ratings yet
Easy Compilation From TouchDevelop To ARM Cortex-M0 Using C++11 - Jonathan Protzenko - CppCon 2015
20 pages
C++ in The Telecom Industry - Yani Miguel - CppCon 2015
No ratings yet
C++ in The Telecom Industry - Yani Miguel - CppCon 2015
13 pages
Introducing Brigand - Edouard Alligand and Joel Falcou - CppCon 2015
No ratings yet
Introducing Brigand - Edouard Alligand and Joel Falcou - CppCon 2015
9 pages
Malware Analysis
No ratings yet
Malware Analysis
56 pages
Flip Flop Notes
No ratings yet
Flip Flop Notes
11 pages
HPE Aruba Networking 9004 4-Port GbE Branch Gateway, 2K Clients - 32 APs With LTE EUDoC ARCN9004LTE-a50004637enw
No ratings yet
HPE Aruba Networking 9004 4-Port GbE Branch Gateway, 2K Clients - 32 APs With LTE EUDoC ARCN9004LTE-a50004637enw
69 pages
Teclado Artek HB030b
No ratings yet
Teclado Artek HB030b
4 pages
Docu104008 - DDVE 7.6.0.5 GCP Installation and Administration Guide (REV 02)
No ratings yet
Docu104008 - DDVE 7.6.0.5 GCP Installation and Administration Guide (REV 02)
66 pages
Link Scheduling Algorithm, MSF: CS578: Internet of Things
No ratings yet
Link Scheduling Algorithm, MSF: CS578: Internet of Things
29 pages
Steve Jobs by Walter Isaacson - Book Review
No ratings yet
Steve Jobs by Walter Isaacson - Book Review
3 pages
Lab 1 Manual
No ratings yet
Lab 1 Manual
4 pages
Dynamic Code Analysis
No ratings yet
Dynamic Code Analysis
49 pages
Vijay G Phone Number: 510-921-2473 Professional Summary
No ratings yet
Vijay G Phone Number: 510-921-2473 Professional Summary
5 pages
Program Logic Formulation
No ratings yet
Program Logic Formulation
3 pages
DSA Project List
No ratings yet
DSA Project List
5 pages
Chapter 5
No ratings yet
Chapter 5
44 pages
Getting Started Guide CC3220
No ratings yet
Getting Started Guide CC3220
41 pages
Synology HAT5300 Data Sheet Enu
No ratings yet
Synology HAT5300 Data Sheet Enu
4 pages
Codesys Opc Server
No ratings yet
Codesys Opc Server
36 pages
Maths C Syllabus III Year
No ratings yet
Maths C Syllabus III Year
4 pages
Cnet Powerfast CNFH 624
No ratings yet
Cnet Powerfast CNFH 624
2 pages
Software Testing - Question - Bank - Complete
No ratings yet
Software Testing - Question - Bank - Complete
6 pages
ALTERA - CORDIC IP Core User Guide: Subscribe Send Feedback
No ratings yet
ALTERA - CORDIC IP Core User Guide: Subscribe Send Feedback
11 pages
Abstraction
No ratings yet
Abstraction
2 pages
COA Lesson Plan - Modified
No ratings yet
COA Lesson Plan - Modified
3 pages
Learn Java Programming in 24 Hours
From Everand
Learn Java Programming in 24 Hours
PublishDrive
No ratings yet
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet

OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014

Uploaded by

OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014

Uploaded by

GPU/Accelerator programming with

And how does it fit with C++?

An Aircraft Carrier Battle Group (ISO)

• “Lambdas, Lambdas Everywhere”

• Single threaded C++98/C99 dominated this picture

• To program the GPU, you use CUDA, OpenCL, OpenGL,

• To program the GPU, you use CUDA, OpenCL, OpenGL,

• To program the GPU, you use CUDA, OpenCL, OpenGL,

• To program the GPU, you use CUDA, OpenCL, OpenGL,

• To program the GPU, you use CUDA, OpenCL, OpenGL,

Fortran, C & C++

C & C++ V2.0

#pragma omp parallel

• To program the GPU, you have to use CUDA, OpenCL,

• The corresponding variable in the device Processor

with the original variable. Cache

#pragma omp target device(0)

Clauses: num_teams( integer-expression )

Team 0 Team 1 Team 2

Thread0 Thread0 Thread0

Structured-block Structured-block Structured-block

Clauses: private( list )

A distribute construct must be closely nested in a teams region.

Team 0 Team 1 Team 2

Thread0 Thread0 Thread0

i = 0,1,2 i = 3,4,5 i = 6,7,8

Team 0 Team 1 Team 2

Thread0 Thread0 Thread0

i = 0,1,2 i = 3,4,5 i = 6,7,8

Thread0 Thread1 Thread2 Thread0 Thread1 Thread2 Thread0 Thread1 Thread2

j=0,1 j=2,3 j=4,5 j=0,1 j=2,3 j=4,5 j=0,1 j=2,3 j=4,5

Hardware configuration: Software configuration:

• Salvador is the largest city on the northeast coast of Brazil

• Getting There (SSA):

• Average Temperatures in September:

You might also like