0% found this document useful (0 votes)
27 views128 pages

OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014

Uploaded by

alan88w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views128 pages

OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014

Uploaded by

alan88w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 128

GPU/Accelerator programming with

OpenMP 4.0:
yet another Significant Paradigm Shift in
High-level Parallel Computing
Michael Wong, Senior Compiler Technical Lead/Architect
[email protected]
OpenMP CEO
Chair of WG21 SG5 Transactional Memory
ISOCPP.org, Director, VP
Vice Chair of Programming Languages, Standards Council of Canada
WG21 C++ Standard, Head of Delegation for Canada and IBM

CPPCON 2014
Acknowledgement and Disclaimer
• Numerous people internal and external to the
OpenMP WG, in industry and academia, have
made contributions, influenced ideas, written
part of this presentations, and offered
feedbacks to form part of this talk.
• I even lifted this acknowledgement and
disclaimer from some of them.
• But I claim all credit for errors, and stupid
mistakes. These are mine, all mine!
• Any opinions expressed in this presentation are
my opinions and do not necessarily reflect the
opinions of IBM or OpenMP or ISO C++.
Legal Disclaimer
• This work represents the view of the author and
does not necessarily represent the view of IBM.
• IBM, PowerPC and the IBM logo are trademarks
or registered trademarks of IBM or its subsidiaries
in the United States and other countries.
• The OpenMP_Timeline files here are licensed
under the three clause BSD license,
http://opensource.org/licenses/BSD-3-Clause
• Other company, product, and service names may
be trademarks or service marks of others..
What is OpenMP about?

And how does it fit with C++?


Common-vendor Specification
Parallel Programming model on
Multiple compilers
AMD, Convey, Cray, Fujitsu, HP, IBM,
Intel, NEC, NVIDIA, Oracle, RedHat
(GNU), ST Mircoelectronics, TI,
clang/llvm
A de-facto Standard: Across 3
Major General Purpose
Languages
C++, C, Fortran
A de-facto Standard: One High-
Level Accelerator Language
One High-Level Vector SIMD
language too!
Support Multiple Devices and let
the local compiler generate the
best code
Xeon Phi, NVIDIA, GPU, GPGPU, DSP,
MIC, ARM and FPGA
So how does it fit with other
GPU/Accelerator efforts?
ISO C++ WG21 SG1 Parallelism TS
C++AMP
OpenCL
Cuda?
WG21 SG1 Parallelism TS
std::vector<int> v = ... size_t threshold = ...
// standard sequential sort
std::sort(vec.begin(), vec.end()); execution_policy exec = seq;
using namespace if (v.size() > threshold) {
std::experimental::parallel;
// explicitly sequential sort
exec = par;
sort(seq, v.begin(), v.end()); }
// permitting parallel execution
sort(exec, v.begin(), v.end());
sort(par, v.begin(), v.end());
// permitting vectorization as well
sort(vecpar_vec, v.begin(), v.end());
// sort with dynamically-selected
execution
C++AMP
void AddArrays(int n, int m, int * pA, int * pB, int * pSum) {
concurrency::array_view<int,2> a(n, m, pA), b(n, m, pB),
sum(n, m, pSum);
concurrency::parallel_for_each(sum.extent,
[=](concurrency::index<2> i) restrict(amp)
{
sum[i] = a[i] + b[i];
});
}
CUDA
texture<float, 2, cudaReadModeElementType> tex;
void foo() {
cudaArray* cu_array;
// Allocate array
cudaChannelFormatDesc description = cudaCreateChannelDesc<float>();
cudaMallocArray(&cu_array, &description, width, height);
// Copy image data to array

// Set texture parameters (default)

// Bind the array to the texture

// Run kernel

// Unbind the array from the texture
}
Its like the difference between:

An Aircraft Carrier Battle Group (ISO)


And a Cruiser (Consortium: OpenMP)
And a Destroyer (Company Specific
language)
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015

1
What now?
• Nearly every C, C++ features makes for beautiful, elegant code for developers
(Disclaimer: I love C++)
– Please insert your beautiful code here:
– Elegance is efficiency, or is it? Or
– What we lack in beauty, we gain in efficiency; Or do we?
• The new C++11 Std is
– 1353 pages compared to 817 pages in C++03
• The new C++14 Std is
– 1373 pages (N3937), vs the free n3972
• The new C11 is
– 701 pages compared to 550 pages in C99
• OpenMP 3.1 is
– 354 pages and growing
• OpenMP 4.0 is
– 520 pages
Beautiful and elegant Lambdas

C++98 C++11
vector<int>::iterator i = auto i = find_if( begin(v), end(v),
v.begin(); [=](int i) {
for( ; i != v.end(); ++i ) { return i > x && i < y;
if( *i > x && *i < y ) } );
break;
}

• “Lambdas, Lambdas Everywhere”


http://vimeo.com/23975522
• Full Disclosure: I love C++ and have for many years
• But … What is wrong here?
The Truth

• Q: Does your language allow you to access all the GFLOPS of your
machine?
“Is there in Truth No Beauty?”
from Jordan by George Herbert

• Q: Does your language allow you to access all the GFLOPS of your
machine?
• A: What a quaint concept!
– I thought its natural to drop out into OpenCL, CUDA, OpenGL, DirectX,
C++AMP, Assembler …. to get at my GPU
– Why? I just use my language as a cool driver, it’s a great scripting
language too. But for real kernel computation, I just use Fortran
– I write vectorized code, so my vendor offers me intrinsics, they also tell
me they can auto-vectorize, though I am not sure how much they really
do, so I am looking into OpenCL
– Well, I used to use one thread, but now that I use multiple threads, I
can get at it with C++11, OpenMP, TBB, GCD, PPL, MS then
continuation, Cilk
– I know I may have a TM core somewhere, so my vendor offers me
intrinsics
– No I like using a single thread, so I just use C, or C++
The Question

• Q: Is it true that there is a language that allows you to access all the
GFLOPS of your machine?
Power of Computing
• 1998, when C++ 98 was released
– Intel Pentium II: 0.45 GFLOPS
– No SIMD: SSE came in Pentium III
– No GPUs: GPU came out a year later
• 2011: when C++11 was released
– Intel Core-i7: 80 GFLOPS
– AVX: 8 DP flops/HZ*4 cores *4.4 GHz= 140 GFlops
– GTX 670: 2500 GFLOPS
• Computers have gotten so much faster, how come
software have not?
– Data structures and algorithms
– latency
In 1998, a typical machine had the
following flops
• .45 GFLOP, 1 core

• Single threaded C++98/C99 dominated this picture


In 2011, a typical machine had the
following flops
• 2500 GFLOP GPU

• To program the GPU, you use CUDA, OpenCL, OpenGL,


DirectX, Intrinsics, C++AMP
In 2011, a typical machine had the
following flops
• 2500 GFLOP GPU+140GFLOP AVX

• To program the GPU, you use CUDA, OpenCL, OpenGL,


DirectX, Intrinsics, C++AMP
• To program the vector unit, you use Intrinsics, OpenCL, or
auto-vectorization
In 2011, a typical machine had the
following flops
• 2500 GFLOP GPU+140GFLOP AVX+80GFLOP 4
cores

• To program the GPU, you use CUDA, OpenCL, OpenGL,


DirectX, Intrinsics, C++AMP
• To program the vector unit, you use Intrinsics, OpenCL, or
auto-vectorization
• To program the CPU, you use C/C++11, OpenMP, TBB,
Cilk, MS Async/then continuation, Apple GCD, Google
executors
In 2011, a typical machine had the
following flops
• 2500 GFLOP GPU+140GFLOP AVX+80GFLOP 4
cores+HTM

• To program the GPU, you use CUDA, OpenCL, OpenGL,


DirectX, Intrinsics, C++AMP
• To program the vector unit, you use Intrinsics, OpenCL, or
auto-vectorization
• To program the CPU, you use C/C++11, OpenMP, TBB,
Cilk, MS Async/then continuation, Apple GCD, Google
executors
• To program HTM, you have?
In 2014, a typical machine had the
following flops
• 2500 GFLOP GPU+140GFLOP AVX+80GFLOP 4
cores+HTM

• To program the GPU, you use CUDA, OpenCL, OpenGL,


DirectX, Intrinsics, C++AMP, OpenMP
• To program the vector unit, you use Intrinsics, OpenCL, or
auto-vectorization, OpenMP
• To program the CPU, you might use C/C++11/14,
OpenMP, TBB, Cilk, MS Async/then continuation, Apple
GCD, Google executors
• To program HTM, you have the upcoming C++ TM TS
OpenMP 4.0OpenMP
released
4.0: A Significant Paradigm S
Parallelism

27
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015

2
A brief history of OpenMP API
by Kelvin Li
Fortran, C & C++
Fortran V1.0 V4.0

Fortran, C & C++


V3.1
Fortran V1.1
Fortran, C & C++
V3.0
Fortran, C & C++
Fortran V2.0
V2.5

1998 2002

1997 1999 2000 2001 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013

C & C++ V2.0


2014 onwards, more agile
Next OpenMP revision cycle:
C & C++ V1.0 faster, more predictable
Less monolithic: Delivering concurrent TRs &
language extensions
OpenMP is a living language
OpenMP Members growth26 members
• From Dieter An Mey, RWTH Aachen 2012, since 2012 added
and growing
– Red Hat/GCC
– Barcelona SuperComputing Centre
– University of Houston

30
Major Features by Jim Cownie
OpenMP internal Organization

Today

Future
The New Mission Statement of
OpenMP
• OpenMP’s new mission statement
–“Standardize directive-based multi-
language high-level parallelism that is
performant, productive and portable”
–Updated from
• "Standardize and unify shared memory,
thread-level parallelism for HPC”

33
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU programming
• VectorSIMD Programming
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015

3
Hello Concurrent World
#include <iostream>
#include <thread> //#1
void hello() //#2
{
std::cout<<"Hello Concurrent World"<<std::endl;
}
int main()
{
std::thread t(hello); //#3
t.join(); //#4
}

35
Is this valid C++ today? Are
these equivalent?
int x = 0;
atomic<int> y = 0; int x = 0;
Thread 1: atomic<int> y = 0;
x = 17; Thread 1:
y.store(1, x = 17;
memory_order_release); y = 1;
// or: y.store(1);
Thread 2:
Thread 2: while (y != 1)
while continue;
(y.load(memory_order_acq assert(x == 17);
uire) != 1)
// or: while
(y.load() != 1)
assert(x == 17);

36
Hello World again
• What will this program print?

#include <stdlib.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
printf("Hello ");
printf("World ");
printf("\n");
return(0);
}

3
2-threaded Hello World with
#include <stdlib.h>
OpenMP threads
#include <stdio.h>
int main(int argc, char *argv[]) {

#pragma omp parallel


{
printf("Hello ");
printf("World ");
} // End of parallel region
printf("\n");
return(0);
}
Hello World Hello World
Or
Hello Hello World World 3
More advanced 2-threaded
Hello World
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
printf("Hello ");
printf("World ");
}
} // End of parallel region
printf("\n");
return(0);
}
Hello World

3
Hello World with OpenMP
tasks now run 3 times
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{printf("Hello ");}
#pragma omp task
{printf("World ");}
}
} // End of parallel region
printf("\n");
return(0);
}
 Hello World
 Hello World
 World Hello

4
Tasks are executed at a task
execution point
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{printf("Hello ");}
#pragma omp task
{printf("World ");}
printf(“\nThank You “);
}
} // End of parallel region
printf("\n");
return(0);
}
Thank You Hello World
Thank You Hello World
Thank You World Hello

4
Execute Tasks First
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{printf("Hello ");}
#pragma omp task
{printf("World ");}
#pragma omp taskwait
printf(“Thank You “);
}
} // End of parallel region
printf("\n");return(0);
}
Hello World Thank You
Hello World Thank You
World Hello Thank You

4
Execute Tasks First with
• OpenMP 4.0 only Dependencies
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
int x = 1;
#pragma omp task shared (x) depend (out:x)
{printf("Hello ");}
#pragma omp task shared (x) depend (in:x)
{printf("World ");}
#pragma omp taskwait
printf(“Thank You “);
}
} // End of parallel region
printf("\n");return(0);
}
Hello World Thank You
Hello World Thank You
Hello World Thank You

4
Intro to OpenMP
• De-facto standard Application Programming
Interface (API) to write shared memory parallel
applications in C, C++, and Fortran
• Consists of:
– ● Compiler directives
– ● Run time routines
– ● Environment variables
• Specification maintained by the OpenMP
Architecture Review Board
(http://www.openmp.org)
– Version 4.0 was released 2013

4
When do you want to use OpenMP?
• If the compiler cannot parallelize the way you like
it even with auto-parallelization
– a loop is not parallelized
• Data dependency analyses are not able to
determine whether it is safe to parallelize or not
– Compiler finds a low level of parallelism
• But your know there is a high level, but compiler
lacks information to parallelize at the highest
possible level
• No Auto-parallelizing compiler, then you have to
do it yourself
– Need explicit parallelization using directives
4
Advantages of OpenMP
• Good performance and scalability
–If you do it right ....
• De-facto and mature standard
• An OpenMP program is portable
–Supported by a large number of
compilers
• Allows the program to be parallelized
incrementally
4
Can OpenMP work with
MultiCore, Heterogeneous
• OpenMP is ideally suited for
multicore architectures
–Memory and threading model
map naturally
–Lightweight
–Mature
–Widely available and used 4
The OpenMP Execution Model

4
Directive Format
• C/C++
– #pragma omp directive [clause [clause] …]
– Continuation: \
– Conditional compilation: _OPENMP macro is set
• Fortran:
– Fortran: directives are case insensitive
• Syntax: sentinel directive [clause [[,] clause]...]
• The sentinel is one of the following:
– ✔ !$OMP or C$OMP or *$OMP (fixed format)
– ✔ !$OMP (free format)
– Continuation: follows the language syntax
– Conditional compilation: !$ or C$ -> 2 spaces 4
Components of OpenMP
• Directives • Environment • Runtime Variables
– Tasking Variables – Number of threads
– Parallel region – Number of – Thread id
– Dynamic thread
– Work sharing threads adjustment
– Synchronization – Scheduling type – Nested Parallelism
– Data scope – Dynamic thread – Schedule
attributes adjustment – Active Levels
• Private – Nested – Thread limit
• Firstprivate parallelism – Nesting Level
• Lastprivate – Stacksize – Ancestor thread
• Shared – Team size
– Idle threads
• reduction – Wallclock Timer
– Active levels
– Orphaning – locking
– Thread limit

5
But why does OpenMP use
pragmas?
It is an intentional design …
Pragmas can support 3 General
Purpose Programming Languages
and maintain the same style
C++
C
Fortran
And National Labs, weather
research, nuclear simulations
Still have substantial kernels written
in mix of Fortran and C driven by C++
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015

5
Goals
• Thread-rich computing environments are becoming more
prevalent
– more computing power, more threads
– less memory relative to compute
• There is parallelism, it comes in many forms
– hybrid MPI - OpenMP parallelism
– mixed mode OpenMP / Pthread parallelism
– nested OpenMP parallelism
• Have to exploit parallelism efficiently
– providing ease of use for casual programmers
– providing full control for power programmers
– providing timing feedback
5
What did we accomplish in
OpenMP 4.0?
• Broad form of accelerator support
• SIMD
• Cancellation (start of a full error model)
• Task dependencies and task groups
• Thread Affinity
• User-defined reductions
• Initial Fortran 2003
• C/C++ array sections
• Sequentially Consistent Atomics
• Display initial OpenMP internal control variable state
5
Compilers are here!
• Intel 13.1 compiler supports
Accelerators/SIMD
• Oracle/Sun Studio 12.4 Beta just
announced full OpenMP 4.0
• GCC 4.9 shipped April 9, 2014 supports
4.0
• Clang support for OpenMP injecting
into trunk, first appears in 3.5 last week
• Cray, TI, IBM coming online 57
In 2014, a typical machine had the
following flops
• 2500 GFLOP GPU+140GFLOP AVX+80GFLOP 4
cores+HTM

• To program the GPU, you have to use CUDA, OpenCL,


OpenGL, DirectX, Intrinsics, C++AMP, OpenMP
• To program the vector unit, you have to use Intrinsics,
OpenCL, or auto-vectorization, OpenMP
• To program the CPU, you might use C/C++11/14,
OpenMP, TBB, Cilk, MS Async/then continuation, Apple
GCD, Google executors
• To program HTM, you have the upcoming C++ TM TS
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU Programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015

5
OpenMP Accelerator
• Co-chairs Technical leads Subcommittee
– Jame Beyers- Cray (courtesy for slides)
– Eric Stotzer – TI (courtesy for slides)
• Active subcommittee members
– Xinmin Tian – Intel (courtesy for slides)
– Ravi Narayanaswamy – Intel (courtesy for slides)
– Jeff Larkin – Nvidia
– Kent Milfeld – TACC
– Henry Jin – NASA
– Kevin O’Brien, Kelvin Li, Alexandre Eichenberger, IBM
– Christian Terboven– RWTH Aachen (courtesy for slides)
– Michael Klemm – Intel (courtesy for slides)
– Stephane Cheveau – CAPS
– Convey, AMD, ORNL, TU Dresden,

6
60
So, how do you program GPU?
Why is GPU important now?
• Or is it a flash in the pan?
• The race to exascale computing .. 10 18 flops
• Vertical scale is in GFLOPS
Top500 contenders
What is OpenMP Model’s aim?
• All forms of accelerators, DSP, GPU, APU, GPGPU
• Network heterogenous consumer devices
– Kitchen appliances, drones, signal processors, medical
imaging, auto, telecom, automation, not just graphics
engines
Heterogeneous Device model
• OpenMP 4.0 supports accelerators/coprocessors
• Device model:
– One host
– Multiple accelerators/coprocessors of the same kind

Heterogeneous SoC
Glossary

6
OpenMP 4.0 Device
Constructs
• Execute code on a target device
– omp target [clause[[,] clause],…]
structured-block
– omp declare target
[function-definitions-or-declarations]
• Map variables to a target device
– map ([map-type:] list) // map clause
map-type := alloc | tofrom | to | from
– omp target data [clause[[,] clause],…]
structured-block
– omp target update [clause[[,] clause],…]
– omp declare target
[variable-definitions-or-declarations]
• Workshare for acceleration
– omp teams [clause[[,] clause],…]
structured-block
– omp distribute [clause[[,] clause],…]
for-loops
6
target Construct

6
target data Construct

6
target update Construct

7
Execution Model

7
Execution Model and Data
Environment

7
map Clause
extern void init(float*, float*, int); • The target construct creates
extern void output(float*, int); a new device data
void vec_mult(float *p, float *v1, float *v2, int N)
environment and explicitly
{ maps the array sections
int i; v1[0:N], v2[:N] and p[0:N] to
init(v1, v2, N); the new device data
#pragma omp target map(to:v1[0:N],v2[:N]) \\ environment.
map(from:p[0:N]) • The variable N implicitly
#pragma omp parallel for
for (i=0; i<N; i++)
mapped into the new device
p[i] = v1[i] * v2[i]; data environment from the
encountering task's data
output(p, N); environment.
}

Map-types:
• alloc: allocate storage for corresponding variable
• to: alloc and assign value of original variable to corresponding variable on entry
• from: alloc and assign value of corresponding variable to original variable on exit
• tofrom: default, both to and form

7
target Construct Example
• Use target construct to
– Transfer control from the host to the device
– Establish a device data environment (if not yet done)
• Host thread waits until offloaded region completed
– Use other OpenMP constructs for asynchronicity

host
#pragma omp target map(to:b[0:count]) map(to:c,d) map(from:a[0:count])
{
#pragma omp parallel for

target
for (i=0; i<count; i++) {
a[i] = b[i] * c + d;
}

host
}
Data Environments

7
target data Construct Example
extern void init(float*, float*, int); • The target data
extern void init_again(float*, float*, int); construct creates a device
extern void output(float*, int); data environment and
encloses target regions,
void vec_mult(float *p, float *v1, float *v2, int N) which have their own device
{ data environments.
int i; • The device data environment
of the target data region
init(v1, v2, N); is inherited by the device
data environment of an
#pragma omp target data map(from: p[0:N]) enclosed target region.
{
• The target data
#pragma omp target map(to: v1[:N], v2[:N])
construct is used to create
#pragma omp parallel for
variables that will persist
for (i=0; i<N; i++)
throughout the target
p[i] = v1[i] * v2[i];
data region.
init_again(v1, v2, N); • v1 and v2 are mapped at
each target construct.
#pragma omp target map(to: v1[:N], v2[:N]) • Instead of mapping the
#pragma omp parallel for variable p twice, once at each
for (i=0; i<N; i++) target construct, p is
p[i] = p[i] + (v1[i] * v2[i]); mapped once by the
} target data construct.
output(p, N);
}

7
Data mapping: shared or distributed memory

Shared memory
Memory

Processor X Processor Y

Cache Cache
A
A A

Distributed memory
Accelertor
Memory X Y

• The corresponding variable in the device Processor


data environment may share storage X Memory Y

with the original variable. Cache


A
A
• Writes to the corresponding variable A
may alter the value of the original
variable.
if Clause Example
#define THRESHOLD1 1000000 • The if clause on the
#define THRESHOLD2 1000
target construct
extern void init(float*, float*, int); indicates that if the
extern void output(float*, int);
variable N is smaller than
void vec_mult(float *p, float *v1, float *v2, int N) a given threshold, then
{
int i;
the target region will
init(v1, v2, N); be executed by the host
#pragma omp target if(N>THRESHOLD1) \\
device.
map(to: v1[0:N], v2[:N]) map(from: p[0:N])
#pragma omp parallel for if(N>THRESHOLD2)
for (i=0; i<N; i++) • The if clause on the
p[i] = v1[i] * v2[i];
output(p, N); parallel construct
} indicates that if the
variable N is smaller than
a second threshold then
the parallel region is
inactive.

7
declare target Constrtuct

7
Host and device functions

8
Explicit Data
Transfers:Target
update Construct Example
#pragma omp target data device(0) map(alloc:tmp[:N]) map(to:input[:N)) map(from:res)

host
{
#pragma omp target device(0)
#pragma omp parallel for

target
for (i=0; i<N; i++)
tmp[i] = some_computation(input[i], i);

update_input_array_on_the_host(input);

host
#pragma omp target update device(0) to(input[:N])

#pragma omp target device(0)


#pragma omp parallel for reduction(+:res)

target
for (i=0; i<N; i++)
res += final_computation(input[i], tmp[i], i)

host
}
Asynchronous Offloading

8
Teams Constructs
C/C++
#pragma omp teams [clause[[,] clause],...] new-line
structured-block

Fortran
!$omp teams [clause[[,] clause],...]
structured-block
!$omp end teams

Clauses: num_teams( integer-expression )


num_threads( integer-expression )
default(shared | none)
private( list )
firstprivate( list )
shared( list )
reduction( operator : list )

83
Restrictions on teams
Construct

8
Teams Execution Model
Teams Constructs
#pragma omp teams num_teams(3), num_threads(3)
structured-block

Team 0 Team 1 Team 2


Thread0 Thread1 Thread2 Thread0 Thread1 Thread2 Thread0 Thread1 Thread2

Thread0 Thread0 Thread0

Structured-block Structured-block Structured-block


SAXPY: Serial (host)

8
SAXPY: Serial (host)

8
SAXPY:
Coprocessor/Accelerator

8
C/C++
Distribute Constructs
#pragma omp distribute [clause[[,] clause],...] new-line
for-loops

Fortran
!$omp distribute [clause[[,] clause],...]
do-loops
[ !$omp end distribute ]

Clauses: private( list )


firstprivate( list )
collapse( n )
dist_schedule( kind[, chunk_size] )

A distribute construct must be closely nested in a teams region.

l 89
distribute Construct

9
Teams + Distribute Execution Model
#pragma omp teams num_teams(3), num_threads(3)
#pragma omp distribute
for (int i=0; i<9; i++) {

Team 0 Team 1 Team 2


Thread0 Thread1 Thread2 Thread0 Thread1 Thread2 Thread0 Thread1 Thread2

Thread0 Thread0 Thread0

i = 0,1,2 i = 3,4,5 i = 6,7,8


Teams + Distribute Constructs
#pragma omp teams num_teams(3), num_threads(3)
#pragma omp distribute
for (int i=0; i<9; i++) {
# pragma omp parallel for
for (int j=0;j<6; j++) {

Team 0 Team 1 Team 2


Thread0 Thread1 Thread2 Thread0 Thread1 Thread2 Thread0 Thread1 Thread2

Thread0 Thread0 Thread0

i = 0,1,2 i = 3,4,5 i = 6,7,8

Thread0 Thread1 Thread2 Thread0 Thread1 Thread2 Thread0 Thread1 Thread2

j=0,1 j=2,3 j=4,5 j=0,1 j=2,3 j=4,5 j=0,1 j=2,3 j=4,5


SAXPY:
Coprocessor/Accelerator

9
Combined Constructs

9
SAXPY: Combined Constructs

9
SAXPY: Combined Constructs

9
Additional Runtime Support

9
Multi-device Example

9
OpenACC1 compared to OpenMP 4.0 (by
Dr. James Beyer)
OpenACC1 OpenMP 4.0
• Parallel (offload) • Target
– Parallel (multiple “threads”) • Team/Parallel
• Kernels •
• Data • Target Data
• Loop • Distribute/Do/for/Simd
• Host data •
• Cache •
• Update • Target Update
• Wait •
• Declare • Declare Target

Slide 99
Future OpenACC vs future OpenMP
(by Dr. James Beyer)
OpenACC2 OpenMP future
• enter data • Unstructured data environment
• exit data
• data api • declare target
• routine •
• async wait • Parallel in parallel or team
• parallel in parallel • tile
• tile • Linkable or Deferred_map
• Linkable • Device_type
• Device_type
Slide 100
Preliminary results: AXPY (Y=a*X)
AXPY Execution Time (s)
3
Sequential

2.5
OpenMP FOR (16 threads)

2
HOMP ACC

1.5
PGI OpenACC

1
HMPP OpenACC

0.5

0
5000 50000 500000 5000000 50000000 100000000 500000000
Vector size (float)

Hardware configuration: Software configuration:


• 4 quad-core Intel Xeon processors (16 cores) 2.27GHz • PGI OpenACC compiler version 13.4
with 32GB DRAM. • HMPP OpenACC compiler version 3.3.3
• NVIDIA Tesla K20c GPU (Kepler architecture) • GCC 4.4.7 and the CUDA 5.0 compiler
Jacobi
Jacobi Execution Time (s)
100
Sequential
90
HMPP

80 PGI

HOMP
70
OpenMP
60 HMPP Collapse

50 HOMP Collpase

40

30

20

10

0
128x128 256x256 512x512 1024x1024 2048x2048
Matrix size (float)
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU Programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015

1
Compilers are here!
• Oracle/Sun Studio 12.4 Beta just
announced full OpenMP 4.0
• GCC 4.9 shipped April 9, 2014 supports
OpenMP 4.0
• Clang support for OpenMP injecting
into trunk, first appears in 3.5
• Intel 13.1 compiler supports
Accelerators/SIMD
• Cray, TI, IBM coming 104
OpenMP in Clang update
• I Chaired Weekly OpenMP Clang review WG (Intel, IBM, AMD, TI, Micron) to
help speedup OpenMP upstream into clang: April-on going
– Joint code reviews, code refactoring
– Delivered Most of OpenMP 3.1 constructs (except atomic and ordered) into Clang 3.5
stream for AST/Semantic Analysis support.
– have OpenMP –fsyntax-only, Runtime, and basic parallel for loop region to
demonstrate code capability
– Added U of Houston OpenMP tests into clang
– IBM team Delivered changes for OpenMP RT for PPC, other teams added their
platform/architecture
– Released Joint design on Multi-device target interface for LLVM to llvm-dev for
comment
• Future:
– Clang 3.5 (Sept 2, 2014): Initial support for AST/SEMA for OpenMP 3.1 (except
atomic and ordered) + OpenMP library for AMD, ARM, TI, IBM, Intel
– Clang 3.6 (~Feb 2015): aim for functional codegen of all OpenMP 3.1 + accelerator
support(from 4.0)
– Clang 3.7 (~Sept 2015): aim for full OpenMP 4.0 functional support
Release note commited by me
to clang/llvm 3.5
• Clang 3.5 now has parsing and semantic-analysis support for all
OpenMP 3.1 pragmas (except atomics and ordered). LLVM's OpenMP
runtime library, originally developed by Intel, has been modified to
work on ARM, PowerPC, as well as X86. Code generation support is
minimal at this point and will continue to be developed for 3.6,
along with the rest of OpenMP 3.1. Support for OpenMP 4.0 features,
such as SIMD and target accelerator directives, is also in progress.
Contributors to this work include AMD, Argonne National Lab., IBM,
Intel, Texas Instruments, University of Houston and many others.

Slide 106
Many Participants/companies
• Ajay Jayaraj, TI • Kelvin Li, IBM
• Alexander Musman, Intel • Kevin O’Brien, IBM
• Alex Eichenberger, IBM • Samuel Antao, IBM
• Alexey Bataev, Intel • Sergey Ostanevich, Intel
• Andrey Bokhanko, Intel • Sunita Chandrasekaran,
• Carlo Bertolli, IBM UH
• Eric Stotzer, TI • Michael Wong, IBM
• Guansong Zhang, AMD • Priya Unikhrishnan, IBM
• Hal Finkel, ANL • Robert Ho, IBM
• Ilia Verbyn, Intel • Wael Yehia, IBM
• James Cownie, Intel • Yan Liu, IBM
Summary of upstream
progress of OpenMP clan
• Upstream progress to clang 3.5
– https://github.com/clang-omp/clang/wiki/Status-of-
supported-OpenMP-constructs
• Benchmark OpenMP clang vs OpenMP GCC
– http://www.phoronix.com/scan.php?page=article&ite
m=llvm_clang_openmp&num=1
• Unfairly Used –O3 for GCC and noopt for clang
• Link to OpenMP offload infrastructure in LLVM
– http://lists.cs.uiuc.edu/pipermail/llvmdev/attachment
s/20140809/cd6c7f7a/attachment-0001.pdf
Slide 108
OpenMP offload/target in
• Samuel Antao (IBM) LLVM
• Carlo Bertolli (IBM)
• Andrey Bokhanko (Intel)
• Alexandre Eichenberger (IBM)
• Hal Finkel ( Argonne National Laboratory )
• Sergey Ostanevich (Intel)
• Eric Stotzer (Texas Instruments)
• Guansong Zhang (AMD)
Goal of Design
1. support multiple target platforms at runtime and
be extensible in the future with minimal or no
changes
2. determine the availability of the target platform
at runtime and able to make a decision to
offload depending on the availability and load of
the target platform
Clang/llvm offload design
Example code
1. #pragma omp declare target
2. int foo(int[1000]);
3. #pragma omp end declare target
4. ...
5. int device_count = omp_get_num_devices();
6. int device_no;
7. int *red = malloc(device_count * sizeof(int));
8. #pragma omp parallel
9. for (i = 0; i < 1000; i++) {
10. device_no = i % device_count;
11. #pragma omp target device(device_no) map(to:c) map(red[i])
12. {
13. red[i] += foo(c);
14. }
15. }
16.
17. for (I = 0; i< device_count; i++)
18. total red = red[i];
Generation of fat binary
1. The driver called on a source code should spawn a
number of front-end executions for each available
target. This should generate a set of object files for each
target
2. Target linkers combine dedicated target objects into
target shared libraries – one for each target
3. The host linker combines host object files into an
executable/shared library and incorporates shared
libraries for each target into a separate section within
host binary. This process and format is target-
dependent and will be thereafter handled by the target
RTL at runtime
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015

1
What did we accomplish in
OpenMP 4.0?
• Much broader form of accelerator support
• SIMD
• Cancellation (start of a full error model)
• Task dependencies and task groups
• Thread Affinity
• User-defined reductions
• Initial Fortran 2003
• C/C++ array sections
• Sequentially Consistent Atomics
• Display initial OpenMP internal control variable state
1
OpenMP future features
• OpenMP Tools: Profilers and Debuggers
– Just released as TR2
• Consumer style parallelism: event/async/futures
• Enhance Accelerator support/FPGA
– Multiple device type, linkable to match OpenACC2
• Additional Looping constructs
• Transactional Memory, Speculative Execution
• Task Model refinements
• CPU Affinity
• Common Array Shaping
• Full Error Model
• Interoperability
• Rebase to new C/C++/Fortran Standards, C/C++11 memory model
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators
– OpenMP and OpenACC
• Affinity
• VectorSIMD
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015 1
IWOMP, SC14 and
OpenMPCon
• International Workshop on OpenMP
– 2014 to be held in Brazil
– A strongly academic conference, with refereed papers, and a Springer-
Verlag published proceeding
• SC14
– Chairing OpenMP Bof, Steering commitete for LLVM in HPC,
giving keynote at OpenMP Exhibitor’s Forum
• What is missing is a user conference similar to ACCU, pyCON, CPPCON (next
week presenting 2 talks), C++Now
• OpenMPCON
– A user conference paired with IWOMP
– Non-refereed, user abstracts
– 1st one will be held in Europe in 2015 to pair with the 2015 IWOMP
1
IWOMP 2014
September 28-30, 2014
SENAI CIMATEC – Salvador, Brazil
Salvador

• Salvador is the largest city on the northeast coast of Brazil


– The capital of the Northeastern Brazilian state of Bahia
– It is also known as Brazil's capital of happiness
• Salvador was the first colonial capital of Brazil
– The city is one of the oldest in the Americas

• Getting There (SSA):


– Direct flights from US (Miami) and Europe (Lisbon, Madrid, &
Frankfurt)
– Alternatively, fly to Rio (GIG) or Sao Paulo (GRU) and connect to
Salvador (SSA)

• Average Temperatures in September:


– Average high: 27 °C / 81 °F
– Daily mean: 25 °C / 77 °F
– Average low 22 °C / 72 °F
Common-vendor Specification
Parallel Programming model on
Multiple compilers
AMD, Convey, Cray, Fujitsu, HP, IBM,
Intel, NEC, NVIDIA, Oracle, RedHat
(GNU), ST Mircoelectronics, TI,
clang/llvm
A de-facto Standard: Across 3
Major General Purpose
Languages
C++, C, Fortran
A de-facto Standard: One High-
Level Accelerator Language
One High-Level Vector SIMD
language too!
Support Multiple Devices and let
the local compiler generate the
best code
Xeon Phi, NVIDIA, GPU, GPGPU, DSP,
MIC, ARM and FPGA
My blogs and email address
• ISOCPP.org Director, VP
http://isocpp.org/wiki/faq/wg21#michael-wong
OpenMP CEO: http://openmp.org/wp/about-openmp/
My Blogs: http://ibm.co/pCvPHR
C++11 status: http://tinyurl.com/43y8xgf
Boost test results
http://www.ibm.com/support/docview.wss?rs=2239&context=SS
JT9L&uid=swg27006911
C/C++ Compilers Feature Request Page
http://www.ibm.com/developerworks/rfe/?PROD_ID=700
Chair of WG21 SG5 Transactional MemoryM:
https://groups.google.com/a/isocpp.org/forum/?hl=en&fromgro
ups#!forum/tm

1
FRAGEN?

Partner:
Ich freue mich auf Ihr Feedback!

Vielen Dank!
Michael Wong

Partner:

You might also like