OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
OpenMP 4.0:
yet another Significant Paradigm Shift in
High-level Parallel Computing
Michael Wong, Senior Compiler Technical Lead/Architect
[email protected]
OpenMP CEO
Chair of WG21 SG5 Transactional Memory
ISOCPP.org, Director, VP
Vice Chair of Programming Languages, Standards Council of Canada
WG21 C++ Standard, Head of Delegation for Canada and IBM
CPPCON 2014
Acknowledgement and Disclaimer
• Numerous people internal and external to the
OpenMP WG, in industry and academia, have
made contributions, influenced ideas, written
part of this presentations, and offered
feedbacks to form part of this talk.
• I even lifted this acknowledgement and
disclaimer from some of them.
• But I claim all credit for errors, and stupid
mistakes. These are mine, all mine!
• Any opinions expressed in this presentation are
my opinions and do not necessarily reflect the
opinions of IBM or OpenMP or ISO C++.
Legal Disclaimer
• This work represents the view of the author and
does not necessarily represent the view of IBM.
• IBM, PowerPC and the IBM logo are trademarks
or registered trademarks of IBM or its subsidiaries
in the United States and other countries.
• The OpenMP_Timeline files here are licensed
under the three clause BSD license,
http://opensource.org/licenses/BSD-3-Clause
• Other company, product, and service names may
be trademarks or service marks of others..
What is OpenMP about?
1
What now?
• Nearly every C, C++ features makes for beautiful, elegant code for developers
(Disclaimer: I love C++)
– Please insert your beautiful code here:
– Elegance is efficiency, or is it? Or
– What we lack in beauty, we gain in efficiency; Or do we?
• The new C++11 Std is
– 1353 pages compared to 817 pages in C++03
• The new C++14 Std is
– 1373 pages (N3937), vs the free n3972
• The new C11 is
– 701 pages compared to 550 pages in C99
• OpenMP 3.1 is
– 354 pages and growing
• OpenMP 4.0 is
– 520 pages
Beautiful and elegant Lambdas
C++98 C++11
vector<int>::iterator i = auto i = find_if( begin(v), end(v),
v.begin(); [=](int i) {
for( ; i != v.end(); ++i ) { return i > x && i < y;
if( *i > x && *i < y ) } );
break;
}
• Q: Does your language allow you to access all the GFLOPS of your
machine?
“Is there in Truth No Beauty?”
from Jordan by George Herbert
• Q: Does your language allow you to access all the GFLOPS of your
machine?
• A: What a quaint concept!
– I thought its natural to drop out into OpenCL, CUDA, OpenGL, DirectX,
C++AMP, Assembler …. to get at my GPU
– Why? I just use my language as a cool driver, it’s a great scripting
language too. But for real kernel computation, I just use Fortran
– I write vectorized code, so my vendor offers me intrinsics, they also tell
me they can auto-vectorize, though I am not sure how much they really
do, so I am looking into OpenCL
– Well, I used to use one thread, but now that I use multiple threads, I
can get at it with C++11, OpenMP, TBB, GCD, PPL, MS then
continuation, Cilk
– I know I may have a TM core somewhere, so my vendor offers me
intrinsics
– No I like using a single thread, so I just use C, or C++
The Question
• Q: Is it true that there is a language that allows you to access all the
GFLOPS of your machine?
Power of Computing
• 1998, when C++ 98 was released
– Intel Pentium II: 0.45 GFLOPS
– No SIMD: SSE came in Pentium III
– No GPUs: GPU came out a year later
• 2011: when C++11 was released
– Intel Core-i7: 80 GFLOPS
– AVX: 8 DP flops/HZ*4 cores *4.4 GHz= 140 GFlops
– GTX 670: 2500 GFLOPS
• Computers have gotten so much faster, how come
software have not?
– Data structures and algorithms
– latency
In 1998, a typical machine had the
following flops
• .45 GFLOP, 1 core
27
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015
2
A brief history of OpenMP API
by Kelvin Li
Fortran, C & C++
Fortran V1.0 V4.0
1998 2002
1997 1999 2000 2001 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
30
Major Features by Jim Cownie
OpenMP internal Organization
Today
Future
The New Mission Statement of
OpenMP
• OpenMP’s new mission statement
–“Standardize directive-based multi-
language high-level parallelism that is
performant, productive and portable”
–Updated from
• "Standardize and unify shared memory,
thread-level parallelism for HPC”
33
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU programming
• VectorSIMD Programming
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015
3
Hello Concurrent World
#include <iostream>
#include <thread> //#1
void hello() //#2
{
std::cout<<"Hello Concurrent World"<<std::endl;
}
int main()
{
std::thread t(hello); //#3
t.join(); //#4
}
35
Is this valid C++ today? Are
these equivalent?
int x = 0;
atomic<int> y = 0; int x = 0;
Thread 1: atomic<int> y = 0;
x = 17; Thread 1:
y.store(1, x = 17;
memory_order_release); y = 1;
// or: y.store(1);
Thread 2:
Thread 2: while (y != 1)
while continue;
(y.load(memory_order_acq assert(x == 17);
uire) != 1)
// or: while
(y.load() != 1)
assert(x == 17);
36
Hello World again
• What will this program print?
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
printf("Hello ");
printf("World ");
printf("\n");
return(0);
}
3
2-threaded Hello World with
#include <stdlib.h>
OpenMP threads
#include <stdio.h>
int main(int argc, char *argv[]) {
3
Hello World with OpenMP
tasks now run 3 times
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{printf("Hello ");}
#pragma omp task
{printf("World ");}
}
} // End of parallel region
printf("\n");
return(0);
}
Hello World
Hello World
World Hello
4
Tasks are executed at a task
execution point
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{printf("Hello ");}
#pragma omp task
{printf("World ");}
printf(“\nThank You “);
}
} // End of parallel region
printf("\n");
return(0);
}
Thank You Hello World
Thank You Hello World
Thank You World Hello
4
Execute Tasks First
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{printf("Hello ");}
#pragma omp task
{printf("World ");}
#pragma omp taskwait
printf(“Thank You “);
}
} // End of parallel region
printf("\n");return(0);
}
Hello World Thank You
Hello World Thank You
World Hello Thank You
4
Execute Tasks First with
• OpenMP 4.0 only Dependencies
int main(int argc, char *argv[]) {
#pragma omp parallel
{
#pragma omp single
{
int x = 1;
#pragma omp task shared (x) depend (out:x)
{printf("Hello ");}
#pragma omp task shared (x) depend (in:x)
{printf("World ");}
#pragma omp taskwait
printf(“Thank You “);
}
} // End of parallel region
printf("\n");return(0);
}
Hello World Thank You
Hello World Thank You
Hello World Thank You
4
Intro to OpenMP
• De-facto standard Application Programming
Interface (API) to write shared memory parallel
applications in C, C++, and Fortran
• Consists of:
– ● Compiler directives
– ● Run time routines
– ● Environment variables
• Specification maintained by the OpenMP
Architecture Review Board
(http://www.openmp.org)
– Version 4.0 was released 2013
4
When do you want to use OpenMP?
• If the compiler cannot parallelize the way you like
it even with auto-parallelization
– a loop is not parallelized
• Data dependency analyses are not able to
determine whether it is safe to parallelize or not
– Compiler finds a low level of parallelism
• But your know there is a high level, but compiler
lacks information to parallelize at the highest
possible level
• No Auto-parallelizing compiler, then you have to
do it yourself
– Need explicit parallelization using directives
4
Advantages of OpenMP
• Good performance and scalability
–If you do it right ....
• De-facto and mature standard
• An OpenMP program is portable
–Supported by a large number of
compilers
• Allows the program to be parallelized
incrementally
4
Can OpenMP work with
MultiCore, Heterogeneous
• OpenMP is ideally suited for
multicore architectures
–Memory and threading model
map naturally
–Lightweight
–Mature
–Widely available and used 4
The OpenMP Execution Model
4
Directive Format
• C/C++
– #pragma omp directive [clause [clause] …]
– Continuation: \
– Conditional compilation: _OPENMP macro is set
• Fortran:
– Fortran: directives are case insensitive
• Syntax: sentinel directive [clause [[,] clause]...]
• The sentinel is one of the following:
– ✔ !$OMP or C$OMP or *$OMP (fixed format)
– ✔ !$OMP (free format)
– Continuation: follows the language syntax
– Conditional compilation: !$ or C$ -> 2 spaces 4
Components of OpenMP
• Directives • Environment • Runtime Variables
– Tasking Variables – Number of threads
– Parallel region – Number of – Thread id
– Dynamic thread
– Work sharing threads adjustment
– Synchronization – Scheduling type – Nested Parallelism
– Data scope – Dynamic thread – Schedule
attributes adjustment – Active Levels
• Private – Nested – Thread limit
• Firstprivate parallelism – Nesting Level
• Lastprivate – Stacksize – Ancestor thread
• Shared – Team size
– Idle threads
• reduction – Wallclock Timer
– Active levels
– Orphaning – locking
– Thread limit
5
But why does OpenMP use
pragmas?
It is an intentional design …
Pragmas can support 3 General
Purpose Programming Languages
and maintain the same style
C++
C
Fortran
And National Labs, weather
research, nuclear simulations
Still have substantial kernels written
in mix of Fortran and C driven by C++
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015
5
Goals
• Thread-rich computing environments are becoming more
prevalent
– more computing power, more threads
– less memory relative to compute
• There is parallelism, it comes in many forms
– hybrid MPI - OpenMP parallelism
– mixed mode OpenMP / Pthread parallelism
– nested OpenMP parallelism
• Have to exploit parallelism efficiently
– providing ease of use for casual programmers
– providing full control for power programmers
– providing timing feedback
5
What did we accomplish in
OpenMP 4.0?
• Broad form of accelerator support
• SIMD
• Cancellation (start of a full error model)
• Task dependencies and task groups
• Thread Affinity
• User-defined reductions
• Initial Fortran 2003
• C/C++ array sections
• Sequentially Consistent Atomics
• Display initial OpenMP internal control variable state
5
Compilers are here!
• Intel 13.1 compiler supports
Accelerators/SIMD
• Oracle/Sun Studio 12.4 Beta just
announced full OpenMP 4.0
• GCC 4.9 shipped April 9, 2014 supports
4.0
• Clang support for OpenMP injecting
into trunk, first appears in 3.5 last week
• Cray, TI, IBM coming online 57
In 2014, a typical machine had the
following flops
• 2500 GFLOP GPU+140GFLOP AVX+80GFLOP 4
cores+HTM
5
OpenMP Accelerator
• Co-chairs Technical leads Subcommittee
– Jame Beyers- Cray (courtesy for slides)
– Eric Stotzer – TI (courtesy for slides)
• Active subcommittee members
– Xinmin Tian – Intel (courtesy for slides)
– Ravi Narayanaswamy – Intel (courtesy for slides)
– Jeff Larkin – Nvidia
– Kent Milfeld – TACC
– Henry Jin – NASA
– Kevin O’Brien, Kelvin Li, Alexandre Eichenberger, IBM
– Christian Terboven– RWTH Aachen (courtesy for slides)
– Michael Klemm – Intel (courtesy for slides)
– Stephane Cheveau – CAPS
– Convey, AMD, ORNL, TU Dresden,
6
60
So, how do you program GPU?
Why is GPU important now?
• Or is it a flash in the pan?
• The race to exascale computing .. 10 18 flops
• Vertical scale is in GFLOPS
Top500 contenders
What is OpenMP Model’s aim?
• All forms of accelerators, DSP, GPU, APU, GPGPU
• Network heterogenous consumer devices
– Kitchen appliances, drones, signal processors, medical
imaging, auto, telecom, automation, not just graphics
engines
Heterogeneous Device model
• OpenMP 4.0 supports accelerators/coprocessors
• Device model:
– One host
– Multiple accelerators/coprocessors of the same kind
Heterogeneous SoC
Glossary
6
OpenMP 4.0 Device
Constructs
• Execute code on a target device
– omp target [clause[[,] clause],…]
structured-block
– omp declare target
[function-definitions-or-declarations]
• Map variables to a target device
– map ([map-type:] list) // map clause
map-type := alloc | tofrom | to | from
– omp target data [clause[[,] clause],…]
structured-block
– omp target update [clause[[,] clause],…]
– omp declare target
[variable-definitions-or-declarations]
• Workshare for acceleration
– omp teams [clause[[,] clause],…]
structured-block
– omp distribute [clause[[,] clause],…]
for-loops
6
target Construct
6
target data Construct
6
target update Construct
7
Execution Model
7
Execution Model and Data
Environment
7
map Clause
extern void init(float*, float*, int); • The target construct creates
extern void output(float*, int); a new device data
void vec_mult(float *p, float *v1, float *v2, int N)
environment and explicitly
{ maps the array sections
int i; v1[0:N], v2[:N] and p[0:N] to
init(v1, v2, N); the new device data
#pragma omp target map(to:v1[0:N],v2[:N]) \\ environment.
map(from:p[0:N]) • The variable N implicitly
#pragma omp parallel for
for (i=0; i<N; i++)
mapped into the new device
p[i] = v1[i] * v2[i]; data environment from the
encountering task's data
output(p, N); environment.
}
Map-types:
• alloc: allocate storage for corresponding variable
• to: alloc and assign value of original variable to corresponding variable on entry
• from: alloc and assign value of corresponding variable to original variable on exit
• tofrom: default, both to and form
7
target Construct Example
• Use target construct to
– Transfer control from the host to the device
– Establish a device data environment (if not yet done)
• Host thread waits until offloaded region completed
– Use other OpenMP constructs for asynchronicity
host
#pragma omp target map(to:b[0:count]) map(to:c,d) map(from:a[0:count])
{
#pragma omp parallel for
target
for (i=0; i<count; i++) {
a[i] = b[i] * c + d;
}
host
}
Data Environments
7
target data Construct Example
extern void init(float*, float*, int); • The target data
extern void init_again(float*, float*, int); construct creates a device
extern void output(float*, int); data environment and
encloses target regions,
void vec_mult(float *p, float *v1, float *v2, int N) which have their own device
{ data environments.
int i; • The device data environment
of the target data region
init(v1, v2, N); is inherited by the device
data environment of an
#pragma omp target data map(from: p[0:N]) enclosed target region.
{
• The target data
#pragma omp target map(to: v1[:N], v2[:N])
construct is used to create
#pragma omp parallel for
variables that will persist
for (i=0; i<N; i++)
throughout the target
p[i] = v1[i] * v2[i];
data region.
init_again(v1, v2, N); • v1 and v2 are mapped at
each target construct.
#pragma omp target map(to: v1[:N], v2[:N]) • Instead of mapping the
#pragma omp parallel for variable p twice, once at each
for (i=0; i<N; i++) target construct, p is
p[i] = p[i] + (v1[i] * v2[i]); mapped once by the
} target data construct.
output(p, N);
}
7
Data mapping: shared or distributed memory
Shared memory
Memory
Processor X Processor Y
Cache Cache
A
A A
Distributed memory
Accelertor
Memory X Y
7
declare target Constrtuct
7
Host and device functions
8
Explicit Data
Transfers:Target
update Construct Example
#pragma omp target data device(0) map(alloc:tmp[:N]) map(to:input[:N)) map(from:res)
host
{
#pragma omp target device(0)
#pragma omp parallel for
target
for (i=0; i<N; i++)
tmp[i] = some_computation(input[i], i);
update_input_array_on_the_host(input);
host
#pragma omp target update device(0) to(input[:N])
target
for (i=0; i<N; i++)
res += final_computation(input[i], tmp[i], i)
host
}
Asynchronous Offloading
8
Teams Constructs
C/C++
#pragma omp teams [clause[[,] clause],...] new-line
structured-block
Fortran
!$omp teams [clause[[,] clause],...]
structured-block
!$omp end teams
83
Restrictions on teams
Construct
8
Teams Execution Model
Teams Constructs
#pragma omp teams num_teams(3), num_threads(3)
structured-block
8
SAXPY: Serial (host)
8
SAXPY:
Coprocessor/Accelerator
8
C/C++
Distribute Constructs
#pragma omp distribute [clause[[,] clause],...] new-line
for-loops
Fortran
!$omp distribute [clause[[,] clause],...]
do-loops
[ !$omp end distribute ]
l 89
distribute Construct
9
Teams + Distribute Execution Model
#pragma omp teams num_teams(3), num_threads(3)
#pragma omp distribute
for (int i=0; i<9; i++) {
9
Combined Constructs
9
SAXPY: Combined Constructs
9
SAXPY: Combined Constructs
9
Additional Runtime Support
9
Multi-device Example
9
OpenACC1 compared to OpenMP 4.0 (by
Dr. James Beyer)
OpenACC1 OpenMP 4.0
• Parallel (offload) • Target
– Parallel (multiple “threads”) • Team/Parallel
• Kernels •
• Data • Target Data
• Loop • Distribute/Do/for/Simd
• Host data •
• Cache •
• Update • Target Update
• Wait •
• Declare • Declare Target
Slide 99
Future OpenACC vs future OpenMP
(by Dr. James Beyer)
OpenACC2 OpenMP future
• enter data • Unstructured data environment
• exit data
• data api • declare target
• routine •
• async wait • Parallel in parallel or team
• parallel in parallel • tile
• tile • Linkable or Deferred_map
• Linkable • Device_type
• Device_type
Slide 100
Preliminary results: AXPY (Y=a*X)
AXPY Execution Time (s)
3
Sequential
2.5
OpenMP FOR (16 threads)
2
HOMP ACC
1.5
PGI OpenACC
1
HMPP OpenACC
0.5
0
5000 50000 500000 5000000 50000000 100000000 500000000
Vector size (float)
80 PGI
HOMP
70
OpenMP
60 HMPP Collapse
50 HOMP Collpase
40
30
20
10
0
128x128 256x256 512x512 1024x1024 2048x2048
Matrix size (float)
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU Programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015
1
Compilers are here!
• Oracle/Sun Studio 12.4 Beta just
announced full OpenMP 4.0
• GCC 4.9 shipped April 9, 2014 supports
OpenMP 4.0
• Clang support for OpenMP injecting
into trunk, first appears in 3.5
• Intel 13.1 compiler supports
Accelerators/SIMD
• Cray, TI, IBM coming 104
OpenMP in Clang update
• I Chaired Weekly OpenMP Clang review WG (Intel, IBM, AMD, TI, Micron) to
help speedup OpenMP upstream into clang: April-on going
– Joint code reviews, code refactoring
– Delivered Most of OpenMP 3.1 constructs (except atomic and ordered) into Clang 3.5
stream for AST/Semantic Analysis support.
– have OpenMP –fsyntax-only, Runtime, and basic parallel for loop region to
demonstrate code capability
– Added U of Houston OpenMP tests into clang
– IBM team Delivered changes for OpenMP RT for PPC, other teams added their
platform/architecture
– Released Joint design on Multi-device target interface for LLVM to llvm-dev for
comment
• Future:
– Clang 3.5 (Sept 2, 2014): Initial support for AST/SEMA for OpenMP 3.1 (except
atomic and ordered) + OpenMP library for AMD, ARM, TI, IBM, Intel
– Clang 3.6 (~Feb 2015): aim for functional codegen of all OpenMP 3.1 + accelerator
support(from 4.0)
– Clang 3.7 (~Sept 2015): aim for full OpenMP 4.0 functional support
Release note commited by me
to clang/llvm 3.5
• Clang 3.5 now has parsing and semantic-analysis support for all
OpenMP 3.1 pragmas (except atomics and ordered). LLVM's OpenMP
runtime library, originally developed by Intel, has been modified to
work on ARM, PowerPC, as well as X86. Code generation support is
minimal at this point and will continue to be developed for 3.6,
along with the rest of OpenMP 3.1. Support for OpenMP 4.0 features,
such as SIMD and target accelerator directives, is also in progress.
Contributors to this work include AMD, Argonne National Lab., IBM,
Intel, Texas Instruments, University of Houston and many others.
Slide 106
Many Participants/companies
• Ajay Jayaraj, TI • Kelvin Li, IBM
• Alexander Musman, Intel • Kevin O’Brien, IBM
• Alex Eichenberger, IBM • Samuel Antao, IBM
• Alexey Bataev, Intel • Sergey Ostanevich, Intel
• Andrey Bokhanko, Intel • Sunita Chandrasekaran,
• Carlo Bertolli, IBM UH
• Eric Stotzer, TI • Michael Wong, IBM
• Guansong Zhang, AMD • Priya Unikhrishnan, IBM
• Hal Finkel, ANL • Robert Ho, IBM
• Ilia Verbyn, Intel • Wael Yehia, IBM
• James Cownie, Intel • Yan Liu, IBM
Summary of upstream
progress of OpenMP clan
• Upstream progress to clang 3.5
– https://github.com/clang-omp/clang/wiki/Status-of-
supported-OpenMP-constructs
• Benchmark OpenMP clang vs OpenMP GCC
– http://www.phoronix.com/scan.php?page=article&ite
m=llvm_clang_openmp&num=1
• Unfairly Used –O3 for GCC and noopt for clang
• Link to OpenMP offload infrastructure in LLVM
– http://lists.cs.uiuc.edu/pipermail/llvmdev/attachment
s/20140809/cd6c7f7a/attachment-0001.pdf
Slide 108
OpenMP offload/target in
• Samuel Antao (IBM) LLVM
• Carlo Bertolli (IBM)
• Andrey Bokhanko (Intel)
• Alexandre Eichenberger (IBM)
• Hal Finkel ( Argonne National Laboratory )
• Sergey Ostanevich (Intel)
• Eric Stotzer (Texas Instruments)
• Guansong Zhang (AMD)
Goal of Design
1. support multiple target platforms at runtime and
be extensible in the future with minimal or no
changes
2. determine the availability of the target platform
at runtime and able to make a decision to
offload depending on the availability and load of
the target platform
Clang/llvm offload design
Example code
1. #pragma omp declare target
2. int foo(int[1000]);
3. #pragma omp end declare target
4. ...
5. int device_count = omp_get_num_devices();
6. int device_no;
7. int *red = malloc(device_count * sizeof(int));
8. #pragma omp parallel
9. for (i = 0; i < 1000; i++) {
10. device_no = i % device_count;
11. #pragma omp target device(device_no) map(to:c) map(red[i])
12. {
13. red[i] += foo(c);
14. }
15. }
16.
17. for (I = 0; i< device_count; i++)
18. total red = red[i];
Generation of fat binary
1. The driver called on a source code should spawn a
number of front-end executions for each available
target. This should generate a set of object files for each
target
2. Target linkers combine dedicated target objects into
target shared libraries – one for each target
3. The host linker combines host object files into an
executable/shared library and incorporates shared
libraries for each target into a separate section within
host binary. This process and format is target-
dependent and will be thereafter handled by the target
RTL at runtime
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators and GPU programming
• Implementation status and Design in clang/llvm
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015
1
What did we accomplish in
OpenMP 4.0?
• Much broader form of accelerator support
• SIMD
• Cancellation (start of a full error model)
• Task dependencies and task groups
• Thread Affinity
• User-defined reductions
• Initial Fortran 2003
• C/C++ array sections
• Sequentially Consistent Atomics
• Display initial OpenMP internal control variable state
1
OpenMP future features
• OpenMP Tools: Profilers and Debuggers
– Just released as TR2
• Consumer style parallelism: event/async/futures
• Enhance Accelerator support/FPGA
– Multiple device type, linkable to match OpenACC2
• Additional Looping constructs
• Transactional Memory, Speculative Execution
• Task Model refinements
• CPU Affinity
• Common Array Shaping
• Full Error Model
• Interoperability
• Rebase to new C/C++/Fortran Standards, C/C++11 memory model
Agenda
• What Now?
• OpenMP ARB Corporation
• A Quick Tutorial
• A few key features in 4.0
• Accelerators
– OpenMP and OpenACC
• Affinity
• VectorSIMD
• The future of OpenMP
• IWOMP 2014 and OpenMPCon 2015 1
IWOMP, SC14 and
OpenMPCon
• International Workshop on OpenMP
– 2014 to be held in Brazil
– A strongly academic conference, with refereed papers, and a Springer-
Verlag published proceeding
• SC14
– Chairing OpenMP Bof, Steering commitete for LLVM in HPC,
giving keynote at OpenMP Exhibitor’s Forum
• What is missing is a user conference similar to ACCU, pyCON, CPPCON (next
week presenting 2 talks), C++Now
• OpenMPCON
– A user conference paired with IWOMP
– Non-refereed, user abstracts
– 1st one will be held in Europe in 2015 to pair with the 2015 IWOMP
1
IWOMP 2014
September 28-30, 2014
SENAI CIMATEC – Salvador, Brazil
Salvador
1
FRAGEN?
Partner:
Ich freue mich auf Ihr Feedback!
Vielen Dank!
Michael Wong
Partner: