0% found this document useful (0 votes)
34 views32 pages

OpenMP Chapter

OpenMP, which stands for Open Multi-Processing, is an API (Application Programming Interface) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran. Developed to facilitate parallel programming, OpenMP provides a set of directives for compilers that allow developers to parallelize their code easily.

Uploaded by

useforme08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
34 views32 pages

OpenMP Chapter

OpenMP, which stands for Open Multi-Processing, is an API (Application Programming Interface) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran. Developed to facilitate parallel programming, OpenMP provides a set of directives for compilers that allow developers to parallelize their code easily.

Uploaded by

useforme08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 32
Shared-Memory Programming Not what we give, but what we share— For the gift without the giver is bare; Who gives himself with his alms feeds three Himself his hungering neighbor, and me. James Russell Lowell, The Vision of Sir Launfal 17.1 INTRODUCTION In the 1980s commercial multiprocessors with a modest number of CPUs vost hundreds of thousands of dollars. Today, multiprocessors with dozens of proces- sors are sill quite expensive, but small systems are readily available for a low price. Dell, Gateway, and other companies sel dual-CPU multiprocessors for less than $5,000, and you can purchase a quad-processorsystem forless than $20,000. Q—ap _Itis possibie to write parallel programs for multiprocessors using MPI, but you can often achieve better performance by using a programming language tailored for a shared-memory environment. Recently, OpenMP has emerged as a shared-memory standard. OpenMP is an application programming interface (API for parallel programming on muktiprocessors.liconsists of set of compiler directives anda library of support functions. OpenMP works in conjunction with standard Fortran, C, or C4. Thischapter introduces shared-memory parallel programming using OpenMP. You can use tin tweelifferent ways. Perhaps the only parallel computer you have access tois a multiprocessor. Inthatcase, you may prefer lo write programs using (OpenMP rather than MPL On the other hand, you may have access to a multicomputer consisting ‘of many nodes, each of which is a multiprocessor. This is a popular way to build large multicomputers with hundreds or thousands of processors. Consider 404 SECTION 17.2. Tha Sharod-Mesrory Model 405 these examples (circa 2002}: IBM's RSV6000 SP system contains up to 512 nodes, up to 16 CPUs init at Fujitsu's AP3000 Series supercomputer contains up to 1024 nodes, and each node consists of one or wo UltraSPARC processors. Del's High Performance Computing Cluster has up to 64 nodes. Each node is amultiprocessor with two Pentinm HII CPUs. Zach node can have In this chapter you'll see how the shared-memory programming model is different from the message-passing model, and you'll learn enough OpenMP compiler directives and functions o beable to paralelize a wide variety of C code segments, ‘This chapter introduces a powerful set of OpenMP compiler directives: a paraliel, which precedes a block of code to be executed in parallel by imaltiple threads for. which precedes a for loop with independent iterations that may be divided among threads executing in parallel = parallel for,acombination of the parallel and £or directives sections, which precedes a series of blocks that may be executed in parallel i, # parallel sections, s¢ombination of the parallel and lirectives sect ion crit ical, which precedes a critical section ‘¢, which prevedes a code block wo be executed by a single thread ‘You'll also encounter four important OpenMP functions: omp_get_num_proce, which returns the number of CPUs in the rultiprocessor on which this thread is executing © omp_get_num_threads, which returns the number of threads active in the cuurent parallel region comp_get_thread_num, which retums the thread identification number = omp_set,_num_threads, which allows you to fix the number of threads executing the parallel sections of code 17.2 THE SHARED-MEMORY MODEL ‘The shared-memory model (Figure 17.1) isan abstraction ofthe generic central- ized multiprocessor described in Section 2.4. The underlying hardwareisassurned to be a collection of processors, each with access to the same shared memory. Because they have access to he same memory locations, processoss can interact and synchronize with each other through shared variables. CHAPTER 17. Shareu-Memory Progamming [me=| [bee] [ie] [me Mewory Figure 17.1. The shared-memory model of parallel ‘computation, Processors syrchronize and ‘communicate with each other through shared variables. ‘The sida iw oF parallelism if SKai€I-mcmory program is fork/join parallelism, When the program begins execution, only a single thread, called the master thread, is active (Figure 17.2). The master thread executes the sequential portions ofthe algorithm, At those points where parallel operations are required, the master thread forks (creates or awakens) additional threads. The master thread and the created threads work concurrently through the parallel section. At the end of the parallel code the created threads die or are suspended, and the flow ot control returns tothe single master thread, This is called a join. A key difference, thea, between the shared-memory model and the message passing model is hatin the message-passing model all processes typically remain active throughout the execution of the program, whereas in the shared-memory ‘model the number of active threads is one at the program's start and finish and may change dynamically throughout the execution ofthe progeam, ‘You can view « sequential program asa special case of a shared-memory par- allel programe it is simply one with no fork/joins init, Paalfel shared-memory programs range from those with only a single fork/join around a single loop to those in which most of the code segmenls are executed in parallel, Hence the shared-memory model supports ineremental parallelization, the process of transforming a sequential program into a paralle! program one block of code at atime. ‘The ability ofthe shared-memory model to support incremental paralleliza- tionis one ofits greatest advantages over the messaze-passing model Ialfows you to profile the execution af a sequential program, sort the program blocks accord- ing to how much time they consume, consider each block in turn beginning with the most time-consuming, paralleliz each block amenable to paraliel execution, and stop when the effort required to achieve further performance improvements is not warranted Consider, in contrast, message-passing programs. They have no shared mem ory to hold variables, andthe parallel processes are active throughdat the execution of the program, Transforming a sequential program into. paralle] program isnot incremental at all—the chasm must be erossed with one giant leap, rather than ‘many smal steps Tnthis chapter you’ Il encounterinereasingly complicated blocks of sequential code and learn how to transform them into parallel code sections, SECTION 17.3 Parate! for Loops Mate head ery Fock Att Join ag Seppe Fok ALLE ig Figure 17.2 The shared-memory model is _ characterized by forkjoin parallelism, in which parallelism comes and goes. Althe beginning of ‘execution only a single thread, called the master thread, is active, The master thread executes the serial portions ofthe program. It forks additional threads to help it execute parallel portions ofthe program, These threads are deactivated when serial execution resumes. 17.3 PARALLEL for LOOPS Inherently parallel operations are often expressed in € programs as for loops. OpenMP makes it easy to indicate when the iterations of a Lox Joop may be executed in parallel. For example, consider the following loop, which accounts for « lange proportion of the execution tine in our MPI implementation of the Sieve of Eratosthenes: for {i = fir ; i< size; i += prime) mar Clearly there is no dependence between one iteration of the Joop and another. How do we convert it into a parallel loop? in OpenMP we simply indicate to 408 CHAPTER 17. Shared-Memory Programming the compiler thatthe iterations of 2 fox loop may bé execiited in parallel; the compiler takes care of generating the code that forks/joins threads and schedules the iterations, that is, allocates iterations to threads. 17.3.1 parallel for Pragma A compiler directive in C or C++ is called a pragma. The word pragma is short for “pragmatic information.” A pragma is a way t0 communicate information to the compiler. The information is nonessential in the sense that the compiler may ignore the information and still produce a correct object program. How- ever, the information provided by the pragma can help the compiler optimize the program. 3 es Like other lines that provide inforniation tothe preprocessor, a pragma begins with the # character. A pragma in C or C-+ has this syntax: épragwa omp ‘The first pragina we are going to consider is the parallel for pragma. The simplest form of the parallel for pragmais: #pragma omp parallel for Putting this line imunediately before the £or loop insteucts the compiler to try to parallelize the loop: Spragma omp parallel for for (i = first; i < size; i +> pri e) marked{i] = 1; in order for the compiler to successfully transform the sequential loop into parallel loop, it must be able to verify that the run-time system will have the {information it needs to determine the number of loop iterations when it evaluates, the control clause. Bor this reason the control clause of the Lor loop must have canonical shape, as illustrated in Figure 17.3. In addition, the Eor loop must net contain statements that allow the loop to be exited prematurely. Examples include. the break statement, return statement, exit statement, and goto statements indexss =-index xr tindex = start: index end: ¢ index index -= i index = index + index = inc + index index + index ~ i Figure 17.3. tn order to be made parallel, the control clause of a for loop must have canonical shape. This tigure shows the legal variants ‘The identifiers start, end, and ine may be expressions. SECTION 17.3 Paraiol cor Loops tolabels ouside the loop. The cont irue statement is allowed, however, because its execution does not affect the number of loop iterations. Our example Eo foop oe as i< size; i +> prime) marked[il = 1; meets these criteria: the control clause has canonical shape, and there are no premature exits in the body of the loop. Hence the compiler can generate code that allows its iterations to execute in parallel. During parallel execution of the For loop, the master thread creates additional threads, and all threads work together to cover the iterations of the loop. Every thread has its own execution context: an address space containing all of the variables the thtead may access. The execution context includes static variables, dynamically allocated data structores in the heap, and variables on the run-time stack. ‘The execution context includes its own additional run-time stack, where the frames for functions it invokes are Kept. Other variables maiy either be shared + or private. A shared variable has the same address in the execution context of cevery thread, Allthreads have access to shared variables. A private variable has a different address in the execution context of every thread. A thread can access its own private variables, but cannot access the private variable of another thread. Inthe case ofthe parallel for pragma, variables are by default shafed, with the exception that the loop index variable is private Figure 17.4 illustrates shared and private variables: In this example the iter- ations ofthe Zor loop are being divided among two threads. The loop index i is «private variable—each thread has its own copy. The remaining variables b and pz, as well as data allocated on the Reap, ae shared. How does therun-time system know how many threads to create? The value of an enviconment variable called OMP_NUM_THREADS provides a default aurber of threads for parallel sections of code. In Unix you can use the pr inteny Command to inspect the value of this variable and the setenv command to modify its value. int sain (int arge, char* argvlli { ine D{2 art opt: ete ‘pragma omp parallel for Moste thread Thread | (Tinead 9) Figure 17.8 During parallel execution of the Zor loop, index i is private variable, while, pr, and heap data are shared 410 CHAPTER 17. Shared-Mamory Programming Another strategy is to set the number of threads equal to the number of rmulti- processor CPUs. Let’s explore the OpenMP functions that enable us to do this. 17.3.2 Function omp_get_num_procs Function omp_get._num_procs retums the aumber of physical processors available for use by the parallel program. Here is the function header: int omp_get_num_procs (void) The integer retumed by this function may be less than the total number of physical processors in the multiprocessor, depending on how the run-time system, gives processes access to processors. 17.3.3 Function omp_set_num_threads wun_threads uses the parameter value fo set the number ve in parallel sections of code. It has this function header: Function omp_set of threads to be void omp_set_num threads (int t) Since this function may be called at multiple points ina program, you have the ability to tailor the level of parallelism tothe grain size or other characteristics of the code block. Setting the number of threads equal to the number of available CPUs is straightforward: int t; t = omp_get_num_procs(); omp_set_num_threads(t 17.4 DECLARING PRIVATE VARIABLES For uur second example, let’s look at slightly more complicated loop structure. Here is the computational heat of our MPT implementation of Floyd's algorithm for (i 0; tx piec for (3 = 0; j ) does just that, It directs the compifer to create frivate variables having initial values identical tothe value ofthe variable vontcolled by the master thread as the loop is entered. Here is the correct way to code the parallel loop: x{G] = complex_function({}; #pragma omp parallel for private(j] firstprivace (x) for (i= 0; i :} where is one of the reduction operators shown in Table 17.1 and is the name of te shared variable that will end up with the result of the reduction, Here is an implementation of the finding code with the reduction clause replacing the critical section: double area, pi, x; int i, ny area = 0.0; Ypragma omp parallel for private (x) reduct ion(+:area) for (i dicmy iss) ( x = (i#0.5)/ area += 4.0/(1.0 + x*x); pi = area / aj Table 17.2 compares our hwo implementations of the rectangle mule to com pute =, We set m = 100,000 and execute the programs on a Sun Enterprise Server 4000. The implementation that uses the reduct ion clause is clearly superior w the one using the crit ical pragima 1 is faster when only a single, thread is active, and the execution time improves when additional threads are added ‘Table 17.1 OpenMP reduction operatars for C and C++, Operator Meaning ‘Allowable types Initial vale + Sam Alot, int 0 ‘ Product Alot, int 1 & Bitwise and ine all bits 1 | Binwise or int 0 : Bitwise exclusive or int 4 ui Logical and int 1 1 Logical or int 0 Table 17.2 Exooutcn times on a Sun Enterprise Server 4000 of two programs that ‘compute using the eoterigie rule ecation tne of program (sec) sing critical pga Using reduction dause 0.0780 oon O50 o.n1a6 03400 0.0105 0.3608 0.0086 04710 con SECTION 17.7. Performance Improvements 417 17.7 PERFORMANCE IMPROVEMENTS Sometimes transforming a sequential fox loop into a parallel for Joop can actually increase a program's execution time. in this section we'll look at three ‘ways of improving the pesformance of parallel Yoops. 17.7.4 inverting Loops Consider the following cae segment: le ee for (j=; j) If the scalar expression evaluates to tue, the loop will be executed in parallel Otherwise, it will be executed serially Forexample, here is how weconld add an if clauseto the parallel for rogma inthe parallel program computing 7 using the rectangle rule: #pragma omp parallel for private(x) reduction(+:area) for (i = 0; i 5,000. 17.7.3 Scheduling Loops In some loops the time needed to execute different {oop iterations varies consid- erably. For example, consider the following doubly nested loop that nifializes an upper triangular matrix: for (i = 0; i 3000) a9 420 CHAPTER 17. Shated Mamory Programing and increase the cache hit rate. Reducing the chunk size can allow finer balancing of workloads, The schedule clause has this syntax: schedule («type > {, ) In other words, the schedule type is required, but the chunk size i optional. With these two parameters it’s easy to describe a wide variety of schedules: schedule (static): A static allocation of about n/t contiguous iterations to each thread, & schaduje(static, C): Am interleaved allocation of chunks to tasks. Each chunk contains C contiguous iterations = schedule (dynamic) : Iterations are dynamically allocated, one ata time, to threads, a schedule (dynamic, C): A dynamic allocation of C iterations at atime to the tasks schedule (guided, C): A dynamic allocation of iterations to tasks using the guided self scheduling heuristic. Guided self-scheduling begins by allocating a large chunk size fo each task and responds to further requests for chunks by allocating chunks of decreasing size. The size ofthe chunks decreases exponentially to a minimum chunk size of C. © schedule (guided): Guided self scheduling with a minimum chunk size of | # schedude (runtime) : The schedule type is chosen at run-time based on the Valué ofthe environment variable OMP_SCH=DULE. For example, the Unix command 7 setenv OMP_SCHEDULE “static, 1* would set the run-time schedule to be an interleaved allocation, Whenthe schedule clauseisnotincluded inthe parallel for pragma, most run-time systems default to a simple static scheduling of consecutive loop iterations to tasks, Going back to our original example, the sun time of any particular iteration of the outermost for loop is predictable. An interleaved allocation of lonp iterations balances the workload of the threads: #pragma omp parallel fer private(3) schedule{static, 1} | = for tj jtask; job ptr = (Job_ptr}-next; eturn answer } . How would we ike this algorithm toexecute in parallel? We want every thread to do the same thing: repeatedly take the next task from the list and complete it, until there are wo more tasks to do, We need to ensure that no two threads take the same task from the list. In other words, itis important to execute Function get_noxt_task atomically. 422 CMAPTER 17 Shared Memory Programming Figure 17.7 Two threads work their way through a singly linked ‘to do” fist. Variable 3ob pt must be shared, while task_ptr must be a private variable. 17.8.1 parallel Pragma ‘The parallel pragma precedes a block of code that should be executed-by all of the threads, It has this syntax: #pragma omp parallel If the code we want executed in parallel is nota simple statement (such as an assignment statement, if statemeni, or Eor loop) we can use curly braces to create a block of code from a statement group. Note that ualike the parallel Cor pragma, which divided the iterations ofthe or loopamong the active threads, the execution ofthe code block after the parallel program is replicated among the threads. Our section of function inaiin now looks lke this: int main (int arge, char argv(]) struct job struct job_ptr; struct task_struct task_ptr; #pragma omp paral { private (task_ptr) task_ptr = get_next_task (&job_ptr); 1 SECTION 17.8 Mora Gonesal Data Parallelism axa while (task complete. (task_plr); task_ptr = get_next_task {&job_ptr]; } } Now we need to ensure function get_next_task executes atomically. + Otherwise, allowing two threads to execute function got_next_task sinmal- taneously may result in more than one thread retuming from the function with the same value of task_ptr. We use the crit ical pragma to ensure futually exclusive execution of=#es--¥i2te this critical section of code. Here is the rewritien function get _noxt_task: char get_next_task{ struct ta struct job_struct job_ptr) { _struct answer; pragma omp critical { NULL) answer = NULI answer’= (job_pLr) ->task; job..plr = (job_ptr) ->next; } } return answer; 17.8.2 Function omp get_thread_num Earlier jn this chapter we computed 7 using the rectangle rule. In Chapter 10 we computed sr using the Monte Carlo method. The idea, illustrated in Figure 17.8, is togenerate pars of points inthe unit square (where each coordinate varies between ‘and |), We count the fraction of points inside the circle (those points for which a? +-y" < [). The expected value of this fraction is 7/4; henve multiplying the fraction by 4 gives an estimate of 1 Here isthe C code implementing the algorithm: a count ; /* Points inside unit circle */ unsigned short xi[2]; /* Random number seed */ int a7 int samples; /* Points to generate */ double x, ¥; * Coordinates of point */ samples = atoi (arav[1]); xi[0} = atoi (argv(2]); xi[l} = atoi (arqv(3]); CHAPTER 17 Shared-Memay Programming Figure 17.8 Example of a Monte Carlo algorithm to compute x. In this ‘example we have generated 1,000 pairs from a uniform distribution between 0 and 1. Since 773 pairs are inside the unit circle, our estimate of xis A(773/1000), or 3.092 xi[2] = aloi (ecgv(d)); count = 0; for (i = 0; i < samples; it+) { “x = eranddg (xi); y = erand43 (xi); if (etxeyty <= 1.0) countes; } print? ("Estimate of pi: 7.5f\n", 4.0*count/samples) ; If we wantto speed the execution of the program by using multiplethreads, we must ensure that each thread is generating a different streamm of random numbers. Otherwise, each thread would generate the same sequence of (x,y) pais, and there would be no increase in the precision of the answer through the use of parallelism. Hence xi must be a private variable, and must find some way for cach thread to initialize array xi with unique values. That means we need to have some way of distinguishing threads. In OpenMP every thread on a multiprocessor has a unique identification num, ber. We van retrieve this number using the function omp_get_theead_num, which has this header: int omp_get_tnread_num (void) Ifthere are active threads, te thread identification numbers are integers ranging from 0 through ¢ — J. The master thread always has identification number 0. SECTION 17.8 More General Data Paraloisn Assigning the thread identification number to x12] ensures each thread has a different random aurber seed. 17.8.3 Function omp_get num threads In order to divide the iterations among the threads, we must kaow the number of active threads. Function omp_get_num_threads, with this header int omp_get_num_threads retums the number of threads active in the current parallel region. We can use this information, as well as the thread identification number, to divide the iterations among the threads, ach threat will accumulate its count of points inside the circle in a private variable, When each thread! completes the Lor loop, it will add its subtotal to count inside «critical section ‘The OpenMP implementation of the Monte Carlo finding algorithm ap- pears in Figure 17.9. 17.8.4 for Pragma The paral Le} pragma can also come in handy when parallelizing Zor Yoops. ‘Consider this doubly nested loup: for i low = alii; high = blil; if (low > high) { . print® ("Exiting during iteration $d\n", i}; break; ise) { } for (j = low; j < high; j++) ej] = (cl) - afil)/blil; We cannot execute the iterations of the outer loop in parallel, because it contains a break statement, Ifwe pula parallel Lor pragma before the loop indexed by j, there will be a fork/join step for every iteration of the outer Joop. We Would lke to avoid this overhead. Previously, we showed how inverting Lor loops could solve this problem, but that approach doesn’t work here because of the data dependences. If we put the parallel] pragma immediately in front ofthe loop indexed by i, then we'll oly have a single fork/join, The default behavior is that every thread executes all of the code inside the block. OF course, we want the threads to divide up th iterations of the inner loop. The for pragma directs the compiter to do just eat: fpragma onp for 425 46 CHAPTER 17 Shared Moma Programming * open? implementation of Yon " e Carlo pi-finding algor #include high) ( printf ("Exiting during iteration $d\n*, break; SECTION 17.8 t4ore Genesal Data Paaloliam fpragma omp for for (j = low; j < high: jr) cli = {el3] - afis)/bLil; Our work is not yet complete, however. 17.8.5 single Pragma We have parallelized the execution ofthe loop indexed by j. What about the other code inside the cuter loop’? We certainly don’t want ta see the error message more than once. ‘The single pragma tells the compiler that only a single thread should execute the block of code the pragma precedes. Is syntax i ‘pragma omp single ‘Adding the singe pragma to the code block, we now have: pragma omp parallel private(i,j} for’ (i= 0; i < m ive) { low = afil; high = blil;. if (ow > high) ( Spragma omp single printf ("Exiting during iteration td\n", i}; break; } ‘pragma omp for for (5 = low; j < high: ja+) eLj] = (ef3] - ali) /bfil; ‘The code block now executes correctly, but we can improve its performance. 17.8.6 nowait Clause ‘The compiler puts a barrier synchronization at the end of every paralle) for statement. In the example we have heen considering, this barter is necessary, because we need to ensure that every thread has completed one iteration of the loop indexed by i before any thread begins the next iteration. Otherwise, a thread might change the valve of Yovc or high, altering the number of iterations ofthe 5 loop performed by another thread. On the other hand, if we make Love and hi gh private variables, there is no need forthe barrier atthe end of the loop indexed by +. The nowait clause, added oa parallel for pragma, tells the compiler to omit the bartier syn- clironization atthe end ofthe parallel for loop. aan 428 CHAPTER 17 Shared Memory Programing After making low and high private and adding the nowait clause, our final version of our example code segment parallel private(i,j,low,high) icmp ist} f blil; if (low > high) { #pragma omp single printf (‘Exiting during iteration Sd\n", i); break; ) #pragma omp for nowait for (j = low; 4 < high; cij] = (ef3} - afil jest bil; 17.9 FUNCTIONAL PARALLELISM To this point we have focused entirely on exploiting data parallelism. Another source of concurrency is functional parallelism. OpenMP allows us to assign different threads to differem portions of code. Consider, for example, the following code segment: v = alpha(}; oo x = ganma(v, w); y = deltal); print£ (*86.2é\n", epsilon(x,y}); IEall ofthe functions ate side-effect free, we can represent the data dependences as shown in Figure 17.10, Clearly functions alpha, bet, and delta may be executed in parallel. If we execute these functions concurrently, there is qo more () = oy gana) deta Cen ‘epsilon ey Figure 17.10 Data dependence diagram tor code segment of Section 17.9. SECTION 17.9 Functions Parallsm functionat paralelism to exploit, because function gama must be called after functions pha and bet.a and before function epsi lon 17.9.1 parallel sections Pragma The parallel sect ions pragma precedes a block of & blocks of code that may be cxecuted concurrently by & threads, Ihas this syntax: fpragma omp paralle! sections 17.9.2 section Pragma ‘The section pragma precedes each block of code within the encompassing block precededby the para! lel sect ions pragma. (The sect ion pragma may be omitted for the first parallel section after the parallel sect ions pragma.) In the example we considered, the’calls to functions alpha, beva, and del ta could be evaluated concurrently. In our parallelization of this code seg- ‘meat, we use curly bracts to ereate a block of eode containing these three assign- meat statements, (Recall that an assignment statement is a trivial example of a code block. Hence a block containing three assignment statements is a block of three blocks of code} ipragma omp parallel si ( #pragma omp section ‘* This pragma optional */ v = alpha); #pragma omp section w= betal); #pragma omp section y = delta(); } x = gamna{v, w); printf ("36, €\n", epsilonix,y}); Note that we reordered the assignment statements to bring together the three that could be executed in parallel. 17.9.3 sections Pragma Let's take another look atthe dala dependence diagram of Figure 17.10, There is a socoil way to exploit functional parallelism inthis code segment. As we noted carlir, if we execute functions al pha, beta, and delta in parallel, there are no further opportunities for functional parallelism. However, if we execute only functions alpha and beta in parallel, then after they retun we may execute functions gamma and delta in parallel. To this design we have two different parallel sections one following the other. Wecan reduce fork/join costs by putting al four assignment statements in a single 430 CHAPTER 17 Shated- Memory Froytmming block preceded by the parallel pragma, then using the sect ions pragma to identify the first and second pairs of functions that may execute in parallel. ‘The sect: ions pragma with syntax #pragma onp sections appears inside a parallel block of code. It has exactly the same meaning as theparallel sections pragma we have already described. Here is another way 10 express functional parallelism in the code segment wwe have been considering, using the sect ions pragma: #pragma omp parallel #pragma onp sections { Apragma omp section /* This pragma optional */ v = alpha(): : ‘pragma omp section w= beta(); ‘pragma omp sections { pragma omp section /* This pragma optional */ x = gamma(v, w): §pragma omp section y = delta(); } . print£ (*86.2£\n", epsilontx,y)); In one respect this solution is beter than the first one we presente, because ithas two parallel sections of code, each requiring two threads. Our first solution has only a single parallel section of code that required three threads. If only two processors are available, the second section of code could result in higher efficiency, Whether or not that is the case depends upon the execution times of the individual functions. 17.10 SUMMARY OpenMP isan API for shared-memory parallel programming, The shared-memory model relies upon fork/join parallelism, You can envision the execution of 2 shared-memory program as periods of sequential execution alternating with pe~ riods of parallel execution. A master thread executes all of the sequential code. When it reaches a parallel code segment, it forks other threads. The threads con- tmunicate with each other via shared variables. AC the end of the parallel code segment, these threads synchronize, rejoining the master thread. SECTION 17.10 Summary ‘This chapter has introduced OpenMP pragmas and clauses that canbe used to transforma sequential C program nto one thal runsin parallel on armltiprocessor. First we considered the parallelization of for loops, In € programs data paral- felism i often expressed in the form of for loops. We se theparallel for pragma to indicate tothe compiler those loops whose iterations may be performed in parallel. There are certain restrictions on for loops that may be executed in parallel. The conirol clause mustbe simple, so thatthe run-time system can deter- mine, before the loop executes, how many iterations it will ave. The loop cannot havea break statement, goto statement, or another statement that allows early Joop termination We also discussed how to take advantage of functional parallelism through the use of the parallel sections pragma. This pragma precedes a block of blocks of code, where each of the inner blocks, or sections, represents an independent task that may be performed in parallel with the other sections. ‘The paral Le1 pragma precedes a block of code that should be executed in parallel by all threads. When all threads execute the same code, the result is SPMD-style parallel execution similar to that exhibited by many of our programs using MPI. A for pragma ora sections pragma may appear inside the block of code marked with a paralLe1 pragma, allowing the conpiler to exploit data or functional parallelism. We also use praginas tg point out areas within parallel sections that must be executed sequentially. The crit ica pragma indicates a block of code forming a ctitical section where mutual exclusion must be enforced. The single pragma Indicates a block of code that should only be executed by one ofthe threads We can convey additional information to the compiler by adding clauses to pragmas. The private clause gives each thread its own copy ofthe listed vafi- ables. Values can be copies between the original variable and private variables using the firstprivate andor the Lastprivate clauses, The reduc tion clause allows the compiler to generate efficient code for reduction opera tions happening inside a parallel loop. The schedule clanse lets you specify the way loop iterations are allocated to tasks. The i clause allows the system t0 determine at runtime if @ constmet should be executed sequentially or by mul- tiple threads. The nowait clause eliminates the barcier synchronization at the end of the paraliel construct. ~ While we have introduced clauses in the context of particular pragmas, most clauses can be applied to most pragmas. Table 17.4 lists which ofthe clauses we have introduced inthis chapter may be attached to which pragmas.. We have examined various ways in which the performance of parallel Lor loops ean be enhanced. The strategies are inverting Loops, conditionally paral lelizing loops, aad changing the way in which loop iterations afe scheduled Table 17.5 compares OpeuMP with MPL. Both programming environments can be used to program multiprocessors, MPI is suitable for programming mukt- ccompaiers. Since OpenMP has shared variables, OpenMP is not appropriate for generic multicomputers in which there is no shared memory. MPI also makes it easier for the programmer to take control of the memory hierarchy. On the 43 432 CHAPTER 17. Shazed-Nemary Programming Table 17.4 This table summarizes which clauses may be attached to which praginas firstpri private, reduction, parallel Firstprivate, if, lastprivate, : private, reduction parallel for firstprivate, if, lastprivate, private, reduction, schedule parallel sections firotprivate, if, lastprivate, private, reduction sect ions tirstprivate, lastprivate, nowait, private, reduction single firstprivate, nowait, private "Note: Openb(P has edonal clauses no introduced i ths chap. Table 17.5 Comparison of OpenMP and VPI. ‘OpensiP MEL Suitable for multiprocessors Yes Yes Soitable for anulkcomputers No Yes ‘Suppors incremental paralletization Yes No Mini extra code Yes No Explicit control of memory hierar No Yes other hand, OpenMP has the significant advantage of allowing programs to be incrementally parallelized. n addition, ualike programs using MPI, which often are much longer than their sequential counterparts, programs using OpenMP are usually not much longer than the sequential codes they displace. 17.11 KEY TERMS ‘canonical shape rain sie ace condition chunk suided se-scheduling —_vedaction variable clause incremental parlleization schedule ccitical section master thread sequentially lat iteration dynamic schedule pragma shared variable execution content private clause static schedule fork/join paraleism private variable 17.12 BIBLIOGRAPHIC NOTES The URL for the official OpenMP Web site is www-OpenMP.org. You can download the official OpenMP specifications for the C/C++ and Fortran versions cof OpenMP from this site. SECTION 17.13. Exercis0s Parallel! Programming in OpenMP by Chandra et al. is an excellent introduc- ti to this shared-metnory application prigramming interface [16}.It provides broader and deeper coverage of the features of OpecMP. I also discusses perfor mance tuning of OpenMP codes. 17.13 EXERCISES WA 172, Of the four OpenMP functions presented in this chapter, which two ‘have the closest analogs to MPI functions? Name the MPI function each of these functions is similar to, For each of the following code segments, use OpenMP pragmas to take the loop parallel, or explain why the code segment isnot suitable for parallel execution, a for (i i< (int) sqrt(x); i++) ( afij-= 2.3 * i; if (i < 10) blil } b. flag = for (i = 0; (i

You might also like