Medical Image Processing Strategies for multi-core CPUs

Medical image processing strategies for multi-core CPUsDaniel Blezek, Mayo Clinicblezek.daniel@mayo.edu

PollDoes your primary computer have more than one core...?2Have you ever written parallel code?

It’s a parallel world...SMP formerly was the domain of researchersThanks to Intel, now it’s everywhere!3... but most of us think in serial ...Hardware has far outstripped software

Development of parallel software is difficult

Parallel Computing – according to Google“parallel computing” 1.4M hits on Google“multithreading” 10M hits“multicore” 2.4M hits“parallel programming” 1.1M hitsWhy is it so hard?the world is parallelwe all think in parallelyet we are taught to program in serial4driving

Degrees of parallelism (my take)Serial – SISD single thread of executionData parallel – SIMD (fine grained parallelism)Embarrassingly parallel – larger scale SIMDCT or MR reconstructionEach operation is independent, e.g. iFFT of slicesWorker thread – e.g. virus scanning softwareCoarse grained parallelism – SMP or MIMDFocus of this presentation, more in GPU talkConcurrency, OpenMP, TBB, pthreads/WinthreadsLarge scale – MPI on cluster, tight couplingLarge scale – Grid computing, loose coupling5

Pragmatic approachC/C++ and Fortran are the kings of performance(I’ve never written a single line of Fortran, so don’t ask)“Bolted on” parallel conceptsZero language supportHuge existing codebase6

Pragmatic approachBriefly touch on SIMDIntroduce SMP conceptsThreads, concurrencyDevelopment modelspthreads/WinThreadsOpenMPTBBITKMedical Image ProcessingExample problemsCommon errorsNext steps7packed

SIMD – basic principles9http://en.wikipedia.org/wiki/SIMD

Data structures for SIMDArray of Structuresstruct Vec { float x, y, z;};Vec[] points = new Vec[sz];10XYZ--PackXYZ--XYZ--*UnpackXYZ--

Data structures for SIMD 11Structure of Arraysstruct Vec { float[] x; float[] y; float[] z; Vec ( int sz ) { x = new float[sz]; y = new float[sz]; z = new float[sz]; };};Structure of Arraysstruct Vec{ Vector4f[] v; Vec ( int sz ) { // must be word // aligned v = new Vector4f[sz]; };};

SIMD pitfallsStructure alignmentUsually needs to be aligned on word boundaryStructure considerationsMay need to refactor existing code/structuresGenerally not cross-platformMMX, 3D Now!, SSE, SSE2, SSE4, AltVec, AVX, etc...Performance gains are modest2x – 4x commonLimited instructionsAdd, multiply, divide, roundNot suitable for branching logicAutovectorizing compilers for simple loops-ftree-vectorize (GCC), -fast, -O2 or -O3 (Intel Compiler)12

14Threads – they’re everywhere

SMP concepts15Useful to think in terms of “cores”2 dual-core CPU = 4 “cores”Cores share main memory, may share cacheThreads in same process share memoryGenerally, one executing thread per coreOther threads sleeping

Cores – they’re everywhere16How many cores does your laptop have?Mine has 50(!) 2 Intel CPU (Core 2 Duo) 32 nVidia cores (9600M GT) 16 nVidia cores (9400M)

Parallel concepts for SMPProcessStarted by the OSSingle thread executes “main”No direct access to memory of other processesThreadsStream of execution under a processAccess to memory in containing processPrivate memoryLifetime may be less than main threadConcurrencyCoordination between threadsHigh level (mutex, locks, barriers)Low level (atomic operations)17

Processes & Threads18ProcessThreadNoNo

#include <pthread.h>// Thread work function, must return pointer to voidvoid *doWork(void *work) { // Do work return work; // equivalent to pthread_exit ( myWork );}...pthread_t child;...rc=pthread_create(&child, &attr, doWork, (void *)work);...rc = pthread_join ( child, &threadwork );...Thread construction – pthread example19

Thread construction – Win32 example20#include <windows.h>DWORD WINAPI doWork( LPVOID work) {};...PMYDATA work;DWORD childID;HANDLE child;child = CreateThread( NULL, // default security attributes 0, // use default stack size doWork, // thread function name work, // argument to thread function 0, // use default creation flags &childID); // returns the thread identifierWaitForMultipleObjects(NThreads, child, TRUE, INFINITE);

Thread construction – Java example21import java.lang.Thread;class Worker implements Runnable { public Worker ( Work work ) {}; public void run() {}; // Do work here}...Worker worker = new Worker ( someWork );New Thread ( worker ).start();

Race Conditions22SerialParallelProblem!nono/door

MutexMutex – Mutual exclusion lockProtects a section of codeOnly one thread has a lock on the objectThreads maywait for the mutexreturn a status if the mutex is lockedSemaphoreN threadsCritical SectionOne thread executes codeProtects global resourcesMaintain consistent state23

Race Conditions24...N = 0;...// Start some threads...void* doWork() { N++; // get, incr, store}Mutexmutex;mutex.lock();mutex.release();Solution w/MutexNoNo

Atomic operationsLocks are not perfectCause blockingRelatively heavy-weightAtomic operationsSimple operationsHardware supportCan implement w/MutexConditionsInvisibility – no other thread knows about the changeAtomicity – if operation fails, return to original state25

DeadlockDeadlock26Mutex AMutex BMutexThreadNoNo

Thread synchronization – barrierInitialized with the number of threads expectedThreads signal when they are readyWait until all expected threads are thereA stalled or dead thread can stall all the threads27

Thread synchronization – Condition variablesWorkers atomically release mutex and waitMaster atomically releases mutex and signalsWorkers wake up and acquire mutex28Mutex AWorkingConditionMutex AConditionMutex AWaitMutex AConditionMutex AMutexThread

Thread pool & Futures29Maintains a “pool” of Worker threadsWork queued until thread availableOptionally notify through a “Future”Future can query status, holds return valueThread returns to pool, no startup overheadCore concept for OpenMP and TBB

Introduction to OpenMPScatter / gather paradigmMaintains a thread poolRequires compiler supportVisual C++, gcc 4.0, Intel CompilerEasy to adapt existing serial code, easy to debugSimple paradigm31

OpenMP – simple parallel sections32#pragmaomp parallel sections num_threads ( 5 ){ // 5 Threads scatter here #pragmaomp section { // Do task 1 } #pragmaomp section { // Do task 2 } ... #pragmaomp section { // Do task N } // Implicit barrier}Barrier...NoNo

OpenMP – parallel for33#pragmaomp parallel forfor ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of idoSomeWork( i );}// Implicit barrierScheduling the iterations

OpenMP – reduction34int TotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of i & TotalAmountOfWorkTotalAmountOfWork += doSomeWork( i );}// Implicit barrier// TotalAmountOfWork was properly accumulated// Each thread has local copy, barrier does reduction// No need to use critical sections

OpenMP – “atomic” reduction35int TotalAmountOfWork = 0;#pragmaomp parallel forfor ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here int myWork = doSomeWork( i); #pragmaomp atomicTotalAmountOfWork += myWork;}// Implicit barrier// TotalAmountOfWork was properly accumulated// However, the atomic section can cause thread stalls

OpenMP – critical36int TotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of iTotalAmountOfWork += doSomeWork( i ); #pragmaomp critical { // Execute by one thread at a time, e.g., “Mutex lock”criticalOperation(); }}// Implicit barrier

OpenMP – single37int TotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of iTotalAmountOfWork += doSomeWork( i ); #pragmaomp single nowait { // Execute by one thread, use “master” for the main threadreportProgress ( TotalAmountOfWork ); } // !! No implicit barrier because of “nowait” clause !!}// Implicit barrier

Threading Building Blocks (TBB)38

Introduction to TBBCommercial and Open Source LicensesGPL with runtime exceptionCross-platform C++ librarySimilar to STLUsual concurrency classesSeveral different constructs for threadingfor, do, reduction, pipelineFiner control over schedulingMaintains a thread pool to execute taskshttp://www.threadingbuildingblocks.org/39

TBB – parallel for 40#include "tbb/blocked_range.h”#include "tbb/parallel_for.h” class Worker { public: Worker ( /* ... */ ) {...}; void operator() ( const tbb::blocked_range<int>& r ) const { for ( int i = r.begin(); i != r.end(); ++i ) { doWork ( i ); } }};...tbb::parallel_for ( tbb::blocked_range<int> ( 0, N ), Worker ( /* ... */ ), tbb::auto_partitioner() );

TBB – parallel reduction41#include "tbb/blocked_range.h”#include "tbb/parallel_reduce.h” class ReducingWorker { int mLocalWork; public:ReducingWorker ( /* ... */ ) {...};ReducingWorker ( ReducingWorker& o, split ) : mLocalWork(0) {}; void join ( const ReducingWorker& o ) {mLocalWork += o.mLocalWork}; void operator() ( const tbb::blocked_range<int>& r ) { ... }};...Worker w;tbb::parallel_reduce ( tbb::blocked_range<int> ( 0, N ), w, tbb::auto_partitioner() );w.getLocalWork();

TBB – synchronization43tbb::spin_mutex MyMutex;void doWork ( /* ... */ ) { // Enter critical section, exit when lock goes out of scopetbb::spin_mutex::scoped_lock lock ( MyMutex ); // NB: This is an error!!! // tbb::spin_mutex::scoped_lock( MyMutex );}...#include <tbb/atomic.h>tbb::atomic<int> MyCounter;...MyCounter = 0; // Atomicint i = MyCounter; // AtomicMyCounter++; MyCounter--; ++MyCounter; --MyCounter; // Atomic...MyCounter = 0; MyCounter += 2; // Watch out for other threads!

ITK ImplementationThreads operate across slicesOnly implemented behavior in ITKitk::MultiThreader is somewhat flexibleRequires that you break the ITK modelUses Thread Join, higher overheadNo thread pool45

Comparison46Language specific (Java)+ Fine-grain control+ Cross-platform easy(?)+ Many constructs+/- Language-specificThreads (C/C++)+ Fine-grain control Not cross-platform

Few constructsITK+ Integrated+ Simple Limited control+/- ITK onlyTBB+/- More complex+ Fine-grain control+ Intel (-?)+ Open Source+ Some constructs Must re-write codeOpenMP+ Simple+ Adapt existing code+/- Industry standard+/- Compiler support Coarse-grain controldiy

Image class48class Image { public: short* mData; int mWidth, mHeight, mDepth; int mVoxelsPerSlice; int mVoxelsPerVolume; short* mSlicePointers; // Pointers to the start of each slice short getVoxel ( int x, int y, int z ) {...} void setVoxel ( int x, int y, int z, short v ) {...}};

Trivial problem – thresholdThreshold an imageIf intensity > 100, output 1otherwise output 0Present from simple to complexOpenMPTBBITKpthread(see extra slides)49

Threshold – OpenMP #150void doThreshold ( Image* in, Image* out ) {#pragmaomp parallel for for ( int z = 0; z < in->mDepth; z++ ) { for ( int y = 0; y < in->mHeight; y++ ) { for ( int x = 0; x < in->mWidth; x++ ) { if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } }}// NB: can loop over slices, rows or columns by moving// pragma, but must choose at compile time

Threshold – OpenMP #251void doThreshold ( Image* in, Image* out ) {#pragmaomp parallel for for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) { if ( in->mData[s] > 100 ) { out->mData[s] = 1; } else { out->mData[s] = 0; } }}// Likely a lot faster than previous code

class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) { for ( int x = r.begin(); x != r.end(); ++x ) { if ( in->mData[x] > 100 ) { out->mData[x] = 1; } else { out->mData[x] = 0; } } }}...parallel_for ( tbb::blocked_range<int>(0, in->mVoxelsPerVolume ), Threshold ( in, out ), auto_partitioner() );// NB: default “grain size” for blocked_range is 1 pixel// tbb::blocked_range<int>(..., in->mVoxelsPerVolume / NumberOfCPUs )Threshold – TBB #152

class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) {...} void operator() ( const tbb::blocked_range2d<int,int>& r ) { for ( int z = in->mDepth; z < in->mDepth; z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } };...parallel_for ( tbb::blocked_range2d<int,int>( 0, in->mHeight, 32 0, in->mWidth, 32 ), Threshold ( in, out ), auto_partitioner() ); Threshold – TBB #253

class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) {...} void operator() ( const tbb::blocked_range2d<int,int>& r ) {...} void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } };...parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth, 1 0, in->mHeight, 32 0, in->mWidth, 32 ), Threshold ( in, out ), auto_partitioner() );Threshold – TBB #3 54

Threshold – ITK solution55ThreadedGenerateData( const OutputImageRegionType out, int threadId){... // Define the iteratorsImageRegionConstIterator<TIn> inputIt(inputPtr, out);ImageRegionIterator<TOut> outputIt(outputPtr, out);inputIt.GoToBegin();outputIt.GoToBegin(); while( !inputIt.IsAtEnd() ) { if ( inputIt.Get() > 100 ) {outputIt.Set ( 1 ); } else {outputIt.Set ( 0 ); { ++inputIt; ++outputIt;}}

Interesting problem – anisotropic diffusionEdge preserving smoothing methodPerona and Malik. Scale-space and edge detection using anisotropic diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on (1990) vol. 12 (7) pp. 629 – 639Iterative processDemonstrateOpenMPTBB(ITK has an implementation)(pthreads are tedious at the very least)Pop quiz – are the following correct?56

Anisotropic diffusion – OpenMP57void doAD ( Image* in, Image* out ) {#pragmaomp parallel for for ( int t = 0; t < TotalTime; t++ ) { for ( int z = 0; z < in->mDepth; z++ ) { ... } }}

Anisotropic diffusion – OpenMP58void doAD ( Image* in, Image* out ) { short *previousSlice, *slice, *nextSlice; for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for for ( int z = 1; z < in->mDepth-1; z++ ) {previousSlice = in->mSlicePointers[z-1]; slice = in->mSlicePointers[z];nextSlice = in->mSlicePointers[z+1]; for ( int y = 1; y < in->mHeight-1; y++ ) { short* previousRow = slice + y-1 * in->mWidth; short* row = slice + y * in->mWidth; short* nextRow = slice + y-1 * in->mWidth; short* aboveRow = previousSlice + y * in->mWidth; short* belowRow = nextSlice + y * in->mWidth; for ( int x = 1; i < in->mWidth-1; x++ ) {dx = 2 * row[x] – row[x-1] – row[x+1];dy = 2 * row[x] – previousRow[x] – nextRow[x];dz = 2 * row[x] – aboveRow[x] – belowRow[x]; ...

Anisotropic diffusion – OpenMP59void doAD ( Image* in, Image* out ) { for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for for ( int z = 1; z < in->mDepth-1; z++ ) { short* previousSlice = in->mSlicePointers[z-1]; short* slice = in->mSlicePointers[z]; short* nextSlice = in->mSlicePointers[z+1]; for ( int y = 1; y < in->mHeight-1; y++ ) { short* previousRow = slice + y-1 * in->mWidth; short* row = slice + y * in->mWidth; short* nextRow = slice + y-1 * in->mWidth; short* aboveRow = previousSlice + y * in->mWidth; short* belowRow = nextSlice + y * in->mWidth; for ( int x = 1; i < in->mWidth-1; x++ ) {dx = 2 * row[x] – row[x-1] – row[x+1];dy = 2 * row[x] – previousRow[x] – nextRow[x];dz = 2 * row[x] – aboveRow[x] – belowRow[x]; ...

Anisotropic diffusion – TBB #160class doAD { public: static ADConstants* sConstants;doAD ( Image* in, Image* out ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { if ( !sConstants == NULL ) { initConstants(); } // process ... }}

Threshold – TBB #2 61class doAD { public:doAd ( ... ) {...} void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth 0, in->mHeight 0, in->mWidth ),doAD ( in, out ), auto_partitioner() );

Threshold – TBB #3 62class doAD { public: static tbb::atomic<int> sProgress;tbb::spin_mutexmMutex;doAd ( ... ) {...} void reportProgress ( int p ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {tbb::spin_mutex::scoped_lock lock ( mMutex );sProgress++;reportProgress ( sProgress ); for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...doAD::sProgress = 0;parallel_for (...);

Threshold – TBB #4 63class doAD { public: static tbb::atomic<int> sProgress; static tbb::spin_mutexmMutex;doAd ( ... ) {...} void reportProgress ( int p ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {tbb::spin_mutex::scoped_lock lock ( mMutex );sProgress++;reportProgress ( sProgress ); for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...doAD::sProgress = 0;parallel_for (...);

nowaitAnisotropic diffusion – OpenMP (Progress)64using std;void doAD ( Image* in, Image* out ) {int progress = 0;for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for for ( int s = 0; s < in->mDepth; s++ ) { #pragmaomp atomic progress++; #pragmaomp singlereportProgress ( progress ); ... } }}

Real-life problemCompute Frangi’svesselness measureFrangi et al. Model-based quantitation of 3-D magnetic resonance angiographic images. IEEE Transactions on Medical Imaging (1999) vol. 18 (10) pp. 946-956Memory constrained solutionITK implementation requires 1.2G for 100M volumeAntiga. Generalizing vesselness with respect to dimensionality and shape. Insight Journal (2007)Possible solutions usingOpenMP, TBB65

ITK Implementation – computing the Hessian6 volumes computed in serialIndividual filters are threadedGood CPU usageHigh memory requirements67

Design considerationsBreak problem into blocksCompute hessian, eigenvalues, and vesselnessReduces memory requirementsIncurs overhead, boundary conditions68

Design considerations69keep cpu’s full

Design considerations – boundary condition70

Algorithm sketch – Serial72intBlockSize = 32;for ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) { for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize ); } }}

Algorithm sketch – OpenMP73intBlockSize = 32;#pragmaomp parallel forfor ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) { for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize ); } }}Each thread is on a different sliceMay cause cache contentionSimilar problems for “y” direction

Algorithm sketch – OpenMP74intBlockSize = 32;for ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) {#pragmaomp parallel for for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize ); } }}All threads on same rowsMay not utilize all CPUsIf Ratio of Width to BlockSize < # CPUsBetter cache utilization

Algorithm sketch – TBB75class Vesselness { public: void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { // Process the block, could use ITK hereprocessBlock ( r.cols().begin(), r.rows().begin(), r.pages().begin(),r.cols().size(), r.rows().size(), r.pages().size() );...parallel_for ( tbb::blocked_range3d<int,int,int>( 0, in->mDepth, 32 0, in->mHeight, 32 0, in->mWidth, 32 ),Vesselness( in, out ), auto_partitioner() );Individual blocksFull CPUsMay not have best cache performance

Next stepsGo try parallel developmentTry threads to gain understanding and insightNext OpenMP, adapting existing codeTBB: more constructs, different approachsExperiment with new languagesErlang, Scala, Reia, Chapel, X10, Fortress...Check out some of the resources providedHave fun! It’s a brave new world out there...76

ResourcesTBB (http://www.threadingbuildingblocks.org/)OpenMP (http://openmp.org/wp/)Books/ArticlesJava Concurrency in Practice (http://www.javaconcurrencyinpractice.com/)Parallel Programming (http://www-users.cs.umn.edu/~karypis/parbook/)ITK Software Guide (http://www.itk.org/ItkSoftwareGuide.pdf)The Problem with Threads (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf)TutorialsParallel Programming(https://computing.llnl.gov/tutorials/parallel_comp/)pthreads (https://computing.llnl.gov/tutorials/pthreads/)OpenMP (https://computing.llnl.gov/tutorials/openMP/)OtherLLNL (https://computing.llnl.gov/)Erlang (http://en.wikipedia.org/wiki/Erlang_programming_language)GCC-OpenMP (http://gcc.gnu.org/projects/gomp/)Intel Compiler (http://software.intel.com/en-us/intel-compilers/)77

ResourcesLanguagesErlang (http://www.erlang.org/)Scala (http://www.scala-lang.org/)Chapel (http://chapel.cs.washington.edu/)X10 (http://x10-lang.org/)Unified Parallel C (http://upc.gwu.edu/)Titanium (http://titanium.cs.berkeley.edu/)Co-Array Fortran (http://www.co-array.org/)ZPL (http://www.cs.washington.edu/research/zpl/home/index.html)High Performance Fortran (http://hpff.rice.edu/)Fortress (http://projectfortress.sun.com/Projects/Community/) Others (http://www.google.com/search?q=parallel+programming+language)78

Thread construction – pthread example80include <pthread.h>void *(*start_routine)(void *);intpthread_create(pthread_t *restrict thread, const pthread_attr_t *restrict attr, void *(*start_routine)(void *), void *restrict arg);voidpthread_exit(void *value_ptr);intpthread_join(pthread_t thread, void **value_ptr);

Mutex – pthread example81#include <pthread.h>pthread_mutex_t myMutex;...pthread_mutex_init ( &myMutex, NULL );...pthread_mutex_lock ( &myMutex );// Critical Section, only one thread at a time...pthread_mutex_unlock ( &myMutex );...if ( pthread_mutex_trylock ( &myMutex ) == EBUSY ) { // We did get the lock, so we are in the critical section ... pthread_mutex_unlock ( &myMutex );}

Mutex – Java example82import java.lang.*;class Foo { public synchronized int doWork () { // only one thread can execute doWork } Object resource; public int otherWork () { synchronized ( resource ) { // critical section, resource is the mutex ... }}

Medical Image Processing Strategies for multi-core CPUs

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Medical Image Processing Strategies for multi-core CPUs (20)

Recently uploaded (20)

Medical Image Processing Strategies for multi-core CPUs

Editor's Notes