SlideShare a Scribd company logo
Medical image processing strategies for multi-core CPUsDaniel Blezek, Mayo Clinicblezek.daniel@mayo.edu
PollDoes your primary computer have more than one core...?2Have you ever written parallel code?
It’s a parallel world...SMP formerly was the domain of researchersThanks to Intel, now it’s everywhere!3... but most of us think in serial ...Hardware has far outstripped software
Developers are not trained
Development of parallel software is difficult
Outside the box
Erlang
Scala
...shoehorn
Parallel Computing – according to Google“parallel computing” 1.4M hits on Google“multithreading” 10M hits“multicore” 2.4M hits“parallel programming” 1.1M hitsWhy is it so hard?the world is parallelwe all think in parallelyet we are taught to program in serial4driving
Degrees of parallelism (my take)Serial – SISD single thread of executionData parallel – SIMD (fine grained parallelism)Embarrassingly parallel – larger scale SIMDCT or MR reconstructionEach operation is independent, e.g. iFFT of slicesWorker thread – e.g. virus scanning softwareCoarse grained parallelism – SMP or MIMDFocus of this presentation, more in GPU talkConcurrency, OpenMP, TBB, pthreads/WinthreadsLarge scale – MPI on cluster, tight couplingLarge scale – Grid computing, loose coupling5
Pragmatic approachC/C++ and Fortran are the kings of performance(I’ve never written a single line of Fortran, so don’t ask)“Bolted on” parallel conceptsZero language supportHuge existing codebase6
Pragmatic approachBriefly touch on SIMDIntroduce SMP conceptsThreads, concurrencyDevelopment modelspthreads/WinThreadsOpenMPTBBITKMedical Image ProcessingExample problemsCommon errorsNext steps7packed
SIMD8
SIMD – basic principles9http://en.wikipedia.org/wiki/SIMD
Data structures for SIMDArray of Structuresstruct Vec {	float x, y, z;};Vec[] points = new Vec[sz];10XYZ--PackXYZ--XYZ--*UnpackXYZ--
Data structures for SIMD 11Structure of Arraysstruct Vec {	float[] x;	float[] y;	float[] z;	Vec ( int sz ) {	 x = new float[sz]; y = new float[sz]; z = new float[sz];	};};Structure of Arraysstruct Vec{	Vector4f[] v;	Vec ( int sz ) {  // must be word   // aligned v =    new Vector4f[sz];	};};
SIMD pitfallsStructure alignmentUsually needs to be aligned on word boundaryStructure considerationsMay need to refactor existing code/structuresGenerally not cross-platformMMX, 3D Now!, SSE, SSE2, SSE4, AltVec, AVX, etc...Performance gains are modest2x – 4x commonLimited instructionsAdd, multiply, divide, roundNot suitable for branching logicAutovectorizing compilers for simple loops-ftree-vectorize (GCC), -fast, -O2 or -O3 (Intel Compiler)12
Threads13
14Threads – they’re everywhere
SMP concepts15Useful to think in terms of “cores”2 dual-core CPU = 4 “cores”Cores share main memory, may share cacheThreads in same process share memoryGenerally, one executing thread per coreOther threads sleeping
Cores – they’re everywhere16How many cores does your laptop have?Mine has 50(!)	2 Intel CPU (Core 2 Duo)	32 nVidia cores (9600M GT)   16 nVidia cores (9400M)
Parallel concepts for SMPProcessStarted by the OSSingle thread executes “main”No direct access to memory of other processesThreadsStream of execution under a processAccess to memory in containing processPrivate memoryLifetime may be less than main threadConcurrencyCoordination between threadsHigh level (mutex, locks, barriers)Low level (atomic operations)17
Processes & Threads18ProcessThreadNoNo
#include <pthread.h>// Thread work function, must return pointer to voidvoid *doWork(void *work) {  // Do work  return work; // equivalent to pthread_exit ( myWork );}...pthread_t child;...rc=pthread_create(&child, &attr, doWork, (void *)work);...rc = pthread_join ( child, &threadwork );...Thread construction – pthread example19
Thread construction – Win32 example20#include <windows.h>DWORD WINAPI doWork( LPVOID work) {};...PMYDATA work;DWORD   childID;HANDLE  child;child = CreateThread(        NULL,         // default security attributes       0,            // use default stack size         doWork,       // thread function name       work,         // argument to thread function        0,            // use default creation flags        &childID);    // returns the thread identifierWaitForMultipleObjects(NThreads, child, TRUE, INFINITE);
Thread construction – Java example21import java.lang.Thread;class Worker implements Runnable {	public Worker ( Work work ) {};  public void run() {}; // Do work here}...Worker worker = new Worker ( someWork );New Thread ( worker ).start();
Race Conditions22SerialParallelProblem!nono/door
MutexMutex – Mutual exclusion lockProtects a section of codeOnly one thread has a lock on the objectThreads maywait for the mutexreturn a status if the mutex is lockedSemaphoreN threadsCritical SectionOne thread executes codeProtects global resourcesMaintain consistent state23
Race Conditions24...N = 0;...// Start some threads...void* doWork() {  N++; // get, incr, store}Mutexmutex;mutex.lock();mutex.release();Solution w/MutexNoNo
Atomic operationsLocks are not perfectCause blockingRelatively heavy-weightAtomic operationsSimple operationsHardware supportCan implement w/MutexConditionsInvisibility – no other thread knows about the changeAtomicity – if operation fails, return to original state25
DeadlockDeadlock26Mutex AMutex BMutexThreadNoNo
Thread synchronization – barrierInitialized with the number of threads expectedThreads signal when they are readyWait until all expected threads are thereA stalled or dead thread can stall all the threads27
Thread synchronization – Condition variablesWorkers atomically release mutex and waitMaster atomically releases mutex and signalsWorkers wake up and acquire mutex28Mutex AWorkingConditionMutex AConditionMutex AWaitMutex AConditionMutex AMutexThread
Thread pool & Futures29Maintains a “pool” of Worker threadsWork queued until thread availableOptionally notify through a “Future”Future can query status, holds return valueThread returns to pool, no startup overheadCore concept for OpenMP and TBB
OpenMP30
Introduction to OpenMPScatter / gather paradigmMaintains a thread poolRequires compiler supportVisual C++, gcc 4.0, Intel CompilerEasy to adapt existing serial code, easy to debugSimple paradigm31
OpenMP – simple parallel sections32#pragmaomp parallel sections num_threads ( 5 ){  // 5 Threads scatter here	#pragmaomp section  { // Do task 1 }  #pragmaomp section  { // Do task 2 }  ...  #pragmaomp section  { // Do task N }  // Implicit barrier}Barrier...NoNo
OpenMP – parallel for33#pragmaomp parallel forfor ( int i = 0; i < NumberOfIterations; i++ ) {  // Threads scatter here  // each thread has a private copy of idoSomeWork( i );}// Implicit barrierScheduling the iterations
OpenMP – reduction34int TotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) {  // Threads scatter here  // each thread has a private copy of i & TotalAmountOfWorkTotalAmountOfWork += doSomeWork( i );}// Implicit barrier// TotalAmountOfWork was properly accumulated// Each thread has local copy, barrier does reduction// No need to use critical sections
OpenMP – “atomic” reduction35int TotalAmountOfWork = 0;#pragmaomp parallel forfor ( int i = 0; i < NumberOfIterations; i++ ) {  // Threads scatter here  int myWork = doSomeWork( i);  #pragmaomp atomicTotalAmountOfWork += myWork;}// Implicit barrier// TotalAmountOfWork was properly accumulated// However, the atomic section can cause thread stalls
OpenMP – critical36int TotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) {  // Threads scatter here  // each thread has a private copy of iTotalAmountOfWork += doSomeWork( i );  #pragmaomp critical  {    // Execute by one thread at a time, e.g., “Mutex lock”criticalOperation();  }}// Implicit barrier
OpenMP – single37int TotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) {  // Threads scatter here  // each thread has a private copy of iTotalAmountOfWork += doSomeWork( i );  #pragmaomp single nowait  {    // Execute by one thread, use “master” for the main threadreportProgress ( TotalAmountOfWork );  }  // !! No implicit barrier because of “nowait” clause !!}// Implicit barrier
Threading Building Blocks (TBB)38
Introduction to TBBCommercial and Open Source LicensesGPL with runtime exceptionCross-platform C++ librarySimilar to STLUsual concurrency classesSeveral different constructs for threadingfor, do, reduction, pipelineFiner control over schedulingMaintains a thread pool to execute taskshttp://www.threadingbuildingblocks.org/39
TBB – parallel for	40#include "tbb/blocked_range.h”#include "tbb/parallel_for.h”	 class Worker { public:  Worker ( /* ... */ ) {...};  void operator() ( const tbb::blocked_range<int>& r ) const {    for ( int i = r.begin(); i != r.end(); ++i ) {      doWork ( i );    }  }};...tbb::parallel_for ( tbb::blocked_range<int> ( 0, N ),	Worker ( /* ... */ ), tbb::auto_partitioner() );
TBB – parallel reduction41#include "tbb/blocked_range.h”#include "tbb/parallel_reduce.h”	 class ReducingWorker {  int mLocalWork; public:ReducingWorker ( /* ... */ ) {...};ReducingWorker ( ReducingWorker& o, split ) : mLocalWork(0) {};  void join ( const ReducingWorker& o ) {mLocalWork += o.mLocalWork};  void operator() ( const tbb::blocked_range<int>& r ) { ... }};...Worker w;tbb::parallel_reduce ( tbb::blocked_range<int> ( 0, N ),	w, tbb::auto_partitioner() );w.getLocalWork();
TBB – parallel reduction42
TBB – synchronization43tbb::spin_mutex MyMutex;void doWork ( /* ... */ ) {  // Enter critical section, exit when lock goes out of scopetbb::spin_mutex::scoped_lock lock ( MyMutex );  // NB: This is an error!!!  // tbb::spin_mutex::scoped_lock( MyMutex );}...#include <tbb/atomic.h>tbb::atomic<int> MyCounter;...MyCounter = 0;  // Atomicint i = MyCounter;  // AtomicMyCounter++; MyCounter--; ++MyCounter; --MyCounter;  // Atomic...MyCounter = 0; MyCounter += 2;  // Watch out for other threads!
ITK Model44
ITK ImplementationThreads operate across slicesOnly implemented behavior in ITKitk::MultiThreader is somewhat flexibleRequires that you break the ITK modelUses Thread Join, higher overheadNo thread pool45
Comparison46Language specific (Java)+ Fine-grain control+ Cross-platform easy(?)+ Many constructs+/- Language-specificThreads (C/C++)+ Fine-grain control Not cross-platform
 Few constructsITK+ Integrated+ Simple Limited control+/- ITK onlyTBB+/- More complex+ Fine-grain control+ Intel (-?)+ Open Source+ Some constructs Must re-write codeOpenMP+ Simple+ Adapt existing code+/- Industry standard+/- Compiler support Coarse-grain controldiy
Medical Imaging47
Image class48class Image {  public:    short* mData;    int mWidth, mHeight, mDepth;    int mVoxelsPerSlice;    int mVoxelsPerVolume;    short* mSlicePointers; // Pointers to the start of each slice    short getVoxel ( int x, int y, int z ) {...}    void setVoxel ( int x, int y, int z, short v ) {...}};
Trivial problem – thresholdThreshold an imageIf intensity > 100, output 1otherwise output 0Present from simple to complexOpenMPTBBITKpthread(see extra slides)49
Threshold – OpenMP #150void doThreshold ( Image* in, Image* out ) {#pragmaomp parallel for  for ( int z = 0; z < in->mDepth; z++ ) {    for ( int y = 0; y < in->mHeight; y++ ) {      for ( int x = 0; x < in->mWidth; x++ ) {        if ( in->getVoxel(x,y,z) > 100 ) {          out->setVoxel(x,y,z,1);        } else {          out->setVoxel(x,y,z,0);        }      }    }  }}// NB: can loop over slices, rows or columns by moving// pragma, but must choose at compile time
Threshold – OpenMP #251void doThreshold ( Image* in, Image* out ) {#pragmaomp parallel for  for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) {    if ( in->mData[s] > 100 ) {      out->mData[s] = 1;    } else {      out->mData[s] = 0;    }  }}// Likely a lot faster than previous code
class Threshold {  public:    Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}    void operator() ( const tbb::blocked_range<int>& r ) {      for ( int x = r.begin(); x != r.end(); ++x ) {        if ( in->mData[x] > 100 ) {          out->mData[x] = 1;        } else {          out->mData[x] = 0;        }      }   }}...parallel_for ( tbb::blocked_range<int>(0, in->mVoxelsPerVolume ),    Threshold ( in, out ), auto_partitioner() );// NB: default “grain size” for blocked_range is 1 pixel// tbb::blocked_range<int>(..., in->mVoxelsPerVolume / NumberOfCPUs )Threshold – TBB #152
class Threshold {  public:    Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}    void operator() ( const tbb::blocked_range<int>& r ) {...}    void operator() ( const tbb::blocked_range2d<int,int>& r ) {      for ( int z = in->mDepth; z < in->mDepth; z++ ) {        for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {          for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){            if ( in->getVoxel(x,y,z) > 100 ) {              out->setVoxel(x,y,z,1);            } else {              out->setVoxel(x,y,z,0);            } } } }    } };...parallel_for ( tbb::blocked_range2d<int,int>( 0, in->mHeight, 32                                              0, in->mWidth,  32 ),    Threshold ( in, out ), auto_partitioner() );	Threshold – TBB #253
class Threshold {	  public:    Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}    void operator() ( const tbb::blocked_range<int>& r ) {...}    void operator() ( const tbb::blocked_range2d<int,int>& r ) {...}    void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {      for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {        for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {          for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){            if ( in->getVoxel(x,y,z) > 100 ) {              out->setVoxel(x,y,z,1);            } else {              out->setVoxel(x,y,z,0);            } } } }    } };...parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth, 1                                                 0, in->mHeight, 32                                                 0, in->mWidth, 32 ),    Threshold ( in, out ), auto_partitioner() );Threshold – TBB #3	54
Threshold – ITK solution55ThreadedGenerateData( const OutputImageRegionType out, int threadId){...  // Define the iteratorsImageRegionConstIterator<TIn>  inputIt(inputPtr, out);ImageRegionIterator<TOut> outputIt(outputPtr, out);inputIt.GoToBegin();outputIt.GoToBegin();  while( !inputIt.IsAtEnd() )     {    if ( inputIt.Get() > 100 ) {outputIt.Set ( 1 );    } else {outputIt.Set ( 0 );    {    ++inputIt;    ++outputIt;}}
Interesting problem – anisotropic diffusionEdge preserving smoothing methodPerona and Malik. Scale-space and edge detection using anisotropic diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on (1990) vol. 12 (7) pp. 629 – 639Iterative processDemonstrateOpenMPTBB(ITK has an implementation)(pthreads are tedious at the very least)Pop quiz – are the following correct?56
Anisotropic diffusion – OpenMP57void doAD ( Image* in, Image* out ) {#pragmaomp parallel for  for ( int t = 0; t < TotalTime; t++ ) {    for ( int z = 0; z < in->mDepth; z++ ) {      ...    }  }}
Anisotropic diffusion – OpenMP58void doAD ( Image* in, Image* out ) {  short *previousSlice, *slice, *nextSlice;  for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for    for ( int z = 1; z < in->mDepth-1; z++ ) {previousSlice = in->mSlicePointers[z-1];      slice = in->mSlicePointers[z];nextSlice = in->mSlicePointers[z+1];      for ( int y = 1; y < in->mHeight-1; y++ ) {        short* previousRow = slice + y-1 * in->mWidth;        short* row = slice + y * in->mWidth;        short* nextRow = slice + y-1 * in->mWidth;        short* aboveRow = previousSlice + y * in->mWidth;        short* belowRow = nextSlice + y * in->mWidth;        for ( int x = 1; i < in->mWidth-1; x++ ) {dx = 2 * row[x] – row[x-1] – row[x+1];dy = 2 * row[x] – previousRow[x] – nextRow[x];dz = 2 * row[x] – aboveRow[x] – belowRow[x];          ...
Anisotropic diffusion – OpenMP59void doAD ( Image* in, Image* out ) {  for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for    for ( int z = 1; z < in->mDepth-1; z++ ) {      short* previousSlice = in->mSlicePointers[z-1];      short* slice = in->mSlicePointers[z];      short* nextSlice = in->mSlicePointers[z+1];      for ( int y = 1; y < in->mHeight-1; y++ ) {        short* previousRow = slice + y-1 * in->mWidth;        short* row = slice + y * in->mWidth;        short* nextRow = slice + y-1 * in->mWidth;        short* aboveRow = previousSlice + y * in->mWidth;        short* belowRow = nextSlice + y * in->mWidth;        for ( int x = 1; i < in->mWidth-1; x++ ) {dx = 2 * row[x] – row[x-1] – row[x+1];dy = 2 * row[x] – previousRow[x] – nextRow[x];dz = 2 * row[x] – aboveRow[x] – belowRow[x];          ...
Anisotropic diffusion – TBB #160class doAD {  public:  static ADConstants* sConstants;doAD ( Image* in, Image* out ) { ... }  void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {    if ( !sConstants == NULL ) { initConstants(); }    // process    ...  }}
Threshold – TBB #2	61class doAD {	  public:doAd ( ... ) {...}    void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {      for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {        for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {          for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){          ...  } };...parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth                                                 0, in->mHeight                                                 0, in->mWidth ),doAD ( in, out ), auto_partitioner() );
Threshold – TBB #3	62class doAD {	  public:    static tbb::atomic<int> sProgress;tbb::spin_mutexmMutex;doAd ( ... ) {...}    void reportProgress ( int p ) { ... }    void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {      for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {tbb::spin_mutex::scoped_lock lock ( mMutex );sProgress++;reportProgress ( sProgress );        for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {          for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){          ...  } };...doAD::sProgress = 0;parallel_for (...);
Threshold – TBB #4	63class doAD {	  public:    static tbb::atomic<int> sProgress;    static tbb::spin_mutexmMutex;doAd ( ... ) {...}    void reportProgress ( int p ) { ... }    void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {      for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {tbb::spin_mutex::scoped_lock lock ( mMutex );sProgress++;reportProgress ( sProgress );        for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {          for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){          ...  } };...doAD::sProgress = 0;parallel_for (...);
nowaitAnisotropic diffusion – OpenMP (Progress)64using std;void doAD ( Image* in, Image* out ) {int progress = 0;for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for    for ( int s = 0; s < in->mDepth; s++ ) {      #pragmaomp atomic      progress++;      #pragmaomp singlereportProgress ( progress );      ...    }  }}
Real-life problemCompute Frangi’svesselness measureFrangi et al. Model-based quantitation of 3-D magnetic resonance angiographic images. IEEE Transactions on Medical Imaging (1999) vol. 18 (10) pp. 946-956Memory constrained solutionITK implementation requires 1.2G for 100M volumeAntiga. Generalizing vesselness with respect to dimensionality and shape. Insight Journal (2007)Possible solutions usingOpenMP, TBB65
Vesselness66
ITK Implementation – computing the Hessian6 volumes computed in serialIndividual filters are threadedGood CPU usageHigh memory requirements67
Design considerationsBreak problem into blocksCompute hessian, eigenvalues, and vesselnessReduces memory requirementsIncurs overhead, boundary conditions68
Design considerations69keep cpu’s full
Design considerations – boundary condition70
Trade-offs71
Algorithm sketch – Serial72intBlockSize = 32;for ( intz = 0; z < image->mDepth; z += BlockSize ) {  for ( inty = 0; y < image->mHeight; y += BlockSize ) {    for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize );    }  }}
Algorithm sketch – OpenMP73intBlockSize = 32;#pragmaomp parallel forfor ( intz = 0; z < image->mDepth; z += BlockSize ) {  for ( inty = 0; y < image->mHeight; y += BlockSize ) {    for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize );    }  }}Each thread is on a different sliceMay cause cache contentionSimilar problems for “y” direction
Algorithm sketch – OpenMP74intBlockSize = 32;for ( intz = 0; z < image->mDepth; z += BlockSize ) {  for ( inty = 0; y < image->mHeight; y += BlockSize ) {#pragmaomp parallel for    for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize );    }  }}All threads on same rowsMay not utilize all CPUsIf Ratio of Width to BlockSize < # CPUsBetter cache utilization
Algorithm sketch – TBB75class Vesselness {	  public:  void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {      // Process the block, could use ITK hereprocessBlock ( r.cols().begin(), r.rows().begin(), r.pages().begin(),r.cols().size(),  r.rows().size(),  r.pages().size() );...parallel_for ( tbb::blocked_range3d<int,int,int>(                 0, in->mDepth, 32                 0, in->mHeight, 32                 0, in->mWidth, 32 ),Vesselness( in, out ), auto_partitioner() );Individual blocksFull CPUsMay not have best cache performance
Next stepsGo try parallel developmentTry threads to gain understanding and insightNext OpenMP, adapting existing codeTBB: more constructs, different approachsExperiment with new languagesErlang, Scala, Reia, Chapel, X10, Fortress...Check out some of the resources providedHave fun!  It’s a brave new world out there...76
ResourcesTBB (http://www.threadingbuildingblocks.org/)OpenMP (http://openmp.org/wp/)Books/ArticlesJava Concurrency in Practice (http://www.javaconcurrencyinpractice.com/)Parallel Programming (http://www-users.cs.umn.edu/~karypis/parbook/)ITK Software Guide (http://www.itk.org/ItkSoftwareGuide.pdf)The Problem with Threads (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf)TutorialsParallel Programming(https://computing.llnl.gov/tutorials/parallel_comp/)pthreads (https://computing.llnl.gov/tutorials/pthreads/)OpenMP (https://computing.llnl.gov/tutorials/openMP/)OtherLLNL (https://computing.llnl.gov/)Erlang (http://en.wikipedia.org/wiki/Erlang_programming_language)GCC-OpenMP (http://gcc.gnu.org/projects/gomp/)Intel Compiler (http://software.intel.com/en-us/intel-compilers/)77
ResourcesLanguagesErlang (http://www.erlang.org/)Scala (http://www.scala-lang.org/)Chapel (http://chapel.cs.washington.edu/)X10 (http://x10-lang.org/)Unified Parallel C (http://upc.gwu.edu/)Titanium (http://titanium.cs.berkeley.edu/)Co-Array Fortran (http://www.co-array.org/)ZPL (http://www.cs.washington.edu/research/zpl/home/index.html)High Performance Fortran (http://hpff.rice.edu/)Fortress (http://projectfortress.sun.com/Projects/Community/) Others (http://www.google.com/search?q=parallel+programming+language)78
Medical image processing strategies for multi-core CPUsDaniel Blezek, Mayo Clinicblezek.daniel@mayo.edu
Thread construction – pthread example80include <pthread.h>void *(*start_routine)(void *);intpthread_create(pthread_t *restrict thread,               const pthread_attr_t *restrict attr,               void *(*start_routine)(void *),               void *restrict arg);voidpthread_exit(void *value_ptr);intpthread_join(pthread_t thread, void **value_ptr);
Mutex – pthread example81#include <pthread.h>pthread_mutex_t myMutex;...pthread_mutex_init ( &myMutex, NULL );...pthread_mutex_lock ( &myMutex );// Critical Section, only one thread at a time...pthread_mutex_unlock ( &myMutex );...if ( pthread_mutex_trylock ( &myMutex ) == EBUSY ) {  // We did get the lock, so we are in the critical section  ...  pthread_mutex_unlock ( &myMutex );}
Mutex – Java example82import java.lang.*;class Foo {  public synchronized int doWork () {	  // only one thread can execute doWork  }  Object resource;	public int otherWork () {    synchronized ( resource ) {      // critical section, resource is the mutex      ...    }}

More Related Content

PDF
Java Memory Model
PPT
Threaded Programming
PDF
Where destructors meet threads
PPT
Posix Threads
PPTX
The Java Memory Model
PDF
.Net Multithreading and Parallelization
PDF
Wait for your fortune without Blocking!
Java Memory Model
Threaded Programming
Where destructors meet threads
Posix Threads
The Java Memory Model
.Net Multithreading and Parallelization
Wait for your fortune without Blocking!

What's hot (20)

PDF
JVM Mechanics: When Does the JVM JIT & Deoptimize?
KEY
Lock? We don't need no stinkin' locks!
PPT
FreeRTOS Course - Semaphore/Mutex Management
PDF
PPTX
concurrency_c#_public
PDF
Why GC is eating all my CPU?
PPT
OpenMP And C++
PDF
C++11 talk
PPTX
Parallelization using open mp
PPTX
MPI n OpenMP
PDF
Objective-C Blocks and Grand Central Dispatch
PDF
OpenMP Tutorial for Beginners
PPTX
java memory management & gc
PDF
TensorFlow Lite (r1.5) & Android 8.1 Neural Network API
PDF
Parallel Programming
PDF
Fast as C: How to Write Really Terrible Java
PDF
JVM for Dummies - OSCON 2011
PDF
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
KEY
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
KEY
Know yourengines velocity2011
JVM Mechanics: When Does the JVM JIT & Deoptimize?
Lock? We don't need no stinkin' locks!
FreeRTOS Course - Semaphore/Mutex Management
concurrency_c#_public
Why GC is eating all my CPU?
OpenMP And C++
C++11 talk
Parallelization using open mp
MPI n OpenMP
Objective-C Blocks and Grand Central Dispatch
OpenMP Tutorial for Beginners
java memory management & gc
TensorFlow Lite (r1.5) & Android 8.1 Neural Network API
Parallel Programming
Fast as C: How to Write Really Terrible Java
JVM for Dummies - OSCON 2011
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Know yourengines velocity2011
Ad

Viewers also liked (20)

PPT
Medical image processing studies
PPTX
Applications of Digital image processing in Medical Field
PPT
Medical Image Processing
PDF
Gfh 17 may-tatrc, afsim-magee
PDF
Using information technology in medical professionalism
PDF
36324442 biosignal-and-bio-medical-image-processing-matlab-based-applications...
PPTX
Medical Informatics
PPT
An Introduction to Image Processing and Artificial Intelligence
PPTX
Using 3DP in Plastic & Reconstructive Surgery
PPT
Medical Image Analysis and Its Application
PDF
Health Informatics & eHealth: Application of ICT for Health
PDF
Basic image processing
PDF
Medical Physics 102 - Clinical Leadership - Prado
PDF
The Birth of Doraemon
PDF
Medical image Processing - Vahid Nayini
PDF
Embedded and Reliable Computer Vision
PPTX
Artificial intelligence in medical image processing
PPTX
Health informatics
PDF
Medical Image Processing on NVIDIA TK1/TX1
PPTX
Precision Medicine in Oncology Informatics
Medical image processing studies
Applications of Digital image processing in Medical Field
Medical Image Processing
Gfh 17 may-tatrc, afsim-magee
Using information technology in medical professionalism
36324442 biosignal-and-bio-medical-image-processing-matlab-based-applications...
Medical Informatics
An Introduction to Image Processing and Artificial Intelligence
Using 3DP in Plastic & Reconstructive Surgery
Medical Image Analysis and Its Application
Health Informatics & eHealth: Application of ICT for Health
Basic image processing
Medical Physics 102 - Clinical Leadership - Prado
The Birth of Doraemon
Medical image Processing - Vahid Nayini
Embedded and Reliable Computer Vision
Artificial intelligence in medical image processing
Health informatics
Medical Image Processing on NVIDIA TK1/TX1
Precision Medicine in Oncology Informatics
Ad

Similar to Medical Image Processing Strategies for multi-core CPUs (20)

PPT
Migration To Multi Core - Parallel Programming Models
PPT
Os Reindersfinal
PPT
Os Reindersfinal
PPT
Parallel Programming: Beyond the Critical Section
PPT
Nbvtalkataitamimageprocessingconf
PPTX
Presentation on Shared Memory Parallel Programming
PPTX
Threads and multi threading
PPT
Introduction & Parellelization on large scale clusters
PDF
Introduction to OpenMP
PPT
Google: Cluster computing and MapReduce: Introduction to Distributed System D...
PPTX
Parallel Computing - openMP -- Lecture 5
PPT
Lecture6
PPT
Distributed computing presentation
PDF
Unmanaged Parallelization via P/Invoke
PPT
Lecture8
PPT
Lec1 Intro
PPT
Lec1 Intro
PDF
Introduction to OpenMP (Performance)
PPTX
Intro to OpenMP
PPSX
Parallel Computing--Webminar.ppsx
Migration To Multi Core - Parallel Programming Models
Os Reindersfinal
Os Reindersfinal
Parallel Programming: Beyond the Critical Section
Nbvtalkataitamimageprocessingconf
Presentation on Shared Memory Parallel Programming
Threads and multi threading
Introduction & Parellelization on large scale clusters
Introduction to OpenMP
Google: Cluster computing and MapReduce: Introduction to Distributed System D...
Parallel Computing - openMP -- Lecture 5
Lecture6
Distributed computing presentation
Unmanaged Parallelization via P/Invoke
Lecture8
Lec1 Intro
Lec1 Intro
Introduction to OpenMP (Performance)
Intro to OpenMP
Parallel Computing--Webminar.ppsx

Recently uploaded (20)

PPT
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
PDF
Copy of OB - Exam #2 Study Guide. pdf
PPT
HIV lecture final - student.pptfghjjkkejjhhge
PPTX
NRPchitwan6ab2802f9.pptxnepalindiaindiaindiapakistan
PPTX
ONCOLOGY Principles of Radiotherapy.pptx
PPTX
NASO ALVEOLAR MOULDNIG IN CLEFT LIP AND PALATE PATIENT
PDF
TISSUE LECTURE (anatomy and physiology )
PPTX
IMAGING EQUIPMENiiiiìiiiiiTpptxeiuueueur
PPTX
Electrolyte Disturbance in Paediatric - Nitthi.pptx
PPT
Rheumatology Member of Royal College of Physicians.ppt
PPT
Infections Member of Royal College of Physicians.ppt
PPTX
Anatomy and physiology of the digestive system
PPTX
antibiotics rational use of antibiotics.pptx
PDF
Extended-Expanded-role-of-Nurses.pdf is a key for student Nurses
PPTX
Human Reproduction: Anatomy, Physiology & Clinical Insights.pptx
PDF
focused on the development and application of glycoHILIC, pepHILIC, and comm...
PDF
Cardiology Pearls for Primary Care Providers
PPTX
Post Op complications in general surgery
PPTX
Reading between the Rings: Imaging in Brain Infections
PPTX
Morphology of Bacterial Cell for bsc sud
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
Copy of OB - Exam #2 Study Guide. pdf
HIV lecture final - student.pptfghjjkkejjhhge
NRPchitwan6ab2802f9.pptxnepalindiaindiaindiapakistan
ONCOLOGY Principles of Radiotherapy.pptx
NASO ALVEOLAR MOULDNIG IN CLEFT LIP AND PALATE PATIENT
TISSUE LECTURE (anatomy and physiology )
IMAGING EQUIPMENiiiiìiiiiiTpptxeiuueueur
Electrolyte Disturbance in Paediatric - Nitthi.pptx
Rheumatology Member of Royal College of Physicians.ppt
Infections Member of Royal College of Physicians.ppt
Anatomy and physiology of the digestive system
antibiotics rational use of antibiotics.pptx
Extended-Expanded-role-of-Nurses.pdf is a key for student Nurses
Human Reproduction: Anatomy, Physiology & Clinical Insights.pptx
focused on the development and application of glycoHILIC, pepHILIC, and comm...
Cardiology Pearls for Primary Care Providers
Post Op complications in general surgery
Reading between the Rings: Imaging in Brain Infections
Morphology of Bacterial Cell for bsc sud

Medical Image Processing Strategies for multi-core CPUs

  • 1. Medical image processing strategies for multi-core CPUsDaniel Blezek, Mayo [email protected]
  • 2. PollDoes your primary computer have more than one core...?2Have you ever written parallel code?
  • 3. It’s a parallel world...SMP formerly was the domain of researchersThanks to Intel, now it’s everywhere!3... but most of us think in serial ...Hardware has far outstripped software
  • 5. Development of parallel software is difficult
  • 10. Parallel Computing – according to Google“parallel computing” 1.4M hits on Google“multithreading” 10M hits“multicore” 2.4M hits“parallel programming” 1.1M hitsWhy is it so hard?the world is parallelwe all think in parallelyet we are taught to program in serial4driving
  • 11. Degrees of parallelism (my take)Serial – SISD single thread of executionData parallel – SIMD (fine grained parallelism)Embarrassingly parallel – larger scale SIMDCT or MR reconstructionEach operation is independent, e.g. iFFT of slicesWorker thread – e.g. virus scanning softwareCoarse grained parallelism – SMP or MIMDFocus of this presentation, more in GPU talkConcurrency, OpenMP, TBB, pthreads/WinthreadsLarge scale – MPI on cluster, tight couplingLarge scale – Grid computing, loose coupling5
  • 12. Pragmatic approachC/C++ and Fortran are the kings of performance(I’ve never written a single line of Fortran, so don’t ask)“Bolted on” parallel conceptsZero language supportHuge existing codebase6
  • 13. Pragmatic approachBriefly touch on SIMDIntroduce SMP conceptsThreads, concurrencyDevelopment modelspthreads/WinThreadsOpenMPTBBITKMedical Image ProcessingExample problemsCommon errorsNext steps7packed
  • 14. SIMD8
  • 15. SIMD – basic principles9http://en.wikipedia.org/wiki/SIMD
  • 16. Data structures for SIMDArray of Structuresstruct Vec { float x, y, z;};Vec[] points = new Vec[sz];10XYZ--PackXYZ--XYZ--*UnpackXYZ--
  • 17. Data structures for SIMD 11Structure of Arraysstruct Vec { float[] x; float[] y; float[] z; Vec ( int sz ) { x = new float[sz]; y = new float[sz]; z = new float[sz]; };};Structure of Arraysstruct Vec{ Vector4f[] v; Vec ( int sz ) { // must be word // aligned v = new Vector4f[sz]; };};
  • 18. SIMD pitfallsStructure alignmentUsually needs to be aligned on word boundaryStructure considerationsMay need to refactor existing code/structuresGenerally not cross-platformMMX, 3D Now!, SSE, SSE2, SSE4, AltVec, AVX, etc...Performance gains are modest2x – 4x commonLimited instructionsAdd, multiply, divide, roundNot suitable for branching logicAutovectorizing compilers for simple loops-ftree-vectorize (GCC), -fast, -O2 or -O3 (Intel Compiler)12
  • 21. SMP concepts15Useful to think in terms of “cores”2 dual-core CPU = 4 “cores”Cores share main memory, may share cacheThreads in same process share memoryGenerally, one executing thread per coreOther threads sleeping
  • 22. Cores – they’re everywhere16How many cores does your laptop have?Mine has 50(!) 2 Intel CPU (Core 2 Duo) 32 nVidia cores (9600M GT) 16 nVidia cores (9400M)
  • 23. Parallel concepts for SMPProcessStarted by the OSSingle thread executes “main”No direct access to memory of other processesThreadsStream of execution under a processAccess to memory in containing processPrivate memoryLifetime may be less than main threadConcurrencyCoordination between threadsHigh level (mutex, locks, barriers)Low level (atomic operations)17
  • 25. #include <pthread.h>// Thread work function, must return pointer to voidvoid *doWork(void *work) { // Do work return work; // equivalent to pthread_exit ( myWork );}...pthread_t child;...rc=pthread_create(&child, &attr, doWork, (void *)work);...rc = pthread_join ( child, &threadwork );...Thread construction – pthread example19
  • 26. Thread construction – Win32 example20#include <windows.h>DWORD WINAPI doWork( LPVOID work) {};...PMYDATA work;DWORD childID;HANDLE child;child = CreateThread( NULL, // default security attributes 0, // use default stack size doWork, // thread function name work, // argument to thread function 0, // use default creation flags &childID); // returns the thread identifierWaitForMultipleObjects(NThreads, child, TRUE, INFINITE);
  • 27. Thread construction – Java example21import java.lang.Thread;class Worker implements Runnable { public Worker ( Work work ) {}; public void run() {}; // Do work here}...Worker worker = new Worker ( someWork );New Thread ( worker ).start();
  • 29. MutexMutex – Mutual exclusion lockProtects a section of codeOnly one thread has a lock on the objectThreads maywait for the mutexreturn a status if the mutex is lockedSemaphoreN threadsCritical SectionOne thread executes codeProtects global resourcesMaintain consistent state23
  • 30. Race Conditions24...N = 0;...// Start some threads...void* doWork() { N++; // get, incr, store}Mutexmutex;mutex.lock();mutex.release();Solution w/MutexNoNo
  • 31. Atomic operationsLocks are not perfectCause blockingRelatively heavy-weightAtomic operationsSimple operationsHardware supportCan implement w/MutexConditionsInvisibility – no other thread knows about the changeAtomicity – if operation fails, return to original state25
  • 33. Thread synchronization – barrierInitialized with the number of threads expectedThreads signal when they are readyWait until all expected threads are thereA stalled or dead thread can stall all the threads27
  • 34. Thread synchronization – Condition variablesWorkers atomically release mutex and waitMaster atomically releases mutex and signalsWorkers wake up and acquire mutex28Mutex AWorkingConditionMutex AConditionMutex AWaitMutex AConditionMutex AMutexThread
  • 35. Thread pool & Futures29Maintains a “pool” of Worker threadsWork queued until thread availableOptionally notify through a “Future”Future can query status, holds return valueThread returns to pool, no startup overheadCore concept for OpenMP and TBB
  • 37. Introduction to OpenMPScatter / gather paradigmMaintains a thread poolRequires compiler supportVisual C++, gcc 4.0, Intel CompilerEasy to adapt existing serial code, easy to debugSimple paradigm31
  • 38. OpenMP – simple parallel sections32#pragmaomp parallel sections num_threads ( 5 ){ // 5 Threads scatter here #pragmaomp section { // Do task 1 } #pragmaomp section { // Do task 2 } ... #pragmaomp section { // Do task N } // Implicit barrier}Barrier...NoNo
  • 39. OpenMP – parallel for33#pragmaomp parallel forfor ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of idoSomeWork( i );}// Implicit barrierScheduling the iterations
  • 40. OpenMP – reduction34int TotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of i & TotalAmountOfWorkTotalAmountOfWork += doSomeWork( i );}// Implicit barrier// TotalAmountOfWork was properly accumulated// Each thread has local copy, barrier does reduction// No need to use critical sections
  • 41. OpenMP – “atomic” reduction35int TotalAmountOfWork = 0;#pragmaomp parallel forfor ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here int myWork = doSomeWork( i); #pragmaomp atomicTotalAmountOfWork += myWork;}// Implicit barrier// TotalAmountOfWork was properly accumulated// However, the atomic section can cause thread stalls
  • 42. OpenMP – critical36int TotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of iTotalAmountOfWork += doSomeWork( i ); #pragmaomp critical { // Execute by one thread at a time, e.g., “Mutex lock”criticalOperation(); }}// Implicit barrier
  • 43. OpenMP – single37int TotalAmountOfWork = 0;#pragmaomp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of iTotalAmountOfWork += doSomeWork( i ); #pragmaomp single nowait { // Execute by one thread, use “master” for the main threadreportProgress ( TotalAmountOfWork ); } // !! No implicit barrier because of “nowait” clause !!}// Implicit barrier
  • 45. Introduction to TBBCommercial and Open Source LicensesGPL with runtime exceptionCross-platform C++ librarySimilar to STLUsual concurrency classesSeveral different constructs for threadingfor, do, reduction, pipelineFiner control over schedulingMaintains a thread pool to execute taskshttp://www.threadingbuildingblocks.org/39
  • 46. TBB – parallel for 40#include "tbb/blocked_range.h”#include "tbb/parallel_for.h” class Worker { public: Worker ( /* ... */ ) {...}; void operator() ( const tbb::blocked_range<int>& r ) const { for ( int i = r.begin(); i != r.end(); ++i ) { doWork ( i ); } }};...tbb::parallel_for ( tbb::blocked_range<int> ( 0, N ), Worker ( /* ... */ ), tbb::auto_partitioner() );
  • 47. TBB – parallel reduction41#include "tbb/blocked_range.h”#include "tbb/parallel_reduce.h” class ReducingWorker { int mLocalWork; public:ReducingWorker ( /* ... */ ) {...};ReducingWorker ( ReducingWorker& o, split ) : mLocalWork(0) {}; void join ( const ReducingWorker& o ) {mLocalWork += o.mLocalWork}; void operator() ( const tbb::blocked_range<int>& r ) { ... }};...Worker w;tbb::parallel_reduce ( tbb::blocked_range<int> ( 0, N ), w, tbb::auto_partitioner() );w.getLocalWork();
  • 48. TBB – parallel reduction42
  • 49. TBB – synchronization43tbb::spin_mutex MyMutex;void doWork ( /* ... */ ) { // Enter critical section, exit when lock goes out of scopetbb::spin_mutex::scoped_lock lock ( MyMutex ); // NB: This is an error!!! // tbb::spin_mutex::scoped_lock( MyMutex );}...#include <tbb/atomic.h>tbb::atomic<int> MyCounter;...MyCounter = 0; // Atomicint i = MyCounter; // AtomicMyCounter++; MyCounter--; ++MyCounter; --MyCounter; // Atomic...MyCounter = 0; MyCounter += 2; // Watch out for other threads!
  • 51. ITK ImplementationThreads operate across slicesOnly implemented behavior in ITKitk::MultiThreader is somewhat flexibleRequires that you break the ITK modelUses Thread Join, higher overheadNo thread pool45
  • 52. Comparison46Language specific (Java)+ Fine-grain control+ Cross-platform easy(?)+ Many constructs+/- Language-specificThreads (C/C++)+ Fine-grain control Not cross-platform
  • 53. Few constructsITK+ Integrated+ Simple Limited control+/- ITK onlyTBB+/- More complex+ Fine-grain control+ Intel (-?)+ Open Source+ Some constructs Must re-write codeOpenMP+ Simple+ Adapt existing code+/- Industry standard+/- Compiler support Coarse-grain controldiy
  • 55. Image class48class Image { public: short* mData; int mWidth, mHeight, mDepth; int mVoxelsPerSlice; int mVoxelsPerVolume; short* mSlicePointers; // Pointers to the start of each slice short getVoxel ( int x, int y, int z ) {...} void setVoxel ( int x, int y, int z, short v ) {...}};
  • 56. Trivial problem – thresholdThreshold an imageIf intensity > 100, output 1otherwise output 0Present from simple to complexOpenMPTBBITKpthread(see extra slides)49
  • 57. Threshold – OpenMP #150void doThreshold ( Image* in, Image* out ) {#pragmaomp parallel for for ( int z = 0; z < in->mDepth; z++ ) { for ( int y = 0; y < in->mHeight; y++ ) { for ( int x = 0; x < in->mWidth; x++ ) { if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } }}// NB: can loop over slices, rows or columns by moving// pragma, but must choose at compile time
  • 58. Threshold – OpenMP #251void doThreshold ( Image* in, Image* out ) {#pragmaomp parallel for for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) { if ( in->mData[s] > 100 ) { out->mData[s] = 1; } else { out->mData[s] = 0; } }}// Likely a lot faster than previous code
  • 59. class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) { for ( int x = r.begin(); x != r.end(); ++x ) { if ( in->mData[x] > 100 ) { out->mData[x] = 1; } else { out->mData[x] = 0; } } }}...parallel_for ( tbb::blocked_range<int>(0, in->mVoxelsPerVolume ), Threshold ( in, out ), auto_partitioner() );// NB: default “grain size” for blocked_range is 1 pixel// tbb::blocked_range<int>(..., in->mVoxelsPerVolume / NumberOfCPUs )Threshold – TBB #152
  • 60. class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) {...} void operator() ( const tbb::blocked_range2d<int,int>& r ) { for ( int z = in->mDepth; z < in->mDepth; z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } };...parallel_for ( tbb::blocked_range2d<int,int>( 0, in->mHeight, 32 0, in->mWidth, 32 ), Threshold ( in, out ), auto_partitioner() ); Threshold – TBB #253
  • 61. class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) {...} void operator() ( const tbb::blocked_range2d<int,int>& r ) {...} void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } };...parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth, 1 0, in->mHeight, 32 0, in->mWidth, 32 ), Threshold ( in, out ), auto_partitioner() );Threshold – TBB #3 54
  • 62. Threshold – ITK solution55ThreadedGenerateData( const OutputImageRegionType out, int threadId){... // Define the iteratorsImageRegionConstIterator<TIn> inputIt(inputPtr, out);ImageRegionIterator<TOut> outputIt(outputPtr, out);inputIt.GoToBegin();outputIt.GoToBegin(); while( !inputIt.IsAtEnd() ) { if ( inputIt.Get() > 100 ) {outputIt.Set ( 1 ); } else {outputIt.Set ( 0 ); { ++inputIt; ++outputIt;}}
  • 63. Interesting problem – anisotropic diffusionEdge preserving smoothing methodPerona and Malik. Scale-space and edge detection using anisotropic diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on (1990) vol. 12 (7) pp. 629 – 639Iterative processDemonstrateOpenMPTBB(ITK has an implementation)(pthreads are tedious at the very least)Pop quiz – are the following correct?56
  • 64. Anisotropic diffusion – OpenMP57void doAD ( Image* in, Image* out ) {#pragmaomp parallel for for ( int t = 0; t < TotalTime; t++ ) { for ( int z = 0; z < in->mDepth; z++ ) { ... } }}
  • 65. Anisotropic diffusion – OpenMP58void doAD ( Image* in, Image* out ) { short *previousSlice, *slice, *nextSlice; for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for for ( int z = 1; z < in->mDepth-1; z++ ) {previousSlice = in->mSlicePointers[z-1]; slice = in->mSlicePointers[z];nextSlice = in->mSlicePointers[z+1]; for ( int y = 1; y < in->mHeight-1; y++ ) { short* previousRow = slice + y-1 * in->mWidth; short* row = slice + y * in->mWidth; short* nextRow = slice + y-1 * in->mWidth; short* aboveRow = previousSlice + y * in->mWidth; short* belowRow = nextSlice + y * in->mWidth; for ( int x = 1; i < in->mWidth-1; x++ ) {dx = 2 * row[x] – row[x-1] – row[x+1];dy = 2 * row[x] – previousRow[x] – nextRow[x];dz = 2 * row[x] – aboveRow[x] – belowRow[x]; ...
  • 66. Anisotropic diffusion – OpenMP59void doAD ( Image* in, Image* out ) { for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for for ( int z = 1; z < in->mDepth-1; z++ ) { short* previousSlice = in->mSlicePointers[z-1]; short* slice = in->mSlicePointers[z]; short* nextSlice = in->mSlicePointers[z+1]; for ( int y = 1; y < in->mHeight-1; y++ ) { short* previousRow = slice + y-1 * in->mWidth; short* row = slice + y * in->mWidth; short* nextRow = slice + y-1 * in->mWidth; short* aboveRow = previousSlice + y * in->mWidth; short* belowRow = nextSlice + y * in->mWidth; for ( int x = 1; i < in->mWidth-1; x++ ) {dx = 2 * row[x] – row[x-1] – row[x+1];dy = 2 * row[x] – previousRow[x] – nextRow[x];dz = 2 * row[x] – aboveRow[x] – belowRow[x]; ...
  • 67. Anisotropic diffusion – TBB #160class doAD { public: static ADConstants* sConstants;doAD ( Image* in, Image* out ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { if ( !sConstants == NULL ) { initConstants(); } // process ... }}
  • 68. Threshold – TBB #2 61class doAD { public:doAd ( ... ) {...} void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth 0, in->mHeight 0, in->mWidth ),doAD ( in, out ), auto_partitioner() );
  • 69. Threshold – TBB #3 62class doAD { public: static tbb::atomic<int> sProgress;tbb::spin_mutexmMutex;doAd ( ... ) {...} void reportProgress ( int p ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {tbb::spin_mutex::scoped_lock lock ( mMutex );sProgress++;reportProgress ( sProgress ); for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...doAD::sProgress = 0;parallel_for (...);
  • 70. Threshold – TBB #4 63class doAD { public: static tbb::atomic<int> sProgress; static tbb::spin_mutexmMutex;doAd ( ... ) {...} void reportProgress ( int p ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {tbb::spin_mutex::scoped_lock lock ( mMutex );sProgress++;reportProgress ( sProgress ); for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...doAD::sProgress = 0;parallel_for (...);
  • 71. nowaitAnisotropic diffusion – OpenMP (Progress)64using std;void doAD ( Image* in, Image* out ) {int progress = 0;for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for for ( int s = 0; s < in->mDepth; s++ ) { #pragmaomp atomic progress++; #pragmaomp singlereportProgress ( progress ); ... } }}
  • 72. Real-life problemCompute Frangi’svesselness measureFrangi et al. Model-based quantitation of 3-D magnetic resonance angiographic images. IEEE Transactions on Medical Imaging (1999) vol. 18 (10) pp. 946-956Memory constrained solutionITK implementation requires 1.2G for 100M volumeAntiga. Generalizing vesselness with respect to dimensionality and shape. Insight Journal (2007)Possible solutions usingOpenMP, TBB65
  • 74. ITK Implementation – computing the Hessian6 volumes computed in serialIndividual filters are threadedGood CPU usageHigh memory requirements67
  • 75. Design considerationsBreak problem into blocksCompute hessian, eigenvalues, and vesselnessReduces memory requirementsIncurs overhead, boundary conditions68
  • 77. Design considerations – boundary condition70
  • 79. Algorithm sketch – Serial72intBlockSize = 32;for ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) { for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize ); } }}
  • 80. Algorithm sketch – OpenMP73intBlockSize = 32;#pragmaomp parallel forfor ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) { for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize ); } }}Each thread is on a different sliceMay cause cache contentionSimilar problems for “y” direction
  • 81. Algorithm sketch – OpenMP74intBlockSize = 32;for ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) {#pragmaomp parallel for for ( intx = 0; x < image->mWidth; x += BlockSize ) {processBlock ( in, out, x, y, z, BlockSize ); } }}All threads on same rowsMay not utilize all CPUsIf Ratio of Width to BlockSize < # CPUsBetter cache utilization
  • 82. Algorithm sketch – TBB75class Vesselness { public: void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { // Process the block, could use ITK hereprocessBlock ( r.cols().begin(), r.rows().begin(), r.pages().begin(),r.cols().size(), r.rows().size(), r.pages().size() );...parallel_for ( tbb::blocked_range3d<int,int,int>( 0, in->mDepth, 32 0, in->mHeight, 32 0, in->mWidth, 32 ),Vesselness( in, out ), auto_partitioner() );Individual blocksFull CPUsMay not have best cache performance
  • 83. Next stepsGo try parallel developmentTry threads to gain understanding and insightNext OpenMP, adapting existing codeTBB: more constructs, different approachsExperiment with new languagesErlang, Scala, Reia, Chapel, X10, Fortress...Check out some of the resources providedHave fun! It’s a brave new world out there...76
  • 84. ResourcesTBB (http://www.threadingbuildingblocks.org/)OpenMP (http://openmp.org/wp/)Books/ArticlesJava Concurrency in Practice (http://www.javaconcurrencyinpractice.com/)Parallel Programming (http://www-users.cs.umn.edu/~karypis/parbook/)ITK Software Guide (http://www.itk.org/ItkSoftwareGuide.pdf)The Problem with Threads (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf)TutorialsParallel Programming(https://computing.llnl.gov/tutorials/parallel_comp/)pthreads (https://computing.llnl.gov/tutorials/pthreads/)OpenMP (https://computing.llnl.gov/tutorials/openMP/)OtherLLNL (https://computing.llnl.gov/)Erlang (http://en.wikipedia.org/wiki/Erlang_programming_language)GCC-OpenMP (http://gcc.gnu.org/projects/gomp/)Intel Compiler (http://software.intel.com/en-us/intel-compilers/)77
  • 85. ResourcesLanguagesErlang (http://www.erlang.org/)Scala (http://www.scala-lang.org/)Chapel (http://chapel.cs.washington.edu/)X10 (http://x10-lang.org/)Unified Parallel C (http://upc.gwu.edu/)Titanium (http://titanium.cs.berkeley.edu/)Co-Array Fortran (http://www.co-array.org/)ZPL (http://www.cs.washington.edu/research/zpl/home/index.html)High Performance Fortran (http://hpff.rice.edu/)Fortress (http://projectfortress.sun.com/Projects/Community/) Others (http://www.google.com/search?q=parallel+programming+language)78
  • 86. Medical image processing strategies for multi-core CPUsDaniel Blezek, Mayo [email protected]
  • 87. Thread construction – pthread example80include <pthread.h>void *(*start_routine)(void *);intpthread_create(pthread_t *restrict thread, const pthread_attr_t *restrict attr, void *(*start_routine)(void *), void *restrict arg);voidpthread_exit(void *value_ptr);intpthread_join(pthread_t thread, void **value_ptr);
  • 88. Mutex – pthread example81#include <pthread.h>pthread_mutex_t myMutex;...pthread_mutex_init ( &myMutex, NULL );...pthread_mutex_lock ( &myMutex );// Critical Section, only one thread at a time...pthread_mutex_unlock ( &myMutex );...if ( pthread_mutex_trylock ( &myMutex ) == EBUSY ) { // We did get the lock, so we are in the critical section ... pthread_mutex_unlock ( &myMutex );}
  • 89. Mutex – Java example82import java.lang.*;class Foo { public synchronized int doWork () { // only one thread can execute doWork } Object resource; public int otherWork () { synchronized ( resource ) { // critical section, resource is the mutex ... }}
  • 90. struct Work { Image* in; Image *out; int start; int end; };Work workArray[THREADCOUNT];pthread_t thread[THREADCOUNT];void* doThreshold ( void* inWork ) { Work* work = (Work*) inWork; for ( int s = work->start; s < work->end; s++ ) {...}}...pthread_attr_t attributes;pthread_attr_init ( &attributes );pthread_attr_setdetachstate ( &attributes, PTHREAD_CREATE_JOINABLE );for ( int t = 0; t < THREADCOUNT; t++ ) {initializeWork ( in, out, t, workArray[t] );pthread_create ( &thead[t], &attributes,doThreshold, (void*) workArray[t] );}for ( int t = 0; t < THREADCOUNT; t++ ) { pthread_join ( thread[t], NULL );}Threshold – pthread 83
  • 92. SemaphoreAllow N threads accessProtects limited resourcesBinary semaphoreN = 1Equivalent to Mutex85
  • 93. ITK ImplementationThreads operate across slicesOnly implemented behavior in ITKitk::MultiThreader is somewhat flexibleRequires that you break the ITK modelUses Thread Join, higher overheadNo thread pool86
  • 94. ITK – itk::MultiTheader87#include <itkMultiThreader.h>// Win32DWORD doWork ( LPVOID lpThreadParameter );// Pthread - Linux, Mac, Unixvoid* doWork ( void* inWork );itk::MultiThreader::Pointerthreader = itk::MultiThreader::New();threader->SetNumberOfThreads ( NumberOfThreads );for ( int i = 0; i < NumberOfThreads; i++ ) {threader->SetMultipleMethod ( i, doWork, (void*) work[i] );}// Explicit barrier, waits for Thread jointhreader->MultipleMethodExecute();
  • 95. #include <itkImageToImageFilter.h>template <In, Out> Worker : public ImageToImageFilter<In, Out> {... void BeforeThreadedGenerateData() { // Master thread only ... } void ThreadedGenerateData(constOutputImageRegionType &r, int tid ){ // Generate output data for r ... } voidAfterThreadedGenerateData() { // Master thread only ...}// Output split on last dimension// i.e. Slices for 3D volumesInsight Toolkit88
  • 96. Anisotropic diffusion – OpenMP89using std;void doAD ( Image* in, Image* out ) {for ( int t = 0; t < TotalTime; t++ ) {#pragmaomp parallel for for ( int slice = 0; slice < in->mDepth; slice++ ) { ... } }}

Editor's Notes

  • #3: If I had asked this question 5 years ago, almost no one would have raised their hand.
  • #5: Driving is inherently a parallel task, we coordinate at stop signs, stop lights, we obey the rules of the road, but we can get deadlocked (grid lock).