DS1822-ParallelComputing-unit4
DS1822-ParallelComputing-unit4
Parallel Patterns – Convolution – Prefix Sum – Sparse matrix – Vector Multiplication – Imaging Case
Study
1. Parallel Patterns
• Parallel programming with CUDA offers a variety of patterns that can help efficiently utilize the
computational power of GPUs.
Data parallelism:
• One of the simplest and most common ways to use parallel computing is data parallelism.
• This means that you divide your data into smaller chunks and assign them to different threads or
blocks of threads that run on the GPU.
• Each thread or block performs the same operation on its own chunk of data, independently of the
others.
• This way, you can process large amounts of data in parallel, without worrying about synchronization
or communication between threads.
• Data parallelism is useful for tasks such as image processing, matrix operations, sorting, or
reduction.
Task parallelism:
• In addition to data parallelism and task parallelism, CUDA can also be used to implement common
parallel patterns.
• These are general strategies or templates that can be applied to different problems and domains, and
that can help streamline the design and optimization of parallel algorithms.
• Map is a pattern that applies a function to each element of an input array to produce an output array
of the same size - for example, scaling, filtering, or transforming an image.
• Scan computes a prefix sum or a prefix operation on an input array and produces an output array of
the same size - such as computing the cumulative sum, the maximum, or the minimum of an array.
• Stencil applies a function to each element of an input array and its neighbors to produce an output
array of the same size - like performing convolution, smoothing, or edge detection on an image.
• Lastly, reduce applies a function to all elements of an input array and produces a single output value
- like computing the sum, the average, or the histogram of an array.
CUDA libraries
• If you want to implement parallel algorithms and patterns in CUDA, you don't have to start from
scratch.
• There are many libraries and frameworks available that offer ready-made solutions and
optimizations for common problems and domains.
• For instance, cuBLAS provides linear algebra subprograms, cuFFT offers fast Fourier transform
functions, cuRAND provides random number generation functions, cuDNN includes deep neural
network functions, and cuSPARSE offers sparse matrix operations.
• All of these libraries offer the potential for improved performance and efficiency.
• To implement parallel algorithms and patterns in CUDA effectively and efficiently, you must adhere
to some best practices and guidelines.
• This includes finding the right balance between the number and size of threads and blocks that run
on the GPU, minimizing data transfer between the host (CPU) and device (GPU), optimizing
memory access pattern of your threads and blocks, avoiding branches or conditionals that cause
threads to execute different paths of code, and using synchronization primitives such as barriers,
atomics, or locks to coordinate the execution and communication of threads and blocks.
• Techniques such as memory pinning, asynchronous copy, coalescing, caching, shared memory,
predication, loop unrolling, warp voting, warp shuffle, cooperative groups, or dynamic parallelism
can help optimize performance.
2.Convolution
• The convolutional operation (or filtering) is another common operation in many applications,
especially in image and signal processing, as well as deep learning.
• Although this operation is based on the product of sequential data from the input and filter, we have
a different approach for matrix multiplication.
• The convolution operation is a mathematical operation which depicts a rule of how to combine two
functions or pieces of information to form a third function.
• The feature map (or input data) and the kernel are combined to form a transformed feature map.
• The convolution algorithm is often interpreted as a filter, where the kernel filters the feature map
for certain information.
• A kernel, for example, might filter for edges and discard other information. The inverse of the
convolution operation is called deconvolution.
• In image processing, convolution is a commonly used algorithm that modifies the value of each
pixel in an image by using information from neighboring pixels.
• A convolution kernel, or filter, describes how each pixel will be influenced by its neighbors.
• For example, a blurring filter will take the weighted average of neighboring pixels so that large
differences between pixel values are reduced. By using the same source image and changing only
the filter, one can produce effects such as sharpening, blurring, edge enhancing, and embossing.
Convolution operation in CUDA :
• The convolutional operation consists of source data and a filter.
• The filter is also known as a kernel. By applying the filter against the input data, we can obtain
the modified result.
• A two-dimensional convolution is shown in the following diagram
• We need to consider a couple of concepts when we implement convolution operation, that is, kernel
and padding.
• The kernel is a set of coefficients that we want to apply to the source data.
• This is also known as a filter.
• The padding is extra virtual space around the source data so that we can apply kernel functions to
the edge.
• When the padding size is 0, we don't allow the filter to move beyond the source space.
• However, in general, the padding size is half the size of the filter.
• To start easily, we can design the kernel function with the following in mind: Each CUDA thread
generates one filtered output. Each CUDA thread applies the filter's coefficients to the data. The
filter shape is a box filter.
Mathematical Foundation for Convolution:
2. Linear time-invariant (LTI) systems are widely used in applications related to signal processing.
LTI systems are both linear (output for a combination of inputs is the same as a combination of the
outputs for the individual inputs) and time invariant (output is not dependent on the time when an
input is applied). For an LTI system, the output signal is the convolution of the input signal with the
impulse response function of the system.
3. Convolution of two functions is an important mathematical operation that found heavy application
in signal processing. In computer graphics and image processing fields, discrete functions (e.g. an
image) are used and a discrete form of the convolution is applied to remove high frequency noise,
sharpen details, detect edges, or otherwise modulate the frequency domain of the image. A general
2D convolution has a high bandwidth requirement as the final value of a given pixel is determined
by several neighboring pixels. Following Fig.5.3.1 depicts convolution blurred affect.
4. Above Fig. 5.3.2 depicts the convolution using a small 3 3 kernel. The filter is defined as a matrix,
where the central item weights the center pixel, and the other items define the weights of the
neighbor pixels. It can be said that the radius of the 3 3 kernel is 1, since only the one-ring
neighborhood is considered during the convolution. Also the convolution’s behavior has to be
defined at border of the image, where the kernel maps to undefined values outside the image.
Generally, the filtered values outside the image boundaries are either treated as zeros or clamped to
the border pixels of the image. The design of the convolution filter requires a careful selection of
kernel weights to achieve the desired effect.
Applications of Convolution:
Convolution algorithms works by iterating over each pixel in the source image.
• For each source pixel, the filter is centered over the pixel, and the values of the filter multiply the
pixel values that they overlay.
• A sum of the products is then taken to produce a new pixel value.
Fourier Transforms in Convolution :
• Convolution is important in physics and mathematics as it defines a bridge between the
spatial and time domains (pixel with intensity 147 at position (0, 30)) and the frequency
domain (amplitude of 0.3, at 30 Hz, with 60-degree phase) through the convolution
theorem.
• This bridge is defined by the use of Fourier transforms:
• When Fourier transform is used on both the kernel and the feature map, then the convolute
operation is simplified significantly (integration becomes mere multiplication).
• Convolution in the frequency domain can be faster than in the time domain by using the Fast
Fourier Transform (FFT) algorithm.
• Some of the fastest GPU implementations of convolutions (for example some
implementations in the NVIDIA cuDNN library) currently make use of Fourier transforms.
CPU Implementation:
• A serial code implementing the image convolution on a CPU employs two loops to compute the
values of the pixels of the output image.
GPU Implementation
• The parallel implementation of convolution of GPU is described by the following figure.
• Multiple threads are used to calculate the convolution operator of multiple pixels simultaneously.
The total number of calculated pixels at each step will be equal to the total number of launched
threads (NumberofBlocks×BlockThreadsNumberofBlocks×BlockThreads).
• Each thread having access to the coefficients of the convolution kernel calculates the double sum of
the convolution operator.
• The kernel coefficients being constant during the whole execution can be stored into the shared
memory (accessible by the block threads).
Hands On Image Convolution with CUDA:
3. Prefix Sum
Prefix Sum Fundamentals
• Parallel prefix sum belongs to popular data-parallel algorithms.
• Prefix Sum is often used in problems such as stream compaction, sorting, Eulerian tours of a graph,
computation of cumulative distribution functions, etc.
• The all-prefix-sums operation on an array of data is commonly known as scan.
• A simple and common parallel algorithm building block is the all-prefix-sums operation.
• Here we define and illustrate the operation, and we discuss in detail its efficient implementation
using NVIDIA CUDA.
• Blelloch (1990) describes all-prefix-sums as a good example of a computation that seems inherently
sequential, but for which there is an efficient parallel algorithm. He defines the all-prefix-sums
operation as follows:
A Sequential Computation and Its Parallel Equivalent:
Sequential:
out[0] = 0;
for j from 1 to n do
Parallel:
temp[j] = f(in[j]);
all_prefix_sums(out, temp);
Sequential Scan:
• Implementing a sequential version of scan (that could be run in a single thread on a CPU, for
example) is trivial.
• We simply loop over all the elements in the input array and add the value of the previous element
of the input array to the sum computed for the previous element of the output array, and write the
sum to the current element of the output array.
out[0] := 0
for k := 1 to n do
out[k] := in[k-1] + out[k-1]
• This code performs exactly n adds for an array of length n; this is the minimum number of adds
required to produce the scanned array.
• When we develop our parallel version of scan, we would like it to be work-efficient.
• A parallel computation is work-efficient if it does asymptotically no more work (add operations, in
this case) than the sequential version.
• In other words the two implementations should have the same work complexity, O(n).
• This algorithm is based on the scan algorithm presented by Hillis and Steele (1986) and
demonstrated for GPUs by Horn (2005). Figure 39-2 illustrates the operation.
• The problem with Algorithm 1 is apparent if we examine its work complexity. The algorithm
performs O(n log2 n) addition operations.
• Remember that a sequential scan performs O(n) adds. Therefore, this naive implementation is
not work-efficient. The factor of log2 n can have a large effect on performance.
equivalent, but parallel, computations. There are many uses for scan, that includes,
2. Lexical analysis.
3. String comparison.
4. Polynomial evaluation.
5. Stream compaction.
6. Building histograms.
7. Building data structures (like graphs, trees) and performing operation on them in
parallel.
8. Solving recurrence.
4. Sparse matrix - Vector Multiplication
• In a sparse matrix, the vast majority of the elements are zeros. Storing and processing these zero
elements are wasteful in terms of memory, time, and energy.
• Sparse matrices arise in many science, engineering, and financial modeling problems. Matrices are
often used to represent the coefficients in a linear system of equations.
• Each row of the matrix represents one equation of the linear system.
• Sparse matrix multiplication is an important algorithm in a wide variety of problems, including
graph algorithms, simulations and linear solving to name a few.
• Yet, there are but a few works related to acceleration of sparse matrix multiplication on a GPU.
• Many algorithms in machine learning, data analysis, and graph analysis can be organized such that
the bulk of the computation is structured as sparse matrix-dense matrix multiplication (SpMM).
• GPU programs are called kernels, which run a large number of threads in parallel in a single-
program, multiple-data (SPMD) fashion.
• The underlying hardware runs an instruction on each SM on each clock cycle on a warp of 32 threads
in lockstep.
• The largest parallel unit that can be synchronized within a GPU kernel is called a cooperative thread
array(CTA),which is composed of warps.
• The compressed sparse row (CSR) format stores only the column indices and values of non-zeroes
within a row.
• The start and end of each row are then stored in terms of the column indices and value in a row
offsets (or row pointers) array.
• Hence,CSR only requires m + 2n non-zero memory for storage.
• A dense matrix is in row-major order when successive elements in the same row are contiguous in
memory. Similarly, it is in column-major order when successive elements in the same column are
contiguous in memory.
• Sparse matrices are stored in a format that a voids storing zero elements. The process start with the
Compressed Sparse Row (CSR) storage format.
• The Compressed Sparse Row (CSR) format is a popular, general-purpose sparse matrix
representation. CSR stores a sparse matrix via three arrays :
(2) the array JA contains column indices of the nonzero entries stored in AA.
(3) entries of the array IA point to the first entry of subsequent rows of A in the arrays AA
and JA.
Parallelizations of SpMM:
row to start.
(b) Merge path - Assign an equal number of nonzeroes and rows per processor. This is
done by doing a 2-D binary search (i.e., on the diagonal line) over row offsets and
• While row split focuses primarily on ILP (Instruction-Level Parallelism) and TLP (Thread-
Level Parallelism), nonzero split and merge path focus on load-balancing as well.
• Consider nonzero split and merge path to be explicit load-balancing methods, as they rearrange
the distribution of work such that each thread must perform T independent instructions; if T >
1, then explicit load-balancing creates ILP where there was previously little or none.
• Thus loadbalance is closely linked with ILP, because if each thread is guaranteed T > 1 units of
independent work (ILP), then each thread is doing the same amount of work (i.e., is
loadbalanced).
• Row split aims to assign each row to a different thread, warp, or CTA.
• The typical SpMM row split is only the left-most column of matrix B with orange cells replaced
by green cells.
• This gives SpMM independent instruction and uncoalesced, random accesses into the vector.
• Although row-split is a well-known method for SpMM there are three important design
decisions,
1. Granularity - Should each row be assigned to a thread, warp, or CTA?
• Each row is assigned to a warp compared to the alternatives of assigning a thread and a CTA
per row.
• This leads to the simplest design out of the three options, since it gives us coalesced memory
accesses into B. For matrices with few non-zeroes per row, the thread-per-matrix-row work
assignment may be more efficient.
2. Memory access pattern - How should work be divided in fetching B ? What is the impact on ILP and
TLP ?
3. Shared memory - Can shared memory be used for performance gain ? This design decision had the
greatest impact on performance.
6. Imaging
• Image processing is a type of signals processing in which the input is an image, and the output can
be an image or anything else that undergoes some meaningful processing.
• Converting a colored image to its grayscale representation is an example of image processing.
• Enhancing a dull and worn off fingerprint image is another example of image processing.
• More often than not, image processing happens on the entire image, and the same steps are
repeatedly applied to every pixel of the image.
• This programming paradigm is a perfect candidate to fully leverage CUDAs massive compute
capabilities.
• Image processing is ideal for running on the GPU because each pixel can be directly mapped to a
separate thread.
• The experiment will involve a series of image convolution algorithms. Convolutions are commonly
used in a wide array of engineering and mathematical applications.
• A simple highlevel explanation is basically taking one matrix (the image) and passing it through
another matrix (the convolution matrix).
• The result is your convoluted image.
• The matrix can also be called the filter.
Gaussian Blur :
• Image smoothing is a type of convolution most commonly used to reduce image noise and detail.
• This is generally done by applying the image through a low pass filter.
• The filter will retain lower frequency values while reducing high frequency values.
• The image is smoothed by reducing the disparity between pixels by its nearby pixels.
• An image is smoothed to reduce noise before an edge detection algorithm is applied.
• Smoothing can be applied to the same image over and over again until the desired effect is
achieved.
• A simple way to achieve smoothing is by using a mean filter.
• The idea is to replace each pixel with the average value of all neighboring pixels including itself.
One of the advantages of this approach is its simplicity and speed.
• Another way to smooth an image is to use the Gaussian Blur.
• The Gaussian Blur is a sophisticated image smoothing technique because it reduces the
magnitude of high frequencies proportional to their frequencies.
• It gives less weight to pixels further from the center of the window.
• The Gaussian function is defined as:
Sobel Edge Detection:
• Edge detection is a common image processing technique used in feature detection and extraction.
• Applying an edge detection on an image can significantly reduce the amount of data needed to be
processed at a later phase while maintaining the important structure of the image.
• The idea is to remove everything from the image except the pixels that are part of an edge.
• These edges have special properties, such as corners, lines, curves, etc.
• A collection of these properties or features can be used to accomplish a bigger picture, such as image
recognition.
• An edge can be identified by significant local changes of intensity in an image.
• An edge usually divides two different regions of an image.
• Most edge detection algorithms work best on an image that has the noise removal procedure already
applied.
• The main ones existing today are techniques using differential operators and high pass filtration.
• A simple edge detection algorithm is to apply the Sobel edge detection algorithm.
• It involves convolving the image using a integer value filter, which is both simple and
computationally inex pensive. The Sobel filter is defined as: