09 Communication models of Parallel platforms
09 Communication models of Parallel platforms
• There are two primary forms of data exchange between parallel tasks –
• accessing a shared data space (Shared Memory Architecture)
• exchanging messages. (Distributed Memory Architecture)
• Platforms that provide a shared data space are called shared-address-space machines or
multiprocessors machines.
• Platforms that support messaging are also called message passing platforms or
multicomputers machines.
• NUMA: Multiple processors access shared memory, but some areas are closer and
faster to access.
• - Algorithms need to be designed for locality to perform well.
• - Easier programming, but needs coordination for shared data.
• - Cache coherence is a challenge.
• - UMA: All processors access shared memory equally.
• In short, NUMA machines need special consideration for algorithm design and data
coordination, while UMA machines provide equal access to shared memory.
Message-Passing Platforms
• The architecture is used to communicate data among a set of processors without the
need for global memory.
• These platforms comprise of a set of processors and their own (exclusive) memory.
• In the message passing model, an application runs as a collection of autonomous
processes, each with its own local memory.
• In this model, processes communicate with other processes by sending and receiving
messages
• Massage passing model is used widely on parallel computer with distributed memory
and on clusters of servers.
• Libraries such as MPI (Message Passing Interface) and PVM (Parallel Virtual
Machines) provide such primitives.
Message Passing vs. Shared Address Space Platforms
• Memory system, and not processor speed, is often the bottleneck for many applications.
• Memory system performance is largely captured by two parameters:
• Latency and
• Bandwidth.
• Latency is the time from the issue of a memory request to the time the data is available
at the processor.
• Bandwidth is the rate at which data can be pumped to the processor by the memory
system.
• .
Memory System Performance: Bandwidth and Latency
• It is very important to understand the difference between
latency and bandwidth.
• Consider the example of a fire-hose. If the water comes out
of the hose two seconds after the hydrant is turned on, the
latency of the system is two seconds.
• Once the water starts flowing, if the hydrant delivers water
at the rate of 5 gallons/second, the bandwidth of the system
is 5 gallons/second.
• If you want immediate response from the hydrant, it is
important to reduce latency.
• If you want to fight big fires, it is important to high
bandwidth.
• .
Memory Latency: An Example
• Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a
latency of 100 ns (no caches). Assume that the processor has two multiply-add units and
is capable of executing four instructions in each cycle of 1 ns. The following
observations follow:
• The peak processor rating is 4 GFLOPS.
• Since the memory latency is equal to 100 cycles and block size is one word, every
time a memory request is made, the processor must wait 100 cycles before it can
process the data
Memory Latency: An Example
• On the above architecture, consider the problem of computing a dot-product of two
vectors.
• A dot-product computation performs one multiply-add on a single pair of vector
elements, i.e., each floating point operation requires one data fetch.
• It follows that the peak speed of this computation is limited to one floating point
operation every 100 ns, or a speed of 10 MFLOPS, a very small fraction of the
peak processor rating!
Improving Effective Memory Latency Using Caches
• Caches are small and fast memory elements between the processor and DRAM.
• This memory acts as a low-latency high-bandwidth storage.
• If a piece of data is repeatedly used, the effective latency of this memory system can be
reduced by the cache.
• The fraction of data references satisfied by the cache is called the cache hit ratio of the
computation on the system.
• Cache hit ratio achieved by a code on a memory system often determines its
performance.
Impact of Caches: Example
• Memory bandwidth is determined by the bandwidth of the memory bus as well as the
memory units.
• Memory bandwidth can be improved by increasing the size of memory blocks.
• The underlying system takes L time units (where L is the latency of the system) to
deliver B units of data (where B is the block size).
Impact of Memory Bandwidth: Example
• Consider the same setup as before, except in this case, the block size is 4 words instead
of 1 word. We repeat the dot-product computation in this scenario:
• Assuming that the vectors are laid out linearly in memory, eight FLOPs (four
multiply-adds) can be performed in 200 cycles.
• This is because a single memory access fetches four consecutive words in the
vector.
• Therefore, two accesses can fetch four elements of each of the vectors. This
corresponds to a FLOP every 25 ns, for a peak speed of 40 MFLOPS.
Impact of Memory Bandwidth
• It is important to note that increasing block size does not change latency of the system.
• Physically, the scenario illustrated here can be viewed as a wide data bus (4 words or
128 bits) connected to multiple memory banks.
• In practice, such wide buses are expensive to construct.
• In a more practical system, consecutive words are sent on the memory bus on
subsequent bus cycles after the first word is retrieved.
Impact of Memory Bandwidth
• The above examples clearly illustrate how increased bandwidth results in higher peak
computation rates.
• The data layouts were assumed to be such that consecutive data words in memory were
used by successive instructions (spatial locality of reference).
• If we take a data-layout centric view, computations must be reordered to enhance
spatial locality of reference.
Impact of Memory Bandwidth: Example