PDC Lecture 01
PDC Lecture 01
Prerequisites
o Operating Systems or equivalent
o No parallel programming/systems background required
This class introduces parallel and distributed systems and programming, covering
three areas: parallel computer architectures, programming parallel and distributed
systems, and algorithms and systems issues in parallel and distributed systems.
o Architectures: Architectural classes, Flynn‘s taxonomy, SIMD/Vector architecture,Shared
memory architecture , Distributed memory architecture , GPU architecture ,
Interconnection networks
o Programming: Optimizing single thread performance,SIMD and vector extensions,Shared
memory programming , GPU programming , Distributed memory programming ,
Synchronization, concurrency, deadlock, race condition, determinacy
o Algorithms and systems issues: PRAM, BSP, LogP models, systems issues: job scheduling,
power, performance, security
Course Objectives
Attendance: required.
Late assignments: not accepted without a valid excuse.
Missed exam: following the university rules.
o Let me know when you need to miss an exam ASAP.
Incomplete grade:
o Miss the final with an accepted excuse
o Due to extraordinary circumstances with appropriate documentation.
Course policies
Academic Integrity
o No copying from anywhere
o Don’t ask others for solutions and don’t give solutions to others.
Violation
o The university requires all violations to be reported.
o First violation with level 1 agreement:
0 for the particular assignment/exam and the lowering of one letter (A->B) for course final
grade.
o Second violation: resolved through the office of the Dean and the Faculties
Syllabus Changes
This syllabus is a guide for the course and is subject to change with
advance notice.
Parallel and Distributed Systems
The CPU clock frequency implicitly implies how many operations the
computer can perform for a sequential (or single-thread) program.
o For more than 10 years, the highest CPU clock frequency stays around 4GHz
o For a sequential (single thread) program, the time to perform 10 operations is in
the order of seconds
This is a physical limit: the CPU clock frequency is limited by the size of
the CPU and the speed of light.
The limit of clock frequency
One may think of reducing the size of the CPU to increase frequency.
Increasing CPU frequency also increases CPU power density.
We switch to multi-core in the 2004 due to these physical limits.
Some large scale applications can use any amount of computing power.
o Scientific computing applications
Weather simulation. More computing power means more finer granularity and prediction
longer into the future.
Japan’s K machine was built to provide enough computing power to better understand the
inner workings of the brain.
o Training of large machine learning models in many domains.
In small scale, we would like our programs to run faster as the technology
advances – conventional sequential programs are not going to do that.
Why parallel/distributed computing?
PDS and their programming are very broad, we try to achieve a balance
between breadth and depth.
Classification and Performance
Flynn’s Taxonomy (1966)
Performance, peak performance and sustained performance
Example of parallel computing
Computation graph, scheduling and execution time
Flynn’s Taxonomy
The maximum degree of parallelism depends on the structure of the arithmetic and logic unit. Higher degree of
parallelism indicates a highly parallel ALU or processing element. Average parallelism depends on both the
hardware and the software. Higher average parallelism can be achieved through concurrent programs.
Feng Taxonomy
In 1972, Tse-yun Feng proposed a system for classifying parallel processing systems based
on the number of bits in a word and word length. This classification focuses on the
parallelism of bits and words. Here are the four categories according to Feng’s
Classification:
1. Word Serial Bit Serial (WSBS): In this case, one bit of a selected word is processed at a time. It corresponds
to serial processing and requires maximum processing time.
2. Word Serial Bit Parallel (WSBP): All the bits of a selected word are processed simultaneously, but one word
at a time. It provides slightly more parallelism than WSBS.
3. Word Parallel Bit Serial (WPBS): One selected bit from all specified words is processed at a time. WPBS can
be thought of as column parallelism.
4. Word Parallel Bit Parallel (WPBP): All the bits of all specified words are operated on simultaneously. This
category offers maximum parallelism and minimum execution time.
Feng Taxonomy
Processors like IBM370, Cray-1, and PDP11 execute words in parallel but with varying
word sizes (from 16 to 64 bits), falling under the WSBP category. On the other hand,
processors like STARAN and MPP execute one bit of a word at a time but multiple words
together, categorizing them as WPBS processors. Finally, processors like C.mmp and PEPE
execute multiple bits and multiple words simultaneously, fitting into the WPBP category.
Handler’s Taxonomy
In 1977, Wolfgang Handler proposed a computer architectural
classification scheme for determining the degree of parallelism and
pipelining built into the computer system hardware. His classification
focuses on pipeline processing systems and divides them into three
subsystems:
1. Processor Control Unit (PCU): Each PCU corresponds to one processor or one CPU.
2. Arithmetic Logic Unit (ALU): ALU is equivalent to the processing element (PE). It
performs arithmetic and logical calculations.
3. Bit Level Circuit (BLC): BLC corresponds to the combinational logic circuit required for
1-bit operations in ALU.
Handler’s Taxonomy
Handler’s classification uses three pairs of integers to describe the computer system:
Computer: K = number of processors (PCUs) within the computer, K’ = number of PCUs that can be pipelined.
ALU: D = number of ALUs (PEs) under the control of PCU, D’ = number of PEs that can be pipelined.
Word Length: W = word length of a PE, W’ = number of pipeline stages in all PEs.
For example:
Texas Instrument’s Advanced Scientific Computer (TI ASC) has one controller controlling 4 arithmetic pipelines,
each with a 64-bit word length and 8 pipeline stages. Representing TI ASC according to Handler’s classification:
TI ASC=(K=1,K′=1,D=4,D′=1,W=64,W′=8)
CDC 6600 has a single CPU with an ALU having 10 specialized hardware functions (each 60-bit word length), and
up to 10 of these functions can be linked into a longer pipeline. It also has 10 peripheral I/O processors
operating in parallel. Each I/O processor has 1 ALU with 12 bits of word length. The representation:
CDC 6600=(K=1,K′=1,D=1,D′=10,W=60,W′=1)
Summary
Flynn’s taxonomy: SISD, SIMD, MISD, MIMD
Parallelism = Work(G) / span(G), an approximation of the number of processors that can be used
Greedy scheduler assigns tasks to processors whenever a task is ready and a processor is available. The
execution time with a greedy scheduler is at most 2 times that of the optimal scheduler.