0% found this document useful (0 votes)
0 views36 pages

PDC Lecture 01

The CS-402 Parallel and Distributed Systems course syllabus outlines the course structure, prerequisites, and objectives for Fall 2024, focusing on parallel computer architectures, programming, and algorithms. Students will learn to write efficient code for parallel systems and explore various programming paradigms while completing projects related to parallelization and performance comparison. The syllabus emphasizes attendance, academic integrity, and the possibility of changes with advance notice.

Uploaded by

arhamkhan4241
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views36 pages

PDC Lecture 01

The CS-402 Parallel and Distributed Systems course syllabus outlines the course structure, prerequisites, and objectives for Fall 2024, focusing on parallel computer architectures, programming, and algorithms. Students will learn to write efficient code for parallel systems and explore various programming paradigms while completing projects related to parallelization and performance comparison. The syllabus emphasizes attendance, academic integrity, and the possibility of changes with advance notice.

Uploaded by

arhamkhan4241
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CS-402 Parallel and Distributed Systems

Syllabus, Fall 2024


Lecture No. 01
 Location: Computer Science Department
 Time:
TUESDAY (08:30 AM TO 11:30 AM, SECTION A)

WEDNESDAY (12:00 PM TO 3:00 PM, SECTION B and C)

 Instructor: Qamas Gul Khan Safi, Email: [email protected]


o Office hours: Monday, Thursday 10:00 to -12:30 pm, ad hoc times announced on MS-Teams,
and by appointment.
Course Requirements

 Prerequisites
o Operating Systems or equivalent
o No parallel programming/systems background required

 Course website: MS-Teams


Required textbook: There is no single required textbook for this course.
Please see the class lectures.
Course Requirements
 Reference materials:
o Hennessy and D. Patterson, “Computer Architecture A Quantitative Approach,” 6th Edition, 2017.
o Maurice Herlihy, et al, "The Art of Multiprocessor Programming," 2nd edition, 2020.
o William James Dally and Brian Patrick Towles, "Principles and Practices of Interconnection Networks,"
Morgan Kaufmann, 1st Edition, 2004.
o David B. Kirk and Wen-mei W. Hwu, "Programming Massively Parallel Processors: A Hands-on Approach,"
Morgan Kaufmann, 3rd edition, 2016.
o Kevin R. Wadleigh and Isom L Crawford, "Software Optimization for High Performance Computing: Creating
Faster Applications," Prentice Hall, 1st Edition, 2000.
o MPI: http://www-unix.mcs.anl.gov/mpi/
o OpenMP: http://www.openmp.org
o CUDA: NVIDIA CUDA Programming Guide
o Papers and tutorials in recent technical conferences and journals.
Course Description

 This class introduces parallel and distributed systems and programming, covering
three areas: parallel computer architectures, programming parallel and distributed
systems, and algorithms and systems issues in parallel and distributed systems.
o Architectures: Architectural classes, Flynn‘s taxonomy, SIMD/Vector architecture,Shared
memory architecture , Distributed memory architecture , GPU architecture ,
Interconnection networks
o Programming: Optimizing single thread performance,SIMD and vector extensions,Shared
memory programming , GPU programming , Distributed memory programming ,
Synchronization, concurrency, deadlock, race condition, determinacy
o Algorithms and systems issues: PRAM, BSP, LogP models, systems issues: job scheduling,
power, performance, security
Course Objectives

Upon completion of the course, the student will be able to


o Explain the challenges and techniques in parallel computer architectures.

o Write efficient code to exploit parallelism in uniprocessor and multi-processor


systems with different programming paradigms.

o Explain systems issues and techniques in contemporary parallel and distributed


systems.
2024
Term project

 Development projects, examples:


Parallelize a kernel or an application (Shared memory, GPU, MPI, Spark, Cloud)
Implement a PDS related technique, algorithm from a recent paper.
 Evaluation projects, examples:
Comparing SIMD performance of Intel, AMD, and ARM processors.
Comparing the performance of different All-reduce algorithms (heavily used in
distributed deep learning frameworks).
Benchmarking the performance of unified memory between GPU and CPU.
 Research projects
o Survey an emerging area (e.g. emerging programming models for heterogeneous systems,
recent advances in interconnection networks for exa-scale computing systems).
o Develop a new technique related to PDS (e.g. new algorithm to perform all-reduce for deep
learning, new topology for interconnection networks)
Course policies

 Attendance: required.
 Late assignments: not accepted without a valid excuse.
 Missed exam: following the university rules.
o Let me know when you need to miss an exam ASAP.
 Incomplete grade:
o Miss the final with an accepted excuse
o Due to extraordinary circumstances with appropriate documentation.
Course policies

 Academic Integrity
o No copying from anywhere
o Don’t ask others for solutions and don’t give solutions to others.
 Violation
o The university requires all violations to be reported.
o First violation with level 1 agreement:
0 for the particular assignment/exam and the lowering of one letter (A->B) for course final
grade.
o Second violation: resolved through the office of the Dean and the Faculties
Syllabus Changes

 This syllabus is a guide for the course and is subject to change with
advance notice.
Parallel and Distributed Systems

 What is a parallel computer?


o A collection of processing elements that communicate and cooperate to solve
problems.

 What is a distributed system?


o A collection of independent computers that appear to its users as a single
coherent system.
Almost all Contemporary Computing Systems are
Parallel and Distributed Systems.
Apple Iphone 16 Promax
OS iOS 18
o Mobile devices, IoT devices, many have multi-core CPUs Chipset Apple A18 Pro (3 nm)
CPU Hexa-core
 IPhone 13, A15 – 6 CPU cores, 16 Neural Engine cores, 4 GPU cores
GPU Apple GPU
o Desktop or laptop, multi-core CPU Uniprocessor systems
o A high-end gaming computer (CPU + GPU)
o Multi-core server Multi-processor systems
o Cloud Computing platforms (Amazon AWS, Google cloud). Location: Oak Ridge National Laboratory —
Tennessee, U.S.
o Massive gaming platforms Performance: 1,194 petaFLOPS (1.2 exaFLOPS)
Components: AMD EPYC 64-core CPUs and AMD
o Internet of Things Instinct MI250X GPUs
First online: August 2022
o Fugaku supercomputer (No. 1 on November 11, 2021, 442 Peta flops peak performance)
The performance limit of sequential program

 The CPU clock frequency implicitly implies how many operations the
computer can perform for a sequential (or single-thread) program.
o For more than 10 years, the highest CPU clock frequency stays around 4GHz
o For a sequential (single thread) program, the time to perform 10 operations is in
the order of seconds

 This is a physical limit: the CPU clock frequency is limited by the size of
the CPU and the speed of light.
The limit of clock frequency

 Speed of light = 3 × 10 m/s


 One cycle at 4Ghz frequency = s = × 10 s
×
 The distance that the light can move at one cycle:
o × = 3 × 10 m/s × × 10 s = 0.75× 10 m = 7.5cm

Intel chip dimension = 1.47 in x 1.47 in


= 3.73cm x 3.73cm

Not much room left for increasing the frequency!


Another physical limit: power

 One may think of reducing the size of the CPU to increase frequency.
 Increasing CPU frequency also increases CPU power density.
 We switch to multi-core in the 2004 due to these physical limits.

 For a sequential (single thread) program, the time to perform 10


operations is in the order of seconds.
o If one need more performance, making use of parallelism implicitly or explicitly in
the hardware is the only way to go.
Using the Multiple Computing Elements in
Contemporary Computing Systems
 In many cases, they support concurrent applications (multiple independent apps
running at the same time).
 They can also support individual parallel/distributed applications by pulling
more computing resources for one application. This will require a different type
of programming than the conventional sequential programs.
o Partition the task among multiple computing threads, coordinating and communicating
among computing threads
 This course will look under the hood of such systems and examine their
architectures, how to write effective programs to exploit architectural features,
and issues and solutions at different levels to enable parallel and distributed
computing.
Programming Parallel and Distributed Systems
 Two focuses of programming paradigms for PDS:
o Productivity
 Computing systems are fast enough for most applications. Coding is often where the bottleneck and cost are.
 Many programming systems is designed for productivity. For example, Python, Matlab, etc.
o Performance
 Computing systems are not fast enough for some applications (e.g. the training of very large deep learning
models). As a result, performance is also a focus.

 Programming systems in practice all claim to support both productivity and


performance. As computing systems become more heterogeneous and complicated,
the balance between the two is still under heavy investigation at this time.
 This class focuses on performance.
Why parallel/distributed computing?

 Some large scale applications can use any amount of computing power.
o Scientific computing applications
 Weather simulation. More computing power means more finer granularity and prediction
longer into the future.
 Japan’s K machine was built to provide enough computing power to better understand the
inner workings of the brain.
o Training of large machine learning models in many domains.

 In small scale, we would like our programs to run faster as the technology
advances – conventional sequential programs are not going to do that.
Why parallel/distributed computing?

 Bigger: Solving larger problems in the same amount of time.


 Faster: Solving the same sized problem in a shorter time.
More about parallel/distributed computing

 Parallel/distributed computing allows more hardware resources to


be utilized for a single problem. Parallel/distributed programs,
however, do not always solve bigger problems or solve the same
sized problems faster.
 Exploiting parallelism introduces overheads: work that is not necessary in the
sequential program.
 Not all applications have enough parallelism.
Naïve parallel programs are easy to write, but may not give you what you want.
What we will do in this class?

 Examine architectural features of PDS


 Introduce how to exploit the features and write efficient code for PDS
o Sequential code is a fundamental part of parallel code, so we will briefly discuss
how to write efficient sequential code.

 Study systems issues

 PDS and their programming are very broad, we try to achieve a balance
between breadth and depth.
Classification and Performance
 Flynn’s Taxonomy (1966)
 Performance, peak performance and sustained performance
 Example of parallel computing
 Computation graph, scheduling and execution time
Flynn’s Taxonomy

 Computing is basically executing instructions that operate on data.


 Flynn’s taxonomy classifies the system based on the parallelism in
instruction stream and parallel in data stream.
o single instruction stream or multiple instruction streams.
o single data stream or multiple data streams.
Flynn’s taxonomy
 Single Instruction Single Data (SISD)
 Single Instruction Multiple Data (SIMD)
 Multiple Instructions Multiple Data (MIMD)
 Multiple Instructions Single Data (MISD)
SISD

 At one time, one instruction operates on one data


o Traditional sequential architecture, Von Neumann architecture.
SIMD
 Single control unit and multiple processing units. The control unit
fetches an instruction and broadcast control to all processing units.
The instruction operates on different data.
o Can achieve massive processing power with minimum control logic
o SIMD instructions allow for sequential reasoning.
SIMD
 Exploit data-level parallelism
o Matrix-oriented scientific computing and deep learning applications
o Media (image and sound) processing

 Vector machines, MMX, SSE (Streaming SIMD Extensions), AVX


(Advanced Vector eXtensions), GPU
MISD
 Not commonly seen, no general purpose MISD computer has been built.
 Systolic array is one example of an MISD architecture.
MIMD

 Multiple instruction streams operating on multiple data streams


o MIMD can be thought of as many copies of SISD machine.
o Distributed memory multi-computers, shared memory multi-processors,
multi-core computers.
Flynn’s Taxonomy

Type Instruction Data Examples


Streams Streams
SISD 1 1 Early computers, Von Neumann architecture, turing machine
SIMD 1 N Vector architectures, MMX, SSE, AVX, GPU
MISD N 1 No general purpose machine, systolic array
MIMD N N Multi-core, multi-processor, multi-computer, cluster
Degree of Parallelism
Maximum degree of parallelism
The maximum number of binary digits that can be processed within a unit time by a
computer system is called the maximum parallelism degree P. If a processor is processing P
bits in unit time, then P is called the maximum degree of parallelism.

The maximum degree of parallelism depends on the structure of the arithmetic and logic unit. Higher degree of
parallelism indicates a highly parallel ALU or processing element. Average parallelism depends on both the
hardware and the software. Higher average parallelism can be achieved through concurrent programs.
Feng Taxonomy
In 1972, Tse-yun Feng proposed a system for classifying parallel processing systems based
on the number of bits in a word and word length. This classification focuses on the
parallelism of bits and words. Here are the four categories according to Feng’s
Classification:

1. Word Serial Bit Serial (WSBS): In this case, one bit of a selected word is processed at a time. It corresponds
to serial processing and requires maximum processing time.
2. Word Serial Bit Parallel (WSBP): All the bits of a selected word are processed simultaneously, but one word
at a time. It provides slightly more parallelism than WSBS.
3. Word Parallel Bit Serial (WPBS): One selected bit from all specified words is processed at a time. WPBS can
be thought of as column parallelism.
4. Word Parallel Bit Parallel (WPBP): All the bits of all specified words are operated on simultaneously. This
category offers maximum parallelism and minimum execution time.
Feng Taxonomy

Processors like IBM370, Cray-1, and PDP11 execute words in parallel but with varying

word sizes (from 16 to 64 bits), falling under the WSBP category. On the other hand,

processors like STARAN and MPP execute one bit of a word at a time but multiple words

together, categorizing them as WPBS processors. Finally, processors like C.mmp and PEPE

execute multiple bits and multiple words simultaneously, fitting into the WPBP category.
Handler’s Taxonomy
In 1977, Wolfgang Handler proposed a computer architectural
classification scheme for determining the degree of parallelism and
pipelining built into the computer system hardware. His classification
focuses on pipeline processing systems and divides them into three
subsystems:
1. Processor Control Unit (PCU): Each PCU corresponds to one processor or one CPU.
2. Arithmetic Logic Unit (ALU): ALU is equivalent to the processing element (PE). It
performs arithmetic and logical calculations.
3. Bit Level Circuit (BLC): BLC corresponds to the combinational logic circuit required for
1-bit operations in ALU.
Handler’s Taxonomy
Handler’s classification uses three pairs of integers to describe the computer system:
Computer: K = number of processors (PCUs) within the computer, K’ = number of PCUs that can be pipelined.
ALU: D = number of ALUs (PEs) under the control of PCU, D’ = number of PEs that can be pipelined.
Word Length: W = word length of a PE, W’ = number of pipeline stages in all PEs.
For example:
Texas Instrument’s Advanced Scientific Computer (TI ASC) has one controller controlling 4 arithmetic pipelines,
each with a 64-bit word length and 8 pipeline stages. Representing TI ASC according to Handler’s classification:
TI ASC=(K=1,K′=1,D=4,D′=1,W=64,W′=8)
CDC 6600 has a single CPU with an ALU having 10 specialized hardware functions (each 60-bit word length), and
up to 10 of these functions can be linked into a longer pipeline. It also has 10 peripheral I/O processors
operating in parallel. Each I/O processor has 1 ALU with 12 bits of word length. The representation:
CDC 6600=(K=1,K′=1,D=1,D′=10,W=60,W′=1)
Summary
 Flynn’s taxonomy: SISD, SIMD, MISD, MIMD

 Performance metrics MIPS, GFLOPS.

 Peak performance and sustained performance.

 Computation graph: describe the dependencies between tasks in a parallel computation.

 Parallelism = Work(G) / span(G), an approximation of the number of processors that can be used

effectively in the computation.

 Greedy scheduler assigns tasks to processors whenever a task is ready and a processor is available. The

execution time with a greedy scheduler is at most 2 times that of the optimal scheduler.

You might also like