PDC Complete Course File
PDC Complete Course File
CS482
Computing History
• Introduction to Early Computers:
□ ENIAC and UNIVAC in the 1940s and 1950s.
□ Massive room sized structures with vacuum tubes.
□ Limited processing capabilities compared to contemporary technology.
• Evolution to Mainframes:
• Transition in the mid20th century.
• IBM System/360 as a milestone in the 1960s.
• Room sized mainframe computers with enhanced processing power.
• Personal Computers Era:
• Late 20thcentury rise of personal computers.
• Apple and Microsoft pivotal in popularizing personal computing.
• Introduction of user-friendly interfaces and affordable hardware.
• Apple II (1977) and IBM PC (1981) marked the beginning
• Moore's Law:
• Formulated by Gordon Moore in 1965.
• Predicts doubling of transistors on a microchip every two years.
• Drives exponential growth in processing power.
• Facilitates development of smaller, faster, and more efficient computers.
• Influences technological innovation across various devices.
• Revolutionizes modern life, impacting smartphones to supercomputers..
Serial Computation:
Cost of communication: Computational resources are used to package and transmit data. Requires frequently
synchronization – some tasks will wait instead of doing work. Could saturate network bandwidth.
Latency vs. Bandwidth: Latency is the time it takes to send a minimal message between two tasks. Bandwidth is the
amount of data that can be communicated per unit of time. Sending many small messages can cause latency to dominate
communication overhead.
33
Comparison
Parallel Computing Distributed Computing
Memory Architecture Shared memory architecture Each node has its own memory
Advantages:
❑ Global address space provides a userfriendly programming perspective to memory
❑ Fast and uniform data sharing due to proximity of memory to CPUs
Disadvantages:
❑ Lack of scalability between memory and CPUs. Adding more CPUs increases traffic on the shared
memoryCPU path
❑ Programmer responsibility for “correct” access to global memory
Parallel Computer Memory Architectures:
Distributed Memory:
❑ Requires a communication network to connect interprocessor
memory
❑ Processors have their own local memory. Changes made by one CPU
have no effect on others
❑ Requires communication to exchange data among processors
Advantages:
❑ Memory is scalable with the number of CPUs
❑ Each CPU can rapidly access its own memory without overhead incurred with trying to maintain global cache
coherency
Disadvantages:
❑ Programmer is responsible for many of the details associated with data communication between processors
❑ It is usually difficult to map existing data structures to this memory organization, based on global memory
Parallel Computer Memory Architectures:
Hybrid DistributedShared Memory:
The largest and fastest computers in the world today employ both shared and distributed memory
architectures.
CS482
Parallelism in microprocessor
□ Definition: Parallelism in microprocessors refers to the
simultaneous execution of multiple tasks to enhance overall
processing speed and efficiency.
□ Types of Parallelism:
1. InstructionLevel Parallelism (ILP):
Description: Executes multiple instructions in parallel within a single instruction
stream.
Example: Pipelining allows the CPU to process different stages of multiple
instructions simultaneously.
2. DataLevel Parallelism (DLP):
Description: Processes multiple data elements simultaneously.
Example: SIMD (Single Instruction, Multiple Data) instructions enable operations on
multiple data items concurrently.
3. TaskLevel Parallelism (TLP):
Description: Involves the parallel execution of multiple independent tasks.
Example: Multicore processors, where each core handles a distinct task concurrently.
Parallelism in microprocessor (Cont..)
□ Benefits of Parallelism:
1. Increased Throughput:
Parallel execution of tasks results in higher overall processing speed.
2. Improved Performance:
Reduces the time taken to complete complex computations and tasks.
3. Enhanced Efficiency:
Allows for optimal resource utilization, maximizing computational power.
Parallelism in microprocessor (Cont..)
□ Challenges and Considerations:
1. Synchronization:
Ensuring coordinated execution to maintain data integrity and prevent conflicts.
2. Dependency Management:
Handling dependencies between parallel tasks to avoid errors and maintain accuracy.
3. Scalability:
Ensuring effective parallelism with the addition of more processing units.
□ Examples in Modern Processors:
1. Multithreading:
Description: Simultaneous execution of multiple threads within a single processor.
Example: HyperThreading Technology in Intel processors.
2. Multicore Processors:
Description: Integration of multiple processing cores on a single chip.
Example: Intel Core i9, AMD Ryzen processors.
□ Conclusion:
Parallelism in microprocessors significantly enhances computing capabilities by
executing tasks concurrently, leading to improved performance, efficiency, and
throughput.
Architectural classification schemes
Architectural classification schemes in the context of computing refer to the ways in which
computer architectures are categorized based on their design principles, features, and
structures. Here are some common architectural classification schemes:
□ Memory Hierarchy:
Classifies architectures based on the organization and hierarchy of memory components.
Examples: Von Neumann architecture, Harvard architecture, and Cache Memory Hierarchy.
□ Pipelined Architecture:
Classifies architectures based on the use of pipelines for instruction execution.
Examples: Instruction Pipelining and Superscalar Architecture.
□ Parallelism:
Classifies architectures based on the degree of parallel processing employed.
Examples: SIMD (Single Instruction, Multiple Data) and MIMD (Multiple Instruction, Multiple Data)
architectures.
Architectural classification schemes
□ Data Flow Architecture:
Classifies architectures based on the flow of data through the system rather than the control flow.
Examples: Dataflow computers.
□ Von Neumann vs. Harvard Architecture:
Classifies architectures based on how they handle instructions and data storage.
Examples: Von Neumann (single memory space for data and instructions) and Harvard (separate spaces for data and
instructions) architectures.
□ Microarchitecture:
Classifies architectures based on the internal organization and design decisions within a processor.
Examples: Superscalar, VLIW (Very Long Instruction Word), and SIMD microarchitectures.
□ System Organization:
Classifies architectures based on the organization of components within a computing system.
Examples: Single processor systems, Multiprocessor systems, and Multicore systems.
□ Memory Addressing Modes:
Classifies architectures based on how they access and manipulate data in memory.
Examples: Register addressing, Immediate addressing, and Indirect addressing.
□ Fault Tolerance:
Classifies architectures based on their ability to handle faults and errors.
Examples: SISD (Single Instruction, Single Data) vs. MIMD (Multiple Instruction, Multiple Data) fault tolerant
architectures.
These classification schemes help in understanding the design principles, capabilities, and characteristics of
different computer architectures, aiding in the selection and analysis of appropriate systems for specific
applications.
Principles of pipelining and vector processing
□ Dividing Tasks into Stages:
Principle: Pipelining involves breaking down the execution of instructions into discrete stages. Each stage
represents a specific task in the instruction execution process.
□ Parallel Processing of Instructions:
Principle: Different stages of the pipeline operate in parallel, allowing multiple instructions to be in
various stages of execution simultaneously. This increases throughput and overall processing speed.
□ Continuous Flow of Instructions:
Principle: Instructions move through the pipeline continuously. As one instruction completes a stage, the
next instruction enters the pipeline, ensuring a steady flow of operations.
□ Overlap of Execution:
Principle: Pipelining aims to overlap the execution of multiple instructions. While one instruction is in
the execution stage, another can be in the decoding stage, maximizing processor utilization.
□ Stall and Hazard Handling:
Principle: Pipelining may face hazards such as data dependencies or branch instructions. Techniques like
instruction forwarding and branch prediction are employed to handle these hazards and prevent pipeline
stalls.
□ Optimizing Resource Utilization:
Principle: Pipelining optimizes the use of processor resources by allowing different stages to work
concurrently. This reduces idle time and improves overall efficiency.
Principles of pipelining and vector processing
Principles of Vector Processing:
□ Simultaneous Processing of Data Elements:
Principle: Vector processing involves the simultaneous execution of the same operation on multiple data elements. This is achieved
through specialized vector instructions.
□ Vector Registers:
Principle: Vector processors use vector registers to store and manipulate multiple data elements. These registers allow efficient access to
and processing of contiguous data.
□ Vectorization of Code:
Principle: Vector processing requires code to be written or compiled in a way that exploits the capabilities of vector instructions. Loops
and operations are structured to take advantage of parallelism.
□ Parallelism with a Single Instruction:
Principle: Vector processors achieve parallelism by executing a single instruction on multiple data elements concurrently. This contrasts
with scalar processors that operate on individual data items.
□ Enhanced Throughput for Regular Data:
Principle: Vector processing is particularly effective for regular and repetitive data structures, where the same operation is performed on
a large set of data elements.
□ Reduced Instruction Overhead:
Principle: Vector processing minimizes instruction overhead by expressing operations on entire vectors with a single instruction, reducing
the need for individual instructions for each data element.
□ Efficient Memory Access:
Principle: Vector processors often implement techniques like vector prefetching and caching to optimize memory access patterns,
ensuring efficient retrieval of vector data from memory.
Both pipelining and vector processing aim to improve processing speed and efficiency by introducing parallelism. Pipelining
focuses on breaking down instruction execution into stages and overlapping them, while vector processing emphasizes the
simultaneous processing of multiple data elements using specialized vector instructions and registers.
Array Processors
□ Definition:
Array processors are specialized computing units designed for efficiently processing arrays or matrices of data. These processors excel at
performing parallel computations on large sets of data elements simultaneously.
□ Key Characteristics:
1. Parallel Processing:
Description: Array processors leverage parallelism to process multiple elements of an array simultaneously. Each processing element in the array
processor handles a different element of the array concurrently.
2. Specialized Instructions:
Description: Array processors typically come with a set of specialized instructions optimized for array operations. These instructions facilitate efficient
parallel computation on data arrays.
3. Vector and Matrix Operations:
Description: Array processors excel at performing vector and matrix operations. Common operations include addition, multiplication, and other
mathematical transformations applied concurrently to multiple elements.
4. Memory Architecture:
Description: The memory architecture of array processors is designed to support rapid access to array elements. This may involve vector registers or
specialized memory banks to facilitate efficient data retrieval.
5. High Throughput:
Description: Array processors are known for their high throughput when dealing with regular and repetitive data structures. This makes them suitable
for scientific and engineering applications involving large datasets.
6. Scientific and Engineering Applications:
Description: Array processors find extensive use in scientific and engineering computations, such as simulations, signal processing, and numerical
simulations where large arrays of data need to be processed simultaneously.
7. Data Parallelism:
Description: The architecture of array processors emphasizes data parallelism, where the same operation is performed on multiple data elements
concurrently. This aligns with the nature of array based computations.
□ Examples:
Description: Notable examples of array processors include the Connection Machine and the Cray T90 series. Graphics processing units
(GPUs) can also function as array processors, particularly in the context of parallel processing for graphics rendering and general purpose
computing (GPGPU).
Array Processors
□ Advantages:
• Efficiency in Parallel Operations: Array processors are highly efficient in
parallelizing operations on arrays, leading to faster computation times.
• Optimized for Mathematical Operations: The specialized instructions and
architecture make array processors well suited for mathematical computations
common in scientific and engineering applications.
• High Throughput: The parallel processing capabilities contribute to high
throughput, making array processors suitable for handling large datasets.
□ Challenges:
• Limited Applicability: Array processors are specialized and may not be as
versatile as general purpose processors for all types of computations.
• Programming Complexity: Developing software for array processors can be
more complex than traditional programming due to the need to explicitly handle
parallelism.
□ Conclusion: Array processors play a crucial role in accelerating
computations involving large datasets, particularly in scientific and
engineering domains. Their architecture is tailored for efficient parallel
processing of arrays, making them valuable in specific applications where
such parallelism is essential.
Multiprocessor Architecture and Parallel algorithms
□ Multiprocessor Architecture:
□ Definition:
Multiprocessor architecture involves the use of multiple processors or central processing units (CPUs) working
together to execute tasks concurrently. It aims to improve overall system performance and throughput by
parallelizing computations.
□ Types of Multiprocessor Architectures:
1. Shared Memory Multiprocessor (SMP):
□ Multiple processors share a common memory space.
□ Communication occurs through shared memory.
2. Distributed Memory Multiprocessor:
□ Processors have their own local memory.
□ Communication happens via message passing.
□ Advantages:
• Increased Throughput: Multiprocessor systems can execute multiple tasks simultaneously, improving
overall throughput.
• Scalability: Additional processors can be added to enhance system performance as workloads
increase.
• Fault Tolerance: Redundancy in processors allows for continued operation in the presence of
failures.
□ Challenges:
• Synchronization: Coordinating tasks among processors without conflicts.
• Data Sharing and Consistency: Ensuring consistency in shared data across processors.
• Programming Complexity: Developing parallel algorithms for multiprocessor systems can be
complex.
Multiprocessor Architecture and Parallel algorithms
Parallel Algorithms:
□ Definition:
Parallel algorithms are designed to solve computational problems by dividing them into smaller tasks that can be executed simultaneously.
They exploit the parallel processing capabilities of multiprocessor architectures.
□ Types of Parallelism in Algorithms:
Task Parallelism: Divides the overall task into subtasks, each processed concurrently by different processors.
Data Parallelism: Involves processing multiple data elements simultaneously using parallel operations.
Pipeline Parallelism: Breaks down a task into stages, allowing different stages to be executed concurrently.
□ Examples of Parallel Algorithms:
Matrix Multiplication: Divide matrices into submatrices, and perform multiplications concurrently.
Sorting Algorithms: Divide the data into subsets for parallel sorting.
Graph Algorithms: Parallelize graph traversal or search algorithms for faster processing.
□ Advantages:
• Improved Speedup: Parallel algorithms can significantly reduce the time needed to solve problems.
• Efficient Resource Utilization: Multiprocessor systems can concurrently execute different parts of an algorithm, optimizing resource
usage.
• Scalability: Parallel algorithms can scale with the addition of more processors.
□ Challenges:
• Load Balancing: Distributing the workload evenly among processors.
• Communication Overhead: Efficient communication between processors is crucial for parallel algorithm performance.
• Dependency Management: Handling dependencies between parallel tasks.
□ Conclusion: Multiprocessor architecture and parallel algorithms together form a powerful combination to address the
increasing demand for computational power. The efficient utilization of multiple processors in solving complex problems
provides a scalable and high performance computing solution. However, effective design and implementation require
addressing synchronization, data sharing, and communication challenges inherent in parallel systems.
RISC (Reduced Instruction Set Computing) and CISC (Complex
Instruction Set Computing)
□ RISC Architecture:
□ RISC processors are characterized by a simplified set of instructions,
aiming for streamlined execution.
□ The focus is on a smaller, more optimized instruction set, each taking a
single clock cycle to execute.
□ Examples of RISC architectures include ARM processors, widely used in
mobile devices and embedded systems.
□ CISC Architecture:
□ CISC processors have a more extensive and complex set of instructions,
allowing for more powerful and versatile operations.
□ Instructions can vary in length, and a single instruction can perform
multiple low level operations.
□ x86 processors, found in many desktop and server environments, are
prominent examples of CISC architecture.
□ Both RISC and CISC architectures have distinct advantages and use
cases, and the choice between them depends on the specific
requirements of the computing tasks at hand.
RISC Architecture
□ Principles of RISC:
□ RISC, which stands for Reduced Instruction Set Computing, is a processor architecture
that emphasizes simplicity and efficiency in its design.
□ The core principle of RISC is to use a small, highly optimized set of instructions, each
taking a single clock cycle to execute.
□ The goal is to streamline instruction execution, making it faster and more predictable.
□ Advantages of RISC Architecture:
□ Simplicity: A reduced instruction set leads to simpler processor design, making it easier
to optimize and manufacture. This simplicity also facilitates faster instruction execution.
□ Efficiency: With a focus on basic instructions that execute quickly, RISC architectures
are often more efficient in terms of power consumption and overall performance.
□ CompilerFriendly: RISC architectures are typically more compilerfriendly, allowing
compilers to optimize code more effectively.
□ RealWorld Applications and Use Cases:
□ Mobile Devices: RISC architectures, such as ARM processors, are prevalent in mobile
devices due to their efficiency and low power consumption.
□ Embedded Systems: RISC architectures are commonly used in embedded systems,
where compact size and power efficiency are crucial.
□ Networking Equipment: RISC processors find applications in networking equipment,
where fast and predictable execution is essential for routing and packet processing.
CISC Architecture
□ Complex Instruction Set Computing (CISC) architecture is characterized by a diverse and
extensive set of instructions, each capable of performing complex operations.
□ Principles of CISC:
□ CISC processors aim to reduce the number of instructions per program by providing instructions
that can perform multiple lowlevel operations in a single instruction.
□ This design philosophy is based on the idea that more complex instructions can lead to more
efficient programs.
□ Advantages of CISC:
□ Versatility: CISC instructions can perform intricate tasks, reducing the number of instructions
needed for a given operation.
□ Efficiency for Complex Operations: Wellsuited for tasks that require multiple operations, as a
single CISC instruction can handle them.
□ Disadvantages of CISC:
□ Complexity: The extensive instruction set can make the processor architecture more complex,
potentially leading to longer development cycles.
□ Power Consumption: In some cases, CISC architectures may consume more power compared to
RISC due to the complexity of instructions.
□ Examples of CISC Instructions:
□ x86 processors, such as those manufactured by Intel and AMD, are classic examples of CISC
architecture.
Lecture# #3 5
Lecture
Concurrency Controls
CS-482
Concurrency Control
Concurrency refers to the ability of a system to handle
multiple tasks or processes simultaneously.
Concurrency can be achieved in various ways:
□ Multithreading
□ Multiprocessing
□ Asynchronous Programming
Concurrency introduces challenges such as race conditions,
deadlocks, and resource sharing, which need to be carefully
managed to ensure the correctness and reliability of
concurrent software.
Conflicts of serializabity of transactions
Concurrency in databases refers to the ability of multiple transactions
or operations to be executed simultaneously without causing conflicts
or data inconsistency.
In the context of concurrency control in databases, conflicts can arise
when multiple transactions concurrently access and modify the same
data. There are three main types of conflicts that can occur:
□ Reading uncommitted data(Write-Read (WR) Conflict):
Occurs when one transaction reads uncommitted data that another
transaction writes.
□ Unrepeatable read ( Read-Write (RW) Conflict)
Occurs when one transaction reads data that another transaction writes.
□ Lost update (Write-Write (WW) Conflict)
Occurs when two transactions both write to the same data item.(blind
write)
Why concurrency control in database?
□ Isolation of Transactions:
□ Preventing Lost Updates:
□ Avoiding Dirty Reads:
□ Preventing Inconsistent Reads:
□ Deadlock Prevention:
□ Improving Concurrency:
Synchronization mechanism
These 4 requirements/ condition are crucial for preventing
race conditions and ensuring the correctness of concurrent
programs.
□ Primary condition (Mutual Exclusion, Progress),
□ Secondary condition( Bounded Waiting and no
assumption related to hardware speed)
Process synchronization
□ Process synchronization refers to the coordination of
activities or ordering of operations among multiple
concurrent processes or threads to ensure correct and
predictable behavior. Synchronization mechanisms are
used to prevent race conditions, deadlock, and other
concurrency-related issues.
Shared =y Shared =x
Y— X++
Sleep Sleep
Abort(return) Abort(return)
Process Types
In the context of process synchronization,
processes can be categorized into various
types based on their interaction and
synchronization requirements. Here are
some common types of processes:
□ Independent Processes:
□ Cooperating Processes:
□ Producer-Consumer Processes:
□ Readers-Writers Processes:
□ Client-Server Processes:
□ Real-Time Processes:
Cooperating Processes:
Cooperative processes can share various resources such as variables, memory, code, and other system
resources in a coordinated manner. Let's discuss how each of these resources can be shared among
cooperative processes:
□ Variables Cooperative processes
□ Memory Cooperative processes
□ Code Cooperative processes
□ Resources Cooperative processes
Race condition
A race condition is a situation that occurs in a concurrent system when the outcome of the system depends on
the timing or interleaving of multiple threads or processes. Race conditions typically occur when multiple
threads or processes access shared resources concurrently and at least one of them performs a write
operation. Without proper synchronization mechanisms in place, the order of execution of these
threads/processes becomes unpredictable, leading to unexpected and incorrect behavior. Common
manifestations of race conditions include:
□ Lost Updates
□ Inconsistent Reads
□ Deadlocks
□ Livelocks
Peterson’s Solution algorithm
Peterson's Solution is a classic algorithm for solving the critical section problem, which ensures mutual exclusion
between two processes without requiring hardware support for synchronization. The algorithm was proposed
by Gary L. Peterson in 1981. Here's a simplified version of Peterson's Solution for two processes:
In this algorithm:
□ Each process has its flag indicating its intent to enter the critical section.
□ The `turn` variable indicates whose turn it is to enter the critical section. If `turn` is 0, it's Process P0's turn;
if `turn` is 1, it's Process P1's turn.
□ When a process wants to enter the critical section, it sets its flag to true and gives priority to the other
process by setting `turn` accordingly.
□ Processes busy wait until it's their turn to enter the critical section. They spin in a loop until the other
process has finished its critical section and set its `flag` to false or changed `turn` to indicate that it's now
their turn.
□ Once a process exits the critical section, it sets its flag to false.
Shared variables:
int turn; // Indicates whose turn it is to enter the critical section
bool flag[2]; // Indicates whether a process wants to enter the critical section
Process P0:
flag[0] = true; // P0 wants to enter the critical section
turn = 1; // P0 gives priority to P1
while (flag[1] && turn == 1) {} // Busy waiting until it's P0's turn
// Critical section
flag[0] = false; // P0 exits the critical section
// Remainder section
Process P1:
flag[1] = true; // P1 wants to enter the critical section
turn = 0; // P1 gives priority to P0
while (flag[0] && turn == 0) {} // Busy waiting until it's P1's turn
Lecture
Lecture##46
System APIs for concurrency control
CS-482
System APIs for concurrency control
□ System APIs typically refer to platform-specific
mechanisms provided by the operating system
for managing concurrency, such as
□ Thread creation,
□ Synchronization primitives (e.g., mutexes,
semaphores), and
□ Inter-process communication facilities.
Threads
A thread is a basic unit of execution within a process, representing a single sequence of instructions that can
be independently scheduled and executed by the operating system's scheduler.
#include <iostream>
#include <omp.h>
int main() {
// OpenMP directive to create a parallel region with 4 threads
#pragma omp parallel num_threads(4)
{
// Get the unique identifier of the current thread
int thread_id = omp_get_thread_num();
return 0;
}
What is mutex?
□ A mutex, short for "mutual exclusion," is a
synchronization primitive used to control access to
shared resources in concurrent programming. It ensures
that only one thread can access a shared resource at a
time, preventing data races and ensuring data integrity.
int main() {
int shared_variable = 0;
omp_lock_t lock;
omp_init_lock(&lock);
omp_destroy_lock(&lock);
std::cout << "Final value of shared_variable: " << shared_variable << std::endl;
return 0;
}
Semaphore
A semaphore is a synchronization primitive used in concurrent programming to control access to a shared resource by multiple threads or
processes. Semaphores maintain a count or value, which can be incremented or decremented by threads. Depending on the value of the
semaphore, threads may either be allowed to proceed (if the count is positive) or be blocked until the count becomes positive.
Here's a conceptual overview of how semaphores work:
□ Initialization: A semaphore is initialized with an integer value, often referred to as the semaphore's "count" or "resource count."
□ Acquiring (Wait): When a thread wants to access the shared resource, it attempts to acquire the semaphore. If the semaphore's count is
greater than zero, indicating that resources are available, the thread decrements the count and continues execution. If the count is zero,
the thread is blocked until the count becomes positive.
□ Releasing (Signal): When a thread finishes using the shared resource, it releases the semaphore by incrementing its count. This allows
other threads waiting on the semaphore to proceed if resources become available.
There are two main types of semaphores:
□ Binary Semaphore: Also known as mutexes, binary semaphores have a count of either 0 or 1. They are typically used to control access
to a single resource, ensuring that only one thread can access it at a time.
□ Counting Semaphore: Counting semaphores can have a count greater than 1, allowing multiple threads to access a finite pool of
resources concurrently. They are useful for scenarios where multiple instances of a resource can be accessed simultaneously, up to a
certain limit.
Binary Semaphore
#include <iostream>
#include <omp.h>
int main() {
int shared_resource = 0;
omp_lock_t semaphore;
omp_init_lock(&semaphore);
#pragma omp parallel num_threads(2)
{
int thread_id = omp_get_thread_num();
if (thread_id == 0) { // Thread 0
omp_set_lock(&semaphore); // Acquire the semaphore
std::cout << "Thread " << thread_id << " has acquired the semaphore" << std::endl;
shared_resource = 1; // Modify the shared resource
std::cout << "Thread " << thread_id << " has modified the shared resource to: " << shared_resource << std::endl;
omp_unset_lock(&semaphore); // Release the semaphore
std::cout << "Thread " << thread_id << " has released the semaphore" << std::endl;
} else { // Thread 1
omp_set_lock(&semaphore); // Acquire the semaphore
std::cout << "Thread " << thread_id << " has acquired the semaphore" << std::endl;
shared_resource = 2; // Modify the shared resource
std::cout << "Thread " << thread_id << " has modified the shared resource to: " << shared_resource << std::endl;
omp_unset_lock(&semaphore); // Release the semaphore
std::cout << "Thread " << thread_id << " has released the semaphore" << std::endl;
}
}
omp_destroy_lock(&semaphore);
return 0;
}
Amdahl's Law
□
Example
□
Practice
□
Distributed computing
Lecture## 75
Lecture
Types of Computer System
□ Multiprocessors
□ A computer-system in which 2 or more CPUs share full-access to a
common RAM.
□ Characterized by tight coupling of CPUs
□ Multicomputers
□ An interconnected collection of nodes such that each node
generally has a CPU, RAM, a network interface and perhaps a hard
disk for paging.
□ Characterized by loose coupling of CPUs that do not share
memory.
□ All nodes are in a single room and communicate by a high-speed
dedicated network.
□ All nodes run the same OS, share a single file system and are under
a common administration.
□ A typical example of a multicomputer is 512 nodes in a single room
at a company, working on, say, pharmaceutical modeling.
Distributed systems
A collection of independent computers appearing to its users as a single
coherent system in which hardware or software components communicate
and coordinate their actions only by passing messages.
□ Each node is a complete computer with a full complement of peripherals.
□ Nodes of a distributed system may each run a different OS, each has its
own file system and be under a different administration.
□ A typical distributed system consists of thousands of machines loosely
cooperating over the internet.
□ Distributed systems are even more loosely coupled than multicomputers.
□ Loose coupling of distributed systems is both
□ A strength
□ And a weakness
□ Strength: Computers can be used for a variety of different applications
□ Weakness: Programming these applications is difficult due to lack of any
common underlying model.
Significant consequences for definition of
distributed systems
□ Concurrency
□ Different computers in a network can concurrently execute programs sharing
resources such as web pages or files when necessary.
□ Coordination of concurrently executing programs is an important and recurring
topic.
□ No global clock
□ Computers in a network can’t synchronize their clocks accurately.
□ Programs coordinate actions by exchanging messages.
□ Independent Failures
□ Distributed systems can fail in new ways:-
□ Faults in network result in isolation of computers but the later don’t stop working.
◻ Programs may not be able to detect whether network failed or becomes unusually slow.
□ Failures of a computer or unexpected termination of a program (a crash) is not
immediately made known to other components.
◻ Each component can fail independently.
□ Motivation
□ Motivation for constructing and using distributed systems stems from desire to
share resources.
Two types of distribution systems
Two opposing extreme positions provide a pair of models
□ The first has a strong assumption of time and
□ The second makes no assumptions about time.
Synchronous distributed systems:
□ Synchronous distributed systems: Hadzilacos and Toueg define
such systems as one in which following bounds are defined:-
□ The time to execute each step of a process has known lower and
upper bounds.
□ Each message transmitted over a channel is received within a known
bounded time.
□ Each process has a local clock whose drift rate from real time has a
known bound.
□ Advantage: It is possible to use timeouts to detect the failure
of a process.
□ Synchronous distributed systems can be built.
□ provided that processes’ resource requirements are known
□ so that sufficient processor cycles and network capacity can be
guaranteed, and
□ clocks provided with bounded drift rates.
Synchronous distributed computing
#include <iostream>
#include <vector>
#include <omp.h>
#include <chrono>
#include <thread>
int main() {
// Define the size of the data array
const int size = 10;
// Define upper and lower bounds for execution time and message transmission delay
const int lower_bound = 100; // in milliseconds
const int upper_bound = 500; // in milliseconds
int main() {
// Define the size of the data array
const int size = 10;
26
The Network Time Protocol
▪ The NTP service is provided by a network of servers located
across the Internet.
▪ Primary servers are connected directly to a time source such as
a radio clock receiving UTC;
▪ Secondary servers are synchronized, ultimately, with primary
servers.
▪ The servers are connected in a logical hierarchy called a
synchronization subnet, whose levels are called stra ta.
1 3 2 3
Fig 2
m the value t = Li .
b) On receiving (m, t), a process p j computes L j := max(Lj, t) and
then applies LC1 before time-stamping the event receive(m).
▪ Although we increment clocks by 1, we could have chosen any
positive value.
▪ It can easily be shown, by induction on the length of any sequence
of events relating two events e and e’ , that e → e’
⇒ L(e) < L(e’).
Logical time and logical clocks
▪ Note that the converse is not true.
▪ If L(e) < L(e’), then we cannot infer that e → e’ .
▪ Each of the processes p1 , p2 and p3 has its logical clock initialized to
0.
▪ The clock values given are those immediately after the event
to which they are adjacent.
▪ Note that, for example, L(b) > L(e) but b ║ e .
Totally ordered logical clocks
▪ Distinct events, generated by different processes may have
numerically identical Lamport timestamps.
▪ However, we can create a total order on the set of events.
▪ If e is an event occurring at p i with local timestamp T i , and
e’ is an event occurring at pj with local timestamp T j ,
o we define the global logical timestamps for these events
to be (Ti, i) and (Tj, j) , respectively.
▪ And we define (T , i) < (T , j) if and only if either T < T , or
i j i j
Ti = T j and i < j.
▪ This ordering has no general physical significance (because process
identifiers are arbitrary), but it is sometimes useful.
▪ Lamport used it, for example, to order the entry of processes to a
critical section.
Vector clocks
▪ Mattern [1989] and Fidge [1991] developed vector clocks to
overcome the shortcoming of Lamport’s clocks:
o the fact that from L(e) < L(e’) we cannot conclude e → e’.
▪ A vector clock for a system of N processes is an array of N
integers.
▪ Each process keeps its own vector clock, Vi , which it uses to
timestamp local events.
▪ Like Lamport timestamps, processes piggyback vector timestamps on
the messages they send to one another, and there are simple rules for
updating the clocks:
o VC1: Initially, Vi[j] = 0 , for i, j = 1, 2,… N.
o VC2: Just before pi timestamps an event, it sets Vi[i] :=Vi[i] + 1.
o VC3: pi includes the value t = Vi in every message it sends.
o VC4: When pi receives a timestamp t in a message, it sets Vi[j] :=
max(Vi[j], t[j]) , for j = 1, 2, … N .
Vector clocks
▪ Taking the component-wise maximum of two vector timestamps in
this way is known as a merge operation.
▪ For a vector clock V , V [i] is the number of events that p has
i i i
timestamped, and Vi[j] (j ≠ i) is the number of events that have
occurred at pj that have potentially affected pi.
j
k
▪ The algorithm can be initiated by any process by executing the marker sending
rule.
Snapshot algorithm
▪ The algorithm terminates after each process has received a
marker on all of its incoming channels.
▪ The recorded local snapshots can be put together to create
the global snapshot in several ways.
▪ One policy is to have each process send its local snapshot to
the initiator of the algorithm.
▪ Another policy is to have each process send the information it
records along all outgoing channels.
□ o all the processes can determine the global state.
▪ Multiple processes can initiate the algorithm concurrently.
o Each initiation needs to be distinguished by using unique
markers.
o Different initiations by a process are identified by a
sequence number.
Properties of the recorded global state
▪ Two possible executions of the snapshot algorithm for the
money transfer example (Fig 7):
1. (Markers shown using dashed-and-dotted
arrows.)
▪ Let site S1 initiate the algorithm just after t1.
▪ S1 records its local state (account A=$550)
and sends a marker to site S2.
▪ The marker is received by site S2 after t4.
▪ Then it records its local state (account
B=$170), the state of channel C12 as $0, and
sends a marker along C21.
▪ When site S1 receives this marker, it Fig 7: Timing diagram of
records the state of C21 as $80. two possible executions of
▪ The $800 amount in the system is the banking example
Lecture # 8
Challenges facing Distributed Systems
Significant challenges are encountered in the design and use of distributed systems.
1. Heterogeneity
Heterogeneity (i.e., variety and difference) applies to all of the following:-
□ networks;
□ computer hardware;
□ operating systems;
□ programming languages;
Although Internet consists of many different sorts of network, their differences are masked as all computers attached
to them use the Internet protocols (IPs) for communication. e.g., a computer attached to an Ethernet has an
implementation of the IPs over the Ethernet, whereas a computer on a different sort of network will need an
implementation of the IPs for that network.
□ Data types such as integers may be represented in different ways on different hardware e.g., two
alternatives for the byte ordering of integers.
□ These differences must be dealt with if messages are to be exchanged between programs running on
different hardware.
□ Although the operating systems of all computers on the Internet need to include an implementation of
the Internet Protocols, they do not necessarily all provide the same API to these protocols. e.g., the calls
for exchanging messages in UNIX are different from the calls in Windows.
□ Different programming languages use different representations for characters and data structures such as
arrays and records.
□ These differences must be addressed if programs in different languages need to communicate with one
another.
Middleware
□ Heterogeneity and mobile code mobile code is one that can be
transferred from one computer to another and run at the destination –
Java applets are an example.
□ Code suitable for running on one computer is not necessarily suitable for running
on another because
□ executable programs are normally specific both to the instruction set and to the host OS.
□ e.g., executable files sent as e-mail attachments by Windows/x86 users will not
run on Macintosh computer running Mac OS X.
□ The virtual machine (VM) approach provides a way of making code executable on
a variety of host computers:
□ e.g., the Java compiler produces code for a Java VM, which executes it by interpretation.
□ The Java VM needs to be implemented once for each type of computer to
enable Java programs to run.
□ Today, the most commonly used form of mobile code is the inclusion of
Javascript programs in some web pages loaded into client browsers.
Openness
□ An open distributed system is a system that may be extended
□ by the introduction of new services and
□ the reimplementation of old ones,
□ enabling application programs to share resources.
□ For example, in an extensible system, it should be relatively easy to add parts that run on a different OS or even to
replace an entire file system.
□ It also allows two independent parties to build completely different implementations of those interfaces, leading to two
separate distributed systems that operate in exactly the same way.
□ Openness cannot be achieved unless the specification of key software interfaces of the components of a system are
published so that they are available to software developers.
□ In distributed systems, services are generally specified through interfaces in an Interface Definition Language (IDL).
□ Interface definitions written in an IDL always capture only the syntax of services.
□ i.e. they specify precisely the names of the functions together with types of parameters, return values, possible exceptions that can
be raised, and so on.
□ The hard part is specifying precisely what those services do, that is, the semantics of interfaces.
□ In practice, such specifications are always given in an informal way by means of natural language.
□ Open distributed systems can be constructed from heterogeneous hardware and software, possibly from different
vendors.
□ In the case of Web caching, for example,
□ a browser should ideally provide facilities for only storing documents, and
□ at the same time allow users to decide about
□ the size of the cache,
□ about which documents are stored and for how long,
□ whether a cached document should always be checked for consistency.
□ In practice, a user can implement his own policy in the form of a component that can be plugged into the browser.
Security
□ Security for information resources has three components:-
□ Confidentiality (protection against disclosure to unauthorized individuals),
□ Integrity (protection against alteration or corruption), and
□ Availability (protection against interference with the means to access the resources).
□ Although firewall can be used to form barrier around an intranet,
□ this does not deal with ensuring the appropriate use of resources by users within an
intranet, or in the Internet.
□ In a distributed system, clients send requests to access data managed by servers.
□ For example:
□ A doctor might request access to hospital patient data or send additions to that data.
□ In electronic commerce and banking, users send their credit card numbers across the
Internet.
□ In both examples, the challenge is to send sensitive information in a message over a network
in a secure manner.
□ Solution: the use of encryption techniques to send sensitive information in a message.
□ But security is not just a matter of concealing contents of messages
□ it also involves knowing for sure the identity of the user on whose behalf a message was
sent.
□ Solution: use of biometric techniques or verification code on the cell phone to authenticate
the user.
Scalability
□ A system is scalable if it remains effective even with significant
increase in the number of resources and users.
□ Distributed systems operate effectively and efficiently at many
different scales, ranging from a small intranet to Internet.
□ System scalability measured along at least 3 different
dimensions.
□ With respect to size, - we can easily add more users and
resources to the system.
□ A geographically scalable system - the users and resources
may lie far apart.
□ An administratively scalable system - still be easy to manage
even if it spans many independent administrative organizations.
Scalability Problems
□ Scalability problems in DSs appear as performance problems caused by limited capacity of servers and network.
□ With respect to size
□ Obvious problem with centralized services: the server can become a bottleneck as number of users and applications
grows.
□ Using only a single server is sometimes unavoidable.
□ Imagine a service for managing highly confidential information such as medical records, bank accounts and so on.
□ Copying the server to several locations to enhance performance would otherwise make the service less secure.
□ With geographical scalability
□ Earlier distributed systems were designed for LANs that are based on synchronous communication.
□ In LANs communication between two machines is generally at worst a few hundred microseconds.
□ However, in a WAN, IPC may be hundreds of millisecs, three orders of magnitude slower.
□ Building interactive applications using synchronous communication in WAN systems requires a great deal of care.
□ Communication in wide-area networks is inherently unreliable, and virtually always point-to-point.
□ In contrast, LANs generally provide highly reliable communication facilities based on broadcasting, making it much easier
to develop distributed systems.
□ For example, consider the problem of locating a service.
□ Only those machines that have that service respond, each providing its network address in the reply message.
□ Such a location scheme is unthinkable in a WAN system.
□ Special location services needed which may scale worldwide.
□ In a system with many centralized components, geographical scalability (like size one) is limited due to performance and
reliability problems from wide-area communication.
Problems with administrative scalability
□ Conflicting policies with respect to resource usage, management,
and security.
□ If a distributed system expands into another domain, two types of
security measures need to be taken.
□ First of all, the distributed system has to protect itself against
malicious attacks from the new domain.
□ e.g., users from the new domain may have only read access to the
file system in its original domain.
□ Second, the new domain has to protect itself against malicious
attacks from the distributed system.
□ A typical example is that of downloading programs such as applets
in Web browsers.
□ Administrative scalability seems to be the most difficult one, partly
also because we need to solve nontechnical problems (e.g., politics
of organizations and human collaboration).
Scaling Techniques
Basically three techniques for scaling: hiding communication latencies, distribution, and replication.
□ Hiding communication latencies: important to achieve geographical scalability.
□ The basic idea: when a service has been requested at a remote machine, an alternative to waiting for a reply is to do other useful
work at the requester's side.
□ constructing the requesting application for using only asynchronous communication.
□ When a reply comes in, the application interrupted and a special handler called to complete previously-issued request.
□ Alternatively, a new thread of control can be started to perform the request.
□ Distribution: involves taking a component, splitting it into smaller parts, and subsequently spreading those parts across the
system.
□ An excellent example of distribution is the Internet DNS.
□ DNS – the Domain Name System comprises of a table with the correspondence between the domain names of computers (e.g.
www.amazon.com) and their Internet addresses.
□ Algorithms that use hierarchic structures scale better than those that use linear structures.
□ The time taken to access hierarchically structured data is O(log n), where n is the size of the set of data.
□ The DNS name space is hierarchically organized into a tree of domains divided into nonoverlapping zones (Fig 1).
□ The names in each zone are handled by a single name server.
□ One can think of each path name, being the name of a host in the Internet, and thus associated with a network address of that host.
□ Basically, resolving a name means returning the network address of the associated host.
□ Consider, for example, the name nl. vu.cs.flits.
□ To resolve this name, it is first passed to the server of zone z1 (Fig. 1) which returns the address of the server for zone
z2, to which the rest of name, vu.cs.flits, can be handed.
□ The server for z2 will return the address of the server for zone z3, which is capable of handling the last part of the name
and will return the address of the associated host.
Replication
□ Replication of components across a distributed system.
□ It not only increases availability, but also helps to balance the load between components leading to better
performance.
□ Also, in geographically-dispersed systems, having a copy nearby can hide much of the communication
latency problems mentioned before.
□ In general, for a system with n users to be scalable, the quantity of physical resources required to
support them should be at most O(n).
□ For example, if a single file server can support 20 users, then two such servers should be able to support
40 users.
□ One serious drawback to replication may badly affect scalability.
□ modifying one copy makes that copy different from others.
Lecture # 9
Middleware Layers
□ Consider the components shown in the shaded layer in fig 1.
Lecture # 10 Lecture # 9
Interprocess Communication (IPC)
□ Interprocess communication (IPC) mechanisms provide
ways for processes to communicate and synchronize with
each other in a computing system.
□ Shared memory model
□ Message passing
Message Passing
□ Message Passing Interprocess Communication (IPC) is a
mechanism that allows different processes to
communicate and exchange data with each other. In
message passing IPC, processes communicate by sending
and receiving messages rather than sharing a common
address space.
Operation in message passing
□ There are typically two main components involved in
message passing IPC:
□ Sender: The process that sends the message.
□ Receiver: The process that receives the message.
Operation in message passing
□ #include <iostream>
□ #include <mpi.h>
□ #include <omp.h>
□ int main(int argc, char** argv) {
□ MPI_Init(&argc, &argv);
□ int world_rank, world_size;
□ MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
□ MPI_Comm_size(MPI_COMM_WORLD, &world_size);
□ if (world_size != 2) {
□ std::cerr << "This example requires exactly 2 MPI processes." << std::endl;
□ MPI_Abort(MPI_COMM_WORLD, 1);
□ }
□ // OpenMP parallel region
□ #pragma omp parallel num_threads(2)
□ {
□ // Get thread number
□ int thread_id = omp_get_thread_num();
□ // Only first thread in each process will perform send or receive
□ if (thread_id == 0) {
□ if (world_rank == 0) { // Process P
□ int data = 42;
□ MPI_Send(&data, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
□ std::cout << "Process P sent data: " << data << std::endl;
□ } else if (world_rank == 1) { // Process Q
□ int received_data;
□ MPI_Recv(&received_data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
□ std::cout << "Process Q received data: " << received_data << std::endl;
□ }
□ }
□ }
□ MPI_Finalize();
□ return 0;
□ }
Concepts in the MPI standard for building
distributed-memory parallel applications
□ MPI_Init: Initializes MPI environment.
□ MPI_Finalize: Finalizes MPI environment.
□ MPI_Comm_rank: Retrieves the rank of the calling process within the
communicator.
□ MPI_Comm_size: Retrieves the size of the communicator.
□ MPI_Send: Sends a message from one process to another.
□ MPI_Recv: Receives a message sent by another process.
□ MPI_Bcast: Broadcasts a message from one process to all other
processes in the communicator.
□ MPI_Reduce: Combines values from all processes and returns a result to
one process.
□ MPI_Wait: Waits for an MPI request to complete.
□ MPI_Isend: Starts a non-blocking send operation.
□ MPI_Irecv: Starts a non-blocking receive operation.
□ MPI_Test: Tests if a request has completed.
Blocking and non-blocking communication
□ Blocking Communication:
In blocking communication, a process that initiates a
communication operation (send or receive) is blocked until the
operation completes. The blocking nature of this communication
means that the sender will wait until the receiver has received
the message, and the receiver will wait until the sender has sent
the message.
□ Characteristics of blocking Communication:
□ Synchronization:
□ Simple Programming Model:
□ Potential Deadlocks:
□ Resource Utilization:
Blocking Communication:
□ # Blocking Send and Receive in MPI (Python)
□ from mpi4py import MPI
□ comm = MPI.COMM_WORLD
□ rank = comm.Get_rank()
□ if rank == 0:
□ data = {'message': 'Hello, World!'}
□ comm.send(data, dest=1) # Blocking send
□ else:
□ data = comm.recv(source=0) # Blocking receive
□ print(f"Process {rank} received: {data}")
Blocking and non-blocking communication
□ Non-blocking Communication:
In non-blocking communication, a process initiates a
communication operation and proceeds with its execution
without waiting for the operation to complete. This allows the
sender or receiver to perform other tasks concurrently with the
communication operation.
□ Characteristics of Non-blocking Communication:
□ Asynchronous:
□ Overlap of Computation and Communication:
□ Complex Programming Model:
□ Avoidance of Deadlocks:
Non-blocking Communication:
□ # Non-blocking Send and Receive in MPI (Python)
□ from mpi4py import MPI
□ comm = MPI.COMM_WORLD
□ rank = comm.Get_rank()
□ req = MPI.Request()
□ if rank == 0:
□ data = {'message': 'Hello, World!'}
□ req = comm.isend(data, dest=1) # Non-blocking send
□ # Continue with other computation
□ req.Wait() # Wait for completion if necessary
□ else:
□ data = {}
□ req = comm.irecv(source=0) # Non-blocking receive
□ # Continue with other computation
□ req.Wait() # Wait for completion if necessary
□ data = req.recv()
□ if (rank == 0) {
□ // Process 0 sends a message to Process 1
□ int data = 42;
□ MPI_Send(&data, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
□ } else if (rank == 1) {
□ // Process 1 receives the message from Process 0
□ int received_data;
□ MPI_Recv(&received_data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
□ std::cout << "Process 1 received data: " << received_data << std::endl;
□ }
□ MPI_Finalize();
□ return 0;
□ }
Dynamic scheduling in MPI
□ #include <iostream>
□ #include <mpi.h>
□ if (rank == 0) {
□ // Process 0 dynamically determines whether to send a message
□ if (some_condition) {
□ int data = 42;
□ MPI_Send(&data, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
□ }
□ } else if (rank == 1) {
□ // Process 1 dynamically determines whether to receive a message
□ if (some_condition) {
□ int received_data;
□ MPI_Recv(&received_data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
□ std::cout << "Process 1 received data: " << received_data << std::endl;
□ }
□ }
□ MPI_Finalize();
□ return 0;
□ }
Advantages of IPC
□ However, message passing IPC also introduces overhead due
to message copying and context switching between processes.
Careful design and optimization are necessary to minimize this
overhead and maximize performance. Message passing IPC
offers several advantages, including:
□ Isolation: Processes have separate address spaces, providing
better isolation and security.
□ Modularity: Processes can be developed and maintained
independently, promoting modularity and code reuse.
□ Scalability: Message passing can be scaled across multiple
systems, making it suitable for distributed computing
environments.
□ Flexibility: Different communication patterns can be
implemented, such as one-to-one, one-to-many, or
many-to-many communication.
Remote Procedure call
Lecture # 11
Remote Procedure call
□ A remote procedure call (RPC) is a protocol that allows
a computer program to cause a subroutine or procedure
to execute in another address space (commonly on
another computer on a shared network) without the
programmer explicitly coding the details for this remote
interaction.
RPC Architecture
□ RPC architecture has mainly five components of the
program:
□ Client
□ Client Stub
□ RPC Runtime
□ Server Stub
□ Server
How RPC Works?
□ Following steps take place during the RPC process:
□ Step 1) The client, the client stub, and one instance of RPC run time execute on
the client machine.
□ Step 2) A client starts a client stub process by passing parameters in the usual
way. The client stub stores within the client’s own address space. It also asks the
local RPC Runtime to send back to the server stub.
□ Step 3) In this stage, RPC accessed by the user by making regular Local
Procedural Call. RPC Runtime manages the transmission of messages between the
network across client and server. It also performs the job of retransmission(if the
message got lost), acknowledgment, routing, and encryption.
□ Step 4) After completing the server procedure, it returns to the server stub,
which packs (marshalls) the return values into a message. The server stub then
sends a message back to the transport layer.
□ Step 5) In this step, the transport layer sends back the result message to the
client transport layer, which returns back a message to the client stub.
□ Step 6) In this stage, the client stub demarshalls (unpack) the return parameters,
in the resulting packet, and the execution process returns to the caller.
Parallel MPI and OpenMP Program with Reduction
□ #include <iostream>
□ #include <mpi.h>
□ #include <omp.h>
□ using namespace std;
□ int main(int argc, char *argv[]) {
□ int rank, size;
□
□ MPI_Init(&argc, &argv);
□ MPI_Comm_rank(MPI_COMM_WORLD, &rank);
□ MPI_Comm_size(MPI_COMM_WORLD, &size);
□
□ int total_threads = 4; // Total threads per MPI process
□
□ // Parallel region with OpenMP
□ #pragma omp parallel num_threads(total_threads)
□ {
□ int thread_id = omp_get_thread_num();
□ int num_threads = omp_get_num_threads();
□
□ // Compute some parallel task
□ int local_result = thread_id + rank * num_threads;
□
□ // Synchronize threads within MPI process
□ #pragma omp barrier
□
□ // Reduce local results to get global result
□ int global_result;
□ MPI_Reduce(&local_result, &global_result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
□
□ // Print global result from root process
□ if (rank == 0 && thread_id == 0) {
□ cout << "Global result: " << global_result << endl;
□ }
□ }
□
□ MPI_Finalize();
□
□ return 0;
□ }
Characteristics of RPC
□ Here are the essential characteristics of RPC:
□ The called procedure is in another process, which is likely
to reside in another machine.
□ The processes do not share address space.
□ Parameters are passed only by values.
□ RPC executes within the environment of the server
process.
□ It doesn’t offer access to the calling procedure’s
environment.
Features of RPC
□ Here are the important features of RPC:
□ Simple call syntax
□ Offers known semantics
□ Provide a well-defined interface
□ It can communicate between processes on the same or
different machines
Communication Protocols For RPCs
□ The following are the communication protocols that are
used:
□ Request Protocol
□ Request/Reply Protocol
□ The Request/Reply/Acknowledgement-Reply Protocol
Request Protocol:
□ The Request Protocol is also known as the R protocol.
□ It is used in Remote Procedure Call (RPC) when a request is made from the calling procedure to the called procedure.
After execution of the request, a called procedure has nothing to return and there is no confirmation required of the
execution of a procedure.
□ Because there is no acknowledgement or reply message, only one message is sent from client to server.
□ A reply is not required so after sending the request message the client can further proceed with the next request.
□ May-be call semantics are provided by this protocol, which eliminates the requirement for retransmission of request
packets.
□ Asynchronous Remote Procedure Call (RPC) employs the R protocol for enhancing the combined performance of the
client and server. By using this protocol, the client need not wait for a reply from the server and the server does not
need to send that.
□ In an Asynchronous Remote Procedure Call (RPC) in case communication fails, the RPC Runtime does not retry the
request. TCP is a better option than UDP since it does not require retransmission and is connection-oriented.
□ In most cases, asynchronous RPC with an unstable transport protocol is utilized to implement periodic update services.
One of its applications is the Distributed System Window.
Request/Reply Protocol:
□ The Request-Reply Protocol is also known as the RR protocol.
□ It works well for systems that involve simple RPCs.
□ The parameters and result values are enclosed in a single packet
buffer in simple RPCs. The duration of the call and the time
between calls are both briefs.
□ This protocol has a concept base of using
□ Here, a reply from the server is treated as the acknowledgement
(ACK) for the client’s request message, and a client’s following call
is considered as an acknowledgement (ACK) of the server’s reply
message to the previous call made by the client.
□ To deal with failure handling e.g. lost messages, the timeout
transmission technique is used with RR protocol.
□ If a client does not get a response message within the
predetermined timeout period, it retransmits the request message.
□ Exactly-once semantics is provided by servers as responses get
held in reply cache that helps in filtering the duplicated request
messages and reply messages are retransmitted without processing
the request again.
□ If there is no mechanism for filtering duplicate messages then at
least-call semantics is used by RR protocol in combination with
timeout transmission.
The Request/Reply/Acknowledgement-Reply
Protocol:
□ This protocol is also known as the RRA protocol
(request/reply/acknowledge-reply).
□ Exactly-once semantics is provided by RR protocol
which refers to the responses getting held in reply cache
of servers resulting in loss of replies that have not been
delivered.
□ The RRA (Request/Reply/Acknowledgement-Reply )
Protocol is used to get rid of the drawbacks of the RR
(Request/Reply) Protocol.
□ In this protocol, the client acknowledges the receiving of
reply messages and when the server gets back the
acknowledgement from the client then only deletes the
information from its cache.
□ Because the reply acknowledgement message may be
lost at times, the RRA protocol requires unique ordered
message identities. This keeps track of the
acknowledgement series that has been sent.
Complicated RPCs
□ RPCs that involve long-duration calls or large gaps
between calls.
□ RPCs that involve parameters(arguments) and/or result in
values that are too large to fit in a single datagram packet.
RPCs that involve long-duration calls or large gaps
between calls:
□ The client probes the server regularly: After the submission of
a request message from a client to the server, the client
continuously sends a probe packet which a server needs to
acknowledge.The exception status is communicated to the
corresponding user if a communication failure occurs. Each probe
packet contains the message identifier from the initial request
message.
□ The server generates an acknowledgement regularly: If the
generation of the next packet by the server gets delayed then the
predicted retransmission time interval, then it generates an
acknowledgement on its own. Hence, during a long-duration call,
many acknowledgements may be generated from the server as
several acknowledgements directly correspond to the call duration.
If within the set interval of time the response or acknowledgement
from the server has not been received by the client then it comes
to the conclusion that either server has crashed or the failure
occurs on the client-side or in case of communication failure user is
alerted about the exception condition.
RPCs that involve parameters/arguments and/or result in
values that are too large to fit in a single
□ Datagram packet:
□ RPCs with Long Messages: To handle such an RPC, employ
many physical RPCs for a single logical RPC. The sending of
data in each physical RPC is made in the size of a single
datagram packet. This technique is inefficient since each RPC
incurs a set amount of overhead regardless of the quantity of
data transmitted.
□ Multidatagram Messages: Multidatagram messages are
another approach for dealing with sophisticated RPCs in this
category. It involves the division of long RPC
parameters(arguments) or result into many packets and then
sent in multiples. All the packets in a multi datagram message
utilize a single acknowledgement packet for enhancing
communication performance.
Advantages of RPC
□ Here are Pros/benefits of RPC:
□ RPC method helps clients to communicate with servers by the conventional use of
procedure calls in high-level languages.
□ RPC method is modeled on the local procedure call, but the called procedure is
most likely to be executed in a different process and usually a different computer.
□ RPC supports process and thread-oriented models.
□ RPC makes the internal message passing mechanism hidden from the user.
□ The effort needs to re-write and re-develop the code is minimum.
□ Remote procedure calls can be used for the purpose of distributed and the local
environment.
□ It commits many of the protocol layers to improve performance.
□ RPC provides abstraction. For example, the message-passing nature of network
communication remains hidden from the user.
□ RPC allows the usage of the applications in a distributed environment that is not
only in the local environment.
□ With RPC code, re-writing and re-developing effort is minimized.
□ Process-oriented and thread-oriented models support by RPC.
Disadvantages of RPC
□ Here are the cons/drawbacks of using RPC:
□ Remote Procedure Call Passes Parameters by values only and
pointer values, which is not allowed.
□ Remote procedure calling (and return) time (i.e., overheads)
can be significantly lower than that for a local procedure.
□ This mechanism is highly vulnerable to failure as it involves a
communication system, another machine, and another
process.
□ RPC concept can be implemented in different ways, which is
can’t standard.
□ Not offers any flexibility in RPC for hardware architecture as
It is mostly interaction-based.
□ The cost of the process is increased because of a remote
procedure call.
External Data Representation (IPC)
CORBA’s CDR
Lecture # 12
External data representation
▪ Information stored in running programs
o represented as data structures – for example, by sets of
interconnected objects.
▪ The information in messages consists of sequences of
bytes.
▪ The data structures must be flattened (converted to a
sequence of bytes) before transmission and rebuilt on
arrival.
External data representation
Why?
1. Not all computers store primitive values such as
integers in the same order.
□ There are two variants for the ordering of integers:
o the big-endian order, in which the most significant byte comes
first;
o and little-endian order, in which it comes last.
▪ .
Fig 3: Request-reply
message structure
Fig 3: Request-reply
message structure
CS-482
Failure Model of Request-Reply Protocols
▪ If the three primitives doOperation, getRequest and
sendReply are implemented over UDP datagrams, then
they suffer from the same communication failures.
▪ That is:
a) They suffer from omission failures.
b) Messages are not guaranteed to be delivered in sender
order.
▪ In addition, the protocol can suffer from the failure of
processes.
Failure Model of Request-Reply Protocols
▪ We assume that processes have crash failures.
▪ That is, when they halt, they remain halted – they do not
produce Byzantine behavior.
▪ To allow for occasions when
o a server has failed or
o a request or reply message is dropped,
o doOperation uses a timeout when it is waiting to get the
server’s reply message.
▪ The action taken when a timeout occurs depends upon the
delivery guarantees being offered.
Timeouts
▪ There are various options as to what doOperation can do after a
timeout.
▪ The simplest option is
o to return immediately from doOperation
o with an indication to the client that the doOperation has
failed.
▪ Not the usual approach
o the timeout may have been due to the request or reply
message getting lost and
o in the latter case, the operation will have been performed.
Timeouts
▪ To compensate for the possibility of lost messages,
o doOperation sends the request message repeatedly until either it
gets a reply or
o it is reasonably sure that the delay is due to lack of response
from the server rather than to lost messages.
▪ .
At-least-once semantics
□ Can be achieved by the retransmission of request messages,
which masks the omission failures of the request or result
message.
▪ the invoker receives either a result, in which case the
procedure was executed at least once, or an exception
informing it that no result was received.
▪ At-least-once semantics can suffer from following types
of failure:
o crash failures when the server containing the remote
procedure fails;
At-least-once semantics
o arbitrary failures – in cases when the request message
is retransmitted, the remote server receives it and
execute the procedure more than once, possibly causing
wrong values to be stored or returned.
o e.g., an operation to increase a bank balance by Rs
5000/- should be performed only once; if it were
repeated, the balance would grow and grow!
▪ At-least-once call semantics may be acceptable if the
operations in a server are idempotent.
At-most-once semantics
□ This semantics can be achieved by using all of the
fault-tolerance measures outlined in Fig 2.
o the caller receives either a result, in which case the procedure
will have been executed once or
o an exception informing it that no result was received.
▪ As in the previous case, the use of retries masks any omission
failures of the request or result messages.
▪ This set of fault tolerance measures prevents arbitrary failures
by ensuring that for each RPC a procedure is never executed
more than once.
▪ Sun RPC provides at-least-once call semantics.
Transparency
▪ The originators of RPC, Birrell and Nelson [1984], aimed to make
RPCs as much like local procedure calls as possible, with no
distinction in syntax between a local and a remote procedure call.
▪ All the necessary calls to marshalling and message-passing
procedures were hidden from the programmer making the call.
▪ Although request messages are retransmitted after a timeout,
o this is transparent to the caller
o to make the semantics of remote procedure calls like that of
local procedure calls.
Transparency
▪ More precisely, RPC strives to offer at least location and
access transparency,
o hiding the physical location of the (potentially remote)
procedure and
o also accessing local and remote procedures in the same way.
▪ Middleware can also offer additional levels of transparency to
RPC.
▪ However, remote procedure calls are more vulnerable to failure
than local ones, since they involve a network, anothercomputer
and another process.
▪ This requires that clients making remote calls are able to recover
from such situations.
Transparency
▪ The latency of RPC is several orders of magnitude greater
than that of a local one.
▪ This suggests that programs making remote calls should
minimize remote interactions.
▪ The designers of Argus suggested that a caller should be
able to abort an RPC that is taking too long in such a way
that it has no effect on the server.
o To allow this, the server would need to be able to restore
things to how they were before the procedure was called.
Transparency
▪ RPCs also require a different style of parameter passing, as
discussed above.
▪ In particular, RPC does not offer call by reference.
▪ Waldo et al. [1994] say that the difference between local and
remote operations should be expressed at the service interface.
▪ Other systems went further by arguing that the syntax of a
remote call should be different from that of a local call:
o in the case of Argus, the language was extended to make
remote operations explicit to the programmer.
Transparency
▪ The choice as to whether RPC should be transparent is also
available to the designers of IDLs.
o For example, in some IDLs, a remote invocation may throw
an exception when the client is unable to communicate with
a remote procedure.
o This requires that the client program handle such exceptions,
allowing it to deal with such failures.
o An IDL can also provide a facility for specifying the call
semantics of a procedure.
o This can help the designer of the service – for example, if at-
least-once call semantics is chosen to avoid the overheads of
at-most-once, the operations must be designed to be
idempotent.
Transparency
▪ The server process contains a dispatcher together with one server stub procedure and
one service procedure for each procedure in the service interface.
▪ The dispatcher selects one of the server stub procedures according to the procedure
identifier in the request message.
▪ The server stub procedure then unmarshals the arguments in the request message, calls
the corresponding service procedure and marshals the return values for the reply
message.
▪ The service procedures implement the procedures in the service interface.
Implementation of RPC
▪ The client and server stub procedures and the dispatcher can be generated
automatically by an interface compiler from the interface definition of the
service.
▪ RPC is generally implemented over a request-reply protocol.
▪ Choices of invocation semantics – at-least-once or at-most-once.
▪ To achieve this, the communication module will implement the choices in terms
of retransmission of requests, dealing with duplicates andretransmission of
results.
Remote Method Invocation (RMI)
▪ RMI: Method invocations between objects in
different processes, whether in the same computer or
not.
o Closely related to RPC but extended into the world of distributed
objects.
▪ LMI: Method invocations between objects in the
same process.
Remote Method Invocation (RMI)
▪ The commonalities between RMI and RPC:-
1. Both support programming with interfaces.
2. They are both typically constructed on top of request-reply
protocols and can offer a range of call semantics.
3. Both offer a similar level of transparency – i.e., local and
remote calls employ the same syntax
❖ but remote interfaces typically expose the
distributed nature of the underlying call, e.g. by
supporting remote exceptions.
Differences between RMI and RPC
1. The programmer is able to use the full expressive power of
OOP in the development of distributed systems applications.
2. Building on the concept of object identity in OO systems,
o all objects in an RMI-based system have unique object
references (whether they are local or remote),
3. RMI allows the programmer to pass parameters not only by
value but also by object reference.
o The remote end can then access this object using RMI.
o RMI thus offers significantly richer parameter-passing
semantics than in RPC.
Design issues for RMI
▪ RMI shares the same design issues as RPC in terms of
o programming with interfaces,
o call semantics and
o level of transparency.
▪ The key added design issue relates to achieving the
transition from objects to distributed objects.
□ Distributed objects the objects physically distributed into
different processes or computers in a distributed system.
Design issues for RMI
▪ Distributed object systems adopt client-server architecture.
o In RMI, the client’s request to invoke a method of an object is
sent in a message to the server managing the object.
o The method of the object is executed at the server and the result is
returned to the client in another message.
o To allow for chains of related invocations, objects in servers may
become clients of objects in other servers.
▪ Distributed objects can assume other architectural models.
o e.g., objects can be replicated for the usual benefits of fault
tolerance and enhanced performance, and
o objects can be migrated for enhancing performance and
availability.
Design issues for RMI
▪ The possibility of concurrent RMIs from objects in
different computers.
o Therefore the possibility of conflicting accesses arises.
o e.g., objects may use synchronization primitives such as
condition variables to protect access to their instance
variables.
▪ Another advantage:
o an object may be accessed via RMI, or
o it may be copied into a local cache and accessed directly.
The distributed object model
▪ Each process contains a collection of objects,
o some of which can receive both local and remote invocations,
o whereas the other objects can receive only local invocations, as shown in
Fig 2.
o e.g., the objects B and F in Fig 2 must have remote interf aces.
o Objects in other processes can invoke only the methods that
belong to its remote interface, as shown in Fig 3.
▪ Local objects can invoke the methods in the remote interface as well
as other methods implemented by a remote object.
▪ Note that remote interfaces, like all interfaces, do not have constructors.
▪ The CORBA system provides an interface definition language (IDL),
which is used for defining remote interfaces.
▪ The classes of remote objects and the client programs may be
implemented in any language for which an IDL compiler is available,
such as C++, Java or Python.
▪ .
Fig 1b
Fig 2
▪ The top line shows the time, Told, required to execute some program P
on the system before any changes are made.
▪ Now assume that some change is made to the system that reduces
execution time for some operations by a factor of q.
▪ The program now runs in time Tnew, where Tnew < Told, as shown in
the bottom line.
Amdahl’s law
▪ Hence there are many other operations in the program that
are unaffected by this change.
Fig 2
Fig 2
▪ The first component, αTold, is the execution time of that fraction
of the program that is unaffected by the change.
▪ The second component of Tnew, which is the remaining fraction
1-α of the original execution time, has its performance improved
by the factor q.
▪ Thus, the time required for this component is (1-α) Told/q.
▪ The overall speedup caused by this improvement is then found
to be
□ .
Amdahl’s law
▪ This equation can be used to calculate the overall speedup
obtained due to some improvement in the system, assuming
that q and α can be determined.
▪ However, it is interesting to ask what happens as the impact
on performance of the improvement becomes large, that is,
as q → ∞.
▪ It is easy to show that, in the limit as q → ∞, (1-α) Told/q
→ 0.
▪ Thus, the overall speedup, S, is bounded by 1/α.
▪ That is,
▪ .
Amdahl’s law
▪ This result says that, no matter how much one type of
operation in a system is improved,
▪ the overall performance is inherently limited by the
operations that are unaffected by the improvement.
▪ For example, the best speedup that could be obtained in a
parallel computing system with p processors is p.
▪ However, if 10% of a program cannot be executed in
parallel, the overall speedup when using the parallel machine
is at most 1/α = 1/0.1=10, even if an infinite number of
processors were available.
Amdahl’s law
▪ An obvious corollary to Amdahl's law
o any system designer or programmer should concentrate
on making the common case fast.
▪ That is, operations that occur most often will have the
largest value of α.
▪ Thus, improving these operations will have the biggest
impact on overall performance.
▪ Interestingly, the common cases also tend to be the
simplest cases.
▪ As a result, optimizing these cases first tends to be easier
than optimizing the more complex, but rarely used, cases.
Scaling Amdahl’s law
▪ One of the major criticisms concerning Amdahl’s law has been
that it emphasizes the wrong aspect of the performance
potential of parallel-computing systems.
▪ The argument is that purchasers of parallel systems want tosolve
larger problems within the available time.
▪ Following this line of argument leads to the following “scaled” or
“fixed-time” version of Amdahl's law.
▪ It is common to judge the performance of an application
executing on a parallel system by
o comparing the parallel execution time with p processors, Tp,
o with the time required to execute the equivalent sequential
version of the application program, T1,
o using the speedup Sp = T1/Tp.
Scaling Amdahl’s law
▪ With the fixed-time interpretation, however, the
assumption is that
o there is no single-processor system that is capable of
executing an equivalent sequential version of the
parallel application.
o The single-processor may not have a large enough
memory, for example, or
o the time required to execute the sequential version
would be unreasonably long.
Scaling Amdahl’s law
▪ In this case, the parallel-execution time is divided into
o the parallel component, 1-α, and
o the inherently sequential component, α, giving
o Tp = αT1 + (1-α)T1 as shown in fig 3 below.
Fig 3
▪ T1 is the time in which user wants to run application.
▪ Since no single-processor system exists that is capable of executing an
equivalent problem of this size,
o it is assumed that the parallel portion of the execution time would increase
by a factor of p
o if it were executed on a hypothetical single-processor system.
Scaling Amdahl’s law
□
Homework 1
□ An industrial process simulation involves five steps
which are performed sequentially on system A. The
steps 1 and 2 take 1 and 2 minutes respectively
whereas steps 3 – 5 take 3 minutes each. Then system
A was enhanced by introducing some parallelism to
get a system B. The steps 3 to 5 on B now can be
executed in parallel so that they take an overall time
of 4 minutes whereas steps 1 and 2 are still to be
executed in sequence. Calculate speedup using
appropriate form of Amdahl’s law.
Homework 2
□ An industrial process simulation is to be executed on a
system in the available time, ta, of 10 min which
includes parallel execution time (on 10 processors) of
7.5 min. It has been estimated that the parallel portion
of the execution time would increase by a factor of 12
if it were executed on a hypothetical single-processor
system. Calculate the parallel speedup.
Homework 3
In designing a new computer system, we make an
enhancement that improves some mode of execution by a
factor of 10. This enhancement takes 50% of the time when the
enhanced mode is in use. (Recall that Amdahl’s law uses the
fraction of the original, unenhanced execution time to find
speedup)
a) What is the speedup that we have obtained by using this
fast mode?
b) What percentage of the original execution time has been
converted to fast mode?
Multi threading
Lecture # 19
Thread:
□ A thread, also known as a lightweight process, is the
smallest unit of processing that can be scheduled and
executed by an operating system. Threads are a
fundamental part of multithreading, where a single
process can have multiple threads running concurrently,
allowing for parallel execution of tasks within the same
application.
□ Key Characteristics of Threads
□ Shared Resources:
□ Independent Execution:
□ Lightweight:
Common Uses of Threads
□ User Interfaces: Keeping the UI responsive while
performing background operations.
□ Servers: Handling multiple client requests concurrently.
□ Real-time Systems: Executing multiple real-time tasks
in parallel.
□ Simulations: Running different parts of a simulation
simultaneously.
□ How Threads are Used in RPC Systems
□ Concurrent Request Handling:
□ Improved Responsiveness:
□ Resource Utilization:
Multithreading
□ Multithreading is a programming and execution model that allows multiple threads to be
created within a single process, sharing the same memory space but executing
independently. This can improve the performance of applications by enabling parallelism and
better resource utilization.
□ Key Concepts in Multithreading
□ Thread: A thread is the smallest unit of processing that can be performed in an operating
system. It has its own execution context, including its own stack, register set, and program
counter.
□ Process: A process is an instance of a program in execution. It contains one or more
threads, as well as its own memory space, file handles, and other resources.
□ Concurrency vs. Parallelism:
□ Concurrency: Multiple threads make progress within overlapping time periods. It can be achieved
on a single-core CPU by interleaving thread execution.
□ Parallelism: Multiple threads execute simultaneously, which requires multiple cores or processors.
□ Context Switching: The process of storing the state of a thread or process so that it can
be resumed from the same point later. This is managed by the operating system.
□ Synchronization: Techniques to control the access of multiple threads to shared
resources to avoid conflicts and ensure data consistency. Common synchronization
mechanisms include locks, semaphores, and monitors.
Multithreading models
□ Multithreading models refer to different strategies for
implementing and managing threads within a process. The
choice of model impacts the efficiency and behavior of
thread management, including how threads are created,
synchronized, and scheduled. The primary multithreading
models are:
□ Many-to-One Model
□ One-to-One Model
□ Many-to-Many Model
Many-to-One Model
□ In the many-to-one model, many user-level threads are
mapped to a single kernel thread. Thread management is
performed by the thread library in user space, which is
efficient but has significant limitations.
□ Advantages:
□ Efficient thread creation and management since they are done in
user space.
□ Low overhead for context switching between user-level threads.
□ Disadvantages:
□ If one thread makes a blocking system call, the entire process is
blocked because the kernel thread is blocked.
□ Only one thread can execute at a time, even on multiprocessor
systems.
□ Example:
□ Green threads in early versions of Java.
One-to-One Model
□ In the one-to-one model, each user-level thread maps to a
separate kernel thread. This model provides more
concurrency than the many-to-one model and is used by most
modern operating systems.
□ Advantages:
□ True parallelism on multiprocessor systems because each thread
can run on a different processor.
□ If one thread blocks, other threads can continue to run.
□ Disadvantages:
□ Creating a kernel thread for each user thread incurs a higher
overhead.
□ The number of threads per process may be limited by the operating
system.
□ Example:
□ POSIX threads (Pthreads), Windows threads.
Many-to-Many Model
□ In the many-to-many model, many user-level threads are mapped to
a smaller or equal number of kernel threads. The model allows the
operating system to create sufficient kernel threads to handle
multiple user threads efficiently.
□ Advantages:
□ Greater flexibility and efficiency compared to the other models.
□ User-level threads can be created and managed with less overhead.
□ If one thread blocks, the kernel can schedule another thread.
□ Disadvantages:
□ More complex to implement compared to the many-to-one and
one-to-one models.
□ Performance overhead due to the need to manage the mapping between
user-level and kernel-level threads.
□ Example:
□ Solaris, Windows with the Fibers library.
Support for multithreading
□ Support for multithreading can be provided at both the user level and the kernel
level. Each level has its own mechanisms for creating, managing, and synchronizing
threads, with different advantages and trade-offs.
□ User-Level Threads
□ User-level threads are managed by a user-level library or runtime, not the operating
system kernel. All thread operations, such as creation, scheduling, and synchronization,
are performed in user space.
□ Advantages:
□ Efficiency: User-level thread operations are fast because they do not involve system calls,
which can be slow.
□ Portability: User-level threading libraries can be implemented on any operating system,
as they do not rely on kernel support.
□ Customization: Developers have fine control over the scheduling and management
policies of threads.
□ Disadvantages:
□ Blocking System Calls: If a thread makes a blocking system call, the entire process is
blocked because the kernel is unaware of the user-level threads.
□ No True Parallelism: On multiprocessor systems, user-level threads cannot achieve
true parallelism because the kernel sees only one thread per process.
Sample problem
□ When comparing the performance of a single-threaded
and a multi-threaded file server. The following
assumptions are made. It takes 10ms to get a request,
dispatch it and do the rest of the necessary processing
involved in serving the file, assuming the file is cached in
main memory. If the file is not cached, a disk operation is
needed in which case an additional 50ms is required,
during which the thread sleeps. Assume that for one third
of all requests, the file can be served from the cache.
How many requests per second can the single-threaded
server handle?
Solution:
□ To determine how many requests per second a single-threaded
server can handle, let's break down the processing time for each
request based on whether the file is cached or not.
□ .
Solution:
Conclusion
□ The single-threaded server can handle approximately 23.0823.0823.08 requests per second.
CS-482 Parallel And Distributed Computing Test 1
Name: _____________________________________ Roll No.: ___________________________________
Question # 01: Examine array processors and their characteristic features, highlighting their role in
parallel computing and data processing tasks. (CLO-01)
Question # 02: Can you analyze architectural classification schemes in computer architecture, detailing
their significance and various categories? (CLO-02)
Paper B
Question # 01: Examine the principles of vector processing and how they underpin computational
efficiency and performance enhancements? (CLO-01)
Question # 02: How do the principles of pipelining and vector processing contribute to enhancing the
performance of modern processors? Analyze the key concepts and mechanisms involved. (CLO-02)
Paper C
Question # 01: Examine multi-processor architecture and discuss its both types? (CLO-01)
Question # 02: Can you examine the design considerations and efficiency of parallel algorithms? Analyze
their advantages, challenges, and applications in solving complex computational problems. (CLO-02)
Quiz 3
Paper A
Q1. Demonstrate reasons for using external data representation in distributed systems?(CLO-1) (2.5
marks)
Q2. Detect which of the following is a correct statement about Java object serialization? (CLO-2) (1
marks)
A. Serialization always results in smaller object sizes compared to the original.
B. Strings and characters are always serialized using UTF-16.
C. Deserialization assumes prior knowledge of object types.
D. Serialization doesn't support handling object references.
Q3. Explain which remote invocation paradigm extends the conventional procedure call model to
distributed systems? (CLO-2) (1.5 marks)
Paper B
Q1. Define the purpose of marshalling in external data representation? (CLO1) (2.5 marks)
Q2. Explain what does deserialization involve in Java object serialization? (CLO2) (1 marks)
A. Converting a byte stream to an object.
B. Encrypting the serialized object.
C. Serializing the object.
D. Converting an object to a byte stream.
Q3. Explain in request-reply protocols, why is asynchronous communication useful? (CLO-2) (1.5 marks)
Paper C
Q1. Demonstrate what distinguishes Remote Method Invocation (RMI) from Remote Procedure Call
(RPC)? (CLO-1) (2.5 marks)
Q2. Confirm in CORBA's Common Data Representation (CDR), how are primitive values transmitted?
(CLO-2) (1 marks)
A. Always in little-endian order.
B. Always in big-endian order.
C. Depending on the recipient's preference.
D. As ASCII characters.
Q3. Explain in Java, which interface must a class implement to enable its objects to be serialized? (CLO-
2) (1.5 marks)
assignment 1
Question 1 (a): A client attempts to synchronize with a time server. It records the following round-trip times
and timestamps returned by the server:
25 11:27:14.321
21 11:27:16.589
32 11:27:19.247
i) Which of these times should the client use to set its clock?
ii) Estimate the relative accuracy of the setting with respect to the server's clock.
iii) To what time should the client set its clock, considering the calculated server times and potential
averaging?
iv) If it is known that the minimum message transmission time is 6 milliseconds, recalculate the values
in (b) and (c) above, considering if it changes the answer.
Examine how does the minimum message transmission time influence the accuracy of clock synchronization
and the choice of the reference time? (CLO -1)
Question 2 (a):
Consider the space-time diagram of the distributed system below:-
a) Redraw the above diagram and assign lamport time-stamps to different events
b) Again redraw the above diagram and assign vector time-stamps to different events
Based on the causal relationships between these events using Lamport timestamps and vector timestamps.
Analyze both time- stamping techniques and write pros and cons. (CLO -2)
Assignment 2
Question 01: Given below is the definition of the class Project and two objects of the same class.
Draw the serialized form of the object. Apply Java Object Serialization procedure. Use 8 byte
version. (CLO-1)
Question 02: Explain how the following code snippet can potentially cause a deadlock, and
propose a solution to prevent this problem. (CLO-2)
#include <iostream>
#include <thread>
#include <chrono>
#include <mutex>
using namespace std;
int main() {
mutex mut1, mut2;
thread t1([&] { deadlock(mut1, mut2); });
thread t2([&] { deadlock(mut2, mut1); });
t1.join();
t2.join();
return 0;
}
Mid term A
Q1) [CLO-1] [5 Marks]
Explain Flynn's classification in computer architecture and demonstrate its types with examples?
Mid term B
Q1) [CLO-1] [5 Marks]
Can you define parallel and distributed computing and demonstrate their applications by highlighting at least
five different aspects where they are utilized? Additionally, estimate how these computing paradigms contribute
to improving computational efficiency and scalability in various domains?
Q2) [CLO-2] [5 Marks]
Explain Peterson’s Solution algorithm, its approach in ensuring mutual exclusion and compare its effectiveness
in preventing race conditions with other methods. Lastly, evaluate the advantages of Peterson’s Solution in
solving the requirements of the critical section problem.
Q3) [CLO-1] [5 Marks]
Define mutex and semaphore in the context of concurrent programming and demonstrate their algorithm?
Additionally, could you explain their types and how they are applied to synchronize access to shared resources?
Lastly, estimate the effectiveness of mutex and semaphore in ensuring thread safety and preventing race
conditions in concurrent systems.
Q4) [CLO-2] [5 Marks]
Can you elaborate on conflicts of serializability of transactions, comparing them to other types of transaction
conflicts? Differentiate the types of conflicts of serializability arise and how they differ from conflicts like
deadlocks or livelocks. Additionally, evaluate the impact of conflicts of serializability on database performance
and integrity.
NED UNIVERSITY OF ENGINEERING & TECHNOLOGY
FINAL YEAR (BACHELOR OF SCIENCE IN COMPUTER
SCIENCE)
SPRING SEMESTER EXAMINATIONS 2024
BATCH 2020
Dated:23-JUL-2024
Time: 3 Hours
Max.Marks:60
Parallel & Distributed Computing - CS-428
1. Classify the types of parallel memory architecture, defining any two types and explain advantages and
disadvantages associated with each? (CLO-1, 5 marks)
2. Define the following networking architectures based on their key characteristics:
a) Client-Server Architecture
b) Peer-to-Peer Architecture
For each architecture, provide a concise definition and classify it by identifying its primary
features. (CLO-1, 4 marks)
3. Provide a concise definition of array processors and mention its key characteristics? (CLO-1, 4 marks)
4. An industrial process simulation is to be executed on a system in the available time, ta, of 10 min which
includes parallel execution time (on 10 processors) of 7.5 min. It has been estimated that the parallel
portion of the execution time would increase by a factor of 12 if it were executed on a hypothetical
single-processor system. Calculate the parallel speedup. (CLO-1, 5 marks)
5. When comparing the performance of a single-threaded and a multi-threaded file server. The following
assumptions are made. It takes 10ms to get a request, dispatch it and do the rest of the necessary
processing involved in serving the file, assuming the file is cached in main memory. If the file is not
cached, a disk operation is needed in which case an additional 50ms is required, during which the thread
sleeps. Assume that for one third of all requests, the file can be served from the cache. Solve how many
requests per second can the single-threaded server handle? (CLO-1, 6 marks)
6. Examine how the following code snippet can potentially cause a conflict, mention the conflict and
propose a solution to prevent this problem. (CLO-1, 6 marks)
#include <iostream>
#include <thread>
#include <chrono>
#include <mutex>
using namespace std;
void determineConflict(mutex& mA, mutex& mB) {
mA.lock();
cout << "Thread acquired mA" << endl;
this_thread::sleep_for(chrono::milliseconds(100)); // Simulate some work
mB.lock();
cout << "Thread acquired mB" << endl;
mA.unlock();
mB.unlock();
}
int main() {
mutex mutexA, mutexB;
thread t1([&] { determineConflict(mutexA, mutexB); });
thread t2([&] { determineConflict(mutexB, mutexA); });
t1.join();
t2.join();
return 0;
}
7. Analyze the concept of RPC and RMI and explain the term "marshalling" in the context of Remote
Method Invocation (RMI) or Remote Procedure Call (RPC), and illustrate its role in data
transmission between client and server. (CLO-2, 4 marks)
8. Consider distributing a file of 𝐹 = 15 𝐺𝑏𝑖𝑡𝑠 to 𝑁 peers. The server has an upload rate of 𝑢𝑠 =
15 𝑀𝑏𝑝𝑠, and each peer i has a download rate of 𝑑𝑖 = 1.5 𝑀𝑏𝑝𝑠 and an upload rate of 𝑢𝑖 . Complete
the chart giving the minimum distribution time (in hours)
a) For different values of N for client-server distribution.
b) For each of the combinations of N and u for P2P distribution.
After the calculations, analyze the difference between the performance of C-S and P2P systems
for the same number of nodes. (CLO-2, 12 mark)
𝑈𝑖 𝑁 20 200 𝑁 𝑇𝑖𝑚𝑒
600kbps 20
1.5 Mbps 200
9. A master computer is coordinating the internal synchronization of five slave computers using the
Berkley algorithm. At a specific instance, the master polls the slaves 1-5 for their current clock
values. Suppose the slaves respond with the values of 210.3, 212.6, 207.8, 209.5, and 208.2 units
respectively. The master finds its own clock value as 211.0 units. Assuming all clocks are correct:
a) Ignoring round-trip time: Analyze the Berkley algorithm to synchronize all clocks in the
system.
b) Considering round-trip times: Given the round-trip times for the slaves are 6.5, 5.8, 7.2, 5.3,
and 6.8 units respectively, Analyze the Berkley algorithm for internal clock synchronization.
Solve the calculations for both parts (a) and (b), showing the steps involved in adjusting the slave
clocks based on the master's clock value. (CLO-2, 10 marks)
10. Outline the failure model associated with Request-Reply Protocols. (CLO-2, 4 marks)
Model solution
Question 1 (a): A client attempts to synchronize with a time server. It records the following round-trip times
and timestamps returned by the server:
25 11:27:14.321
21 11:27:16.589
32 11:27:19.247
i) Which of these times should the client use to set its clock?
ii) Estimate the relative accuracy of the setting with respect to the server's clock.
iii) To what time should the client set its clock, considering the calculated server times and potential
averaging?
iv) If it is known that the minimum message transmission time is 6 milliseconds, recalculate the values
in (b) and (c) above, considering if it changes the answer.
Examine how does the minimum message transmission time influence the accuracy of clock synchronization
and the choice of the reference time? (CLO -1)
Question 2 (a):
Consider the space-time diagram of the distributed system below:-
a) Redraw the above diagram and assign lamport time-stamps to different events
b) Again redraw the above diagram and assign vector time-stamps to different events
Based on the causal relationships between these events using Lamport timestamps and vector timestamps.
Analyze both time- stamping techniques and write pros and cons. (CLO -2)
Assignment 2
Question 01: Given below is the definition of the class Project and two objects of the same class.
Draw the serialized form of the object. Apply Java Object Serialization procedure. Use 8 byte
version. (CLO-1)
Question 02: Explain how the following code snippet can potentially cause a deadlock, and
propose a solution to prevent this problem. (CLO-2)
#include <iostream>
#include <thread>
#include <chrono>
#include <mutex>
using namespace std;
int main() {
mutex mut1, mut2;
thread t1([&] { deadlock(mut1, mut2); });
thread t2([&] { deadlock(mut2, mut1); });
t1.join();
t2.join();
return 0;
}
1. Classify the types of parallel memory architecture, defining any two types and explain one advantage and
disadvantage associated with each? (CLO-1) 5 marks
Shared Memory: 0.5+
Multiple processors can operate independently, but share the same memory resources 1+
Changes in a memory location caused by one 0.5+
CPU are visible to all processors 0.5
Advantages: =
Global address space provides a user-friendly programming perspective to memory 2.5
Fast and uniform data sharing due to proximity of memory to CPUs
Disadvantages:
Lack of scalability between memory and CPUs. Adding more CPUs increases traffic on the
shared memory CPU path
Programmer responsibility for “correct” access to global memory
Distributed Memory: 0.5+
Requires a communication network to connect interprocessor memory 1+
Processors have their own local memory. Changes made by one CPU have no effect on others 0.5+
Requires communication to exchange data among processors 0.5
Advantages: =
Memory is scalable with the number of CPUs 2.5
Each CPU can rapidly access its own memory without overhead incurred with trying to maintain
global cache coherency
Disadvantages:
Programmer is responsible for many of the details associated with data communication between
processors
It is usually difficult to map existing data structures to this memory organization, based on global
memory
Hybrid Distributed Shared Memory:
The largest and fastest computers in the world today employ both shared and distributed memory
architectures.
Advantages and Disadvantages:
4. Demonstrate the failure model associated with Request-Reply Protocols. (CLO-1) 4 marks
I. Timeouts 1
II. Discarding duplicate request messages 1
III. Lost reply messages 1
IV. History 1
5. Consider distributing a file of to peers. The server has an upload rate of , and each
peer i has a download rate of and an upload rate of . Complete the chart giving the minimum
distribution time (in hours)
a) For different values of N for client-server distribution.
20 200
20 600kbps
6. A master computer is coordinating the internal synchronization of five slave computers using the Berkley algorithm.
At a specific instance, the master polls the slaves 1-5 for their current clock values. Suppose the slaves respond with
the values of 210.3, 212.6, 207.8, 209.5, and 208.2 units respectively. The master finds its own clock value as 211.0
units. Assuming all clocks are correct:
a) Ignoring round-trip time: Apply the Berkley algorithm to synchronize all clocks in the system.
b) Considering round-trip times: Given the round-trip times for the slaves are 6.5, 5.8, 7.2, 5.3, and 6.8 units respectively,
apply the Berkley algorithm for internal clock synchronization.
Solve the calculations for both parts (a) and (b), showing the steps involved in adjusting the slave clocks based on the
master's clock value. (CLO-1) 10 marks
7. When comparing the performance of a single-threaded and a multi-threaded file server. The following assumptions
are made. It takes 10ms to get a request, dispatch it and do the rest of the necessary processing involved in serving
the file, assuming the file is cached in main memory. If the file is not cached, a disk operation is needed in which case
an additional 50ms is required, during which the thread sleeps. Assume that for one third of all requests, the file can
be served from the cache. How many requests per second can the single-threaded server handle? (CLO-1) 6 marks
8. Explain how the following code snippet can potentially cause a conflict, mention the conflict and propose a solution
to prevent this problem. (CLO-1) (6 marks)
#include <iostraem>
#include <thread>
#include <chrono>
#include <mutex>
using namespace std;
int main() {
mutex mut1, mut2;
thread t1([&] { deadlock(mut1, mut2); });
thread t2([&] { deadlock(mut2, mut1); });
t1.join();
t2.join();
return 0;
9. Demonstrate the concept of RPC and RMI and explain the term "marshalling" in the context of Remote Method
Invocation (RMI) or Remote Procedure Call (RPC), and illustrate its role in data transmission between client and
server. (CLO-1) (5 marks)
RPC allows a program to execute a procedure (or function) in another address space (commonly on another 1
computer) as if it were a local procedure call, hiding the details of the network communication.
RMI is a Java-specific implementation of RPC that allows an object to invoke methods on an object running in 1
another JVM (Java Virtual Machine). RMI extends the concept of Java interfaces to invoke methods remotely,
making distributed computing simpler within the Java ecosystem.
Marshalling (also known as serialization) is the process of converting the memory representation of an object 1
or data structure into a format suitable for storage or transmission over a network.
Role in Data Transmission: 2
1. Object Serialization: In Java RMI, when an object is passed as a parameter or returned value
in a remote method call, it needs to be serialized (marshalled) into a byte stream before being
transmitted over the network.
2. Network Transmission: The marshalled data, now in the form of a byte stream, is transmitted
from the client to the server (or vice versa).
3. Object Deserialization: On the receiving end, the byte stream is deserialized (unmarshalled)
back into its original object form, allowing the server to process the method call with the
correct parameters.
10. An industrial process simulation is to be executed on a system in the available time, ta, of 10 min which includes
parallel execution time (on 10 processors) of 7.5 min. It has been estimated that the parallel portion of the execution
time would increase by a factor of 12 if it were executed on a hypothetical single-processor system. Calculate the
parallel speedup. (CLO-1) (5 marks)