0% found this document useful (0 votes)
35 views33 pages

Distributed Systems Simulation and Modeling (Group 19)

The document presents an overview of Distributed Systems Simulation and Modeling, covering key areas such as simulation models, workload generation, performance metrics, and Monte Carlo simulation. It emphasizes the importance of simulation in predicting system performance, optimizing resource allocation, and testing fault tolerance. Additionally, it discusses various tools and techniques used in simulating cloud computing, IoT, and edge computing environments, as well as the significance of profiling, tracing, and failure simulation in enhancing system resilience.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views33 pages

Distributed Systems Simulation and Modeling (Group 19)

The document presents an overview of Distributed Systems Simulation and Modeling, covering key areas such as simulation models, workload generation, performance metrics, and Monte Carlo simulation. It emphasizes the importance of simulation in predicting system performance, optimizing resource allocation, and testing fault tolerance. Additionally, it discusses various tools and techniques used in simulating cloud computing, IoT, and edge computing environments, as well as the significance of profiling, tracing, and failure simulation in enhancing system resilience.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

GROUP 19 COE 453 - Distributed Computing

Distributed
Systems
Simulation
and Modeling
Presented by Group 19
GROUP 19 COE 453 - Distributed Computing

Introduction to Distributed Systems


01 Simulation and Modeling

Table of 02 Simulation Models for Distributed Systems

Distributed System Workload Generation

Contents 03 and Modeling

Performance Metrics and Evaluation


04 Techniques in Distributed Systems

Monte Carlo Simulation for Distributed


05 Systems

06 Distributed System Profiling and Tracing


GROUP 19 COE 453 - Distributed Computing

Simulators for Cloud Computing and Data


07 Centers

Table of 08 Trace Analysis and Visualization Tools

Simulating IoT and Edge Computing

Contents 09 Environments

Distributed System Failure and Anomaly


10 Simulation

Validation and Verification of Simulation


11 Models

12 Conclusion & Future Trends in Simulation


GROUP 19 COE 453 - Distributed Computing

Introduction to
Distributed Systems
Simulation and Modeling

Distributed Systems Simulation and Modeling Key Areas Covered


is the use of computational methods to analyze, • Simulation models
test, and evaluate distributed systems without • Workload generation & performance metrics
physically implementing them. • Statistical techniques (Monte Carlo, profiling)
• Trace analysis & visualization
• Fault and anomaly simulation
Importance • Specialized simulators (Cloud, IoT, Edge)
• Predict system performance under different conditions. • Validation & future trends
• Assist in resource allocation and fault tolerance testing.
• Reduce deployment risks and costs.
GROUP 19 COE 453 - Distributed Computing

Simulation Models for Distributed


Systems
Components Types of Models
Entities Discrete Event Simulation
Events Agent-Based Simulation
Simulation Clock Network Simulation
Parallel and Distributed Simulation
Simulation Engine
There are other models like the Continuous, the Monte
Carlo and the Stochastic Simulation Models

Purpose Software for Simulation Modeling


Performance Evaluation
NS-3 (Network Simulator 3)
Fault Tolerance Testing
OMNeT++
Capacity Planning MATLAB/Simulink
AnyLogic
GROUP 19 COE 453 - Distributed Computing

Workload Generation in Distributed


Systems
Workload Generation refers to creating realistic Types
workloads to mimic user interactions and system • Interactive
• Batch
usage. • IoT
• Cloud

Purpose Techniques
• Performance testing • Replay-based
• Bottleneck detection • Synthetic
• Scalability • Benchmarking tools (e.g.,
• Fault tolerance JMeter, Locust)
• Optimization
GROUP 19 COE 453 - Distributed Computing

Workload Modeling in Distributed Systems


Workload Modeling refers to mathematical Models Used
models predicting workload behaviors. • Poisson processes
• Markov chains
• Gaussian distributions

Characteristics
• Request arrival rate
• Service time
• Concurrency
• Resource usage
GROUP 19 COE 453 - Distributed Computing

Performance Metrics and Evaluation


Techniques
• Why Evaluate? Techniques Cont’d
- Ensure systems handle real-world conditions. 1. Analytical Models:
- Mathematical tools (e.g., queuing theory).
Key Metrics to Evaluate Performance: - Early predictions but may lack real-world details.
Throughput: Number of tasks completed per unit time.
2. Experimental Testing:
Latency: Time taken to complete a request. - Real condition deployment (e.g., AWS testing).
Scalability: Ability to handle increasing workloads. - Accurate but costly and time-consuming.

Reliability: How often the system remains operational. 3. Simulation-Based Evaluation:


Energy Efficiency: Power consumption vs. performance. - Safe, flexible virtual testing.
- Cost-effective but depends on model
quality.
GROUP 19 COE 453 - Distributed Computing

Monte Carlo Simulation for Distributed


Systems
Monte Carlo Simulation is a computational Why Monte Carlo in
method that uses random sampling to estimate
Distributed Systems?
results for problems that are too complex to
solve exactly. · Imagine you are a cybersecurity
analyst trying to estimate:
It is widely used in computer science, How long before a hacker breaks
networking, AI, and finance. into your system?
It works well in distributed computing There are too many factors to
because: calculate exactly:
The problem can be broken into Strength of passwords
smaller, independent tasks. Computing power of the
Each task runs on a separate hacker's machine
computer. - More computers = Number of login attempts
Faster and more accurate results. per second
GROUP 19 COE 453 - Distributed Computing

Monte Carlo Simulation for Distributed


Systems (Cont’d)
Instead of solving it with a formula, we simulate millions Example: Monte Carlo
of attacks using random values: for Password Cracking
Some attacks succeed early
Some take a long time. Let’s say we want to estimate how long it
takes to brute force an 8-character password
We estimate the average time to breach using
Monte Carlo! A hacker tries random passwords until
they get the correct one
Distributed computing makes this fast The time depends on:
by running many attack simulations on Password complexity (e.g., numbers,
multiple machines in parallel! letters, symbols)
How fast the hacker’s computer
guesses passwords
GROUP 19 COE 453 - Distributed Computing

Monte Carlo Simulation Example Simulation:

for Distributed Systems Simulation


#
Password
Length
Speed
(attempts/sec)
Time to
Crack

(Cont’d) 1️M
1 8 chars 5 hours
attempts/sec
Instead of solving a complicated math formula, we run a
Monte Carlo simulation:
5M
2 8 chars 1 hour
1. Randomly generate millions of fake hacking attempts. attempts/sec

2. Measure how long it takes to crack the password in


each case. 3 8 chars
1️0M
3️0 mins
attempts/sec
3. Compute the average time across all simulations.

Formula: Monte Carlo lets us estimate how long


total time taken across all simulations real-world attacks take!
Average time to crack =
total number of simulations Monte Carlo lets us estimate how long
real-world attacks take!
GROUP 19 COE 453 - Distributed Computing

5. Monte Carlo in Other Computer Applications


Monte Carlo Simulation Monte Carlo is widely used in distributed
computing systems, including:
for Distributed Systems
Network Security – Simulating
(Cont’d) cyberattacks to test defense systems.
4. How Distributed Computing Helps Load Balancing – Predicting web server
Instead of running this simulation on one computer, we can: loads by simulating user traffic.
1. Split the work into multiple independent tasks AI & Machine Learning – Simulating
2. Assign each task to a different computer different training scenarios.
3. Each machine simulates random hacking attempts Cloud Computing – Optimizing resource
4. At the end, we combine results to estimate the final allocation in data centers.
breach time. 6. Conclusion
Monte Carlo is a powerful tool for solving
Example complex problems using random sampling.
Single machine: 1️million attempts per second → Takes Distributed computing makes it even better
too long. by running simulations in parallel!
10 distributed machines: Each runs 1️00,000 attempts → More computers = Faster results & better
10x faster! accuracy.
GROUP 19 COE 453 - Distributed Computing

Distributed System Profiling and Tracing


Profiling is monitoring system performance, while Tools Used
tracing records events as they occur.
Perf, eBPF (for Linux systems)
Google’s Dapper (used for tracing in
Why It’s Important large-scale distributed systems)
Helps in debugging performance bottlenecks.
Optimizes resource allocation and load balancing.
GROUP 19 COE 453 - Distributed Computing

Distributed System Profiling and Tracing


Feature Perf (Linux Performance Tools)
eBPF (Extended Berkeley Packet Dapper (Google's Distributed
Filter) Tracing)

Kernel & user-space


Purpose CPU & system profiling Distributed request tracing
tracing

End-to-end tracing across


Scope Process & thread-level System-wide observability
services

Moderate
Overhead Low (event-driven) Low to moderate
(sampling-based)

Dynamic Requires app-level


Instrumentation Manual (command-line)
(programmable in kernel) integration

Security, monitoring, and


Best For Performance tuning Microservices tracing
observability
GROUP 19 COE 453 - Distributed Computing

Trace Analysis and Visualization Tools

Definition: Types of Distributed Tracing


Distributed tracing is a technique used Code tracing: it involves the
to track and profile the execution of inspecting of the flow of source
requests as they travel across multiple codes in an application when
services in a distributed architecture. performing a specific function
It provides a detailed view of the path of Program tracing: in this method,
a requests through various developers examine the addresses
of instructions and variables called
microservices which in turn allows
by an active application
developers and operation teams to End-to-end tracing(main focus):
pinpoint performance bottlenecks, With end-to-end tracing, developers
latency and errors across a system. track data information along the
service request path.
GROUP 19 COE 453 - Distributed Computing

Trace Analysis and Visualization Tools


(Cont’d)
How it works Importance
When a request is initiated, data is collected and a End-to-end visibility
unique trace ID and span(parent span) ID is created Performance monitoring
A trace is an entire execution path Error diagnosis
A span is a single unit of work during a journey
When the request enters a service, a top-level child
span is created.
If multiple commands are made within the same
service, the top-level child span becomes
parent to multiple child spans underneath it
The platform encodes a child span with the original
trace ID, unique span ID, duration, error data and
other relevant metadata.
All spans are then visualized in a flame graph with
the parent span on top and child spans beneath,
GROUP 19 COE 453 - Distributed Computing

Trace Analysis and Visualization Tools


(Cont’d)
Tools Examples of Tools

Distributed tracing tools support three OpenTelemetry: an industry-standard open-


phases of request tracing source platform for data instrumentation and
Instrumentation: which involves collection. Offers vendor-neutral auto-
modifying code so requests can be instrumentation libraries and APIs that allow
recorded as they pass through your you to trace the end-to-end pathways and
stack duration of requests
Data collection: collecting span data for Jaeger: open-source tool with UI that visualize
each request distributed traces. Limited to sampling hence
Analysis and visualization: involves some problems are likely to be omitted.
encoding and tagging the spans for Datadog: offers complete Application
analysis and displaying them as flame Performance Monitoring and distributed tracing
graphs for organizations operating at any scale.
GROUP 19 COE 453 - Distributed Computing

Simulators for Cloud Computing and Data


Centers
Definition: Key Features of Cloud
Cloud simulators create a virtual environment that Simulators
replicates cloud computing and data center operations. Resource Allocation Modeling
They allow researchers and engineers to experiment Workload and Traffic Simulation
with different configurations, workloads, and resource Cost Prediction
management strategies without the high costs and Energy Consumption Analysis
risks associated with real-world testing. Scalability Testing

Why Use Cloud Simulators? How Cloud Simulators


Cost-Effective Testing
Work Together
Time Efficiency Scenario Testing
Resource Optimization Comparative Analysis
Risk Mitigation Optimization
GROUP 19 COE 453 - Distributed Computing

Simulators for Cloud Computing and Data


Centers
Popular Cloud Simulators

Simulator Purpose Key Features Common Use Cases

Simulates cloud resource Models resource


Academic research,
CloudSim allocation and VM provisioning, scheduling,
performance evaluation
migrations and migration

Predicts cloud service Cost modeling, large scale


Cost estimation,
iCanCloud costs for different cloud infrastructure
infrastructure planning
configurations simulation

Energy consumption Analyzing energy


Focuses on energy
GreenCloud modeling, network efficiency, designing green
efficient cloud computing
simulation data centers
GROUP 19 COE 453 - Distributed Computing

Simulating IoT and Edge Computing


Environments
Definition: Simulation Tools for
Simulation tools help model IoT networks and IoT & Edge Computing:
edge computing scenarios.
· IoTSim: Simulates IoT applications on
Challenges in IoT Simulation cloud platforms.
· EdgeCloudSim: Models edge
· Large-scale device connectivity
computing architectures.
· Real-time data processing constraints
· NS3 (Network Simulator 3): Used for
simulating IoT network traffic.
GROUP 19 COE 453 - Distributed Computing

Simulating IoT and Edge Computing


Environments (Cont’d)
Popular Simulation Tools for IoT & Edge Computing

Tool Purpose Key Features Common Use Cases

Models IoT data


Simulates IoT applications Research on IoT-cloud
IoTSim processing in cloud
on cloud platforms integration
environments

Models edge computing Simulates edge and cloud Testing edge offloading
EdgeCloudSim
architectures resource distribution strategies

Studying IoT protocol


Simulates IoT network Models wireless networks
NS3 performance and
traffic and IoT communications
scalability
GROUP 19 COE 453 - Distributed Computing

Distributed System Failure and Anomaly


Simulation
Definition: Why simulate failures?
This involves intentionally causing failures in a distributed 1. Enhancing fault tolerance e.g. A social media platform
system to test how it responds. The goal is to make the like Facebook tests its ability to keep running even
system more robust and resilient to real-world when one of its data centers goes down.
disruptions. 2. Preparing for real-world disruptions e.g. E-commerce
websites like Amazon simulate Black Friday traffic
Example: Imagine an online banking system where sudden spikes to ensure their servers don’t crash during high
server crashes could cause customers to lose access to demand.
their accounts. If the bank's IT team had tested failure 3. Optimizing system recovery e.g. Cloud storage
scenarios in advance, they could ensure the system can services like Google Drive can test data recovery by
automatically switch to backup servers without affecting simulating a disk failure and measuring how fast lost
users. files are restored.
4. Improving load balancing e.g. A video streaming
service like Netflix tests its servers to handle millions
of users watching movies at the same time.
GROUP 19 COE 453 - Distributed Computing

Distributed System Failure and Anomaly


Simulation (Cont’d)
Types of Failures Simulated 2. Hardware crashes
This models failures like server crashes, disk
1. Network partitions corruption, and memory issues. Some sources are
This occurs when different parts of a distributed server crashes, disk failures, and memory
system lose communication. Some sources are corruption.
software bugs (misconfiguration in network Example: If a bank’s database server crashes, its
settings), hardware failure (router or switch not backup system should automatically take over.
working).
Example: If WhatsApp's servers in different 3. Load spikes and system overload
countries stop communicating, messages may not Simulates high traffic conditions and resource
get delivered immediately. consumption. Some sources are high user demand,
DDoS attacks, and inefficient resource allocation.
Example: Before launching a new product, an online
store may test what happens if 100,000 customers
try to buy the same item at once.
GROUP 19 COE 453 - Distributed Computing

Distributed System Failure and Anomaly


Simulation (Cont’d)
Summary Table

Type of Failure What Happens? Real-Life Example How Systems Handle It

Replication, retries,
Some parts of the system
Network Partition WhatsApp message delays partition-tolerant
can't communicate
databases

Backups, failover
A server, disk, or memory
Hardware Crash Bank ATM server crash mechanisms, redundant
fails
power

System overload due to Amazon Black Friday Auto-scaling, load


Load Spike
high traffic outage balancing, caching
GROUP 19 COE 453 - Distributed Computing

Failure Simulation Tools (Cont’d)


Tool Purpose Key Features Common Use Cases

Randomly shuts down Microservices & cloud Shuts down transaction services to test failover
Chaos Monkey
services resilience mechanisms

Simulates network
Network stability Delays customer transactions to check retry
Gremlin failures (latency,
testing mechanisms
dropped connections)

Simulates AWS
AWS cloud Tests if the system recovers from an AWS S3
AWS FIS outages & resource
applications storage failure
failures

Simulates container & Kubernetes-based Terminates transaction-processing containers to


LitmusChaos
Kubernetes failures apps test auto-scaling
GROUP 19 COE 453 - Distributed Computing

Distributed System Failure and Anomaly


Simulation (Cont’d)

Example Use Case of Tools

Let’s consider an online banking system that handles transactions, account management, and fraud
detection. The bank’s infrastructure is cloud-based and runs on AWS and Kubernetes with multiple
microservices.
To test its fault tolerance, the bank can combine Chaos Monkey, Gremlin, AWS FIS, and LitmusChaos in
the following ways:
Step 1: Test Microservices Resilience with Chaos Monkey
Step 2: Simulate Network Failures Using Gremlin
Step 3: Test Cloud Service Failures with AWS FIS
Step 4: Test Kubernetes-Based Services with LitmusChaos
GROUP 19 COE 453 - Distributed Computing

Final Outcome: Banking System


Resilience Testing
Failure Scenario Simulation Tool Used Expected System Response

Microservices failure Chaos Monkey Backup services take over

Network delays Gremlin Requests retry automatically

Cloud service outage AWS FIS Failover to another AWS region

Kubernetes pod failure LitmusChaos Auto-restart failed containers

By combining these tools, the bank ensures that its system can handle failures from multiple angles—from
random shutdowns to cloud outages—keeping transactions secure and minimizing downtime.
GROUP 19 COE 453 - Distributed Computing

Distributed System Failure and Anomaly


Simulation (Cont’d)
Examples of Failure Simulation Tools

Tool Purpose Key Features Common Use Cases

Randomly shuts down Injects controlled failures Testing system robustness


Chaos Monkey
services to test resilience in production and auto-recovery

Simulates cloud service Models outages in cloud- Evaluating cloud


FAILURES.cloud
failures based infrastructures application reliability
GROUP 19 COE 453 - Distributed Computing

Validation and Verification of Simulation


Models
Definitions Verification Methods
Verification means checking if the simulation Compare simulation output with real system logs
model is built correctly according to its design Perform statistical accuracy testing
and specifications. Use benchmarks from existing distributed systems
Validation means checking if the simulation
model accurately represents the real-world
system it is supposed to simulate. Tools Used:
Model checkers (e.g., UPPAAL, SPIN)
Empirical comparison with real-world
Why is Validation Important? data
Ensures simulation results match real-world
behavior
Prevents incorrect conclusions from flawed models
GROUP 19 COE 453 - Distributed Computing

Validation and Verification of Simulation


Models (Cont’d)
How Validation and Verification Work Together
Verification ensures the model is correctly built ("Did we build the model right?").
Validation ensures the model reflects real-world behavior ("Did we build the right model?")

Example Use Case


A cloud resource allocation simulator can be:
Verified by checking if it correctly implements scheduling algorithms.
Validated by running it against AWS cloud service logs and ensuring predicted resource usage matches
real-world observations.
GROUP 19 COE 453 - Distributed Computing

Validation and Verification of Simulation


Models (Cont’d)

Tool/Method Purpose Key Features Common Use Cases

Model Checkers Verifies correctness of Formal verification, Verifying distributed


(UPPAAL, SPIN) system models detects logical errors algorithms and protocols

Validates simulation Ensuring simulations


Empirical Comparison with Uses historical data for
accuracy by comparing reflect real-world
Real-World Data validation
with actual logs performance
GROUP 19 COE 453 - Distributed Computing

Conclusion & Future Future Trends:

Trends in Simulation · AI-driven simulations for predictive


analytics
· Improved real-time monitoring and
Summary of Key Takeaways
visualization tools
· Simulation helps optimize, test, and analyze distributed
· Integration of digital twins for real-world
systems before deployment
system testing

· Various models and tools exist for cloud computing,


IoT, failure handling, and performance optimization
Thank You

You might also like