Distributed
computing
frameworks
Distributed computing frameworks
Distributed computing frameworks are software
systems that manage and coordinate the processing
of large datasets across multiple computers or
"nodes" to handle complex computational problems.
Apache Hadoop
Apache Spark
These framework provides tools for tasks like data
storage, processing, task scheduling and fault
tolerance, allowing developers to scale applications
more easily.
They are fundamental for big data processing, cloud
computing, and parallel processing by abstracting the
complexities of managing a distributed system.
Distributed computing frameworks-
Working
Divide and conquer: The framework divides large
tasks and data into smaller pieces that are
processed in parallel across multiple machines.
Coordination: It handles the complex coordination,
such as scheduling jobs, communicating between
nodes, and managing resource allocation.
Fault tolerance: Frameworks are designed to be
fault-tolerant, meaning they can continue to operate
even if some nodes fail.
Distributed computing frameworks
Apache Hadoop: A foundational framework that provides a
distributed file system (HDFS) for storing data and a processing
model (MapReduce) for parallel processing.
Apache Spark: A faster, in-memory processing framework that
works on top of Hadoop or independently. It is used for a wide
range of big data tasks, including batch processing, real-time
streaming, and machine learning.
Apache Flink: A framework designed for high-performance,
stream-based data processing, which also handles batch
processing.
Dask: A flexible library for parallel computing in Python enabling
users to scale their Python workflows from single machines to
large distributed [Link] Arrys,Dask Dataframes, Dask Bags.
Ray: An open-source framework for scaling AI and machine
learning workloads like hyperparameter tuning and model serving
by making it easier to write distributed Python applications
Advantages
Scalability: Easily handle growing datasets and
computational demands by adding more nodes to
the cluster.
Performance: Dramatically speeds up processing
by using multiple machines to work on a problem
simultaneously.
Fault tolerance: Increases reliability by
distributing data and computation, so the failure of
a single node doesn't stop the entire process.
Apache Hadoop
Apache Hadoop is an open-source software
framework designed for the distributed storage and
processing of very large datasets across clusters of
computers.
It enables applications to work with petabytes of
data and scale from a single server to thousands of
machines, each offering local computation and
storage.
Components of Apache Hadoop
Components of Apache Hadoop
The main components of Hadoop are:
Hadoop Common - provides utility functions
Hadoop Distributed File System (HDFS) - handles
distributed storage across a cluster
MapReduce - processes data in parallel
Hadoop YARN (Yet Another Resource
Negotiator) - manages cluster resources and job
scheduling
Components of Apache Hadoop
•A set of shared utilities and libraries that provides
the foundational services for all other Hadoop
components.
•Hadoop Distributed File System (HDFS):
The storage layer of Hadoop, designed to store very
large datasets across many commodity servers. It is
a distributed file system that stores data in blocks
across multiple machines.
MapReduce:
Components of Apache Hadoop
A programming model for processing large
datasets in parallel.
It divides a job into two main steps: a map step
that processes data and a reduce step that
aggregates the results.
Hadoop YARN (Yet Another Resource
Negotiator):
The resource management and job scheduling
component of Hadoop.
It is responsible for allocating cluster resources to
various applications and managing the lifecycle of
Hadoop Distributed File System
(HDFS)
HDFS is the Hadoop's primary storage system,
designed to store data reliably across a cluster of
machines.
HDFS breaks large files into smaller blocks and
distributes them across multiple nodes, ensuring
fault tolerance through data replication.
Hadoop Distributed File System
(HDFS)
HDFS is designed to be highly scalable, reliable, and efficient, enabling
the storage and processing of massive datasets. Its architecture
consists of several key components:
NameNode:master server that manages the filesystem
namespace and controls access to files by clients
DataNode:responsible for storing and retrieving actual data
blocks as instructed by the NameNode
Secondary NameNode: helper to the primary NameNode,
responsible for merging the EditLogs with the current
filesystem image (FsImage) to reduce the potential load on the
NameNode.
HDFS Client:interface through which users and applications
interact with the HDFS. It allows for file creation, deletion,
reading, and writing operations
Block Structure:HDFS stores files by dividing them into large
blocks, typically 128MB or 256MB in size
Hadoop Distributed File System
(HDFS):
HDFS Architecture and Components
A master node- NameNode is responsible for
accepting jobs from the clients. Its task is to ensure
that the data required for the operation is loaded and
segregated into chunks of data blocks.
HDFS exposes a file system namespace and allows
user data to be stored in files. A file is split into one or
more blocks, stored, and replicated in the slave nodes
known as the DataNodes as shown in the section below
The Secondary NameNode server maintains the edit
log and namespace image information in sync with
the NameNode server.
fsimage
An fsimage in HDFS is a file containing a point-in-
time snapshot of the entire file system namespace,
including the directory structure, file metadata, and
the mapping of files to data blocks.
The NameNode loads this file on startup and merges
it with the Editlog(a log of recent transactions) to
reconstruct the most up-to-date file system state in
memory.
Fsimage
Namespace: The complete directory tree of the
cluster.
Metadata: Details about each file, such as its
owner, permissions, and size.
Block mapping: Information on which data blocks
belong to which files.
How it works with the EditLog
Initialization: When the NameNode starts, it loads the fsimage and
then applies the changes from the EditLog to create the current, in-
memory representation of the file system.
Persistent state: The fsimage is a persistent, on-disk copy of the
file system state, while the EditLog records the changes that have
occurred since the last fsimage was created.
Checkpointing: The fsimage and EditLog are periodically merged to
create a new fsimage. This process is often handled by a Secondary
NameNode or the Standby NameNode to prevent the EditLog from
becoming too large and to reduce the NameNode startup time.
Why it is important
Recovery: The fsimage is crucial for recovering the
file system state after a NameNode restart or
failure.
Data consistency: By merging the EditLog with
the fsimage, the system ensures that all recent
changes are incorporated into the persistent state.
Fsimage-Working
Block Replication Architecture
HDFS Advantages
Scalability
HDFS is highly scalable, allowing for the storage and
processing of petabytes of data across thousands of
machines. It is designed to handle an increasing
number of nodes and storage without significant
performance degradation.
Key Aspects:
Linear scalability allows the addition of new nodes
without reconfiguring the entire system.
Supports horizontal scaling by adding more
DataNodes.
HDFS Advantages
Fault Tolerance
HDFS ensures high availability and fault tolerance
through data replication. Each block of data is
replicated across multiple DataNodes, ensuring that
data remains accessible even if some nodes fail.
Key Features:
Automatic block replication ensures data redundancy.
Configurable replication factor allows administrators
to balance storage efficiency and fault tolerance.
HDFS Advantages
High Throughput
HDFS is optimized for high-throughput access to
large datasets, making it suitable for data-intensive
applications. It allows for parallel processing of data
across multiple nodes, significantly speeding up
data read and write operations.
Key Features:
Supports large data transfers and batch processing.
Optimized for sequential data access, reducing seek
times and increasing throughput.
HDFS Advantages
Cost-Effective
HDFS is designed to run on commodity hardware,
significantly reducing the cost of setting up and
maintaining a large-scale storage infrastructure. Its
open-source nature further reduces the total cost of
ownership.
Key Features:
Utilizes inexpensive hardware, reducing capital
expenditure.
Open-source software eliminates licensing costs.
HDFS Advantages
Data Locality
HDFS takes advantage of data locality by moving
computation closer to where the data is stored. This
minimizes data transfer over the network, reducing
latency and improving overall system performance.
Key Features:
Data-aware scheduling ensures that tasks are
assigned to nodes where the data resides.
Reduces network congestion and improves processing
speed
HDFS Advantages
Reliability and Robustness
HDFS is built to handle hardware failures gracefully.
The NameNode and DataNodes are designed to
recover from failures without losing data, and the
system continually monitors the health of nodes to
prevent data loss.
Key Features:
Automatic detection and recovery from node
failures.
Regular health checks and data integrity
verification.
MapReduce
MapReduce is a programming model suitable for processing of huge
data. Hadoop is capable of running MapReduce programs written in
various languages: Java, Ruby, Python, and C++. MapReduce
programs are parallel in nature, thus are very useful for performing
large-scale data analysis using multiple machines in the cluster.
MapReduce programs work in two phases:
Map phase
Reduce phase.
An input to each phase is key-value pairs. In addition, every
programmer needs to specify two functions: map
function and reduce function.
How MapReduce Works?
The whole process goes through four phases of
execution namely, splitting, mapping, shuffling, and
reducing.
Consider you have following input data for your Map
Reduce Program
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
The final output of the MapReduce task is
bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1
Yarn Architecture
The three important elements of the YARN architecture are:
Resource Manager
Application Master
Node Managers
Yarn Architecture
Resource Manager
The Resource Manager, or RM, which is usually one per
cluster, is the master server.
Resource Manager knows the location of the DataNode and
how many resources they have.
This information is referred to as Rack Awareness.
The RM runs several services, the most important of which
is the Resource Scheduler that decides how to assign the
resources.
Application Master
The Application Master is a framework-specific
process that negotiates resources for a single
application, that is, a single job or a directed acyclic
graph of jobs, which runs in the first container
allocated for the purpose.
Each Application Master requests resources from
the Resource Manager and then works with
containers provided by Node Managers.
Node Managers
The Node Managers can be many in one cluster. They
are the slaves of the infrastructure. When it starts, it
announces itself to the RM and periodically sends a
heartbeat to the RM.
Each Node Manager offers resources to the cluster.
The resource capacity is the amount of memory and the
number of v-cores, short for the virtual core. At run-time,
the Resource Scheduler decides how to use this capacity.
A container is a fraction of the NodeManager capacity,
and it is used by the client to run a program.
Each Node Manager takes instructions from the
ResourceManager and reports and handles containers on
a single node.
SPARK
Apache Spark is an open-source, unified analytics engine
designed for large-scale data processing and analytics.
It provides a comprehensive framework for handling big data
workloads, including batch processing, interactive queries,
real-time analytics, machine learning, and graph processing.
Spark optimizes performance by utilizing in-memory caching
and efficient query execution, making it significantly faster
than disk-based processing systems like Hadoop MapReduce
for many workloads.
Spark - Introduction
Apache Hadoop - which is also open source — is a data
processing engine for big data sets.
Like Hadoop, Spark splits up large tasks across different
nodes.
However, it tends to perform faster than Hadoop and it uses
random access memory (RAM) to cache and process data
instead of a file system.
This enables Spark to handle use cases that Hadoop cannot.
Benefits of the Spark framework
A unified engine that supports SQL queries, streaming
data, machine learning (ML) and graph processing
Can be 100x faster than Hadoop for smaller
workloads (link resides outside [Link]) via in-
memory processing, disk data storage, etc.
APIs designed for ease of use when manipulating semi-
structured data and transforming data
The Spark ecosystem
Apache Spark, the largest open-source project in
data processing, is the only processing framework
that combines data and artificial intelligence (AI).
This enables users to perform large-scale data
transformations and analyses, and then run state-
of-the-art machine learning (ML) and AI algorithms.
The Spark ecosystem
Spark integrates various higher-level libraries and functionalities
within a single engine, enabling seamless transitions between
different types of data processing tasks.
These libraries include:
Spark SQL: For structured data processing using SQL queries.
Spark Streaming: For real-time processing of streaming data.
MLlib: A machine learning library for building and deploying
machine learning models.
GraphX: For graph processing and computation.
Spark - Architecture
Spark - Architecture
Apache Spark utilizes a master-slave architecture
for distributed data processing, primarily composed
of the following key components:
Driver Program
Cluster Manager
Worker Nodes
Executors
Resilient Distributed Datasets (RDDs)
Directed Acyclic Graph (DAG) Scheduler
Spark Components - Driver Program
Acts as the master node of the Spark application
Contains the Sparkcontext - which is the entry point for
all Spark functionalities and establishes a connection to
the Spark cluster
Converts the user's Spark application code into a Directed
Acyclic Graph (DAG) of RDD transformations and actions.
Communicates with the cluster manager to request
resources and schedule tasks on worker nodes.
Monitors the execution of tasks and recovers from failures
Spark Components - Cluster
Manager
Manages the resources across the cluster, allocating
them to Spark applications.
Examples include YARN, Mesos, Kubernetes, and
Spark's standalone cluster manager.
Receives resource requests from the driver and
allocates worker nodes for execution.
Spark Components - Worker
Nodes
Slave nodes in the Spark cluster where actual data
processing occurs.
Each worker node runs one or more Executors.
Spark Components - Executors
Processes that run on worker nodes and are
responsible for executing tasks assigned by the
driver.
Perform computations on data partitions and store
intermediate results in memory or on disk.
Report their status and results back to the driver
program.
Resilient Distributed Datasets
(RDDs)
Fundamental data structure in Spark, representing an
immutable, fault-tolerant, distributed collection of
elements that can be operated on in parallel.
RDDs are divided into partitions, which are processed
independently by executors.
Resilient Distributed Dataset-RDD
RDD- Resilient Distributed Dataset
Immutability
Once an RDD is created, it cannot be altered.
Any operation you perform, like a map or filter, does not change
the original RDD but instead returns a new RDD.
This makes RDDs predictable and easier to manage, as their state
is fixed at any given point.
RDD- Resilient Distributed Dataset
Fault tolerance
Instead of replicating the data itself, RDDs store a log of the
transformations used to create them, called the lineage.
If a partition of the data on a worker node is lost due to a crash,
Spark uses the lineage to recompute that specific partition from
its parent RDDs.
This allows Spark to recover automatically from failures without
needing to manually intervene or rely on redundant data storage
for every piece of data.
This mechanism makes processing large datasets in a distributed
environment robust and reliable.
Directed Acyclic Graph (DAG)
Scheduler
DAG (Directed Acyclic Graph) is a core
component of its execution engine, representing
the logical execution plan of a Spark job
Optimizes the execution plan of Spark
applications by creating a DAG of RDD
transformations and actions.
Identifies dependencies between operations and
stages, allowing for efficient task scheduling and
fault recovery.
Directed Acyclic Graph (DAG)
Scheduler
Directed:
The operations in the graph have a specific order of execution,
represented by edges pointing from one operation to the next.
Acyclic:
There are no loops or cycles in the execution plan, meaning an
operation cannot lead back to a previously executed operation
within the same job.
Graph:
It's a collection of nodes (vertices) and edges, where vertices
represent RDDs (Resilient Distributed Datasets) or data
transformations, and edges represent the operations applied to
these RDDs.
Spark - Workflow
•A user submits a Spark application, which is executed by the Driver
Program.
•The Driver Program initializes the SparkContext and requests resources
from the Cluster Manager.
•The Cluster Manager allocates Worker Nodes and launches Executors on
them.
•The Driver Program converts the application logic into a DAG of RDD
operations and distributes tasks to the Executors.
•Executors process the data partitions and report their progress and
results back to the Driver.
•The Driver monitors the execution and handles fault tolerance, ensuring
the application completes successfully.