0% found this document useful (0 votes)

62 views57 pages

Distributed Computing Frameworks

Distributed computing frameworks are essential software systems that enable the processing of large datasets across multiple nodes, facilitating big data processing and cloud computing. Key frameworks include Apache Hadoop, Apache Spark, and others, which provide tools for data storage, processing, and fault tolerance. These frameworks enhance scalability, performance, and reliability in handling complex computational tasks.

Uploaded by

sharmaragahv9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views57 pages

Distributed Computing Frameworks

Uploaded by

sharmaragahv9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Distributed

computing
frameworks
Distributed computing frameworks
 Distributed computing frameworks are software
systems that manage and coordinate the processing
of large datasets across multiple computers or
"nodes" to handle complex computational problems.
 Apache Hadoop
 Apache Spark
 These framework provides tools for tasks like data
storage, processing, task scheduling and fault
tolerance, allowing developers to scale applications
more easily.
 They are fundamental for big data processing, cloud
computing, and parallel processing by abstracting the
complexities of managing a distributed system.
Distributed computing frameworks-
Working
 Divide and conquer: The framework divides large
tasks and data into smaller pieces that are
processed in parallel across multiple machines.
 Coordination: It handles the complex coordination,
such as scheduling jobs, communicating between
nodes, and managing resource allocation.
 Fault tolerance: Frameworks are designed to be
fault-tolerant, meaning they can continue to operate
even if some nodes fail.
Distributed computing frameworks

 Apache Hadoop: A foundational framework that provides a

distributed file system (HDFS) for storing data and a processing
model (MapReduce) for parallel processing.
 Apache Spark: A faster, in-memory processing framework that
works on top of Hadoop or independently. It is used for a wide
range of big data tasks, including batch processing, real-time
streaming, and machine learning.
 Apache Flink: A framework designed for high-performance,
stream-based data processing, which also handles batch
processing.
 Dask: A flexible library for parallel computing in Python enabling
users to scale their Python workflows from single machines to
large distributed [Link] Arrys,Dask Dataframes, Dask Bags.
 Ray: An open-source framework for scaling AI and machine
learning workloads like hyperparameter tuning and model serving
by making it easier to write distributed Python applications
Advantages

 Scalability: Easily handle growing datasets and

computational demands by adding more nodes to
the cluster.
 Performance: Dramatically speeds up processing
by using multiple machines to work on a problem
simultaneously.
 Fault tolerance: Increases reliability by
distributing data and computation, so the failure of
a single node doesn't stop the entire process.
Apache Hadoop
 Apache Hadoop is an open-source software
framework designed for the distributed storage and
processing of very large datasets across clusters of
computers.

 It enables applications to work with petabytes of

data and scale from a single server to thousands of
machines, each offering local computation and
storage.
Components of Apache Hadoop
Components of Apache Hadoop
 The main components of Hadoop are:
 Hadoop Common - provides utility functions
 Hadoop Distributed File System (HDFS) - handles
distributed storage across a cluster
 MapReduce - processes data in parallel
 Hadoop YARN (Yet Another Resource
Negotiator) - manages cluster resources and job
scheduling
Components of Apache Hadoop
•A set of shared utilities and libraries that provides
the foundational services for all other Hadoop
components.

•Hadoop Distributed File System (HDFS):

The storage layer of Hadoop, designed to store very
large datasets across many commodity servers. It is
a distributed file system that stores data in blocks
across multiple machines.
 MapReduce:
Components of Apache Hadoop
A programming model for processing large
datasets in parallel.
It divides a job into two main steps: a map step
that processes data and a reduce step that
aggregates the results.
 Hadoop YARN (Yet Another Resource
Negotiator):
The resource management and job scheduling
component of Hadoop.
It is responsible for allocating cluster resources to
various applications and managing the lifecycle of
Hadoop Distributed File System
(HDFS)

 HDFS is the Hadoop's primary storage system,

designed to store data reliably across a cluster of
machines.

 HDFS breaks large files into smaller blocks and

distributes them across multiple nodes, ensuring
fault tolerance through data replication.
Hadoop Distributed File System
(HDFS)
 HDFS is designed to be highly scalable, reliable, and efficient, enabling
the storage and processing of massive datasets. Its architecture
consists of several key components:
 NameNode:master server that manages the filesystem
namespace and controls access to files by clients
 DataNode:responsible for storing and retrieving actual data
blocks as instructed by the NameNode
 Secondary NameNode: helper to the primary NameNode,
responsible for merging the EditLogs with the current
filesystem image (FsImage) to reduce the potential load on the
NameNode.
 HDFS Client:interface through which users and applications
interact with the HDFS. It allows for file creation, deletion,
reading, and writing operations
 Block Structure:HDFS stores files by dividing them into large
blocks, typically 128MB or 256MB in size
Hadoop Distributed File System
(HDFS):
HDFS Architecture and Components
A master node- NameNode is responsible for
accepting jobs from the clients. Its task is to ensure
that the data required for the operation is loaded and
segregated into chunks of data blocks.
HDFS exposes a file system namespace and allows
user data to be stored in files. A file is split into one or
more blocks, stored, and replicated in the slave nodes
known as the DataNodes as shown in the section below
The Secondary NameNode server maintains the edit
log and namespace image information in sync with
the NameNode server.
fsimage
 An fsimage in HDFS is a file containing a point-in-
time snapshot of the entire file system namespace,
including the directory structure, file metadata, and
the mapping of files to data blocks.
 The NameNode loads this file on startup and merges
it with the Editlog(a log of recent transactions) to
reconstruct the most up-to-date file system state in
memory.
Fsimage

 Namespace: The complete directory tree of the

cluster.

 Metadata: Details about each file, such as its

owner, permissions, and size.

 Block mapping: Information on which data blocks

belong to which files.
How it works with the EditLog
 Initialization: When the NameNode starts, it loads the fsimage and

then applies the changes from the EditLog to create the current, in-

memory representation of the file system.

 Persistent state: The fsimage is a persistent, on-disk copy of the

file system state, while the EditLog records the changes that have

occurred since the last fsimage was created.

 Checkpointing: The fsimage and EditLog are periodically merged to

create a new fsimage. This process is often handled by a Secondary

NameNode or the Standby NameNode to prevent the EditLog from

becoming too large and to reduce the NameNode startup time.

Why it is important

 Recovery: The fsimage is crucial for recovering the

file system state after a NameNode restart or
failure.

 Data consistency: By merging the EditLog with

the fsimage, the system ensures that all recent
changes are incorporated into the persistent state.
Fsimage-Working
Block Replication Architecture
HDFS Advantages
Scalability
 HDFS is highly scalable, allowing for the storage and
processing of petabytes of data across thousands of
machines. It is designed to handle an increasing
number of nodes and storage without significant
performance degradation.
 Key Aspects:
 Linear scalability allows the addition of new nodes
without reconfiguring the entire system.
 Supports horizontal scaling by adding more
DataNodes.
HDFS Advantages
Fault Tolerance
 HDFS ensures high availability and fault tolerance
through data replication. Each block of data is
replicated across multiple DataNodes, ensuring that
data remains accessible even if some nodes fail.
 Key Features:
 Automatic block replication ensures data redundancy.
 Configurable replication factor allows administrators
to balance storage efficiency and fault tolerance.
HDFS Advantages
High Throughput
 HDFS is optimized for high-throughput access to
large datasets, making it suitable for data-intensive
applications. It allows for parallel processing of data
across multiple nodes, significantly speeding up
data read and write operations.
 Key Features:
 Supports large data transfers and batch processing.
 Optimized for sequential data access, reducing seek
times and increasing throughput.
HDFS Advantages

Cost-Effective
 HDFS is designed to run on commodity hardware,
significantly reducing the cost of setting up and
maintaining a large-scale storage infrastructure. Its
open-source nature further reduces the total cost of
ownership.
 Key Features:
 Utilizes inexpensive hardware, reducing capital
expenditure.
 Open-source software eliminates licensing costs.
HDFS Advantages
Data Locality
 HDFS takes advantage of data locality by moving
computation closer to where the data is stored. This
minimizes data transfer over the network, reducing
latency and improving overall system performance.
 Key Features:
 Data-aware scheduling ensures that tasks are
assigned to nodes where the data resides.
 Reduces network congestion and improves processing
speed
HDFS Advantages

Reliability and Robustness

 HDFS is built to handle hardware failures gracefully.
The NameNode and DataNodes are designed to
recover from failures without losing data, and the
system continually monitors the health of nodes to
prevent data loss.
 Key Features:
 Automatic detection and recovery from node
failures.
 Regular health checks and data integrity
verification.
MapReduce

 MapReduce is a programming model suitable for processing of huge

data. Hadoop is capable of running MapReduce programs written in
various languages: Java, Ruby, Python, and C++. MapReduce
programs are parallel in nature, thus are very useful for performing
large-scale data analysis using multiple machines in the cluster.

 MapReduce programs work in two phases:

 Map phase

 Reduce phase.

 An input to each phase is key-value pairs. In addition, every

programmer needs to specify two functions: map
function and reduce function.
How MapReduce Works?
 The whole process goes through four phases of
execution namely, splitting, mapping, shuffling, and
reducing.
 Consider you have following input data for your Map
Reduce Program
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
The final output of the MapReduce task is

bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1
Yarn Architecture

 The three important elements of the YARN architecture are:

 Resource Manager
 Application Master
 Node Managers
Yarn Architecture
Resource Manager

 The Resource Manager, or RM, which is usually one per

cluster, is the master server.
 Resource Manager knows the location of the DataNode and
how many resources they have.
 This information is referred to as Rack Awareness.
 The RM runs several services, the most important of which
is the Resource Scheduler that decides how to assign the
resources.
Application Master

 The Application Master is a framework-specific

process that negotiates resources for a single
application, that is, a single job or a directed acyclic
graph of jobs, which runs in the first container
allocated for the purpose.

 Each Application Master requests resources from

the Resource Manager and then works with
containers provided by Node Managers.
Node Managers
 The Node Managers can be many in one cluster. They
are the slaves of the infrastructure. When it starts, it
announces itself to the RM and periodically sends a
heartbeat to the RM.
 Each Node Manager offers resources to the cluster.
 The resource capacity is the amount of memory and the
number of v-cores, short for the virtual core. At run-time,
the Resource Scheduler decides how to use this capacity.
 A container is a fraction of the NodeManager capacity,
and it is used by the client to run a program.
 Each Node Manager takes instructions from the
ResourceManager and reports and handles containers on
a single node.
SPARK
 Apache Spark is an open-source, unified analytics engine
designed for large-scale data processing and analytics.

 It provides a comprehensive framework for handling big data

workloads, including batch processing, interactive queries,
real-time analytics, machine learning, and graph processing.

 Spark optimizes performance by utilizing in-memory caching

and efficient query execution, making it significantly faster
than disk-based processing systems like Hadoop MapReduce
for many workloads.
Spark - Introduction

 Apache Hadoop - which is also open source — is a data

processing engine for big data sets.
 Like Hadoop, Spark splits up large tasks across different
nodes.
 However, it tends to perform faster than Hadoop and it uses
random access memory (RAM) to cache and process data
instead of a file system.
 This enables Spark to handle use cases that Hadoop cannot.
Benefits of the Spark framework
 A unified engine that supports SQL queries, streaming
data, machine learning (ML) and graph processing

 Can be 100x faster than Hadoop for smaller

workloads (link resides outside [Link]) via in-
memory processing, disk data storage, etc.

 APIs designed for ease of use when manipulating semi-

structured data and transforming data
The Spark ecosystem

 Apache Spark, the largest open-source project in

data processing, is the only processing framework
that combines data and artificial intelligence (AI).

 This enables users to perform large-scale data

transformations and analyses, and then run state-
of-the-art machine learning (ML) and AI algorithms.
The Spark ecosystem
 Spark integrates various higher-level libraries and functionalities
within a single engine, enabling seamless transitions between
different types of data processing tasks.

 These libraries include:

 Spark SQL: For structured data processing using SQL queries.

 Spark Streaming: For real-time processing of streaming data.

 MLlib: A machine learning library for building and deploying

machine learning models.

 GraphX: For graph processing and computation.

Spark - Architecture
Spark - Architecture
 Apache Spark utilizes a master-slave architecture
for distributed data processing, primarily composed
of the following key components:
 Driver Program
 Cluster Manager
 Worker Nodes
 Executors
 Resilient Distributed Datasets (RDDs)
 Directed Acyclic Graph (DAG) Scheduler
Spark Components - Driver Program

 Acts as the master node of the Spark application

 Contains the Sparkcontext - which is the entry point for

all Spark functionalities and establishes a connection to
the Spark cluster

 Converts the user's Spark application code into a Directed

Acyclic Graph (DAG) of RDD transformations and actions.

 Communicates with the cluster manager to request

resources and schedule tasks on worker nodes.

 Monitors the execution of tasks and recovers from failures

Spark Components - Cluster
Manager
 Manages the resources across the cluster, allocating
them to Spark applications.
 Examples include YARN, Mesos, Kubernetes, and
Spark's standalone cluster manager.
 Receives resource requests from the driver and
allocates worker nodes for execution.
Spark Components - Worker
Nodes

 Slave nodes in the Spark cluster where actual data

processing occurs.
 Each worker node runs one or more Executors.
Spark Components - Executors

 Processes that run on worker nodes and are

responsible for executing tasks assigned by the
driver.
 Perform computations on data partitions and store
intermediate results in memory or on disk.
 Report their status and results back to the driver
program.
Resilient Distributed Datasets
(RDDs)

 Fundamental data structure in Spark, representing an

immutable, fault-tolerant, distributed collection of
elements that can be operated on in parallel.
 RDDs are divided into partitions, which are processed
independently by executors.
Resilient Distributed Dataset-RDD
RDD- Resilient Distributed Dataset
 Immutability
 Once an RDD is created, it cannot be altered.
 Any operation you perform, like a map or filter, does not change
the original RDD but instead returns a new RDD.
 This makes RDDs predictable and easier to manage, as their state
is fixed at any given point.
RDD- Resilient Distributed Dataset
 Fault tolerance
 Instead of replicating the data itself, RDDs store a log of the
transformations used to create them, called the lineage.
 If a partition of the data on a worker node is lost due to a crash,
Spark uses the lineage to recompute that specific partition from
its parent RDDs.
 This allows Spark to recover automatically from failures without
needing to manually intervene or rely on redundant data storage
for every piece of data.
 This mechanism makes processing large datasets in a distributed
environment robust and reliable.
Directed Acyclic Graph (DAG)
Scheduler
 DAG (Directed Acyclic Graph) is a core
component of its execution engine, representing
the logical execution plan of a Spark job
 Optimizes the execution plan of Spark
applications by creating a DAG of RDD
transformations and actions.
 Identifies dependencies between operations and
stages, allowing for efficient task scheduling and
fault recovery.
Directed Acyclic Graph (DAG)
Scheduler
 Directed:
 The operations in the graph have a specific order of execution,
represented by edges pointing from one operation to the next.
 Acyclic:
 There are no loops or cycles in the execution plan, meaning an
operation cannot lead back to a previously executed operation
within the same job.
 Graph:
 It's a collection of nodes (vertices) and edges, where vertices
represent RDDs (Resilient Distributed Datasets) or data
transformations, and edges represent the operations applied to
these RDDs.
Spark - Workflow
•A user submits a Spark application, which is executed by the Driver
Program.
•The Driver Program initializes the SparkContext and requests resources
from the Cluster Manager.
•The Cluster Manager allocates Worker Nodes and launches Executors on
them.
•The Driver Program converts the application logic into a DAG of RDD
operations and distributes tasks to the Executors.
•Executors process the data partitions and report their progress and
results back to the Driver.
•The Driver monitors the execution and handles fault tolerance, ensuring
the application completes successfully.

Overview of Hadoop Architecture and Use Cases
No ratings yet
Overview of Hadoop Architecture and Use Cases
47 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Open Source Distributed File Systems Overview
No ratings yet
Open Source Distributed File Systems Overview
60 pages
Overview of Hadoop Architecture and Ecosystem
No ratings yet
Overview of Hadoop Architecture and Ecosystem
47 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
5 pages
Understanding Apache Hadoop Ecosystem
No ratings yet
Understanding Apache Hadoop Ecosystem
48 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Understanding Apache Hadoop Ecosystem
No ratings yet
Understanding Apache Hadoop Ecosystem
113 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Hadoop Architecture and Data Flow Overview
No ratings yet
Hadoop Architecture and Data Flow Overview
84 pages
Understanding Big Data and Hadoop
No ratings yet
Understanding Big Data and Hadoop
28 pages
Module 2
No ratings yet
Module 2
71 pages
Introduction to Hadoop & DFS
No ratings yet
Introduction to Hadoop & DFS
34 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Hadoop Ecosystem & HDFS Guide
No ratings yet
Hadoop Ecosystem & HDFS Guide
46 pages
4
No ratings yet
4
53 pages
Introduction To Hadoop and MapReduce Programming
No ratings yet
Introduction To Hadoop and MapReduce Programming
29 pages
Overview of Hadoop Architecture and Components
No ratings yet
Overview of Hadoop Architecture and Components
75 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Unit 2
No ratings yet
Unit 2
14 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Comprehensive Guide to Hadoop Framework
No ratings yet
Comprehensive Guide to Hadoop Framework
56 pages
Unit 2
No ratings yet
Unit 2
17 pages
Hadoop
No ratings yet
Hadoop
25 pages
Hadoop Basics and HDFS Overview
No ratings yet
Hadoop Basics and HDFS Overview
126 pages
Hadoop - Mapreduce
No ratings yet
Hadoop - Mapreduce
32 pages
Hadoop Framework
No ratings yet
Hadoop Framework
8 pages
Hadoop Distributed File System Overview
No ratings yet
Hadoop Distributed File System Overview
27 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
Module - 2
No ratings yet
Module - 2
84 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
Understanding the Hadoop Ecosystem
No ratings yet
Understanding the Hadoop Ecosystem
48 pages
HDFS Data Replication Explained
No ratings yet
HDFS Data Replication Explained
65 pages
Hadoop
No ratings yet
Hadoop
154 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
28 pages
Hadoop v1 Vs v2
No ratings yet
Hadoop v1 Vs v2
36 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Understanding Hadoop Framework Basics
No ratings yet
Understanding Hadoop Framework Basics
18 pages
Hadoop Architecture and MapReduce Overview
No ratings yet
Hadoop Architecture and MapReduce Overview
34 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
169 pages
Haoop Architecture
No ratings yet
Haoop Architecture
24 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
26 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
36 pages
BG 345
No ratings yet
BG 345
26 pages
Understanding Hadoop Architecture and Scalability
No ratings yet
Understanding Hadoop Architecture and Scalability
21 pages
Introduction to Hadoop Basics
No ratings yet
Introduction to Hadoop Basics
26 pages
Introduction to Big Data and Hadoop
No ratings yet
Introduction to Big Data and Hadoop
38 pages
Module II
No ratings yet
Module II
46 pages
Notes
No ratings yet
Notes
9 pages
Shshbaz Ahmad Big Data Assig 222222222222
No ratings yet
Shshbaz Ahmad Big Data Assig 222222222222
19 pages
Introduction to Apache Hadoop Framework
No ratings yet
Introduction to Apache Hadoop Framework
79 pages
Retail Analytics Solutions for Growth
No ratings yet
Retail Analytics Solutions for Growth
5 pages
Bigbasket Privacy Policy Overview
No ratings yet
Bigbasket Privacy Policy Overview
15 pages
TLL.360: E-Learning Platform Guide
No ratings yet
TLL.360: E-Learning Platform Guide
7 pages
Data Encoding Patterns
No ratings yet
Data Encoding Patterns
17 pages
TLE Grade 7&8 Computer Hardware Test
85% (20)
TLE Grade 7&8 Computer Hardware Test
4 pages
Number Guessing Game SRS
No ratings yet
Number Guessing Game SRS
3 pages
IoT Health Care System for Paralysis Patients
No ratings yet
IoT Health Care System for Paralysis Patients
12 pages
Compact Lidar Solutions Guide
No ratings yet
Compact Lidar Solutions Guide
2 pages
DEVONthink Manual
No ratings yet
DEVONthink Manual
292 pages
Bouyguestelecom Facture 20161113
No ratings yet
Bouyguestelecom Facture 20161113
8 pages
App Cse Template 2021 BQB
No ratings yet
App Cse Template 2021 BQB
17 pages
BP Metric 3.4 Unscheduled Downtime Olver 060601 R0 PUB
No ratings yet
BP Metric 3.4 Unscheduled Downtime Olver 060601 R0 PUB
3 pages
User Manual For Lorawan End Nodes - Lht65N Lorawan Temperature & Humidity Sensor Manual
No ratings yet
User Manual For Lorawan End Nodes - Lht65N Lorawan Temperature & Humidity Sensor Manual
58 pages
Getman Madero (8) 7026 Manual (3464)
No ratings yet
Getman Madero (8) 7026 Manual (3464)
939 pages
Guidance For Digital Arsm Exams From August 2022
No ratings yet
Guidance For Digital Arsm Exams From August 2022
10 pages
Basic Electronics: Circuit Calculations and Analysis
No ratings yet
Basic Electronics: Circuit Calculations and Analysis
14 pages
Error Reset: - INV080 - Reset: This FB
No ratings yet
Error Reset: - INV080 - Reset: This FB
3 pages
The Readiness Learning
No ratings yet
The Readiness Learning
24 pages
Application Sheets Protimeter MMS2
No ratings yet
Application Sheets Protimeter MMS2
2 pages
Huawei OptiXstar EN8010Ts-20 Datasheet 01
No ratings yet
Huawei OptiXstar EN8010Ts-20 Datasheet 01
2 pages
Automated Industrial Filling System
No ratings yet
Automated Industrial Filling System
6 pages
CanSat Mission Guide 2022
No ratings yet
CanSat Mission Guide 2022
41 pages
SAP Migration Pre-Export Checklist
No ratings yet
SAP Migration Pre-Export Checklist
2 pages
FPGA Technology Mapping Comparison
No ratings yet
FPGA Technology Mapping Comparison
2 pages
Orar FILS Licenta Engleza 2020 2021 Sem 1 Final
No ratings yet
Orar FILS Licenta Engleza 2020 2021 Sem 1 Final
8 pages
Rahul Dewara CV: Wind Energy Engineer
No ratings yet
Rahul Dewara CV: Wind Energy Engineer
4 pages
Systems Engineering Vision 2020 INCOSE February 2009 - 20071003 - v2 - 03
No ratings yet
Systems Engineering Vision 2020 INCOSE February 2009 - 20071003 - v2 - 03
32 pages
Volen Cherkezov Portfolio 2013
No ratings yet
Volen Cherkezov Portfolio 2013
8 pages
Kubota T450 T550 T650
100% (1)
Kubota T450 T550 T650
28 pages
ABB Surge Arresters - Product Overview 1HC0075750 AD en
No ratings yet
ABB Surge Arresters - Product Overview 1HC0075750 AD en
16 pages

Distributed Computing Frameworks

Uploaded by

Distributed Computing Frameworks

Uploaded by

Distributed

 Apache Hadoop: A foundational framework that provides a

 Scalability: Easily handle growing datasets and

 It enables applications to work with petabytes of

•Hadoop Distributed File System (HDFS):

 HDFS is the Hadoop's primary storage system,

 HDFS breaks large files into smaller blocks and

 Namespace: The complete directory tree of the

 Metadata: Details about each file, such as its

 Block mapping: Information on which data blocks

memory representation of the file system.

 Persistent state: The fsimage is a persistent, on-disk copy of the

occurred since the last fsimage was created.

 Checkpointing: The fsimage and EditLog are periodically merged to

create a new fsimage. This process is often handled by a Secondary

NameNode or the Standby NameNode to prevent the EditLog from

becoming too large and to reduce the NameNode startup time.

 Recovery: The fsimage is crucial for recovering the

 Data consistency: By merging the EditLog with

Reliability and Robustness

 MapReduce is a programming model suitable for processing of huge

 MapReduce programs work in two phases:

 An input to each phase is key-value pairs. In addition, every

 The three important elements of the YARN architecture are:

 The Resource Manager, or RM, which is usually one per

 The Application Master is a framework-specific

 Each Application Master requests resources from

 It provides a comprehensive framework for handling big data

 Spark optimizes performance by utilizing in-memory caching

 Apache Hadoop - which is also open source — is a data

 Can be 100x faster than Hadoop for smaller

 APIs designed for ease of use when manipulating semi-

 Apache Spark, the largest open-source project in

 This enables users to perform large-scale data

 These libraries include:

 Spark Streaming: For real-time processing of streaming data.

 MLlib: A machine learning library for building and deploying

 GraphX: For graph processing and computation.

 Acts as the master node of the Spark application

 Contains the Sparkcontext - which is the entry point for

 Converts the user's Spark application code into a Directed

 Communicates with the cluster manager to request

 Monitors the execution of tasks and recovers from failures

 Slave nodes in the Spark cluster where actual data

 Processes that run on worker nodes and are

 Fundamental data structure in Spark, representing an

You might also like