0% found this document useful (0 votes)

17 views19 pages

Apache Spark Architecture

Uploaded by

uday kiran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views19 pages

Apache Spark Architecture

Uploaded by

uday kiran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Apache Spark Architecture

Distributed System Architecture Explained

Apache Spark Architecture.

Apache Spark is an open-source cluster computing

framework that is setting the world of Big Data on fire.
When compared to Hadoop, Spark's performance is up to
100 times faster in memory and 10 times faster on disk. In
this article, I will give you a brief insight on Spark
Architecture and the fundamentals that underlie Spark
Architecture.
In this Spark Architecture article, I will be covering the
following topics:

 Spark & its Features

 Spark Architecture Overview

 Spark Eco-System

 Resilient Distributed Datasets (RDDs)

 Working of Spark Architecture

 Example using Scala in Spark Shell

Spark & its Features

Apache Spark is an open-source cluster computing

framework for real-time data processing. The main feature
of Apache Spark is its in-memory cluster
computing that increases the processing speed of an
application. Spark provides an interface for programming
entire clusters with implicit data parallelism and fault
tolerance. It is designed to cover a wide range of
workloads such as batch applications, iterative algorithms,
interactive queries, and streaming.

Features of Apache Spark:

Speed: Spark runs up to 100 times faster than Hadoop
MapReduce for large-scale data processing. It is also
able to achieve this speed through controlled
partitioning.

Powerful Caching
Simple programming layer provides powerful caching
and disk persistence capabilities.

Deployment
It can be deployed through Mesos, Hadoop via
YARN, or Spark’s own cluster manager.

Real-Time
It offers Real-time computation & low latency because
of in-memory computation.
Polyglot
Spark provides high-level APIs in Java, Scala, Python,
and R. Spark code can be written in any of these four
languages. It also provides a shell in Scala and Python.

Spark Architecture Overview

Apache Spark has a well-defined layered architecture

where all the spark components and layers are loosely
coupled. This architecture is further integrated with
various extensions and libraries. Apache Spark
Architecture is based on two main abstractions:

 Resilient Distributed Dataset (RDD)

 Directed Acyclic Graph (DAG)

But before diving any deeper into the Spark architecture,

let me explain few fundamental concepts of Spark like
Spark Eco-system and RDD. This will help you in gaining
better insights.

Let me first explain what is Spark Eco-System.

Spark Eco-System

As you can see from the below image, the spark ecosystem
is composed of various components like Spark SQL, Spark
Streaming, MLlib, GraphX, and the Core API component.

Spark Core
Spark Core is the base engine for large-scale parallel and
distributed data processing. Further, additional libraries
that are built on the top of the core allows diverse
workloads for streaming, SQL, and machine learning. It is
responsible for memory management and fault recovery,
scheduling, distributing and monitoring jobs on a cluster &
interacting with storage systems.

Spark Streaming

Spark Streaming is the component of Spark which is used

to process real-time streaming data. Thus, it is a useful
addition to the core Spark API. It enables high-throughput
and fault-tolerant stream processing of live data streams.

Spark SQL

Spark SQL is a new module in Spark which integrates

relational processing with Spark’s functional programming
API. It supports querying data either via SQL or via the
Hive Query Language. For those of you familiar with
RDBMS, Spark SQL will be an easy transition from your
earlier tools where you can extend the boundaries of
traditional relational data processing.

GraphX

GraphX is the Spark API for graphs and graph-parallel

computation. Thus, it extends the Spark RDD with a
Resilient Distributed Property Graph. At a high-level,
GraphX extends the Spark RDD abstraction by introducing
the Resilient Distributed Property Graph (a directed
multigraph with properties attached to each vertex and
edge).

MLlib (Machine Learning)

MLlib stands for Machine Learning Library. Spark MLlib is

used to perform machine learning in Apache Spark.

SparkR

It is an R package that provides a distributed data frame

implementation. It also supports operations like selection,
filtering, aggregation but on large data-sets.

As you can see, Spark comes packed with high-level

libraries, including support for R, SQL, Python, Scala, Java
etc. These standard libraries increase the seamless
integrations in a complex workflow. Over this, it also
allows various sets of services to integrate with it like
MLlib, GraphX, SQL + Data Frames, Streaming services
etc. to increase its capabilities.

Now, let’s discuss the fundamental Data Structure of

Spark, i.e. RDD.

Resilient Distributed Dataset (RDD)

RDDs are the building blocks of any Spark application.
RDDs Stands for:

 Resilient: Fault-tolerant and is capable of rebuilding

data on failure

 Distributed: Distributed data among the multiple

nodes in a cluster

 Dataset: Collection of partitioned data with values

It is a layer of abstracted data over the distributed

collection. It is immutable in nature and follows lazy
transformations.

Now you might be wondering about its working. Well, the

data in an RDD is split into chunks based on a key. RDDs
are highly resilient, i.e, they are able to recover quickly
from any issues as the same data chunks are replicated
across multiple executor nodes. Thus, even if one executor
node fails, another will still process the data. This allows
you to perform your functional calculations against your
dataset very quickly by harnessing the power of multiple
nodes.

Moreover, once you create an RDD it

becomes immutable. By immutable I mean, an object
whose state cannot be modified after it is created, but they
can surely be transformed.

Talking about the distributed environment, each dataset in

RDD is divided into logical partitions, which may be
computed on different nodes of the cluster. Due to this,
you can perform transformations or actions on the
complete data parallelly. Also, you don’t have to worry
about the distribution, because Spark takes care of that.
There are two ways to create RDDs − parallelizing an
existing collection in your driver program, or by
referencing a dataset in an external storage system, such
as a shared file system, HDFS, HBase, etc.

With RDDs, you can perform two types of operations:

1. Transformations: They are the operations that are

applied to create a new RDD.

2. Actions: They are applied on an RDD to instruct

Apache Spark to apply computation and pass the result
back to the driver.

I hope you got a thorough understanding of RDD concepts.

Now let’s move further and see the working of Spark
Architecture.

Working of Spark Architecture

As you have already seen the basic architectural overview

of Apache Spark, now let’s dive deeper into its working.

In your master node, you have the driver program, which

drives your application. The code you are writing behaves
as a driver program or if you are using the interactive
shell, the shell acts as the driver program.
Inside the driver program, the first thing you do is,
you create a Spark Context. Assume that the Spark
context is a gateway to all the Spark functionalities. It is
similar to your database connection. Any command you
execute in your database goes through the database
connection. Likewise, anything you do on Spark goes
through Spark context.

Now, this Spark context works with the cluster

manager to manage various jobs. The driver program &
Spark context takes care of the job execution within the
cluster. A job is split into multiple tasks which are
distributed over the worker node. Anytime an RDD is
created in Spark context, it can be distributed across
various nodes and can be cached there.

Worker nodes are the slave nodes whose job is to

basically execute the tasks. These tasks are then executed
on the partitioned RDDs in the worker node and hence
returns back the result to the Spark Context.

Spark Context takes the job, breaks the job in tasks and
distribute them to the worker nodes. These tasks work on
the partitioned RDD, perform operations, collect the
results and return to the main Spark Context.

If you increase the number of workers, then you can divide

jobs into more partitions and execute them parallelly over
multiple systems. It will be a lot faster.

With the increase in the number of workers, memory size

will also increase & you can cache the jobs to execute it
faster.

To know about the workflow of Spark Architecture, you

can have a look at the infographic below:
STEP 1:

The client submits spark user application code. When

application code is submitted, the driver implicitly
converts user code that contains transformations and
actions into a logically directed acyclic
graph called DAG. At this stage, it also performs
optimizations such as pipelining transformations.

STEP 2:

After that, it converts the logical graph called DAG into

physical execution plan with many stages. After converting
into a physical execution plan, it creates physical execution
units called tasks under each stage. Then the tasks are
bundled and sent to the cluster.
STEP 3:

Now the driver talks to the cluster manager and negotiates

the resources. Cluster manager launches executors in
worker nodes on behalf of the driver. At this point, the
driver will send the tasks to the executors based on data
placement. When executors start, they register themselves
with drivers. So, the driver will have a complete view of
executors that are executing the task.

STEP 4:

During the course of the execution of tasks, driver

program will monitor the set of executors that runs. Driver
node also schedules future tasks based on data placement.

This was all about Spark Architecture. Now, let’s get a

hand’s on the working of a Spark shell.

Example using Scala in Spark shell

At first, let’s start the Spark shell by assuming that
Hadoop and Spark daemons are up and running. Web
UI port for Spark is localhost:4040.

Once you have started the Spark shell, now let’s see how
to execute a word count example:

1. In this case, I have created a simple text file and stored

it in the hdfs directory. You can also use other large
data files as well.
2. Once the spark shell has started, let’s create an RDD.
For this, you have to specify the input file path and apply
the transformation flatMap(). Below code illustrates the
same:
scala> var map =
sc.textFile("hdfs://localhost:9000/Example/sample.txt").flatMap(
line => line.split(" ")).map(word => (word,1));

3. On executing this code, an RDD will be created as

shown in the figure.

4. After that, you need to apply the

action reduceByKey() to the created RDD.
scala> var counts = map.reduceByKey(_+_);

After applying action, execution starts as shown below.

5. Next step is to save the output in a text file and specify

the path to store the output.

6. After specifying the output path, go to the hdfs web

browser localhost:50040. Here you can see the output
text in the ‘part’ file as shown below.
7. Below figure shows the output text present in the ‘part’
file.

I hope that you have understood how to create a Spark

Application and arrive at the output.

Now, let me take you through the web UI of Spark to

understand the DAG visualizations and partitions of the
executed task.
 On clicking the task that you have submitted, you can
view the Directed Acyclic Graph (DAG) of the completed
job.

 Also, you can view the summary metrics of the executed

task like — time taken to execute the task, job ID,
completed stages, host IP Address etc.

Now, let’s understand about partitions and parallelism in

RDDs.

 A partition is a logical chunk of

a large distributed data set.

 By default, Spark tries to read data into an RDD from

the nodes that are close to it.
Now, let’s see how to execute a parallel task in the shell.

 Below figure shows the total number of partitions on

the created RDD.

 Now, let me show you how parallel execution of 5

different tasks appears.

Spark Guide for 4th Year Engineering Students
No ratings yet
Spark Guide for 4th Year Engineering Students
241 pages
Understanding RDDs in Apache Spark
No ratings yet
Understanding RDDs in Apache Spark
14 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Apache Spark Overview & Features
No ratings yet
Apache Spark Overview & Features
65 pages
Unit V
No ratings yet
Unit V
35 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Apache Spark Features and Architecture
No ratings yet
Apache Spark Features and Architecture
4 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Learn by Doing It
No ratings yet
Learn by Doing It
9 pages
Introduction to Apache Spark Basics
No ratings yet
Introduction to Apache Spark Basics
23 pages
Introduction to Apache Spark 2 Architecture
No ratings yet
Introduction to Apache Spark 2 Architecture
43 pages
Spark Everything
No ratings yet
Spark Everything
34 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
37 pages
Bda U4
No ratings yet
Bda U4
49 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Overview of Apache Spark and RDDs
100% (1)
Overview of Apache Spark and RDDs
109 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Apache Spark Big Data Framework Overview
No ratings yet
Apache Spark Big Data Framework Overview
58 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
8 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Apache Spark IP Gemini 1 PDF
No ratings yet
Apache Spark IP Gemini 1 PDF
38 pages
Spark Programming Fundamentals Guide
No ratings yet
Spark Programming Fundamentals Guide
54 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
PySpark Cheat Sheet Overview
No ratings yet
PySpark Cheat Sheet Overview
18 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Spark
No ratings yet
Spark
7 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Apache Spark: Features & Components
No ratings yet
Apache Spark: Features & Components
9 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Overview of SPARK Technology and RDDs
No ratings yet
Overview of SPARK Technology and RDDs
39 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Overview of Apache Spark Features
No ratings yet
Overview of Apache Spark Features
9 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
21 pages
Apache Spark IP Chatgpt 2 PDF
No ratings yet
Apache Spark IP Chatgpt 2 PDF
34 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Understanding Apache Spark Basics
No ratings yet
Understanding Apache Spark Basics
125 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
M5 Q&a
No ratings yet
M5 Q&a
26 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Data Bricks
No ratings yet
Data Bricks
42 pages
Macro Definitions and Expansions Explained
No ratings yet
Macro Definitions and Expansions Explained
39 pages
High Level Lang
No ratings yet
High Level Lang
1 page
Cs101 Introduction To Computing: Most Repeated Mcqs Midterm
No ratings yet
Cs101 Introduction To Computing: Most Repeated Mcqs Midterm
40 pages
Chapter 2: Literature Review
No ratings yet
Chapter 2: Literature Review
38 pages
Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation
No ratings yet
Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation
32 pages
Tgc-142 Notification Jan 2026
No ratings yet
Tgc-142 Notification Jan 2026
14 pages
433MHz Wireless Charging Scanner-88x140mm
No ratings yet
433MHz Wireless Charging Scanner-88x140mm
40 pages
IOT Based Automatic Plant Irrigation System Based On Weather Conditions Using Multiple Weather Sensors Document.
0% (1)
IOT Based Automatic Plant Irrigation System Based On Weather Conditions Using Multiple Weather Sensors Document.
85 pages
Chapter 06 - Preparation of Materials and Creation of Project PDF
No ratings yet
Chapter 06 - Preparation of Materials and Creation of Project PDF
47 pages
Viibhor Resume
No ratings yet
Viibhor Resume
2 pages
Overview of JDBC Drivers and Usage
No ratings yet
Overview of JDBC Drivers and Usage
64 pages
TLEModule 2 Q3
No ratings yet
TLEModule 2 Q3
28 pages
Software Engineering Exam 2009
No ratings yet
Software Engineering Exam 2009
9 pages
Chapter 4 - SOC
No ratings yet
Chapter 4 - SOC
49 pages
Tradesense Wba Status Date
No ratings yet
Tradesense Wba Status Date
55 pages
Bloxburg Job Automation Script
No ratings yet
Bloxburg Job Automation Script
3 pages
User Guide EN
No ratings yet
User Guide EN
14 pages
Routing and Packet Filtering in NOSS
No ratings yet
Routing and Packet Filtering in NOSS
9 pages
A Robust Digital Baseband Predistorter Constructed Using Memory Polynomials
No ratings yet
A Robust Digital Baseband Predistorter Constructed Using Memory Polynomials
7 pages
Heimgard Hybrid Hub v1
No ratings yet
Heimgard Hybrid Hub v1
2 pages
Ampliacion VT01N VT02N VT03N
No ratings yet
Ampliacion VT01N VT02N VT03N
3 pages
Certified Cloud Applied Generative AI Engineer (GenEng)
No ratings yet
Certified Cloud Applied Generative AI Engineer (GenEng)
18 pages
Unit 1
No ratings yet
Unit 1
9 pages
Overview of Programming Languages
No ratings yet
Overview of Programming Languages
3 pages
MOKO Beacon Standard APP User Maunal - V1.1
No ratings yet
MOKO Beacon Standard APP User Maunal - V1.1
46 pages
Examiners Report Computer Systems
No ratings yet
Examiners Report Computer Systems
28 pages
VishalArya - 2.9yrs - Sap Basis - KPMG - Bangalore
No ratings yet
VishalArya - 2.9yrs - Sap Basis - KPMG - Bangalore
2 pages
Virtualization Lab Manual
No ratings yet
Virtualization Lab Manual
31 pages
Selected Topics in Software Engineering
No ratings yet
Selected Topics in Software Engineering
50 pages
System and Networks Administration Lecture 6
No ratings yet
System and Networks Administration Lecture 6
33 pages

Apache Spark Architecture

Uploaded by

Apache Spark Architecture

Uploaded by

Apache Spark Architecture

Distributed System Architecture Explained

Apache Spark Architecture.

Apache Spark is an open-source cluster computing

 Spark & its Features

 Spark Architecture Overview

 Resilient Distributed Datasets (RDDs)

 Working of Spark Architecture

 Example using Scala in Spark Shell

Spark & its Features

Apache Spark is an open-source cluster computing

Features of Apache Spark:

Spark Architecture Overview

Apache Spark has a well-defined layered architecture

 Resilient Distributed Dataset (RDD)

 Directed Acyclic Graph (DAG)

But before diving any deeper into the Spark architecture,

Let me first explain what is Spark Eco-System.

Spark Streaming is the component of Spark which is used

Spark SQL is a new module in Spark which integrates

GraphX is the Spark API for graphs and graph-parallel

MLlib (Machine Learning)

MLlib stands for Machine Learning Library. Spark MLlib is

It is an R package that provides a distributed data frame

As you can see, Spark comes packed with high-level

Now, let’s discuss the fundamental Data Structure of

Resilient Distributed Dataset (RDD)

 Resilient: Fault-tolerant and is capable of rebuilding

 Distributed: Distributed data among the multiple

 Dataset: Collection of partitioned data with values

It is a layer of abstracted data over the distributed

Now you might be wondering about its working. Well, the

Moreover, once you create an RDD it

Talking about the distributed environment, each dataset in

With RDDs, you can perform two types of operations:

1. Transformations: They are the operations that are

2. Actions: They are applied on an RDD to instruct

I hope you got a thorough understanding of RDD concepts.

Working of Spark Architecture

As you have already seen the basic architectural overview

In your master node, you have the driver program, which

Now, this Spark context works with the cluster

Worker nodes are the slave nodes whose job is to

If you increase the number of workers, then you can divide

With the increase in the number of workers, memory size

To know about the workflow of Spark Architecture, you

The client submits spark user application code. When

After that, it converts the logical graph called DAG into

Now the driver talks to the cluster manager and negotiates

During the course of the execution of tasks, driver

This was all about Spark Architecture. Now, let’s get a

Example using Scala in Spark shell

1. In this case, I have created a simple text file and stored

3. On executing this code, an RDD will be created as

4. After that, you need to apply the

After applying action, execution starts as shown below.

5. Next step is to save the output in a text file and specify

6. After specifying the output path, go to the hdfs web

I hope that you have understood how to create a Spark

Now, let me take you through the web UI of Spark to

 Also, you can view the summary metrics of the executed

Now, let’s understand about partitions and parallelism in

 A partition is a logical chunk of

 By default, Spark tries to read data into an RDD from

 Below figure shows the total number of partitions on

 Now, let me show you how parallel execution of 5

You might also like