Complete Playlist: Learning Journal
https://www.youtube.com/playlist?list=PLkz1SCf5iB4dXiPdFD4hXwheRGRwhmd6K
Notes By Learning Journal:
https://www.learningjournal.guru/courses/spark/spark-foundation-training/
History & Evolution of Distributed System
https://www.youtube.com/watch?v=TIU5x884Aro
Most of the applications data is growing expontially (Like google search engine). But storing
this data on single machine is became complicated (previously managing this data by
increasing size of hardware’s).
Google facing following problems due to massive volume of data
1. Data Collection and Ingestion
2. Data Storage and Management
3. Data Processing and Transformation
4. Data Access and retrieval
Then google found solutions
1. Google File System 2003: Related to distributed file systems
2. MapReduce 2004: Distributed processing
Then open source communities implemented HADOOP
Hadoop offered a revolutionary solution for this problem. Hadoop solved this problem in two
parts.
1. HDFS – Hadoop Distributed File Systems: Offering a distributed storage.
2. Hadoop Map Reduce: Offering a distributed computing engine
Apache Spark
1. Later Apache Spark came out of UC Berkley. The community started looking at
Spark as a compelling alternative or a replacement of Hadoop's map-reduce.
However, with time, Apache Spark is now a defacto for big data computing. We can
describe Apache Spark as following.
2. Apache Spark is a fast and general purpose engine for large-scale data processing.
Under the hood, it works on a cluster of computers.
3. If you came here from Hadoop Map Reduce background, Spark is 10 to 100 times
faster than Hadoop's Map Reduce. If you know nothing about Hadoop, It's okay.
Hadoop is not a prerequisite for learning Spark. However, you can think of Spark as a
successor of Hadoop.
Concept of Data Lake:
https://www.youtube.com/watch?v=B6RDjs7D-qY
1. Ingest:- Data Collection and Ingestion
2. Store:- Data Storage and Management
3. Process:- Data Processing and Transformation
4. Consume:- Data Access and retrieval
Spark Vs Hadoop
Category for
Hadoop Spark
Comparison
Slower performance, uses disks for storage and Fast in-memory performance with reduced
Performance
depends on disk read and write speed. disk reading and writing operations.
An open-source platform, less expensive to run. An open-source platform, but relies on
Cost Uses affordable consumer hardware. Easier to find memory for computation, which
trained Hadoop professionals. considerably increases running costs.
Best for batch processing. Uses MapReduce to Suitable for iterative and live-stream data
Data Processing split a large dataset across a cluster for parallel analysis. Works with RDDs and DAGs to
analysis. run operations.
Tracks RDD block creation process, and
A highly fault-tolerant system. Replicates the data
then it can rebuild a dataset when a partition
Fault Tolerance across the nodes and uses them in case of an
fails. Spark can also use a DAG to rebuild
issue.
data across nodes.
Easily scalable by adding nodes and disks for A bit more challenging to scale because it
Scalability storage. Supports tens of thousands of nodes relies on RAM for computations. Supports
without a known limit. thousands of nodes in a cluster.
Not secure. By default, the security is turned
Extremely secure. Supports LDAP, ACLs,
Security off. Relies on integration with Hadoop to
Kerberos, SLAs, etc.
achieve the necessary security level.
Ease of Use More difficult to use with less supported More user friendly. Allows interactive shell
and Language languages. Uses Java or Python for MapReduce mode. APIs can be written in Java, Scala, R,
Support apps. Python, Spark SQL.
Slower than Spark. Data fragments can be too
Much faster with in-memory processing.
Machine Learning large and create bottlenecks. Mahout is the main
Uses MLlib for computations.
library.
Scheduling and Uses external solutions. YARN is the most
Has built-in tools for resource allocation,
Resource common option for resource management. Oozie
scheduling, and monitoring.
Management is available for workflow scheduling.
What is Apache Spark?
https://www.youtube.com/watch?v=Hciruu3Gb3E
Apache Spark
At first, in 2009 Apache Spark was introduced in the UC Berkeley R&D Lab, which is now
known as AMPLab. Afterward, in 2010 it became open source under BSD license. Further,
the spark was donated to Apache Software Foundation, in 2013. Then in 2014, it became top-
level Apache project. Apache Spark is a fast and general purpose engine for large-scale data
processing. Under the hood, it works on a cluster of computers.
Let's start with a diagram that represents the Apache Spark Ecosystem.
Dig. Apache Spark Ecosystem
Based on the figure shown above, we can break Apache Spark ecosystem into three layers.
1. Storage and Cluster Manager
2. Spark Core
3. Libraries and DSL
Storage and Cluster Manager
Apache Spark is a distributed processing engine. However, it doesn't come with an inbuilt
cluster resource manager and a distributed storage system. There is a good reason behind that
design decision. Apache Spark tried to decouple the functionality of a cluster resource
manager, distributed storage and a distributed computing engine from the beginning. This
design allows us to use Apache Spark with any compatible cluster manager and storage
solution. Hence, the storage and the cluster manager are part of the ecosystem however they
are not part of Apache Spark. You can plugin a cluster manager and a storage system of your
choice. There are multiple alternatives. You can use Apache YARN, Mesos, and even
Kubernetes as a cluster manager for Apache Spark. Similarly, for the storage system, you can
use HDFS, Amazon S3, Google Cloud storage, Cassandra File system and many others.
Spark Core
Apache Spark core contains two main components.
1. Spark Compute engine
2. Spark Core APIs
The earlier discussion makes one thing clear that Apache Spark does not offer cluster
management and storage management services. However, it has a compute engine as part of
the Spark Core. The compute engine provides some basic functionalities like memory
management, task scheduling, fault recovery and most importantly interacting with the
cluster manager and storage system. So, it is the Spark compute engine that executes and
manages our Spark jobs and provides a seamless experience to the end user. You just submit
your Job to Spark, and the Spark core takes care of everything else.
The second part of Spark Core is core API. Spark core consists of two types of APIs.
1. Structured API
2. Unstructured API
The Structured APIs consists of data frames and data sets. They are designed and optimized
to work with structured data. The Unstructured APIs are the lower level APIs including
RDDs, Accumulators and Broadcast variables. These core APIs are available in Scala,
Python, Java, and R.
Libraries and DSL
Outside the Spark Core, we have four different set of libraries and packages.
1. Spark SQL - It allows you to use SQL queries for structured data processing.
2. Spark Streaming - It helps you to consume and process continuous data streams.
3. MLlib - It is a machine learning library that delivers high-quality algorithms.
4. GraphX - It comes with a library of typical graph algorithms.
These are nothing but a set of packages and libraries. They offer you APIs, DSLs, and
algorithms in multiple languages. They directly depend on Spark Core APIs to achieve
distributed processing.
Why is Spark so popular?
At very high level, there are three main reasons for its popularity and rapid adoption.
1. It abstracts away the fact that you are coding to execute on a cluster of computers. In
the best case scenario, you will be working with tables like in any other database and
using SQL queries. In the worst case scenario, you will be working with collections.
You will feel like working with a local Scala or a Python collection. Everything else,
all the complexity of the distributed storage, computation, and parallel programming
is abstracted away by the Spark Core.
2. Spark is a unified platform that combines the capabilities for batch processing,
structured data handling with SQL like language, near real-time stream processing,
graph processing, and machine learning. All of this into a single framework using
your favorate programming language. You can mix and match them to solve many
sophisticated requirements.
3. Ease of use. If you compare it with Map Reduce code, Spark code is much more
short, simple, easy to read and understand. The growing ecosystem and libraries that
offer ready to use algorithms and tools. The Spark community is continuously
working towards making it more straightforward with every new release.
Now we understand the Spark Ecosystem. Continue reading to uncover the internals.
Spark Architecture:
https://www.youtube.com/watch?v=vJ0eUZxF80s
https://www.youtube.com/watch?v=fyTiJLKEzME
Spark Execution Model
Spark is a distributed processing engine, and it follows the master-slave architecture. So, for
every application, Spark will create one master process and multiple slave processes. In
Spark terminology, the master is the driver, and the slaves are the executors. Let's try to
understand it with a simple example.
Suppose you are using the spark-submit utility. You execute an application A1 using spark-
submit, and Spark will create one driver process and some executor processes for A1. This
entire set is exclusive for the application A1.
Now, you submit another application A2, and Spark will create one more driver process and
some executor process for A2. So, for every application, Spark creates one driver and a bunch
of executors.
Spark Driver
The driver is the master. It is responsible for analyzing, distributing, scheduling and
monitoring work across the executors. The driver is also responsible for maintaining all the
necessary information during the lifetime of the application.
Spark Executors (Worker node)
Spark executors are only responsible for executing the code assigned to them by the driver
and reporting the status back to the driver. The Spark driver will assign a part of the data and
a set of code to executors. The executor is responsible for executing the assigned code on the
given data. They keep the output with them and report the status back to the driver.
Spark Execution Modes
Now we know that every Spark application has a set of executors and one dedicated driver.
The next question is - Who executes where? I mean, we have a cluster, and we also have a
local client machine. What executes where?
The executors are always going to run on the cluster machines. There is no exception for
executors. However, you have the flexibility to start the driver on your local machine or as a
process on the cluster. When you start an application, you have a choice to specify the
execution mode, and there are three options.
1. Client Mode - Start the driver on your local machine
2. Cluster Mode - Start the driver on the cluster
3. Local Mode - Start everything in a single local JVM.
The Client Mode will start the driver on your local machine, and the Cluster Mode will start
the driver on the cluster. Local mode is a for debugging purpose. The local mode doesn't use
the cluster at all and everything runs in a single JVM on your local machine.
Which mode should you use?
You already know that the driver is responsible for the whole application. If anything goes
wrong with the driver, your application state is gone. So, if you start the driver on your local
machine, your application is directly dependent on your local computer. You might not need
that kind of dependency in a production application. After all, you have a dedicated cluster to
run the job. Hence, the Cluster mode makes perfect sense for production deployment.
Because after spark-submit, you can switch off your local computer and the application
executes independently within the cluster.
On the other side, when you are exploring things or debugging an application, you want the
driver to be running locally. If the driver is running locally, you can easily debug it, or at least
it can throw back the output on your terminal. That's where the client-mode makes more
sense over the cluster-mode. And hence, If you are using an interactive client, your client tool
itself is a driver, and you will have some executors on the cluster. If you are using spark-
submit, you have both the choices.
Spark Cluster
The next key concept is to understand the resource allocation process within a Spark cluster.
How Spark gets the resources for the driver and the executors?
That's where Apache Spark needs a cluster manager. Spark doesn't offer an inbuilt cluster
manager. It relies on a third party cluster manager, and that's a powerful thing because it
gives you multiple options. As on the date of writing, Apache Spark supports four different
cluster managers.
1. Apache YARN
2. Apache Mesos
3. Kubernetes
4. Standalone
YARN is the cluster manager for Hadoop. As of date, YARN is the most widely used cluster
manager for Apache Spark.
Apache Mesos is another general-purpose cluster manager. If you are not using Hadoop, you
might be using Mesos for your Spark cluster.
The next option is the Kubernetes. I won't consider the Kubernetes as a cluster manager. In
fact, it's a general purpose container orchestration platform from Google. Spark on
Kubernates is not yet production ready. However, the community is working hard to bring it
to production.
Finally, the standalone. The Standalone is a simple and basic cluster manager that comes with
Apache Spark and makes it easy to set up a Spark cluster very quickly. I don't think you
would be using it in a production environment.
No matter which cluster manager do we use, primarily, all of them delivers the same purpose.
Spark on YARN
Let's take YARN as an example to understand the resource allocation process.
A Spark application begins by creating a Spark Session. That's the first thing in any Spark 2.x
application. If you are building an application, you will be establishing a Spark Session. If
you are using a Spark client tool, for example, scala-shell, it automatically create a Spark
Session for you. You can think of Spark Session as a data structure where the driver
maintains all the information including the executor location and their status.
Now, assume you are starting an application in client mode, or you are starting a spark-shell
(refer the digram below). In this case, your driver starts on the local machine and then as soon
as the driver create a Spark Session, a request (1) goes to YARN resource manager to create a
YARN application. The YARN resource manager starts (2) an Application Master. For the
client mode, the AM acts as an Executor Launcher. So, the YARN application master will
reach out (3) to YARN resource manager and request for further containers. The resource
manager will allocate (4) new containers, and the Application Master starts (5) an executor in
each container. After the initial setup, these executors directly communicate (6) with the
driver.
The process for cluster mode application is slightly different (refer the digram below). In the
cluster mode, you submit your packaged application using the spark-submit tool. The spark-
submit utility will send (1) a YARN application request to the YARN resource manager. The
YARN resource manager starts (2) an application master. And then, the driver starts in the
AM container. That's where the client mode and cluster mode differs.
In the client mode, the YARN AM acts as an executor launcher, and the driver resides on
your local machine, but in the cluster mode, the YARN AM starts the driver, and you don't
have any dependency on your local computer. Once started, the driver will reach out(3) to
resource manager with a request for more Containers. Rest of the process is same. The
resource manager will allocate (4) new Containers, and the driver starts (5) an executor in
each Container.
Parallel Processing in Apache Spark
We already learned about the application driver and the executors. We know that Apache
Spark breaks our application into many smaller tasks and assign them to executors. So Spark
executes the application in parallel. But do you understand the internal mechanics? How does
the Spark breaks our code into a set of task and run it in parallel?
This article aims to answer the above question.
Spark application flow
All that you are going to do in Apache Spark is to read some data from a source and load it
into Spark. You will then process the data and hold the intermediate results, and finally write
the results back to a destination. But in this process, you need a data structure to hold the data
in Spark.
We have three alternatives to hold data in Spark.
1. Data Frame
2. Dataset
3. RDD
Apache Spark 2.x recommends to use the first two and avoid using RDDs. However, there is
a critical fact to note about RDDs. Data Frames and Datasets, both of them are ultimately
compiled down to an RDD. So, under the hood, everything in Spark is an RDD. And for that
reason, I will start with RDDs and try to explain the mechanics of parallel processing.
Spark RDD
Let's define the RDD. The name stands for Resilient Drstributed Dataset. However, I can
describe a RDD as below.
Spark RDD is a resilient, partitioned, distributed and immutable collection of data.
Let's quickly review this description.
Collection of data - RDDs hold data and appears to be a Scala Collection.
Resilient - RDDs can recover from a failure, so they are fault tolerant.
Partitioned - Spark breaks the RDD into smaller chunks of data. These pieces are called
partitions.
Distributed - Instead of keeping those partitions on a single machine, Spark spreads them
across the cluster. So they are a distributed collection of data.
Immutable - Once defined, you can't change a RDD. So Spark RDD is a read-only data
structure.
You can create a RDD using two methods.
Load some data from a source.
Create a RDD by transforming another RDD.
The code below shows an example RDD.
In the first line, we load some data from a file to create a RDD. When you create a RDD by
loading some data from a source, Spark creates some default partitions. The second line
displays the default number of partitions. If you want, you can override the defaults and
create as many partitions as you want. The second parameter in textFile API is the number of
partitions. The last line iterates to all partitions and counts the number of elements for each
partition.
The above example shows that a RDD is a partitioned collection, and we can control the
number of partitions.
Spark RDDs offer two types of operations.
1. Transformations
2. Actions
The transformation operations create a new distributed dataset from an existing distributed
dataset. So, they create a new RDD from an existing RDD.
The Actions are mainly performed to send results back to the driver, and hence they produce
a non distributed dataset.
All transformations in Spark are lazy and the actions are strict. That means, they don't
compute results until an action requires them to provide results.
Parallel Processing of RDD
Let me ask you a simple question. Given the above RDD, If I want to count the number of
lines in that RDD, Can we do it in parallel? No brainer. Right?
We already have five partitions. I will give one partition to each executor and ask them to
count the lines in the given partition. Then I will take the counts back from these executors
and sum it. Simple. Isn't it? That's what the Spark does.
Calculating count is a simple thing. However, the mechanism of parallelism in Spark is the
same. There are two main variables to control the degree of parallelism in Apache Spark.
1. The number of partitions
2. The number of executors
If you have ten partitions, you can achieve ten parallel processes at the most. However, if you
have just two executors, all those ten partitions will be queued to those two executors.
Let's do something little simple and take our understanding to the next level.
Demo-
Spark program we can run by using different ways:
1) Local Machine:
Download and install PySpark on your machine and execute on your machine.
Steps:
1) Install JAVA
2) Install Python
3) Download Apache Spark
4) Verify Spark Software file
5) Install Apache Spark
6) Add winutils.exe File
7) Configure Environment Variables
8) Launch Spark
Link to install PySpark on Local Machine:
https://phoenixnap.com/kb/install-spark-on-windows-10
2) Install PySpark in Anaconda & Jupyter Notebook:
Download & Install Anaconda Distribution.
Install Java.
Install PySpark.
Install FindSpark.
Validate PySpark Installation from pyspark shell.
PySpark in Jupyter notebook.
Run PySpark from IDE.
3) By using Google Colab:
Colab by Google is an incredibly powerful tool that is based on Jupyter Notebook. Since it
runs on the Google server, we don't need to install anything in our system locally, be it Spark
or any deep learning model.
Steps to work with Google Colab:
Link: https://towardsdatascience.com/pyspark-on-google-colab-101-d31830b238be
4) By using Databricks Community Version:
Databricks is an American enterprise software company founded by the creators of Apache
Spark. Databricks develops a web-based platform for working with Spark, which provides
automated cluster management and IPython-style notebooks.
To working with Databricks we have to create Databricks Community version account i.e.
Freeware.
Link: https://community.cloud.databricks.com/login.html
How to create Databricks Account:
1) Goto https://community.cloud.databricks.com/login.html this link
2) Click on Sign Up
3) Fill your personal information as name, emailId, Phone number etc.
4) Click on GET STARTED FOR FREE
5) Select Get started with community edition
6) Solve the puzzle.& confirm your account from Gmail and set password.
Working with Databricks:
1) Upload File in Databricks file system(DBFS):
Sign in into Databricks account. At left Side Click on Data option then click on
Create Table. In upload File option click on Drop files to upload, or click to browse
and select which you want to upload
2) Get Path of Specific File from DBFS:
Goto data Click on Create table Goto DBFS File store Tables
Select file and copy path
.
3) How to Create Cluster:
Click on Create Select Cluster Cluster Name – DemoCluster
Databricks runtime version Select Runtime: 11.3LTS (Spark 3.3 version) -latest version
Instance – keep as it is click on Create Cluster
4) How to Create notebook:
Click on Create Select Notebook
Name: - demo 1
Default language: - Select Python
Cluster: - Select cluster name which is created in step 3
Starts with PySpark:
Create Spark Session:
It is an entry point to underlying PySpark functionality in order to programmatically create
PySpark RDD, DataFrame.
Spark session can be created by using .builder () method.
# Create SparkSession from builder
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName('SparkByExamples.com')
.getOrCreate()
master() – If you are running it on the cluster you need to use your master name as an
argument to master(). usually, it would be either yarn or mesos depends on your cluster setup.
Use local[x] when running in Standalone mode. X = no of partitions
appName() – Used to set your application name.
getOrCreate() – This returns a SparkSession object if already exists, and creates a new one if
not exist.
SparkContext:
It is an entry point to the PySpark functionality that is used to communicate with the cluster
and to create an RDD, accumulator, and broadcast variables.
Each spark application have single SparkContext.but we can create multiple Sparksession.
How to Create RDD by different ways:
1) Create RDD from dataset like list using parallelize method.
data=[1,2,3,4,5,6,7,8,9] # list---
rdd=sc.parallelize(data) # create rdd from list
rdd. Collect() # take collect action on RDD to execute
2) Create RDD using textfile method:
#create RDD from external Data source
rdd2 = sc.textFile ("/FileStore/tables/Network.txt") ---sc=sparkconext
rdd2.collect ()