Introduction To Spark

Uploaded by

somnathdeb0212

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views30 pages

Introduction To Spark

Uploaded by

somnathdeb0212

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Introduction to Spark

Sajan Kedia
Agenda
• What is Spark
• Why Spark
• Spark Framework
• RDD
• Immutability
• Lazy Evaluation
• Dataframe
• Dataset
• Spark SQL
• Architecture
• Cluster Manager
What is Spark?
• Apache Spark is a fast, in-memory data processing engine
• With development APIs it allow to execute streaming, machine learning or SQL.
• Fast, expressive cluster computing system compatible with Apache Hadoop
• Improves efficiency through:
• In-memory computing primitives
• General computation graphs (DAG)
• Up to 100× faster
• Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
• Often 2-10× less code
• open source parallel process computational framework primarily used for data
engineering and analytics.
About Apache Spark
• Initially started at UC Berkeley in 2009
• Open source cluster computing framework
• Written in Scala (gives power of functional Programming)
• Provides high level APIs in
• Java
• Scala
• Python
• R
• Integration with Hadoop and its ecosystem and can read existing data.
• Designed to be fast for iterative algorithms and interactive queries for which mapReduce is inefficient.
• Most popular for running Iterative Machine Learning Algorithms.
• With support for in memory storage and efficient fault recovery.
• 10x (on disk) - 100x (In-Memory) faster
Why Spark ?
• Most of Machine Learning Algorithms are iterative because each iteration
can improve the results
• With Disk based approach each iteration output is written to disk making it
slow

Hadoop execution flow

Spark execution flow

Spark Core Engine
Spark Framework
RDD (Resilient Distributed Dataset)
• Key Spark Construct
• A distributed collection of elements
• Each RDD is split into multiple partitions which may be computed on different nodes of the cluster
• Spark automatically distributes the data in RDD across cluster and parallelize the operations
• RDD have following properties
○ Immutable
○ Lazy evaluated
○ Cacheable
○ Type inferred
RDD Operations
• How to create RDD:
• Loading external data sources
• lines=sc.textfile(“readme.txt”)
• Parallelizing a collection in a driver program
• Lines=sc.parallelize([“pandas”, “I like pandas”])
• Transformation
• transform RDD to another RDD by applying some functions
• Lineage graph (DAG) : keep the track of dependencies between transformed RDDs using which on
demand new RDDs can be created or part of persistent RDD can be recovered in case of failure.
• Examples : map, filter,flapmap,distinct,sample , union, intersect,subtract ,cartesian etc.
• Action
• Actual output is being generated in transformed RDD once action is applied
• Return values to the driver program or write data to external storage
• The entire RDD gets computed from scratch on new action call and if intermediate results are not
persisted.
• Examples : reduce, collect,count,countByvalue,take,top,takeSample,aggreagate,foreach etc.
Immutability
● Immutability means once created it never changes
● Big data by default immutable in nature
● Immutability helps to
○ Parallelize
○ Caching
Immutability in actio
const int a = 0 //immutable
int b = 0; // mutable
Updation
b ++ // in place
c=a+1
Immutability is about value not about reference
Challenges of Immutability
● Immutability is great for parallelism but not good for space
● Doing multiple transformations result in
○ Multiple copies of data
○ Multiple passes over data
● In big data, multiple copies and multiple passes will have poor performance characteristics.
Lazy Evaluation
• Laziness means not computing transformation till it’s need
• Once, any action is performed then the actual computation starts
• A DAG (Directed acyclic graph) will be created for the tasks
• Catalyst Engine is used to optimize the tasks & queries
• It helps reduce the number of passes
• Laziness in action
val c1 = collection.map(value => value +1) //do not compute anything
val c2 = c1.map(value => value +2) // don’t compute
print c2 // Now transform into

Multiple transformations are combined to one

val c2 = collection.map (value => {var result = value +1
result = result + 2 } )
Challenges of Laziness

• Laziness poses challenges in terms of data type

• If laziness defers execution, determining the type of the variable becomes challenging
• If we can’t determine the right type, it allows to have semantic issues
• Running big data programs and getting semantics errors are not fun.
Type inference
• Type inference is part of compiler to determining the type by value
• As all the transformation are side effect free, we can determine the type by operation
• Every transformation has specific return type
• Having type inference relieves you think about representation for many transforms.
• Example:
• c3 = c2.count( ) // inferred as Int
• collection = [1,2,4,5] // explicit type Array
Caching
• Immutable data allows you to cache data for long time
• Lazy transformation allows to recreate data on failure
• Transformations can also be saved
• Caching data improves execution engine performance
• Reduces lots of I/O operations of reading/writing
data from HDFS
What Spark gives Hadoop?
• Machine learning module delivers capabilities not easily exploited in
Hadoop.
• in-memory processing of sizeable data volumes, remains an important
contribution to the capabilities of a Hadoop cluster.
• Valuable for enterprise use cases
• Spark’s SQL capabilities for interactive analysis over big data
• Streaming capabilities (Spark streaming)
• Graph processing capabilities (GraphX)
What Hadoop gives Spark?
• YARN resource manager
• DFS
• Disaster Recovery capabilities
• Data Security
• A distributed data platform
• Leverage existing clusters
• Data locality
• Manage workloads using advanced policies
• Allocate shares to different teams and users
• Hierarchical queues
• Queue placement policies
• Take advantage of Hadoop’s security
• Run on Kerberized clusters
Data Frames in Spark
• Unlike an RDD, data is organized into named columns.
• Allows developers to impose a structure onto a distributed collection of data.
• Enables wider audiences beyond “Big Data” engineers to leverage the power of distributed
processing
• allowing Spark to manage the schema and only pass data between nodes, in a much more efficient
way than using Java serialization.
• Spark can serialize the data into off-heap storage in a binary format and then perform many
transformations directly on this off-heap memory,
• Custom memory management
• Data is stored in off – heap memory in binary format.
• No Garbage collection is involved ,due to avoidance of serialization
• Query optimization plan
• Catalyst Query optimizer
Datasets in Spark
• Aims to provide the best of both worlds
• RDDs – OOP and compile time safely
• Data frames - Catalyst query optimizer, custom memory management
• How dataset scores over Data frame is an additional feature it has: Encoders .
• Encoders act as interface between JVM objects and off-heap custom memory binary format data.
• Encoders generate byte code to interact with off-heap data and provide on-demand access to
individual attributes without having to deserialize an entire object.
Spark SQL
• Spark SQL is a Spark module for structured data processing
• It lets you query structured data inside Spark programs, using SQL or a familiar
DataFrame API.
• Connect to any data source the same way
• Hive, Avro, Parquet, ORC, JSON, and JDBC.
• You can even join data across these sources.
• Run SQL or HiveQL queries on existing warehouses.
• A server mode provides industry standard JDBC and ODBC connectivity for business
intelligence tools.
• Writing code in RDD API in scala can be difficult, using Spark SQL easy SQL format
code can be written which internally converts to Spark API & optimized using the DAG
& catalyst engine.
• There is no reduction in the performance.
PySpark Code - Hands On
Spark Runtime Architecture
Spark Runtime Architecture - Driver
• Master node
• Process where “main” method runs
• Runs user code that
• creates a SparkContext
• Performs RDD operations
• When runs performs two main duties
• Converting user program into tasks
• logical DAG of operations -> physical execution plan.
• Optimization Ex: pipelining map transforms together to merge them and
convert execution graph into set of stages.
• Bundled up task and send them to cluster.
• Scheduling tasks on executors
• Executor register themselves to driver
• Look at current set executors and try to schedule task in appropriate location
,based on data placement.
• Track location of cache data and use it to schedule future tasks that access
that data, to avoid side effect of storing cached data while running a task.
Spark Runtime Architecture - Spark Context
• Driver accesses Spark functionality through SC object
• Represents a connection to the computing cluster
• Used to build RDDs
• Works with the cluster manager
• Manages executors running on worker nodes
• Splits jobs as parallel task and execute them on worker nodes
• Partitions RDDs and distributes them on the cluster
Spark Runtime Architecture - Executor
• Runs individual tasks in given spark job
• Launched once at the beginning of spark application
• Two main roles
• Runs the task and return results to driver
• Provides in-memory data stored for RDDs that are cached by user
programs, through service called Block Manager that lives within each
executor.
Spark Runtime Architecture - Cluster Manager

• Launches Executors and sometimes driver

• Allows sparks to run on top of different external managers
• YARN
• Mesos
• Spark built-in stand alone cluster manager
• Deploy modes
• Client mode
• Cluster mode
Spark + YARN (Cluster Deployment mode)
• Driver runs inside a YARN container
Spark + YARN (Client Deployment Mode)
• Driver runs on the machine that you submitted the application.
e.g. local on laptop
Steps: running spark application on cluster
• Submit an application using spark-submit
• spark-submit launches driver program and invoke main() method
• Driver program contact cluster manager to ask for resources to launch executors
• Cluster manager launches executors on behalf of the driver program.
• Drive process runs through user application and send work to executors in the form
of tasks.
• Executors runs the tasks and save the results.
• driver’s main() method exits or SparkContext.stop() – terminate the executors and
release the resources.
Thanks :)

Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Data Bricks
No ratings yet
Data Bricks
42 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Spark Guide for 4th Year Engineering Students
No ratings yet
Spark Guide for 4th Year Engineering Students
241 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Learn by Doing It
No ratings yet
Learn by Doing It
9 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
SPARK
No ratings yet
SPARK
47 pages
Apache Spark Basics & Comparison
No ratings yet
Apache Spark Basics & Comparison
66 pages
Sparklyr Online Training Overview
No ratings yet
Sparklyr Online Training Overview
80 pages
Apache Spark Overview & Features
No ratings yet
Apache Spark Overview & Features
65 pages
Spark
No ratings yet
Spark
96 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
SPARK
No ratings yet
SPARK
66 pages
Unit V
No ratings yet
Unit V
35 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
MapReduce vs. Spark: Big Data Processing
No ratings yet
MapReduce vs. Spark: Big Data Processing
21 pages
Apache Spark vs Dask: Big Data Tools
No ratings yet
Apache Spark vs Dask: Big Data Tools
55 pages
Understanding Apache Spark Basics
No ratings yet
Understanding Apache Spark Basics
125 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Spark Programming Fundamentals Guide
No ratings yet
Spark Programming Fundamentals Guide
54 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Introduction to Apache Spark 2 Architecture
No ratings yet
Introduction to Apache Spark 2 Architecture
43 pages
Understanding RDDs in Apache Spark
No ratings yet
Understanding RDDs in Apache Spark
14 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
Overview of SPARK Technology and RDDs
No ratings yet
Overview of SPARK Technology and RDDs
39 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Module 4
No ratings yet
Module 4
29 pages
Bda U4
No ratings yet
Bda U4
49 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Spark Programming and RDDs Overview
No ratings yet
Spark Programming and RDDs Overview
59 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Apache Spark Big Data Framework Overview
No ratings yet
Apache Spark Big Data Framework Overview
58 pages
Unit 4 Topic 4 Capped Collections Spark
No ratings yet
Unit 4 Topic 4 Capped Collections Spark
30 pages
SAESM Newsletter 2019 Vol 02 No 01
No ratings yet
SAESM Newsletter 2019 Vol 02 No 01
34 pages
SAESM Newsletter 2020 Vol 03 No 01
No ratings yet
SAESM Newsletter 2020 Vol 03 No 01
58 pages
Receipt: Items
No ratings yet
Receipt: Items
1 page
CDA C2 R 073 en File 71.en
No ratings yet
CDA C2 R 073 en File 71.en
9 pages
CGL Week 3-4 Reasoning IV
No ratings yet
CGL Week 3-4 Reasoning IV
2 pages
CGL Week 3-4 Reasoning III
No ratings yet
CGL Week 3-4 Reasoning III
4 pages
CGL Data Interpretation
No ratings yet
CGL Data Interpretation
12 pages
CGL Reasoning Figure Formation
No ratings yet
CGL Reasoning Figure Formation
6 pages
Ariel Bandeira: Product Designer
No ratings yet
Ariel Bandeira: Product Designer
2 pages
Digital Technology - Computers
No ratings yet
Digital Technology - Computers
119 pages
Flow-Gateway-10-20-0-User Guide
No ratings yet
Flow-Gateway-10-20-0-User Guide
84 pages
Understanding Computer Applications
No ratings yet
Understanding Computer Applications
7 pages
Adobe Analytics Consultant Profile
No ratings yet
Adobe Analytics Consultant Profile
5 pages
Linux File Systems Overview and Analysis
No ratings yet
Linux File Systems Overview and Analysis
13 pages
Bca 5 Sem Programming in Java 70629 Jan 2021
No ratings yet
Bca 5 Sem Programming in Java 70629 Jan 2021
2 pages
CS3492-DBMS Questions and Answers
No ratings yet
CS3492-DBMS Questions and Answers
5 pages
CUDA for Developers and Engineers
No ratings yet
CUDA for Developers and Engineers
28 pages
AZ-900 StudyGuide ENU FY23Q1 Vnext
No ratings yet
AZ-900 StudyGuide ENU FY23Q1 Vnext
9 pages
Mail Delivery Failure Notification
No ratings yet
Mail Delivery Failure Notification
1 page
Ise Assignments
No ratings yet
Ise Assignments
4 pages
MICP BOC Ticket #MICP-02688674 Update
No ratings yet
MICP BOC Ticket #MICP-02688674 Update
3 pages
A023 Blockvote Report
No ratings yet
A023 Blockvote Report
64 pages
Cybersecurity Category Interview Questions
No ratings yet
Cybersecurity Category Interview Questions
71 pages
Chapter 5 Notes
No ratings yet
Chapter 5 Notes
11 pages
Afnan Khan - New 2
No ratings yet
Afnan Khan - New 2
3 pages
Mobile Phone Crimes
No ratings yet
Mobile Phone Crimes
13 pages
FSPM 15.00 Adminguide Eng
No ratings yet
FSPM 15.00 Adminguide Eng
131 pages
Medusa: Malware Analysis Framework
No ratings yet
Medusa: Malware Analysis Framework
23 pages
SEM Strategies for Business Growth
No ratings yet
SEM Strategies for Business Growth
15 pages
Threat Intelligence & Incident Response
No ratings yet
Threat Intelligence & Incident Response
4 pages
AIT Business Solutions: IT Services Overview
No ratings yet
AIT Business Solutions: IT Services Overview
9 pages
Rental Management System Project Report Final
No ratings yet
Rental Management System Project Report Final
21 pages
Main
No ratings yet
Main
4 pages
Experienced Software Engineer Profile
No ratings yet
Experienced Software Engineer Profile
2 pages
VGU Online BCA Program Overview
No ratings yet
VGU Online BCA Program Overview
5 pages
Mod Mad at A Set User License Agreement
No ratings yet
Mod Mad at A Set User License Agreement
4 pages
Sample Power Bi Q & A
No ratings yet
Sample Power Bi Q & A
8 pages
XSIAM Documentation
No ratings yet
XSIAM Documentation
170 pages

Introduction To Spark

Uploaded by

Introduction To Spark

Uploaded by

Introduction to Spark

Hadoop execution flow

Spark execution flow

Multiple transformations are combined to one

• Laziness poses challenges in terms of data type

• Launches Executors and sometimes driver

You might also like