Introduction to Spark
Sajan Kedia
Agenda
• What is Spark
• Why Spark
• Spark Framework
• RDD
• Immutability
• Lazy Evaluation
• Dataframe
• Dataset
• Spark SQL
• Architecture
• Cluster Manager
What is Spark?
• Apache Spark is a fast, in-memory data processing engine
• With development APIs it allow to execute streaming, machine learning or SQL.
• Fast, expressive cluster computing system compatible with Apache Hadoop
• Improves efficiency through:
• In-memory computing primitives
• General computation graphs (DAG)
• Up to 100× faster
• Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
• Often 2-10× less code
• open source parallel process computational framework primarily used for data
engineering and analytics.
About Apache Spark
• Initially started at UC Berkeley in 2009
• Open source cluster computing framework
• Written in Scala (gives power of functional Programming)
• Provides high level APIs in
• Java
• Scala
• Python
• R
• Integration with Hadoop and its ecosystem and can read existing data.
• Designed to be fast for iterative algorithms and interactive queries for which mapReduce is inefficient.
• Most popular for running Iterative Machine Learning Algorithms.
• With support for in memory storage and efficient fault recovery.
• 10x (on disk) - 100x (In-Memory) faster
Why Spark ?
• Most of Machine Learning Algorithms are iterative because each iteration
can improve the results
• With Disk based approach each iteration output is written to disk making it
slow
Hadoop execution flow
Spark execution flow
Spark Core Engine
Spark Framework
RDD (Resilient Distributed Dataset)
• Key Spark Construct
• A distributed collection of elements
• Each RDD is split into multiple partitions which may be computed on different nodes of the cluster
• Spark automatically distributes the data in RDD across cluster and parallelize the operations
• RDD have following properties
○ Immutable
○ Lazy evaluated
○ Cacheable
○ Type inferred
RDD Operations
• How to create RDD:
• Loading external data sources
• lines=sc.textfile(“readme.txt”)
• Parallelizing a collection in a driver program
• Lines=sc.parallelize([“pandas”, “I like pandas”])
• Transformation
• transform RDD to another RDD by applying some functions
• Lineage graph (DAG) : keep the track of dependencies between transformed RDDs using which on
demand new RDDs can be created or part of persistent RDD can be recovered in case of failure.
• Examples : map, filter,flapmap,distinct,sample , union, intersect,subtract ,cartesian etc.
• Action
• Actual output is being generated in transformed RDD once action is applied
• Return values to the driver program or write data to external storage
• The entire RDD gets computed from scratch on new action call and if intermediate results are not
persisted.
• Examples : reduce, collect,count,countByvalue,take,top,takeSample,aggreagate,foreach etc.
Immutability
● Immutability means once created it never changes
● Big data by default immutable in nature
● Immutability helps to
○ Parallelize
○ Caching
Immutability in actio
const int a = 0 //immutable
int b = 0; // mutable
Updation
b ++ // in place
c=a+1
Immutability is about value not about reference
Challenges of Immutability
● Immutability is great for parallelism but not good for space
● Doing multiple transformations result in
○ Multiple copies of data
○ Multiple passes over data
● In big data, multiple copies and multiple passes will have poor performance characteristics.
Lazy Evaluation
• Laziness means not computing transformation till it’s need
• Once, any action is performed then the actual computation starts
• A DAG (Directed acyclic graph) will be created for the tasks
• Catalyst Engine is used to optimize the tasks & queries
• It helps reduce the number of passes
• Laziness in action
val c1 = collection.map(value => value +1) //do not compute anything
val c2 = c1.map(value => value +2) // don’t compute
print c2 // Now transform into
Multiple transformations are combined to one
val c2 = collection.map (value => {var result = value +1
result = result + 2 } )
Challenges of Laziness
• Laziness poses challenges in terms of data type
• If laziness defers execution, determining the type of the variable becomes challenging
• If we can’t determine the right type, it allows to have semantic issues
• Running big data programs and getting semantics errors are not fun.
Type inference
• Type inference is part of compiler to determining the type by value
• As all the transformation are side effect free, we can determine the type by operation
• Every transformation has specific return type
• Having type inference relieves you think about representation for many transforms.
• Example:
• c3 = c2.count( ) // inferred as Int
• collection = [1,2,4,5] // explicit type Array
Caching
• Immutable data allows you to cache data for long time
• Lazy transformation allows to recreate data on failure
• Transformations can also be saved
• Caching data improves execution engine performance
• Reduces lots of I/O operations of reading/writing
data from HDFS
What Spark gives Hadoop?
• Machine learning module delivers capabilities not easily exploited in
Hadoop.
• in-memory processing of sizeable data volumes, remains an important
contribution to the capabilities of a Hadoop cluster.
• Valuable for enterprise use cases
• Spark’s SQL capabilities for interactive analysis over big data
• Streaming capabilities (Spark streaming)
• Graph processing capabilities (GraphX)
What Hadoop gives Spark?
• YARN resource manager
• DFS
• Disaster Recovery capabilities
• Data Security
• A distributed data platform
• Leverage existing clusters
• Data locality
• Manage workloads using advanced policies
• Allocate shares to different teams and users
• Hierarchical queues
• Queue placement policies
• Take advantage of Hadoop’s security
• Run on Kerberized clusters
Data Frames in Spark
• Unlike an RDD, data is organized into named columns.
• Allows developers to impose a structure onto a distributed collection of data.
• Enables wider audiences beyond “Big Data” engineers to leverage the power of distributed
processing
• allowing Spark to manage the schema and only pass data between nodes, in a much more efficient
way than using Java serialization.
• Spark can serialize the data into off-heap storage in a binary format and then perform many
transformations directly on this off-heap memory,
• Custom memory management
• Data is stored in off – heap memory in binary format.
• No Garbage collection is involved ,due to avoidance of serialization
• Query optimization plan
• Catalyst Query optimizer
Datasets in Spark
• Aims to provide the best of both worlds
• RDDs – OOP and compile time safely
• Data frames - Catalyst query optimizer, custom memory management
• How dataset scores over Data frame is an additional feature it has: Encoders .
• Encoders act as interface between JVM objects and off-heap custom memory binary format data.
• Encoders generate byte code to interact with off-heap data and provide on-demand access to
individual attributes without having to deserialize an entire object.
Spark SQL
• Spark SQL is a Spark module for structured data processing
• It lets you query structured data inside Spark programs, using SQL or a familiar
DataFrame API.
• Connect to any data source the same way
• Hive, Avro, Parquet, ORC, JSON, and JDBC.
• You can even join data across these sources.
• Run SQL or HiveQL queries on existing warehouses.
• A server mode provides industry standard JDBC and ODBC connectivity for business
intelligence tools.
• Writing code in RDD API in scala can be difficult, using Spark SQL easy SQL format
code can be written which internally converts to Spark API & optimized using the DAG
& catalyst engine.
• There is no reduction in the performance.
PySpark Code - Hands On
Spark Runtime Architecture
Spark Runtime Architecture - Driver
• Master node
• Process where “main” method runs
• Runs user code that
• creates a SparkContext
• Performs RDD operations
• When runs performs two main duties
• Converting user program into tasks
• logical DAG of operations -> physical execution plan.
• Optimization Ex: pipelining map transforms together to merge them and
convert execution graph into set of stages.
• Bundled up task and send them to cluster.
• Scheduling tasks on executors
• Executor register themselves to driver
• Look at current set executors and try to schedule task in appropriate location
,based on data placement.
• Track location of cache data and use it to schedule future tasks that access
that data, to avoid side effect of storing cached data while running a task.
Spark Runtime Architecture - Spark Context
• Driver accesses Spark functionality through SC object
• Represents a connection to the computing cluster
• Used to build RDDs
• Works with the cluster manager
• Manages executors running on worker nodes
• Splits jobs as parallel task and execute them on worker nodes
• Partitions RDDs and distributes them on the cluster
Spark Runtime Architecture - Executor
• Runs individual tasks in given spark job
• Launched once at the beginning of spark application
• Two main roles
• Runs the task and return results to driver
• Provides in-memory data stored for RDDs that are cached by user
programs, through service called Block Manager that lives within each
executor.
Spark Runtime Architecture - Cluster Manager
• Launches Executors and sometimes driver
• Allows sparks to run on top of different external managers
• YARN
• Mesos
• Spark built-in stand alone cluster manager
• Deploy modes
• Client mode
• Cluster mode
Spark + YARN (Cluster Deployment mode)
• Driver runs inside a YARN container
Spark + YARN (Client Deployment Mode)
• Driver runs on the machine that you submitted the application.
e.g. local on laptop
Steps: running spark application on cluster
• Submit an application using spark-submit
• spark-submit launches driver program and invoke main() method
• Driver program contact cluster manager to ask for resources to launch executors
• Cluster manager launches executors on behalf of the driver program.
• Drive process runs through user application and send work to executors in the form
of tasks.
• Executors runs the tasks and save the results.
• driver’s main() method exits or SparkContext.stop() – terminate the executors and
release the resources.
Thanks :)