Introduction to Spark
Sajan Kedia
Agenda
โ€ข What is Spark
โ€ข Why Spark
โ€ข Spark Framework
โ€ข RDD
โ€ข Immutability
โ€ข Lazy Evaluation
โ€ข Dataframe
โ€ข Dataset
โ€ข Spark SQL
โ€ข Architecture
โ€ข Cluster Manager
What is Spark?
โ€ข Apache Spark is a fast, in-memory data processing engine
โ€ข With development APIs it allow to execute streaming, machine learning or SQL.
โ€ข Fast, expressive cluster computing system compatible with Apache Hadoop
โ€ข Improves efficiency through:
โ€ข In-memory computing primitives
โ€ข General computation graphs (DAG)
โ€ข Up to 100ร— faster
โ€ข Improves usability through:
โ€ข Rich APIs in Java, Scala, Python
โ€ข Interactive shell
โ€ข Often 2-10ร— less code
โ€ข open source parallel process computational framework primarily used for data engineering
and analytics.
About Apache Spark
โ€ข Initially started at UC Berkeley in 2009
โ€ข Open source cluster computing framework
โ€ข Written in Scala (gives power of functional Programming)
โ€ข Provides high level APIs in
โ€ข Java
โ€ข Scala
โ€ข Python
โ€ข R
โ€ข Integration with Hadoop and its ecosystem and can read existing data.
โ€ข Designed to be fast for iterative algorithms and interactive queries for which mapReduce is inefficient.
โ€ข Most popular for running Iterative Machine Learning Algorithms.
โ€ข With support for in memory storage and efficient fault recovery.
โ€ข 10x (on disk) - 100x (In-Memory) faster
Why Spark ?
Hadoop execution flow
Spark execution flow
โ€ข Most of Machine Learning Algorithms are iterative because each iteration
can improve the results
โ€ข With Disk based approach each iteration output is written to disk making it
slow
Spark Core Engine
Spark Framework
RDD (Resilient Distributed Dataset)
โ€ข Key Spark Construct
โ€ข A distributed collection of elements
โ€ข Each RDD is split into multiple partitions which may be computed on different nodes of the cluster
โ€ข Spark automatically distributes the data in RDD across cluster and parallelize the operations
โ€ข RDD have following properties
โ—‹ Immutable
โ—‹ Lazy evaluated
โ—‹ Cacheable
โ—‹ Type inferred
RDD Operations
โ€ข How to create RDD:
โ€ข Loading external data sources
โ€ข lines=sc.textfile(โ€œreadme.txtโ€)
โ€ข Parallelizing a collection in a driver program
โ€ข Lines=sc.parallelize([โ€œpandasโ€, โ€œI like pandasโ€])
โ€ข Transformation
โ€ข transform RDD to another RDD by applying some functions
โ€ข Lineage graph (DAG) : keep the track of dependencies between transformed RDDs using which on
demand new RDDs can be created or part of persistent RDD can be recovered in case of failure.
โ€ข Examples : map, filter,flapmap,distinct,sample , union, intersect,subtract ,cartesian etc.
โ€ข Action
โ€ข Actual output is being generated in transformed RDD once action is applied
โ€ข Return values to the driver program or write data to external storage
โ€ข The entire RDD gets computed from scratch on new action call and if intermediate results are not
persisted.
โ€ข Examples : reduce, collect,count,countByvalue,take,top,takeSample,aggreagate,foreach etc.
Immutability
โ— Immutability means once created it never changes
โ— Big data by default immutable in nature
โ— Immutability helps to
โ—‹ Parallelize
โ—‹ Caching
Immutability in actio
const int a = 0 //immutable
int b = 0; // mutable
Updation
b ++ // in place
c = a + 1
Immutability is about value not about reference
Challenges of Immutability
โ— Immutability is great for parallelism but not good for space
โ— Doing multiple transformations result in
โ—‹ Multiple copies of data
โ—‹ Multiple passes over data
โ— In big data, multiple copies and multiple passes will have poor performance characteristics.
Lazy Evaluation
โ€ข Laziness means not computing transformation till itโ€™s need
โ€ข Once, any action is performed then the actual computation starts
โ€ข A DAG (Directed acyclic graph) will be created for the tasks
โ€ข Catalyst Engine is used to optimize the tasks & queries
โ€ข It helps reduce the number of passes
โ€ข Laziness in action
val c1 = collection.map(value => value +1) //do not compute anything
val c2 = c1.map(value => value +2) // donโ€™t compute
print c2 // Now transform into
Multiple transformations are combined to one
val c2 = collection.map (value => {var result = value +1
result = result + 2 } )
Challenges of Laziness
โ€ข Laziness poses challenges in terms of data type
โ€ข If laziness defers execution, determining the type of the variable becomes challenging
โ€ข If we canโ€™t determine the right type, it allows to have semantic issues
โ€ข Running big data programs and getting semantics errors are not fun.
Type inference
โ€ข Type inference is part of compiler to determining the type by value
โ€ข As all the transformation are side effect free, we can determine the type by operation
โ€ข Every transformation has specific return type
โ€ข Having type inference relieves you think about representation for many transforms.
โ€ข Example:
โ€ข c3 = c2.count( ) // inferred as Int
โ€ข collection = [1,2,4,5] // explicit type Array
Caching
โ€ข Immutable data allows you to cache data for long time
โ€ข Lazy transformation allows to recreate data on failure
โ€ข Transformations can also be saved
โ€ข Caching data improves execution engine performance
โ€ข Reduces lots of I/O operations of reading/writing data
from HDFS
What Spark gives Hadoop?
โ€ข Machine learning module delivers capabilities not easily exploited in
Hadoop.
โ€ข in-memory processing of sizeable data volumes, remains an important
contribution to the capabilities of a Hadoop cluster.
โ€ข Valuable for enterprise use cases
โ€ข Sparkโ€™s SQL capabilities for interactive analysis over big data
โ€ข Streaming capabilities (Spark streaming)
โ€ข Graph processing capabilities (GraphX)
What Hadoop gives Spark?
โ€ข YARN resource manager
โ€ข DFS
โ€ข Disaster Recovery capabilities
โ€ข Data Security
โ€ข A distributed data platform
โ€ข Leverage existing clusters
โ€ข Data locality
โ€ข Manage workloads using advanced policies
โ€ข Allocate shares to different teams and users
โ€ข Hierarchical queues
โ€ข Queue placement policies
โ€ข Take advantage of Hadoopโ€™s security
โ€ข Run on Kerberized clusters
Data Frames in Spark
โ€ข Unlike an RDD, data is organized into named columns.
โ€ข Allows developers to impose a structure onto a distributed collection of data.
โ€ข Enables wider audiences beyond โ€œBig Dataโ€ engineers to leverage the power of distributed
processing
โ€ข allowing Spark to manage the schema and only pass data between nodes, in a much more efficient
way than using Java serialization.
โ€ข Spark can serialize the data into off-heap storage in a binary format and then perform many
transformations directly on this off-heap memory,
โ€ข Custom memory management
โ€ข Data is stored in off โ€“ heap memory in binary format.
โ€ข No Garbage collection is involved ,due to avoidance of serialization
โ€ข Query optimization plan
โ€ข Catalyst Query optimizer
Datasets in Spark
โ€ข Aims to provide the best of both worlds
โ€ข RDDs โ€“ OOP and compile time safely
โ€ข Data frames - Catalyst query optimizer, custom memory management
โ€ข How dataset scores over Data frame is an additional feature it has: Encoders .
โ€ข Encoders act as interface between JVM objects and off-heap custom memory binary format data.
โ€ข Encoders generate byte code to interact with off-heap data and provide on-demand access to
individual attributes without having to deserialize an entire object.
Spark SQL
โ€ข Spark SQL is a Spark module for structured data processing
โ€ข It lets you query structured data inside Spark programs, using SQL or a familiar DataFrame
API.
โ€ข Connect to any data source the same way
โ€ข Hive, Avro, Parquet, ORC, JSON, and JDBC.
โ€ข You can even join data across these sources.
โ€ข Run SQL or HiveQL queries on existing warehouses.
โ€ข A server mode provides industry standard JDBC and ODBC connectivity for business
intelligence tools.
โ€ข Writing code in RDD API in scala can be difficult, using Spark SQL easy SQL format code can
be written which internally converts to Spark API & optimized using the DAG & catalyst
engine.
โ€ข There is no reduction in the performance.
PySpark Code - Hands On
Spark Runtime Architecture
Spark Runtime Architecture - Driver
โ€ข Master node
โ€ข Process where โ€œmainโ€ method runs
โ€ข Runs user code that
โ€ข creates a SparkContext
โ€ข Performs RDD operations
โ€ข When runs performs two main duties
โ€ข Converting user program into tasks
โ€ข logical DAG of operations -> physical execution plan.
โ€ข Optimization Ex: pipelining map transforms together to merge them and
convert execution graph into set of stages.
โ€ข Bundled up task and send them to cluster.
โ€ข Scheduling tasks on executors
โ€ข Executor register themselves to driver
โ€ข Look at current set executors and try to schedule task in appropriate location
,based on data placement.
โ€ข Track location of cache data and use it to schedule future tasks that access
that data, to avoid side effect of storing cached data while running a task.
Spark Runtime Architecture - Spark Context
โ€ข Driver accesses Spark functionality through SC object
โ€ข Represents a connection to the computing cluster
โ€ข Used to build RDDs
โ€ข Works with the cluster manager
โ€ข Manages executors running on worker nodes
โ€ข Splits jobs as parallel task and execute them on worker nodes
โ€ข Partitions RDDs and distributes them on the cluster
Spark Runtime Architecture - Executor
โ€ข Runs individual tasks in given spark job
โ€ข Launched once at the beginning of spark application
โ€ข Two main roles
โ€ข Runs the task and return results to driver
โ€ข Provides in-memory data stored for RDDs that are cached by user
programs, through service called Block Manager that lives within each
executor.
Spark Runtime Architecture - Cluster Manager
โ€ข Launches Executors and sometimes driver
โ€ข Allows sparks to run on top of different external managers
โ€ข YARN
โ€ข Mesos
โ€ข Spark built-in stand alone cluster manager
โ€ข Deploy modes
โ€ข Client mode
โ€ข Cluster mode
Spark + YARN (Cluster Deployment mode)
โ€ข Driver runs inside a YARN container
Spark + YARN (Client Deployment Mode)
โ€ข Driver runs on the machine that you submitted the application.
e.g. local on laptop
Steps: running spark application on cluster
โ€ข Submit an application using spark-submit
โ€ข spark-submit launches driver program and invoke main() method
โ€ข Driver program contact cluster manager to ask for resources to launch executors
โ€ข Cluster manager launches executors on behalf of the driver program.
โ€ข Drive process runs through user application and send work to executors in the form
of tasks.
โ€ข Executors runs the tasks and save the results.
โ€ข driverโ€™s main() method exits or SparkContext.stop() โ€“ terminate the executors and
release the resources.
Thanks :)

4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf

  • 1.
  • 2.
    Agenda โ€ข What isSpark โ€ข Why Spark โ€ข Spark Framework โ€ข RDD โ€ข Immutability โ€ข Lazy Evaluation โ€ข Dataframe โ€ข Dataset โ€ข Spark SQL โ€ข Architecture โ€ข Cluster Manager
  • 3.
    What is Spark? โ€ขApache Spark is a fast, in-memory data processing engine โ€ข With development APIs it allow to execute streaming, machine learning or SQL. โ€ข Fast, expressive cluster computing system compatible with Apache Hadoop โ€ข Improves efficiency through: โ€ข In-memory computing primitives โ€ข General computation graphs (DAG) โ€ข Up to 100ร— faster โ€ข Improves usability through: โ€ข Rich APIs in Java, Scala, Python โ€ข Interactive shell โ€ข Often 2-10ร— less code โ€ข open source parallel process computational framework primarily used for data engineering and analytics.
  • 4.
    About Apache Spark โ€ขInitially started at UC Berkeley in 2009 โ€ข Open source cluster computing framework โ€ข Written in Scala (gives power of functional Programming) โ€ข Provides high level APIs in โ€ข Java โ€ข Scala โ€ข Python โ€ข R โ€ข Integration with Hadoop and its ecosystem and can read existing data. โ€ข Designed to be fast for iterative algorithms and interactive queries for which mapReduce is inefficient. โ€ข Most popular for running Iterative Machine Learning Algorithms. โ€ข With support for in memory storage and efficient fault recovery. โ€ข 10x (on disk) - 100x (In-Memory) faster
  • 5.
    Why Spark ? Hadoopexecution flow Spark execution flow โ€ข Most of Machine Learning Algorithms are iterative because each iteration can improve the results โ€ข With Disk based approach each iteration output is written to disk making it slow
  • 6.
  • 7.
  • 8.
    RDD (Resilient DistributedDataset) โ€ข Key Spark Construct โ€ข A distributed collection of elements โ€ข Each RDD is split into multiple partitions which may be computed on different nodes of the cluster โ€ข Spark automatically distributes the data in RDD across cluster and parallelize the operations โ€ข RDD have following properties โ—‹ Immutable โ—‹ Lazy evaluated โ—‹ Cacheable โ—‹ Type inferred
  • 9.
    RDD Operations โ€ข Howto create RDD: โ€ข Loading external data sources โ€ข lines=sc.textfile(โ€œreadme.txtโ€) โ€ข Parallelizing a collection in a driver program โ€ข Lines=sc.parallelize([โ€œpandasโ€, โ€œI like pandasโ€]) โ€ข Transformation โ€ข transform RDD to another RDD by applying some functions โ€ข Lineage graph (DAG) : keep the track of dependencies between transformed RDDs using which on demand new RDDs can be created or part of persistent RDD can be recovered in case of failure. โ€ข Examples : map, filter,flapmap,distinct,sample , union, intersect,subtract ,cartesian etc. โ€ข Action โ€ข Actual output is being generated in transformed RDD once action is applied โ€ข Return values to the driver program or write data to external storage โ€ข The entire RDD gets computed from scratch on new action call and if intermediate results are not persisted. โ€ข Examples : reduce, collect,count,countByvalue,take,top,takeSample,aggreagate,foreach etc.
  • 10.
    Immutability โ— Immutability meansonce created it never changes โ— Big data by default immutable in nature โ— Immutability helps to โ—‹ Parallelize โ—‹ Caching Immutability in actio const int a = 0 //immutable int b = 0; // mutable Updation b ++ // in place c = a + 1 Immutability is about value not about reference
  • 11.
    Challenges of Immutability โ—Immutability is great for parallelism but not good for space โ— Doing multiple transformations result in โ—‹ Multiple copies of data โ—‹ Multiple passes over data โ— In big data, multiple copies and multiple passes will have poor performance characteristics.
  • 12.
    Lazy Evaluation โ€ข Lazinessmeans not computing transformation till itโ€™s need โ€ข Once, any action is performed then the actual computation starts โ€ข A DAG (Directed acyclic graph) will be created for the tasks โ€ข Catalyst Engine is used to optimize the tasks & queries โ€ข It helps reduce the number of passes โ€ข Laziness in action val c1 = collection.map(value => value +1) //do not compute anything val c2 = c1.map(value => value +2) // donโ€™t compute print c2 // Now transform into Multiple transformations are combined to one val c2 = collection.map (value => {var result = value +1 result = result + 2 } )
  • 13.
    Challenges of Laziness โ€ขLaziness poses challenges in terms of data type โ€ข If laziness defers execution, determining the type of the variable becomes challenging โ€ข If we canโ€™t determine the right type, it allows to have semantic issues โ€ข Running big data programs and getting semantics errors are not fun.
  • 14.
    Type inference โ€ข Typeinference is part of compiler to determining the type by value โ€ข As all the transformation are side effect free, we can determine the type by operation โ€ข Every transformation has specific return type โ€ข Having type inference relieves you think about representation for many transforms. โ€ข Example: โ€ข c3 = c2.count( ) // inferred as Int โ€ข collection = [1,2,4,5] // explicit type Array
  • 15.
    Caching โ€ข Immutable dataallows you to cache data for long time โ€ข Lazy transformation allows to recreate data on failure โ€ข Transformations can also be saved โ€ข Caching data improves execution engine performance โ€ข Reduces lots of I/O operations of reading/writing data from HDFS
  • 16.
    What Spark givesHadoop? โ€ข Machine learning module delivers capabilities not easily exploited in Hadoop. โ€ข in-memory processing of sizeable data volumes, remains an important contribution to the capabilities of a Hadoop cluster. โ€ข Valuable for enterprise use cases โ€ข Sparkโ€™s SQL capabilities for interactive analysis over big data โ€ข Streaming capabilities (Spark streaming) โ€ข Graph processing capabilities (GraphX)
  • 17.
    What Hadoop givesSpark? โ€ข YARN resource manager โ€ข DFS โ€ข Disaster Recovery capabilities โ€ข Data Security โ€ข A distributed data platform โ€ข Leverage existing clusters โ€ข Data locality โ€ข Manage workloads using advanced policies โ€ข Allocate shares to different teams and users โ€ข Hierarchical queues โ€ข Queue placement policies โ€ข Take advantage of Hadoopโ€™s security โ€ข Run on Kerberized clusters
  • 18.
    Data Frames inSpark โ€ข Unlike an RDD, data is organized into named columns. โ€ข Allows developers to impose a structure onto a distributed collection of data. โ€ข Enables wider audiences beyond โ€œBig Dataโ€ engineers to leverage the power of distributed processing โ€ข allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way than using Java serialization. โ€ข Spark can serialize the data into off-heap storage in a binary format and then perform many transformations directly on this off-heap memory, โ€ข Custom memory management โ€ข Data is stored in off โ€“ heap memory in binary format. โ€ข No Garbage collection is involved ,due to avoidance of serialization โ€ข Query optimization plan โ€ข Catalyst Query optimizer
  • 19.
    Datasets in Spark โ€ขAims to provide the best of both worlds โ€ข RDDs โ€“ OOP and compile time safely โ€ข Data frames - Catalyst query optimizer, custom memory management โ€ข How dataset scores over Data frame is an additional feature it has: Encoders . โ€ข Encoders act as interface between JVM objects and off-heap custom memory binary format data. โ€ข Encoders generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to deserialize an entire object.
  • 20.
    Spark SQL โ€ข SparkSQL is a Spark module for structured data processing โ€ข It lets you query structured data inside Spark programs, using SQL or a familiar DataFrame API. โ€ข Connect to any data source the same way โ€ข Hive, Avro, Parquet, ORC, JSON, and JDBC. โ€ข You can even join data across these sources. โ€ข Run SQL or HiveQL queries on existing warehouses. โ€ข A server mode provides industry standard JDBC and ODBC connectivity for business intelligence tools. โ€ข Writing code in RDD API in scala can be difficult, using Spark SQL easy SQL format code can be written which internally converts to Spark API & optimized using the DAG & catalyst engine. โ€ข There is no reduction in the performance.
  • 21.
  • 22.
  • 23.
    Spark Runtime Architecture- Driver โ€ข Master node โ€ข Process where โ€œmainโ€ method runs โ€ข Runs user code that โ€ข creates a SparkContext โ€ข Performs RDD operations โ€ข When runs performs two main duties โ€ข Converting user program into tasks โ€ข logical DAG of operations -> physical execution plan. โ€ข Optimization Ex: pipelining map transforms together to merge them and convert execution graph into set of stages. โ€ข Bundled up task and send them to cluster. โ€ข Scheduling tasks on executors โ€ข Executor register themselves to driver โ€ข Look at current set executors and try to schedule task in appropriate location ,based on data placement. โ€ข Track location of cache data and use it to schedule future tasks that access that data, to avoid side effect of storing cached data while running a task.
  • 24.
    Spark Runtime Architecture- Spark Context โ€ข Driver accesses Spark functionality through SC object โ€ข Represents a connection to the computing cluster โ€ข Used to build RDDs โ€ข Works with the cluster manager โ€ข Manages executors running on worker nodes โ€ข Splits jobs as parallel task and execute them on worker nodes โ€ข Partitions RDDs and distributes them on the cluster
  • 25.
    Spark Runtime Architecture- Executor โ€ข Runs individual tasks in given spark job โ€ข Launched once at the beginning of spark application โ€ข Two main roles โ€ข Runs the task and return results to driver โ€ข Provides in-memory data stored for RDDs that are cached by user programs, through service called Block Manager that lives within each executor.
  • 26.
    Spark Runtime Architecture- Cluster Manager โ€ข Launches Executors and sometimes driver โ€ข Allows sparks to run on top of different external managers โ€ข YARN โ€ข Mesos โ€ข Spark built-in stand alone cluster manager โ€ข Deploy modes โ€ข Client mode โ€ข Cluster mode
  • 27.
    Spark + YARN(Cluster Deployment mode) โ€ข Driver runs inside a YARN container
  • 28.
    Spark + YARN(Client Deployment Mode) โ€ข Driver runs on the machine that you submitted the application. e.g. local on laptop
  • 29.
    Steps: running sparkapplication on cluster โ€ข Submit an application using spark-submit โ€ข spark-submit launches driver program and invoke main() method โ€ข Driver program contact cluster manager to ask for resources to launch executors โ€ข Cluster manager launches executors on behalf of the driver program. โ€ข Drive process runs through user application and send work to executors in the form of tasks. โ€ข Executors runs the tasks and save the results. โ€ข driverโ€™s main() method exits or SparkContext.stop() โ€“ terminate the executors and release the resources.
  • 30.

Editor's Notes

  • #27ย https://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_ig_running_spark_on_yarn.html