SlideShare a Scribd company logo
Spark And Spark Streaming
Internals
Goals for Spark And Spark Streaming Project
• Generalise the framework for diverse workloads.
• Low Latency: For small jobs,latency expected is subsecond rather than waiting for
few seconds for job to start
• Fault Tolerance: Spark Internally should be capable of handling faults rather than
depending on users to treat it as special case
Need to Understand Internals of Spark
Understand Importance of Internals from perspective of performance
Example:
Consider a single core machine where we need to find the position of an integer in an array of
integers.First intuition would be to traverse through the list sequencially and rather than
randomly interating through list.
This is obvious just because we know how cache works and thus sequential access is better
than random.
But this may not be inherently obvious in Spark just because internals of spark works little
differently.
Spark Internals
Execution model of a Job
Example Job
val sc = new SparkContext(...)
val file = sc.textFile(…)
val errors=file.filter(…)
errors.cache(…)
errors.count(…)
RDD
Action
Resilient Distributed Dataset
RDD is a read-only, partitioned collection of records. RDDs are a
'immutable resilient distributed collection of records' which can be stored in
the volatile memory or in a persistent storage (HDFS, HBase etc) and can
be converted into another RDD through some of the transformations. An
action like count can also be applied on an RDD.
Components
Program
Spark Master
Spark Worker
Cluster Manager
HDFS,HBase
DAG TaskSet
Task
RDD Objects
Rdd1.join(rdd2)
.groupBy(..)
.filter(..)
Build Operator
DAG
Splits Graph into
stages
Of Tasks
Submit each stage
as ready
DAG Scheduler
What is Spark Streaming?
Overview
Run a streaming computation as a series of very small, deterministic batch jobs
SparkStreaming
Spark
- Chop up the live stream into batches of X seconds
- Spark treats each batch of data as RDDs
and processes them using RDD operations
- Finally, the processed results of the RDD
operations are returned in batches
Eg: Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => status.getText.split("
").filter(_.startsWith("#"))))
hashTags.saveAsHadoopFiles("hdfs://...") Transformation
#Ebola, #India,
#Mars ...
Questions

More Related Content

What's hot (20)

PPTX
Survey of Spark for Data Pre-Processing and Analytics
Yannick Pouliot
 
PPTX
Apache Spark overview
DataArt
 
PDF
Spark core
Freeman Zhang
 
PDF
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
PDF
Introduction to Spark Internals
Pietro Michiardi
 
PDF
Apache Spark RDDs
Dean Chen
 
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
PDF
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
PPTX
Transformations and actions a visual guide training
Spark Summit
 
ODP
Introduction to Spark with Scala
Himanshu Gupta
 
PPTX
Apache Spark RDD 101
sparkInstructor
 
PDF
Productionizing your Streaming Jobs
Databricks
 
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
PDF
Spark streaming: Best Practices
Prakash Chockalingam
 
PPT
Introduction to Spark Streaming
Knoldus Inc.
 
PDF
BDM25 - Spark runtime internal
David Lauzon
 
PDF
Introduction to spark
Duyhai Doan
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PDF
DTCC '14 Spark Runtime Internals
Cheng Lian
 
PPTX
Apache spark core
Thành Nguyễn
 
Survey of Spark for Data Pre-Processing and Analytics
Yannick Pouliot
 
Apache Spark overview
DataArt
 
Spark core
Freeman Zhang
 
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
Introduction to Spark Internals
Pietro Michiardi
 
Apache Spark RDDs
Dean Chen
 
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
Transformations and actions a visual guide training
Spark Summit
 
Introduction to Spark with Scala
Himanshu Gupta
 
Apache Spark RDD 101
sparkInstructor
 
Productionizing your Streaming Jobs
Databricks
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Spark streaming: Best Practices
Prakash Chockalingam
 
Introduction to Spark Streaming
Knoldus Inc.
 
BDM25 - Spark runtime internal
David Lauzon
 
Introduction to spark
Duyhai Doan
 
Unified Big Data Processing with Apache Spark
C4Media
 
DTCC '14 Spark Runtime Internals
Cheng Lian
 
Apache spark core
Thành Nguyễn
 

Viewers also liked (20)

PDF
Equation solving-at-scale-using-apache-spark
Sigmoid
 
PDF
Building high scalable distributed framework on apache mesos
Sigmoid
 
PDF
Real-time Supply Chain Analytics
Sigmoid
 
PDF
Productionizing spark
Sigmoid
 
PPTX
WEBSOCKETS AND WEBWORKERS
Sigmoid
 
PDF
Graph computation
Sigmoid
 
PPTX
Angular js performance improvements
Sigmoid
 
PPTX
Failsafe Hadoop Infrastructure and the way they work
Sigmoid
 
PPTX
Sparkstreaming with kafka and h base at scale (1)
Sigmoid
 
PDF
Composing and scaling data platforms
Sigmoid
 
PPTX
Introduction to apache nutch
Sigmoid
 
PPTX
Approaches to text analysis
Sigmoid
 
PPTX
Tale of Kafka Consumer for Spark Streaming
Sigmoid
 
PDF
Introduction to Spark R with R studio - Mr. Pragith
Sigmoid
 
PPTX
Joining Large data at Scale
Sigmoid
 
PPTX
Building bots to automate common developer tasks - Writing your first smart c...
Sigmoid
 
PPT
Graph Analytics for big data
Sigmoid
 
PDF
Time series database by Harshil Ambagade
Sigmoid
 
PPTX
Using spark for timeseries graph analytics
Sigmoid
 
PPTX
SORT & JOIN IN SPARK 2.0
Sigmoid
 
Equation solving-at-scale-using-apache-spark
Sigmoid
 
Building high scalable distributed framework on apache mesos
Sigmoid
 
Real-time Supply Chain Analytics
Sigmoid
 
Productionizing spark
Sigmoid
 
WEBSOCKETS AND WEBWORKERS
Sigmoid
 
Graph computation
Sigmoid
 
Angular js performance improvements
Sigmoid
 
Failsafe Hadoop Infrastructure and the way they work
Sigmoid
 
Sparkstreaming with kafka and h base at scale (1)
Sigmoid
 
Composing and scaling data platforms
Sigmoid
 
Introduction to apache nutch
Sigmoid
 
Approaches to text analysis
Sigmoid
 
Tale of Kafka Consumer for Spark Streaming
Sigmoid
 
Introduction to Spark R with R studio - Mr. Pragith
Sigmoid
 
Joining Large data at Scale
Sigmoid
 
Building bots to automate common developer tasks - Writing your first smart c...
Sigmoid
 
Graph Analytics for big data
Sigmoid
 
Time series database by Harshil Ambagade
Sigmoid
 
Using spark for timeseries graph analytics
Sigmoid
 
SORT & JOIN IN SPARK 2.0
Sigmoid
 
Ad

Similar to Spark and spark streaming internals (20)

PPTX
Spark
Heena Madan
 
PPTX
Introduction to Spark - DataFactZ
DataFactZ
 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PPTX
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
PPTX
Spark Internals - Hadoop Source Code Reading #16 in Japan
Taro L. Saito
 
PPTX
Spark core
Prashant Gupta
 
PPTX
Bring the Spark To Your Eyes
Demi Ben-Ari
 
PDF
Apache Spark Introduction.pdf
MaheshPandit16
 
PPTX
spark ...................................
itsTIM66
 
PPTX
Apache Spark for Beginners
Anirudh
 
PDF
Spark 101
Mohit Garg
 
ODP
Spark
Knoldus Inc.
 
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
PDF
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
PPTX
OVERVIEW ON SPARK.pptx
Aishg4
 
PPTX
Intro to Apache Spark
Robert Sanders
 
PPTX
Intro to Apache Spark
clairvoyantllc
 
PPTX
Spark and scala..................................... ppt.pptx
shivani22y
 
PPTX
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
Introduction to Spark - DataFactZ
DataFactZ
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Taro L. Saito
 
Spark core
Prashant Gupta
 
Bring the Spark To Your Eyes
Demi Ben-Ari
 
Apache Spark Introduction.pdf
MaheshPandit16
 
spark ...................................
itsTIM66
 
Apache Spark for Beginners
Anirudh
 
Spark 101
Mohit Garg
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
OVERVIEW ON SPARK.pptx
Aishg4
 
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
clairvoyantllc
 
Spark and scala..................................... ppt.pptx
shivani22y
 
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Introduction to Apache Spark
Rahul Jain
 
Ad

More from Sigmoid (10)

PPTX
Monitoring and tuning Spark applications
Sigmoid
 
PPTX
Structured Streaming Using Spark 2.1
Sigmoid
 
PDF
Real-Time Stock Market Analysis using Spark Streaming
Sigmoid
 
PPTX
Levelling up in Akka
Sigmoid
 
PDF
Expression Problem: Discussing the problems in OOPs language & their solutions
Sigmoid
 
PPTX
Spark 1.6 vs Spark 2.0
Sigmoid
 
PPTX
ML on Big Data: Real-Time Analysis on Time Series
Sigmoid
 
PDF
Dashboard design By Anu Vijayan
Sigmoid
 
PDF
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
PPTX
Real Time search using Spark and Elasticsearch
Sigmoid
 
Monitoring and tuning Spark applications
Sigmoid
 
Structured Streaming Using Spark 2.1
Sigmoid
 
Real-Time Stock Market Analysis using Spark Streaming
Sigmoid
 
Levelling up in Akka
Sigmoid
 
Expression Problem: Discussing the problems in OOPs language & their solutions
Sigmoid
 
Spark 1.6 vs Spark 2.0
Sigmoid
 
ML on Big Data: Real-Time Analysis on Time Series
Sigmoid
 
Dashboard design By Anu Vijayan
Sigmoid
 
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
Real Time search using Spark and Elasticsearch
Sigmoid
 

Recently uploaded (20)

PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
Probability systematic sampling methods.pptx
PrakashRajput19
 

Spark and spark streaming internals

  • 1. Spark And Spark Streaming Internals
  • 2. Goals for Spark And Spark Streaming Project • Generalise the framework for diverse workloads. • Low Latency: For small jobs,latency expected is subsecond rather than waiting for few seconds for job to start • Fault Tolerance: Spark Internally should be capable of handling faults rather than depending on users to treat it as special case
  • 3. Need to Understand Internals of Spark Understand Importance of Internals from perspective of performance Example: Consider a single core machine where we need to find the position of an integer in an array of integers.First intuition would be to traverse through the list sequencially and rather than randomly interating through list. This is obvious just because we know how cache works and thus sequential access is better than random. But this may not be inherently obvious in Spark just because internals of spark works little differently.
  • 6. Example Job val sc = new SparkContext(...) val file = sc.textFile(…) val errors=file.filter(…) errors.cache(…) errors.count(…) RDD Action
  • 7. Resilient Distributed Dataset RDD is a read-only, partitioned collection of records. RDDs are a 'immutable resilient distributed collection of records' which can be stored in the volatile memory or in a persistent storage (HDFS, HBase etc) and can be converted into another RDD through some of the transformations. An action like count can also be applied on an RDD.
  • 9. DAG TaskSet Task RDD Objects Rdd1.join(rdd2) .groupBy(..) .filter(..) Build Operator DAG Splits Graph into stages Of Tasks Submit each stage as ready DAG Scheduler
  • 10. What is Spark Streaming?
  • 11. Overview Run a streaming computation as a series of very small, deterministic batch jobs SparkStreaming Spark - Chop up the live stream into batches of X seconds - Spark treats each batch of data as RDDs and processes them using RDD operations - Finally, the processed results of the RDD operations are returned in batches
  • 12. Eg: Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => status.getText.split(" ").filter(_.startsWith("#")))) hashTags.saveAsHadoopFiles("hdfs://...") Transformation #Ebola, #India, #Mars ...