SlideShare a Scribd company logo
© 2015 IBM Corporation
Apache Hadoop Day 2015
Intro to Apache Spark
LIGHTENING FAST CLUSTER COMPUTING
© 2015 IBM Corporation
© 2015 IBM Corporation
Apache Hadoop Day 2015
Mapreduce Limitations
• Lots of boilerplate , makes it
complex to program in MR.
• Disk based approach not good for
iterative usecases.
• Batch processing not fit for real
time.
In short no single solution, people build
specialized systems as workarounds.
© 2015 IBM Corporation
Spark Goal
Batch
Interactiv
e
Streamin
g
Single
Framework!
Support batch, streaming, and interactive computations…
… in a unified framework
Easy to develop sophisticated algorithms (e.g., graph, ML algos)
© 2015 IBM Corporation
Spark Core
Supports Scala,Java,Python,R
Spark Core
Supports Scala,Java,Python,R
Spark SQL
Interactive
Spark SQL
Interactive
Spark
Streaming
realtime
Spark
Streaming
realtime
Mlib/Spark.ml
Machine learning
Mlib/Spark.ml
Machine learning
GraphX
Graph processing
GraphX
Graph processing
Spark Stack
Unified engine across diverse workload and
environments
© 2015 IBM Corporation
Data processing landscape
GraphLa
b
Girap
h
…
Graph
Grap
h
Graph
Pre
gel
Googl
e
Apache
Dato
© 2015 IBM Corporation
Data processing landscape
Dreme
l
GraphLa
b
Girap
h
Drill
Impala
…
SQ
L
Graph
Grap
h
SQ
L
Graph
Pre
gel
Googl
e
Google
SQL
Apach
e
Apache
Cloudera
Dato
© 2015 IBM Corporation
Data processing landscape
Dreme
l
GraphLa
b
Girap
h
Drill
Impala
…
SQ
L
Graph
Grap
h
SQ
L
Graph
Pre
gel
Googl
e
Google
SQL
Apach
e
Apache
DAG
Tez
Apache
Cloudera
Stream
Stor
m
Apache
Dato
© 2015 IBM Corporation
Data processing landscape
Dr
e
aphLa
b
Girap
h
Graph
Grap
h
G
Pregel
Go Apache
DAG
ez
Apache
raph
ogle SQL
Drill
Apache
SQL
mel T
Google
SQL
Impala
Cloudera
Storm
Stream … Gr
Apache
Dato
Stop surarmement now !
© 2015 IBM Corporation
Spark

Unifies batch, streaming, interactive comp.

Easy to build sophisticated applications

Support iterative, graph-parallel algorithms

Powerful APIs in Scala, Python, Java
Spark
Spark
Streaming Shark SQL
BlinkDB
GraphX MLlib
Streami
ng
Batch,
Interactiv
e
Batch,
Interactive
Interacti
ve
Data-parallel,
Iterative
Sophisticated
algos.
© 2015 IBM Corporation
MapReduce Vs Spark

Mapreduce run each task in its own process, when
tasks completed the process dies
(MultithreadedMapper)

In Spark by default many tasks are concurrently run
in multi-threads on a single executor.

MR executor short lived and runs one large task

Spark executor is long live and runs many small
tasks

Process creation Vs Thread creation cost.
© 2015 IBM Corporation
Problems in Spark

Applications cannot share data(mostly RDDs in
Spark Context) without writing to external
Storage.

Resource allocation inefficiency
[spark.dynamicAllocation.enabled].

Not Exactly designed for interactive applications.
© 2015 IBM Corporation
Spark Internals -RDD
• RDD (Resilient Distributed Dataset)
• Lazy & Immutable
• Iterative operations before RDD
• Fault Tolerant
• Traditional way for achieving Fault Tolerance
• How does RDD achieve Fault Tolerance
• Partition
© 2015 IBM Corporation
Apache Hadoop Day 2015
Spark Internals – RDDs
sc.textFile(“hdfs://<input>”)
.filter(_.startsWith(“ERROR”))
.map(_.split(“ “)(1))
.saveAsTextFile(“hdfs://<output>”)
Stage-1
HDFS HDFSHadoopRDD FilteredRDD MappedRDD
© 2015 IBM Corporation
Apache Hadoop Day 2015
Spark Internals – RDDS
Narrow Vs Wide
Dependency
•Narrow dependency –
Each partition of parent is
Used by at max one
partition of child
•Wide dependency –
multiple child partition may
depend on one parent.
© 2015 IBM Corporation
Apache Hadoop Day 2015
Narrow/Shuffle Dependency – class diagram
© 2015 IBM Corporation
Apache Hadoop Day 2015
Task
Scheduler
Task Thread
Block
Manager
Spark Internal – Job Scheduling
RDD Object DAG Scheduler Task
Scheduler
Executor
Split DAG into Stages
and Tasks
Submit each Stage as
ready
Launches
individual
tasks
Execute tasks
Stores & serves
blocks
Rdd1.join(rdd2)
.groupBy(…)
.filter(…)
© 2015 IBM Corporation
Apache Hadoop Day 2015
Resource Allocation
• Dynamic Resource Allocation.
• Resource Allocation Policy.
 Request Policy
 Remove Policy
© 2015 IBM Corporation
Apache Hadoop Day 2015
Request/Remove Policy
Request
• Pending tasks to be scheduled.
• Spark request executors in rounds.
• spark.dynamicAllocation.schedulerBacklogTim
eout &
spark.dynamicAllocation.sustainedSchedulerB
acklogTimeout
Remove
• Removes when its idle for more than
spark.dynamicAllocation.executorIdleTimeo
ut seconds
© 2015 IBM Corporation
Apache Hadoop Day 2015
Graceful Decommission of Executors
• State before Dynamic Allocation
• With Dynamic Allocation
• Complexity increases with Shuffle
• External Shuffle Service
• State of Cached data either in disk or memory
© 2015 IBM Corporation
Apache Hadoop Day 2015
Fair Scheduler
• What is Fair Scheduling?
• How to enable Fair Scheduler
val conf = new
SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)
• Fair Scheduler Pools
© 2015 IBM Corporation
RDD Deep Dive
• RDD Basics
• How to create
• RDD Operations
• Lineage
• Partitions
• Shuffle
• Type of RDDs
• Extending RDD
• Caching in RDD
© 2015 IBM Corporation
RDD Basics
• RDD (Resilient Distributed Dataset)
• Distributed collection of Object
• Resilient - Ability to re-compute missing partitions
(node failure)
• Distributed – Split across multiple partitions
• Dataset - Can contain any type, Python/Java/Scala
Object or User defined Object
• Fundamental unit of data in spark
© 2015 IBM Corporation
RDD Basics – How to create
Two ways

Loading external datasets
− Spark supports wide range of sources
− Access HDFS data through InputFormat & OutputFormat
of Hadoop.
− Supports custom Input/Output format

Parallelizing collection in driver program
val lineRDD = sc.textFile(“hdfs:///path/to/Readme.md”)
textFile(“/my/directory/*”) or textFile(“/my/directory/*.gz”)
SparkContext.wholeTextFiles returns (filename,content) pair
val listRDD = sc.parallelize(List(“spark”,”meetup”,”deepdive”))
© 2015 IBM Corporation
RDD Operations

Two type of Operations

Transformation

Action

Transformations are lazy, nothing actually happens until an action is
called.

Action triggers the computation

Action returns values to driver or writes data to external storage.
© 2015 IBM Corporation
Lazy Evaluation
−
Transformation on RDD, don’t get performed immediately
−
Spark Internally records metadata to track the operation
−
Loading data into RDD also gets lazy evaluated
−
Lazy evaluation reduce number of passes on the data by
grouping operations
−
MapReduce – Burden on developer to merge the operation,
complex map.
−
Failure in Persisting the RDD will re-compute complete lineage
every time.
© 2015 IBM Corporation
RDD In Action
sc.textFile(“hdfs://file.txt")
.flatMap(line=>line.split(" "))
.map(word => (word,1))
.reduceByKey(_+_)
.collect()
I scream you
scream lets all
scream for
icecream!
I wish I were
what I was when
I wished I were
what I am.
I
scream
you
scream
lets
all
scream
for
icecream
(I,1)
(scream,1)
(you,1)
(scream,1)
(lets,1)
(all,1)
(scream,1)
(icecream,1)
(icecream,1)
(scream,3)
(you,1)
(lets,1)
(I,1)
(all,1)
© 2015 IBM Corporation
Lineage Demo
© 2015 IBM Corporation
RDD Partition

Partition Definition

Fragments of RDD

Fragmentation allows Spark to execute in Parallel.

Partitions are distributed across cluster(Spark worker)

Partitioning

Impacts parallelism

Impacts performance
© 2015 IBM Corporation
Importance of partition Tuning

Too few partitions

Less concurrency, unused cores.

More susceptible to data skew

Increased memory pressure for groupBy, reduceByKey,
sortByKey, etc.

Too many partitions

Framework overhead (more scheduling latency than the time
needed for actual task.)

Many CPU context-switching

Need “reasonable number” of partitions

Commonly between 100 and 10,000 partitions

Lower bound: At least ~2x number of cores in cluster

Upper bound: Ensure tasks take at least 100ms
© 2015 IBM Corporation
How Spark Partitions data

Input data partition

Shuffle transformations

Custom Partitioner
© 2015 IBM Corporation
Partition - Input Data

Spark uses same class as Hadoop to perform Input/Output

sc.textFile(“hdfs://…”) invokes Hadoop TextInputFormat

Below are Knobs which defines #Partitions

dfs.block.size – default 128MB(Hadoop 2.0)

numPartition – can be used to increase number of partition
default is 0 which means 1 partition

mapreduce.input.fileinputformat.split.minsize – default 1kb

Partition Size = Max(minsize,Min(goalSize,blockSize)

goalSize = totalInputSize/numPartitions

32MB, 0, 1KB, 640MB total size - Defaults
−Max(1kb,Min(640MB,32MB) ) = 20 partitions

32MB, 30, 1KB , 640MB total size - Want more partition
−Max(1kb,Min(32MB,32MB)) = 32 partition

32MB, 5, 1KB = Max(1kb,Min(120MB,32MB)) = 20 – Bigger size
partition

32MB,0, 64MB = Max(64MB,Min(640MB,32MB)) = 10 Bigger size
partition
© 2015 IBM Corporation
Partition - Shuffle transformations

All shuffle transformation provides parameter
for desire number of partition

Default Behavior - Spark Uses HashPartitioner.
− If spark.default.parallelism is set , takes that as # of
partitions
− If spark.default.parallelism is not set
largest upstream RDD ‘s number of partition
− Reduces chances of out of memory
1. groupByKey
2. reduceByKey
3. aggregateByKey
4. sortByKey
5. join
6. cogroup
7. cartesian
8. coalesce
9. repartition
10.repartitionAndSort
WithinPartitions
Shuffle Transformation
© 2015 IBM Corporation
Partition - Repartitioning

RDD provides two operators

repartition(numPartitions)
− Can Increase/decrease number of partitions
− Internally does shuffle
− expensive due to shuffle
− For decreasing partition use coalesce

Coalesce(numPartition,Shuffle:[true/false])
− Decreases partitions
− Goes for narrow dependencies
− Avoids shuffle
− In case of drastic reduction may trigger shuffle
© 2015 IBM Corporation
Custom Partitioner

Partition the data according to use case & data structure

Provides control over no of partitions, distribution of data

Extends Partitioner class, need to implement getPartitions &
numPartitons
© 2015 IBM Corporation
Partitioning Demo
© 2015 IBM Corporation
Shuffle - GroupByKey Vs ReduceByKey
val wordCountsWithGroup = rdd
.groupByKey()
.map(t => (t._1, t._2.sum)) .collect()
© 2015 IBM Corporation
Shuffle - GroupByKey Vs ReduceByKey
val wordPairsRDD = rdd.map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD
.reduceByKey(_ + _)
.collect()
© 2015 IBM Corporation
The Shuffle

Redistribution of data among partition between stages.

Most of the Performance, Reliability Scalability Issues in Spark occurs
within Shuffle.

Like MapReduce Spark shuffle uses Pull model.

Consistently evolved and still an area of research in Spark
© 2015 IBM Corporation
Shuffle Overview
• Spark run job stage by stage.
• Stages are build up by DAGScheduler according to RDD’s
ShuffleDependency
• e.g. ShuffleRDD / CoGroupedRDD will have a
ShuffleDependency
• Many operator will create ShuffleRDD / CoGroupedRDD under
the hood.
• Repartition/CombineByKey/GroupBy/ReduceByKey/cogrou
p
• Many other operator will further call into the above
operators
•
e.g. various join operator will call CoGroup.
•
© 2015 IBM Corporation
You have seen this
join
union
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D:
map
E:
F:
G:
© 2015 IBM Corporation
Shuffle is Expensive
• When doing shuffle, data no longer stay in memory only, gets
written to disk.
• For spark, shuffle process might involve
• Data partition: which might involve very expensive data
sorting works etc.
• Data ser/deser: to enable data been transfer through
network or across processes.
• Data compression: to reduce IO bandwidth etc.
• Disk IO: probably multiple times on one single data block
• E.g. Shuffle Spill, Merge combine
© 2015 IBM Corporation
Shuffle History

Shuffle module in Spark has evolved over time.

Spark(0.6-0.7) – Same code path as RDD’s persist method.
MEMORY_ONLY , DISK_ONLY options available.

Spark (0.8-0.9)
-
Separate code for shuffle, ShuffleBlockManager &
BlockObjectWriter for shuffle only.
-
Shuffle optimization - Consolidate Shuffle Write.

Spark 1.0 – Introduced pluggable shuffle framework

Spark 1.1 – Sort based Shuffle Implementation

Spark 1.2 - Netty transfer Implementation. Sort based shuffle is
default now.

Spark 1.2+ - External shuffle service etc.
© 2015 IBM Corporation
Understanding Shuffle

Input Aggregation

Types of Shuffle

Hash based
− Basic Hash Shuffle
− Consolidate Hash Shuffle

Sort Based Shuffle
© 2015 IBM Corporation
Input Aggregation

Like MapReduce, Spark involves aggregate(Combiner) on map side.

Aggregation is done in ShuffleMapTask using

AppendOnlyMap (In Memory Hash Table combiner)
− Key’s are never removed , values gets updated

ExternalAppendOnlyMap (In Memory and disk Hash Table combiner)
− A Hash Map which can spill to disk
− Append Only Map that spill data to disk if insufficient memory

Shuffle file In-Memory Buffer – Shuffle writes to In-memory buffer before
writing to a shuffle file.
© 2015 IBM Corporation
Shuffle Types – Basic Hash Shuffle

Hash Based shuffle (spark.shuffle.manager). Hash Partitions the data
for reducers

Each map task writes each bucket to a file.

#Map Tasks = M

#Reduce Tasks = R

#Shuffle File = M*R , #In-Memory Buffer = M*R
© 2015 IBM Corporation
Shuffle Types – Basic Hash Shuffle

Problem

Lets use 100KB as buffer size

We have 10000 reducers

10 Mapper tasks Per Executor

In-Memory Buffer size will = 100KB*10000*10

Buffer need will be 10GB/Executor

This huge amount of Buffer is not acceptable and this
Implementation cant support 10000 reducer.
© 2015 IBM Corporation
Shuffle Types – Consolidate Hash Shuffle

Solution to decrease the IN-Memory Buffer size , No of File.

Within Executor, Map Tasks writes each Bucket to a Segment of the file.

#Shuffle file/Executor = #Reducers,

# In-Memory Buffer/ Executor=#R( Reducers)
© 2015 IBM Corporation
Shuffle Types – Sort Based Shuffle

Consolidate Hash Shuffle needs one file for each reducer.
- Total C*R intermediate file , C = # of executor running map
tasks

Still too many files(e.g ~10k reducers),

Need significant memory for compression & serialization
buffer.

Too many open files issue.

Sort Based Shuflle is similar to map-side shuffle from
MapReduce

Introduced in Spark 1.1 , now its default shuffle
© 2015 IBM Corporation
Shuffle Types – Sort Based Shuffle

Map output records from each task are kept in memory till they can fit.

Once full , data gets sorted by partition and spilled to single file.

Each Map task generate 1 data file and one index file

Utilize external sorter to do the sort work

If map side combiner is required data will be sorted by key and partition
otherwise only by partition

#reducer <=200, no sorting uses hash approach, generate file per reducer
and merge them into a single file
© 2015 IBM Corporation
Shuffle Reader

On Reader side both Sort & Hash Shuffle uses Hash Shuffle Reader

On reducer side a set of thread fetch remote output map blocks

Once block comes its records are de-serialized and passed into a
result queue.

Records are passed to ExternalAppendOnlyMap , for ordering
operation like sortByKey records are passed to externalSorter.
20
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Reduce Task
Aggregator Aggregator Aggregator Aggregator
Reduce Task Reduce Task Reduce Task
© 2015 IBM Corporation
Type of RDDS - RDD Interface
Base for all RDDs (RDD.scala), consists of

A Set of partitions (“splits” in Hadoop)

A List of dependencies on parent RDDs

A Function to compute the partition from its
parents

Optional preferred locations for each partition

A Partitioner defines strategy for partitionig
hash/range

Basic operations like map, filter, persist etc
Partitions
Dependencies
Compute
PreferredLocations
Partitioner
map,filter,persist
s
Lineage
Optimized execution
Operations
© 2015 IBM Corporation
Example: HadoopRDD

partitions = one per HDFS block

dependencies = none

compute(partition) = read corresponding block

preferredLocations(part) = HDFS block location

partitioner = none
© 2015 IBM Corporation
Example: MapPartitionRDD

partitions = Parent Partition

dependencies = “one-to-one “parent RDD

compute(partition) = apply map on parent

preferredLocations(part) = none (ask parent)

partitioner = none
© 2015 IBM Corporation
Example: CoGroupRDD

partitions = one per reduce task

dependencies = could be narrow or wide dependency

compute(partition) = read and join shuffled data

preferredLocations(part) = none

partitioner = HashPartitioner(numTasks)
© 2015 IBM Corporation
Extending RDDs
Extend RDDs to

To add transformation/actions

Allow developer to express domain specific calculation in
cleaner way

Improves code readability

Easy to maintain

Custom RDD for Input Source, Domain

Way to add new Input data source

Better way to express domain specific data

Better control on partitioning and distribution
© 2015 IBM Corporation
How to Extend

Add custom operators to RDD

Use scala Impilicits

Feels and works like built in operator

You can add operator to Specific RDD or to all

Custom RDD

Extend RDD API to create our own RDD

Implement compute & getPartitions abstract method
© 2015 IBM Corporation
Implicit Class

Creates an extension method to existing type

Introduced in Scala 2.10

Implicits are compile time checked. Implicit class gets resolved
into a class definition with implict conversion

We will use Implicit to add new method in RDD
© 2015 IBM Corporation
Adding new Operator to RDD

We will use Scala Implicit feature to add a new operator to an
existingRDD

This operator will show up only in our RDD

Implicit conversions are handled by Scala
© 2015 IBM Corporation
Custom RDD Implementation

Extending RDD allow you to create your own custom RDD
structure

Custom RDD allow control on computation, change partition &
locality information
© 2015 IBM Corporation
Caching in RDD

Spark allows caching/Persisting entire dataset in memory

Persisting RDD in cache

First time when it is computed it will be kept in memory

Reuse the the cache partition in next set of operation

Fault-tolerant, recomputed in case of failure

Caching is key tool for interactive and iterative algorithm

Persist support different storage level

Storage level - In memory , Disk or both , Techyon

Serialized Vs Deserialized
© 2015 IBM Corporation
Caching In RDD

Spark Context tracks persistent RDDs

Block Manager puts partition in memory when first evaluated

Cache is lazy evaluation , no caching without an action.

Shuffle also keeps its data in Cache after shuffle operations.

We still need to cache shuffle RDDs
© 2015 IBM Corporation
Caching Demo

More Related Content

What's hot (20)

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 

Viewers also liked (20)

Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Alessandro Menabò
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
Giivee The
 
Spark
SparkSpark
Spark
Nitish Upreti
 
Topfoison product catalog
Topfoison product catalogTopfoison product catalog
Topfoison product catalog
Lynapple1022
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos
Rahul Kumar
 
Remote temperature monitor (DHT11)
Remote temperature monitor (DHT11)Remote temperature monitor (DHT11)
Remote temperature monitor (DHT11)
Parshwadeep Lahane
 
Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi
Sachin Aggarwal
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
Qubole
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
DataWorks Summit/Hadoop Summit
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Ashutosh Sonaliya
 
Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystem
rhatr
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Alessandro Menabò
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
Giivee The
 
Topfoison product catalog
Topfoison product catalogTopfoison product catalog
Topfoison product catalog
Lynapple1022
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos
Rahul Kumar
 
Remote temperature monitor (DHT11)
Remote temperature monitor (DHT11)Remote temperature monitor (DHT11)
Remote temperature monitor (DHT11)
Parshwadeep Lahane
 
Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi
Sachin Aggarwal
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
Qubole
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Ashutosh Sonaliya
 
Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystem
rhatr
 
Ad

Similar to Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive (20)

TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
trihug
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
Michael Spector
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
Arjen de Vries
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
Spark Summit
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
Spark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
Demet Aksoy
 
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Stefano Baghino - From Big Data to Fast Data: Apache SparkStefano Baghino - From Big Data to Fast Data: Apache Spark
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Codemotion
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
Ivan Morozov
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
clairvoyantllc
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
trihug
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
Arjen de Vries
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
Spark Summit
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
Spark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Stefano Baghino - From Big Data to Fast Data: Apache SparkStefano Baghino - From Big Data to Fast Data: Apache Spark
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Codemotion
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
Ivan Morozov
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
clairvoyantllc
 
Ad

Recently uploaded (20)

LDMMIA Bonus GUEST GRAD Student Check-in
LDMMIA Bonus GUEST GRAD Student Check-inLDMMIA Bonus GUEST GRAD Student Check-in
LDMMIA Bonus GUEST GRAD Student Check-in
LDM & Mia eStudios
 
LDMMIA About me 2025 Edition 3 College Volume
LDMMIA About me 2025 Edition 3 College VolumeLDMMIA About me 2025 Edition 3 College Volume
LDMMIA About me 2025 Edition 3 College Volume
LDM & Mia eStudios
 
PHYSIOLOGY & SPORTS INJURY by Diwakar Sir
PHYSIOLOGY & SPORTS INJURY by Diwakar SirPHYSIOLOGY & SPORTS INJURY by Diwakar Sir
PHYSIOLOGY & SPORTS INJURY by Diwakar Sir
Diwakar Kashyap
 
IDSP(INTEGRATED DISEASE SURVEILLANCE PROGRAMME...
IDSP(INTEGRATED DISEASE SURVEILLANCE PROGRAMME...IDSP(INTEGRATED DISEASE SURVEILLANCE PROGRAMME...
IDSP(INTEGRATED DISEASE SURVEILLANCE PROGRAMME...
SweetytamannaMohapat
 
LDMMIA Free Reiki Yoga S7 Weekly Workshops
LDMMIA Free Reiki Yoga S7 Weekly WorkshopsLDMMIA Free Reiki Yoga S7 Weekly Workshops
LDMMIA Free Reiki Yoga S7 Weekly Workshops
LDM & Mia eStudios
 
State institute of educational technology
State institute of educational technologyState institute of educational technology
State institute of educational technology
vp5806484
 
Writing Research Papers: Guidance for Research Community
Writing Research Papers: Guidance for Research CommunityWriting Research Papers: Guidance for Research Community
Writing Research Papers: Guidance for Research Community
Rishi Bankim Chandra Evening College, Naihati, North 24 Parganas, West Bengal, India
 
Stewart Butler - OECD - How to design and deliver higher technical education ...
Stewart Butler - OECD - How to design and deliver higher technical education ...Stewart Butler - OECD - How to design and deliver higher technical education ...
Stewart Butler - OECD - How to design and deliver higher technical education ...
EduSkills OECD
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 5-30-2025.pptx
YSPH VMOC Special Report - Measles Outbreak  Southwest US 5-30-2025.pptxYSPH VMOC Special Report - Measles Outbreak  Southwest US 5-30-2025.pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 5-30-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
A Brief Introduction About Jack Lutkus
A Brief Introduction About  Jack  LutkusA Brief Introduction About  Jack  Lutkus
A Brief Introduction About Jack Lutkus
Jack Lutkus
 
STUDENT LOAN TRUST FUND DEFAULTERS GHANA
STUDENT LOAN TRUST FUND DEFAULTERS GHANASTUDENT LOAN TRUST FUND DEFAULTERS GHANA
STUDENT LOAN TRUST FUND DEFAULTERS GHANA
Kweku Zurek
 
Uterine Prolapse, causes type and classification,its managment
Uterine Prolapse, causes type and classification,its managmentUterine Prolapse, causes type and classification,its managment
Uterine Prolapse, causes type and classification,its managment
Ritu480198
 
Introduction to Online CME for Nurse Practitioners.pdf
Introduction to Online CME for Nurse Practitioners.pdfIntroduction to Online CME for Nurse Practitioners.pdf
Introduction to Online CME for Nurse Practitioners.pdf
CME4Life
 
Order: Odonata Isoptera and Thysanoptera.pptx
Order: Odonata Isoptera and Thysanoptera.pptxOrder: Odonata Isoptera and Thysanoptera.pptx
Order: Odonata Isoptera and Thysanoptera.pptx
Arshad Shaikh
 
প্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdf
প্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdfপ্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdf
প্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdf
Pragya - UEM Kolkata Quiz Club
 
0b - THE ROMANTIC ERA: FEELINGS AND IDENTITY.pptx
0b - THE ROMANTIC ERA: FEELINGS AND IDENTITY.pptx0b - THE ROMANTIC ERA: FEELINGS AND IDENTITY.pptx
0b - THE ROMANTIC ERA: FEELINGS AND IDENTITY.pptx
Julián Jesús Pérez Fernández
 
How to Create a Stage or a Pipeline in Odoo 18 CRM
How to Create a Stage or a Pipeline in Odoo 18 CRMHow to Create a Stage or a Pipeline in Odoo 18 CRM
How to Create a Stage or a Pipeline in Odoo 18 CRM
Celine George
 
Types of Actions in Odoo 18 - Odoo Slides
Types of Actions in Odoo 18 - Odoo SlidesTypes of Actions in Odoo 18 - Odoo Slides
Types of Actions in Odoo 18 - Odoo Slides
Celine George
 
TechSoup Microsoft Copilot Nonprofit Use Cases and Live Demo - 2025.05.28.pdf
TechSoup Microsoft Copilot Nonprofit Use Cases and Live Demo - 2025.05.28.pdfTechSoup Microsoft Copilot Nonprofit Use Cases and Live Demo - 2025.05.28.pdf
TechSoup Microsoft Copilot Nonprofit Use Cases and Live Demo - 2025.05.28.pdf
TechSoup
 
Search Engine Optimization (SEO) for Website Success
Search Engine Optimization (SEO) for Website SuccessSearch Engine Optimization (SEO) for Website Success
Search Engine Optimization (SEO) for Website Success
muneebrana3215
 
LDMMIA Bonus GUEST GRAD Student Check-in
LDMMIA Bonus GUEST GRAD Student Check-inLDMMIA Bonus GUEST GRAD Student Check-in
LDMMIA Bonus GUEST GRAD Student Check-in
LDM & Mia eStudios
 
LDMMIA About me 2025 Edition 3 College Volume
LDMMIA About me 2025 Edition 3 College VolumeLDMMIA About me 2025 Edition 3 College Volume
LDMMIA About me 2025 Edition 3 College Volume
LDM & Mia eStudios
 
PHYSIOLOGY & SPORTS INJURY by Diwakar Sir
PHYSIOLOGY & SPORTS INJURY by Diwakar SirPHYSIOLOGY & SPORTS INJURY by Diwakar Sir
PHYSIOLOGY & SPORTS INJURY by Diwakar Sir
Diwakar Kashyap
 
IDSP(INTEGRATED DISEASE SURVEILLANCE PROGRAMME...
IDSP(INTEGRATED DISEASE SURVEILLANCE PROGRAMME...IDSP(INTEGRATED DISEASE SURVEILLANCE PROGRAMME...
IDSP(INTEGRATED DISEASE SURVEILLANCE PROGRAMME...
SweetytamannaMohapat
 
LDMMIA Free Reiki Yoga S7 Weekly Workshops
LDMMIA Free Reiki Yoga S7 Weekly WorkshopsLDMMIA Free Reiki Yoga S7 Weekly Workshops
LDMMIA Free Reiki Yoga S7 Weekly Workshops
LDM & Mia eStudios
 
State institute of educational technology
State institute of educational technologyState institute of educational technology
State institute of educational technology
vp5806484
 
Stewart Butler - OECD - How to design and deliver higher technical education ...
Stewart Butler - OECD - How to design and deliver higher technical education ...Stewart Butler - OECD - How to design and deliver higher technical education ...
Stewart Butler - OECD - How to design and deliver higher technical education ...
EduSkills OECD
 
A Brief Introduction About Jack Lutkus
A Brief Introduction About  Jack  LutkusA Brief Introduction About  Jack  Lutkus
A Brief Introduction About Jack Lutkus
Jack Lutkus
 
STUDENT LOAN TRUST FUND DEFAULTERS GHANA
STUDENT LOAN TRUST FUND DEFAULTERS GHANASTUDENT LOAN TRUST FUND DEFAULTERS GHANA
STUDENT LOAN TRUST FUND DEFAULTERS GHANA
Kweku Zurek
 
Uterine Prolapse, causes type and classification,its managment
Uterine Prolapse, causes type and classification,its managmentUterine Prolapse, causes type and classification,its managment
Uterine Prolapse, causes type and classification,its managment
Ritu480198
 
Introduction to Online CME for Nurse Practitioners.pdf
Introduction to Online CME for Nurse Practitioners.pdfIntroduction to Online CME for Nurse Practitioners.pdf
Introduction to Online CME for Nurse Practitioners.pdf
CME4Life
 
Order: Odonata Isoptera and Thysanoptera.pptx
Order: Odonata Isoptera and Thysanoptera.pptxOrder: Odonata Isoptera and Thysanoptera.pptx
Order: Odonata Isoptera and Thysanoptera.pptx
Arshad Shaikh
 
প্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdf
প্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdfপ্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdf
প্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdf
Pragya - UEM Kolkata Quiz Club
 
How to Create a Stage or a Pipeline in Odoo 18 CRM
How to Create a Stage or a Pipeline in Odoo 18 CRMHow to Create a Stage or a Pipeline in Odoo 18 CRM
How to Create a Stage or a Pipeline in Odoo 18 CRM
Celine George
 
Types of Actions in Odoo 18 - Odoo Slides
Types of Actions in Odoo 18 - Odoo SlidesTypes of Actions in Odoo 18 - Odoo Slides
Types of Actions in Odoo 18 - Odoo Slides
Celine George
 
TechSoup Microsoft Copilot Nonprofit Use Cases and Live Demo - 2025.05.28.pdf
TechSoup Microsoft Copilot Nonprofit Use Cases and Live Demo - 2025.05.28.pdfTechSoup Microsoft Copilot Nonprofit Use Cases and Live Demo - 2025.05.28.pdf
TechSoup Microsoft Copilot Nonprofit Use Cases and Live Demo - 2025.05.28.pdf
TechSoup
 
Search Engine Optimization (SEO) for Website Success
Search Engine Optimization (SEO) for Website SuccessSearch Engine Optimization (SEO) for Website Success
Search Engine Optimization (SEO) for Website Success
muneebrana3215
 

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive

  • 1. © 2015 IBM Corporation Apache Hadoop Day 2015 Intro to Apache Spark LIGHTENING FAST CLUSTER COMPUTING
  • 2. © 2015 IBM Corporation
  • 3. © 2015 IBM Corporation Apache Hadoop Day 2015 Mapreduce Limitations • Lots of boilerplate , makes it complex to program in MR. • Disk based approach not good for iterative usecases. • Batch processing not fit for real time. In short no single solution, people build specialized systems as workarounds.
  • 4. © 2015 IBM Corporation Spark Goal Batch Interactiv e Streamin g Single Framework! Support batch, streaming, and interactive computations… … in a unified framework Easy to develop sophisticated algorithms (e.g., graph, ML algos)
  • 5. © 2015 IBM Corporation Spark Core Supports Scala,Java,Python,R Spark Core Supports Scala,Java,Python,R Spark SQL Interactive Spark SQL Interactive Spark Streaming realtime Spark Streaming realtime Mlib/Spark.ml Machine learning Mlib/Spark.ml Machine learning GraphX Graph processing GraphX Graph processing Spark Stack Unified engine across diverse workload and environments
  • 6. © 2015 IBM Corporation Data processing landscape GraphLa b Girap h … Graph Grap h Graph Pre gel Googl e Apache Dato
  • 7. © 2015 IBM Corporation Data processing landscape Dreme l GraphLa b Girap h Drill Impala … SQ L Graph Grap h SQ L Graph Pre gel Googl e Google SQL Apach e Apache Cloudera Dato
  • 8. © 2015 IBM Corporation Data processing landscape Dreme l GraphLa b Girap h Drill Impala … SQ L Graph Grap h SQ L Graph Pre gel Googl e Google SQL Apach e Apache DAG Tez Apache Cloudera Stream Stor m Apache Dato
  • 9. © 2015 IBM Corporation Data processing landscape Dr e aphLa b Girap h Graph Grap h G Pregel Go Apache DAG ez Apache raph ogle SQL Drill Apache SQL mel T Google SQL Impala Cloudera Storm Stream … Gr Apache Dato Stop surarmement now !
  • 10. © 2015 IBM Corporation Spark  Unifies batch, streaming, interactive comp.  Easy to build sophisticated applications  Support iterative, graph-parallel algorithms  Powerful APIs in Scala, Python, Java Spark Spark Streaming Shark SQL BlinkDB GraphX MLlib Streami ng Batch, Interactiv e Batch, Interactive Interacti ve Data-parallel, Iterative Sophisticated algos.
  • 11. © 2015 IBM Corporation MapReduce Vs Spark  Mapreduce run each task in its own process, when tasks completed the process dies (MultithreadedMapper)  In Spark by default many tasks are concurrently run in multi-threads on a single executor.  MR executor short lived and runs one large task  Spark executor is long live and runs many small tasks  Process creation Vs Thread creation cost.
  • 12. © 2015 IBM Corporation Problems in Spark  Applications cannot share data(mostly RDDs in Spark Context) without writing to external Storage.  Resource allocation inefficiency [spark.dynamicAllocation.enabled].  Not Exactly designed for interactive applications.
  • 13. © 2015 IBM Corporation Spark Internals -RDD • RDD (Resilient Distributed Dataset) • Lazy & Immutable • Iterative operations before RDD • Fault Tolerant • Traditional way for achieving Fault Tolerance • How does RDD achieve Fault Tolerance • Partition
  • 14. © 2015 IBM Corporation Apache Hadoop Day 2015 Spark Internals – RDDs sc.textFile(“hdfs://<input>”) .filter(_.startsWith(“ERROR”)) .map(_.split(“ “)(1)) .saveAsTextFile(“hdfs://<output>”) Stage-1 HDFS HDFSHadoopRDD FilteredRDD MappedRDD
  • 15. © 2015 IBM Corporation Apache Hadoop Day 2015 Spark Internals – RDDS Narrow Vs Wide Dependency •Narrow dependency – Each partition of parent is Used by at max one partition of child •Wide dependency – multiple child partition may depend on one parent.
  • 16. © 2015 IBM Corporation Apache Hadoop Day 2015 Narrow/Shuffle Dependency – class diagram
  • 17. © 2015 IBM Corporation Apache Hadoop Day 2015 Task Scheduler Task Thread Block Manager Spark Internal – Job Scheduling RDD Object DAG Scheduler Task Scheduler Executor Split DAG into Stages and Tasks Submit each Stage as ready Launches individual tasks Execute tasks Stores & serves blocks Rdd1.join(rdd2) .groupBy(…) .filter(…)
  • 18. © 2015 IBM Corporation Apache Hadoop Day 2015 Resource Allocation • Dynamic Resource Allocation. • Resource Allocation Policy.  Request Policy  Remove Policy
  • 19. © 2015 IBM Corporation Apache Hadoop Day 2015 Request/Remove Policy Request • Pending tasks to be scheduled. • Spark request executors in rounds. • spark.dynamicAllocation.schedulerBacklogTim eout & spark.dynamicAllocation.sustainedSchedulerB acklogTimeout Remove • Removes when its idle for more than spark.dynamicAllocation.executorIdleTimeo ut seconds
  • 20. © 2015 IBM Corporation Apache Hadoop Day 2015 Graceful Decommission of Executors • State before Dynamic Allocation • With Dynamic Allocation • Complexity increases with Shuffle • External Shuffle Service • State of Cached data either in disk or memory
  • 21. © 2015 IBM Corporation Apache Hadoop Day 2015 Fair Scheduler • What is Fair Scheduling? • How to enable Fair Scheduler val conf = new SparkConf().setMaster(...).setAppName(...) conf.set("spark.scheduler.mode", "FAIR") val sc = new SparkContext(conf) • Fair Scheduler Pools
  • 22. © 2015 IBM Corporation RDD Deep Dive • RDD Basics • How to create • RDD Operations • Lineage • Partitions • Shuffle • Type of RDDs • Extending RDD • Caching in RDD
  • 23. © 2015 IBM Corporation RDD Basics • RDD (Resilient Distributed Dataset) • Distributed collection of Object • Resilient - Ability to re-compute missing partitions (node failure) • Distributed – Split across multiple partitions • Dataset - Can contain any type, Python/Java/Scala Object or User defined Object • Fundamental unit of data in spark
  • 24. © 2015 IBM Corporation RDD Basics – How to create Two ways  Loading external datasets − Spark supports wide range of sources − Access HDFS data through InputFormat & OutputFormat of Hadoop. − Supports custom Input/Output format  Parallelizing collection in driver program val lineRDD = sc.textFile(“hdfs:///path/to/Readme.md”) textFile(“/my/directory/*”) or textFile(“/my/directory/*.gz”) SparkContext.wholeTextFiles returns (filename,content) pair val listRDD = sc.parallelize(List(“spark”,”meetup”,”deepdive”))
  • 25. © 2015 IBM Corporation RDD Operations  Two type of Operations  Transformation  Action  Transformations are lazy, nothing actually happens until an action is called.  Action triggers the computation  Action returns values to driver or writes data to external storage.
  • 26. © 2015 IBM Corporation Lazy Evaluation − Transformation on RDD, don’t get performed immediately − Spark Internally records metadata to track the operation − Loading data into RDD also gets lazy evaluated − Lazy evaluation reduce number of passes on the data by grouping operations − MapReduce – Burden on developer to merge the operation, complex map. − Failure in Persisting the RDD will re-compute complete lineage every time.
  • 27. © 2015 IBM Corporation RDD In Action sc.textFile(“hdfs://file.txt") .flatMap(line=>line.split(" ")) .map(word => (word,1)) .reduceByKey(_+_) .collect() I scream you scream lets all scream for icecream! I wish I were what I was when I wished I were what I am. I scream you scream lets all scream for icecream (I,1) (scream,1) (you,1) (scream,1) (lets,1) (all,1) (scream,1) (icecream,1) (icecream,1) (scream,3) (you,1) (lets,1) (I,1) (all,1)
  • 28. © 2015 IBM Corporation Lineage Demo
  • 29. © 2015 IBM Corporation RDD Partition  Partition Definition  Fragments of RDD  Fragmentation allows Spark to execute in Parallel.  Partitions are distributed across cluster(Spark worker)  Partitioning  Impacts parallelism  Impacts performance
  • 30. © 2015 IBM Corporation Importance of partition Tuning  Too few partitions  Less concurrency, unused cores.  More susceptible to data skew  Increased memory pressure for groupBy, reduceByKey, sortByKey, etc.  Too many partitions  Framework overhead (more scheduling latency than the time needed for actual task.)  Many CPU context-switching  Need “reasonable number” of partitions  Commonly between 100 and 10,000 partitions  Lower bound: At least ~2x number of cores in cluster  Upper bound: Ensure tasks take at least 100ms
  • 31. © 2015 IBM Corporation How Spark Partitions data  Input data partition  Shuffle transformations  Custom Partitioner
  • 32. © 2015 IBM Corporation Partition - Input Data  Spark uses same class as Hadoop to perform Input/Output  sc.textFile(“hdfs://…”) invokes Hadoop TextInputFormat  Below are Knobs which defines #Partitions  dfs.block.size – default 128MB(Hadoop 2.0)  numPartition – can be used to increase number of partition default is 0 which means 1 partition  mapreduce.input.fileinputformat.split.minsize – default 1kb  Partition Size = Max(minsize,Min(goalSize,blockSize)  goalSize = totalInputSize/numPartitions  32MB, 0, 1KB, 640MB total size - Defaults −Max(1kb,Min(640MB,32MB) ) = 20 partitions  32MB, 30, 1KB , 640MB total size - Want more partition −Max(1kb,Min(32MB,32MB)) = 32 partition  32MB, 5, 1KB = Max(1kb,Min(120MB,32MB)) = 20 – Bigger size partition  32MB,0, 64MB = Max(64MB,Min(640MB,32MB)) = 10 Bigger size partition
  • 33. © 2015 IBM Corporation Partition - Shuffle transformations  All shuffle transformation provides parameter for desire number of partition  Default Behavior - Spark Uses HashPartitioner. − If spark.default.parallelism is set , takes that as # of partitions − If spark.default.parallelism is not set largest upstream RDD ‘s number of partition − Reduces chances of out of memory 1. groupByKey 2. reduceByKey 3. aggregateByKey 4. sortByKey 5. join 6. cogroup 7. cartesian 8. coalesce 9. repartition 10.repartitionAndSort WithinPartitions Shuffle Transformation
  • 34. © 2015 IBM Corporation Partition - Repartitioning  RDD provides two operators  repartition(numPartitions) − Can Increase/decrease number of partitions − Internally does shuffle − expensive due to shuffle − For decreasing partition use coalesce  Coalesce(numPartition,Shuffle:[true/false]) − Decreases partitions − Goes for narrow dependencies − Avoids shuffle − In case of drastic reduction may trigger shuffle
  • 35. © 2015 IBM Corporation Custom Partitioner  Partition the data according to use case & data structure  Provides control over no of partitions, distribution of data  Extends Partitioner class, need to implement getPartitions & numPartitons
  • 36. © 2015 IBM Corporation Partitioning Demo
  • 37. © 2015 IBM Corporation Shuffle - GroupByKey Vs ReduceByKey val wordCountsWithGroup = rdd .groupByKey() .map(t => (t._1, t._2.sum)) .collect()
  • 38. © 2015 IBM Corporation Shuffle - GroupByKey Vs ReduceByKey val wordPairsRDD = rdd.map(word => (word, 1)) val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect()
  • 39. © 2015 IBM Corporation The Shuffle  Redistribution of data among partition between stages.  Most of the Performance, Reliability Scalability Issues in Spark occurs within Shuffle.  Like MapReduce Spark shuffle uses Pull model.  Consistently evolved and still an area of research in Spark
  • 40. © 2015 IBM Corporation Shuffle Overview • Spark run job stage by stage. • Stages are build up by DAGScheduler according to RDD’s ShuffleDependency • e.g. ShuffleRDD / CoGroupedRDD will have a ShuffleDependency • Many operator will create ShuffleRDD / CoGroupedRDD under the hood. • Repartition/CombineByKey/GroupBy/ReduceByKey/cogrou p • Many other operator will further call into the above operators • e.g. various join operator will call CoGroup. •
  • 41. © 2015 IBM Corporation You have seen this join union groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: map E: F: G:
  • 42. © 2015 IBM Corporation Shuffle is Expensive • When doing shuffle, data no longer stay in memory only, gets written to disk. • For spark, shuffle process might involve • Data partition: which might involve very expensive data sorting works etc. • Data ser/deser: to enable data been transfer through network or across processes. • Data compression: to reduce IO bandwidth etc. • Disk IO: probably multiple times on one single data block • E.g. Shuffle Spill, Merge combine
  • 43. © 2015 IBM Corporation Shuffle History  Shuffle module in Spark has evolved over time.  Spark(0.6-0.7) – Same code path as RDD’s persist method. MEMORY_ONLY , DISK_ONLY options available.  Spark (0.8-0.9) - Separate code for shuffle, ShuffleBlockManager & BlockObjectWriter for shuffle only. - Shuffle optimization - Consolidate Shuffle Write.  Spark 1.0 – Introduced pluggable shuffle framework  Spark 1.1 – Sort based Shuffle Implementation  Spark 1.2 - Netty transfer Implementation. Sort based shuffle is default now.  Spark 1.2+ - External shuffle service etc.
  • 44. © 2015 IBM Corporation Understanding Shuffle  Input Aggregation  Types of Shuffle  Hash based − Basic Hash Shuffle − Consolidate Hash Shuffle  Sort Based Shuffle
  • 45. © 2015 IBM Corporation Input Aggregation  Like MapReduce, Spark involves aggregate(Combiner) on map side.  Aggregation is done in ShuffleMapTask using  AppendOnlyMap (In Memory Hash Table combiner) − Key’s are never removed , values gets updated  ExternalAppendOnlyMap (In Memory and disk Hash Table combiner) − A Hash Map which can spill to disk − Append Only Map that spill data to disk if insufficient memory  Shuffle file In-Memory Buffer – Shuffle writes to In-memory buffer before writing to a shuffle file.
  • 46. © 2015 IBM Corporation Shuffle Types – Basic Hash Shuffle  Hash Based shuffle (spark.shuffle.manager). Hash Partitions the data for reducers  Each map task writes each bucket to a file.  #Map Tasks = M  #Reduce Tasks = R  #Shuffle File = M*R , #In-Memory Buffer = M*R
  • 47. © 2015 IBM Corporation Shuffle Types – Basic Hash Shuffle  Problem  Lets use 100KB as buffer size  We have 10000 reducers  10 Mapper tasks Per Executor  In-Memory Buffer size will = 100KB*10000*10  Buffer need will be 10GB/Executor  This huge amount of Buffer is not acceptable and this Implementation cant support 10000 reducer.
  • 48. © 2015 IBM Corporation Shuffle Types – Consolidate Hash Shuffle  Solution to decrease the IN-Memory Buffer size , No of File.  Within Executor, Map Tasks writes each Bucket to a Segment of the file.  #Shuffle file/Executor = #Reducers,  # In-Memory Buffer/ Executor=#R( Reducers)
  • 49. © 2015 IBM Corporation Shuffle Types – Sort Based Shuffle  Consolidate Hash Shuffle needs one file for each reducer. - Total C*R intermediate file , C = # of executor running map tasks  Still too many files(e.g ~10k reducers),  Need significant memory for compression & serialization buffer.  Too many open files issue.  Sort Based Shuflle is similar to map-side shuffle from MapReduce  Introduced in Spark 1.1 , now its default shuffle
  • 50. © 2015 IBM Corporation Shuffle Types – Sort Based Shuffle  Map output records from each task are kept in memory till they can fit.  Once full , data gets sorted by partition and spilled to single file.  Each Map task generate 1 data file and one index file  Utilize external sorter to do the sort work  If map side combiner is required data will be sorted by key and partition otherwise only by partition  #reducer <=200, no sorting uses hash approach, generate file per reducer and merge them into a single file
  • 51. © 2015 IBM Corporation Shuffle Reader  On Reader side both Sort & Hash Shuffle uses Hash Shuffle Reader  On reducer side a set of thread fetch remote output map blocks  Once block comes its records are de-serialized and passed into a result queue.  Records are passed to ExternalAppendOnlyMap , for ordering operation like sortByKey records are passed to externalSorter. 20 Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Reduce Task Aggregator Aggregator Aggregator Aggregator Reduce Task Reduce Task Reduce Task
  • 52. © 2015 IBM Corporation Type of RDDS - RDD Interface Base for all RDDs (RDD.scala), consists of  A Set of partitions (“splits” in Hadoop)  A List of dependencies on parent RDDs  A Function to compute the partition from its parents  Optional preferred locations for each partition  A Partitioner defines strategy for partitionig hash/range  Basic operations like map, filter, persist etc Partitions Dependencies Compute PreferredLocations Partitioner map,filter,persist s Lineage Optimized execution Operations
  • 53. © 2015 IBM Corporation Example: HadoopRDD  partitions = one per HDFS block  dependencies = none  compute(partition) = read corresponding block  preferredLocations(part) = HDFS block location  partitioner = none
  • 54. © 2015 IBM Corporation Example: MapPartitionRDD  partitions = Parent Partition  dependencies = “one-to-one “parent RDD  compute(partition) = apply map on parent  preferredLocations(part) = none (ask parent)  partitioner = none
  • 55. © 2015 IBM Corporation Example: CoGroupRDD  partitions = one per reduce task  dependencies = could be narrow or wide dependency  compute(partition) = read and join shuffled data  preferredLocations(part) = none  partitioner = HashPartitioner(numTasks)
  • 56. © 2015 IBM Corporation Extending RDDs Extend RDDs to  To add transformation/actions  Allow developer to express domain specific calculation in cleaner way  Improves code readability  Easy to maintain  Custom RDD for Input Source, Domain  Way to add new Input data source  Better way to express domain specific data  Better control on partitioning and distribution
  • 57. © 2015 IBM Corporation How to Extend  Add custom operators to RDD  Use scala Impilicits  Feels and works like built in operator  You can add operator to Specific RDD or to all  Custom RDD  Extend RDD API to create our own RDD  Implement compute & getPartitions abstract method
  • 58. © 2015 IBM Corporation Implicit Class  Creates an extension method to existing type  Introduced in Scala 2.10  Implicits are compile time checked. Implicit class gets resolved into a class definition with implict conversion  We will use Implicit to add new method in RDD
  • 59. © 2015 IBM Corporation Adding new Operator to RDD  We will use Scala Implicit feature to add a new operator to an existingRDD  This operator will show up only in our RDD  Implicit conversions are handled by Scala
  • 60. © 2015 IBM Corporation Custom RDD Implementation  Extending RDD allow you to create your own custom RDD structure  Custom RDD allow control on computation, change partition & locality information
  • 61. © 2015 IBM Corporation Caching in RDD  Spark allows caching/Persisting entire dataset in memory  Persisting RDD in cache  First time when it is computed it will be kept in memory  Reuse the the cache partition in next set of operation  Fault-tolerant, recomputed in case of failure  Caching is key tool for interactive and iterative algorithm  Persist support different storage level  Storage level - In memory , Disk or both , Techyon  Serialized Vs Deserialized
  • 62. © 2015 IBM Corporation Caching In RDD  Spark Context tracks persistent RDDs  Block Manager puts partition in memory when first evaluated  Cache is lazy evaluation , no caching without an action.  Shuffle also keeps its data in Cache after shuffle operations.  We still need to cache shuffle RDDs
  • 63. © 2015 IBM Corporation Caching Demo

Editor's Notes

  • #12: MapReduce has MultithreadedMapper
  • #13: MapReduce has MultithreadedMapper
  • #14: Write coarse-grained and not fine grained. Intermediate results written to memory whereas between 2 mapreduce tasks the data is written to disk only. Replicate data or log updates across the machines. RDD provides fault tolerance by logging the transformations used to build a dataset (its lineage) rather than the actual data
  • #15: RDDs can hold premitive , sequence , scala objects, mixed type Special RDDs are their for special purpose – Pair RDD, Double RDD, Sequence File RDD
  • #16: Map leads to narrow dependency, while join lead to wide dependency. Wide dependency needs shuffling , parent gets materialized.
  • #17: Map leads to narrow dependency, while join lead to wide dependency. Wide dependency needs shuffling , parent gets materialized.
  • #18: In the Driver, there is something called DAG Scheduler…looks at the DAG all it understands its wide or narrow. The DAG scheduler than submits the first stage to the Task Scheduler which is also in the driver. A stage is split into tasks. An task is data + computation. The TS determines the number of tasks needed for the stage and allocate to the executors. Execute heap gives 60% to CachedRDD, 20% to shuffle and 20% to User Program by default.
  • #25: (file systems &amp; file formats – NFS,HDFS,S3, CSV,JSON,Sequence,Protocol Buffer)
  • #26: Transformation doesn’t mutate original RDD, always returns a new RDD
  • #30: fragmentation is what enables Spark to execute in parallel, and the level of fragmentation is a function of the number of partitions of your RDD The number of partitions is important because a stage in Spark will operate on one partition at a time (and load the data in that partition into memory)
  • #31: since with fewer partitions there’s more data in each partition, you increase the memory pressure on your program. More Network and disk IO
  • #33: dfs.block.size - The default value in Hadoop 2.0 is 128MB. In the local mode the corresponding parameter is fs.local.block.size (Default value 32MB). It defines the default partition size.
  • #34: HashPartitionerextends Partitioner class
  • #40: A shuffle involves two sets of tasks: tasks from the stage producing the shuffle data and tasks from the stage consuming it. For historical reasons, the tasks writing out shuffle data are known as “map task” and the tasks reading the shuffle data are known as “reduce tasks Every map task writes out data to local disk, and then the reduce tasks make remote requests to fetch that data
  • #46: Just the same as Hadoop Map Reduce, Spark shuffle involves the aggregate step (combiner) before writing map outputs (intermediate values) to buckets. Spark also writes to a small buffers (size of buffer is configurable via spark.shuffle.file.buffer.kb) before writing to physical files to increase disk I/O speed
  • #49: Reduces per map shuffle file to # of Reducer