Matei Zaharia, MosharafChowdhury, Tathagata Das,
Ankur Dave, Justin Ma, Murphy McCauley, Michael
Franklin,
Scott Shenker, Ion Stoica
Spark
Fast, Interactive, Language-
Integrated Cluster Computing
UC BERKELEY
www.spark-project.org
2.
Project Goals
Extend theMapReduce model to better
support two common classes of analytics
apps:
»Iterative algorithms (machine learning,
graphs)
»Interactive data mining
Enhance programmability:
»Integrate into Scala programming
language
3.
Project Goals
Extend theMapReduce model to better
support two common classes of analytics
apps:
»Iterative algorithms (machine learning,
graphs)
»Interactive data mining
Enhance programmability:
»Integrate into Scala programming
language
Explain why the original MapReduce model
does not efficiently support these use cases?
4.
Motivation
Most current clusterprogramming
models are based on acyclic data flow
from stable storage to stable storage
Map
Map
Map
Reduc
e
Reduc
e
Input Output
5.
Motivation
Map
Map
Map
Reduc
e
Reduc
e
Input Output
Benefits ofdata flow: runtime can
decide where to run tasks and can
automatically recover from failures
Most current cluster programming
models are based on acyclic data flow
from stable storage to stable storage
6.
Motivation
Acyclic data flowis inefficient for
applications that repeatedly reuse a
working set of data:
»Iterative algorithms (machine learning,
graphs)
»Interactive data mining tools (R, Excel,
Python)
With current frameworks, apps reload
data from stable storage on each query
7.
Solution: Resilient
Distributed Datasets
(RDDs)
Allowapps to keep working sets in
memory for efficient reuse
Retain the attractive properties of
MapReduce
» Fault tolerance, data locality, scalability
Support a wide range of applications
8.
Programming Model
Resilient distributeddatasets (RDDs)
» Immutable, partitioned collections of objects
» Created through parallel transformations (map,
filter, groupBy, join, …) on data in stable
storage
» Can be cached for efficient reuse
Actions on RDDs
» Count, reduce, collect, save, …
9.
Example: Log Mining
Loaderror messages from a log into
memory, then interactively search for
various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worke
r
Worke
r
Worke
r
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base
RDD
Transformed
RDD
Action
Result: full-text search of
Wikipedia in <1 sec (vs 20 sec
for on-disk data)
Result: scaled to 1 TB data in 5-
7 sec
(vs 170 sec for on-disk data)
10.
RDD Fault Tolerance
RDDsmaintain lineage information that
can be used to reconstruct lost partitions
Ex: messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘t’)(2))
HDFS File Filtered RDD
Mapped
RDD
filter
(func = _.contains(...))
map
(func = _.split(...))
Example: Logistic Regression
valdata = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final w: " + w)
13.
Logistic Regression
Performance
1 510 20 30
0
1000
2000
3000
4000
5000
Hadoop
Spark
Number of Iterations
Running
Time
(s)
127 s / iteration
first iteration 174
s
further iterations
6 s
14.
Spark Operations
Transformations
(define anew
RDD)
map
filter
sample
groupByKey
reduceByKey
sortByKey
flatMap
union
join
cogroup
cross
mapValues
Actions
(return a result to
driver program)
collect
reduce
count
save
lookupKey
#2 Point out that Scala is a modern PL etc
Mention DryadLINQ (but we go beyond it with RDDs)
Point out that interactive use and iterative use go hand in hand because both require small tasks and dataset reuse
#3 Point out that Scala is a modern PL etc
Mention DryadLINQ (but we go beyond it with RDDs)
Point out that interactive use and iterative use go hand in hand because both require small tasks and dataset reuse
#5 Also applies to Dryad, SQL, etc
Benefits: easy to do fault tolerance and
#7 RDDs = first-class way to manipulate and persist intermediate datasets
#8 You write a single program similar to DryadLINQ
Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops
Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization
Mention cached vars useful for some workloads that won’t be shown here
Mention it’s all designed to be easy to distribute in a fault-tolerant fashion
#9 Key idea: add “variables” to the “functions” in functional programming
#11 Note that dataset is reused on each gradient computation
#12 Key idea: add “variables” to the “functions” in functional programming
#13 This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)