Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,
Ankur Dave, Justin Ma, Murphy McCauley, Michael
Franklin,
Scott Shenker, Ion Stoica
Spark
Fast, Interactive, Language-
Integrated Cluster Computing
UC BERKELEY
www.spark-project.org
Project Goals
Extend the MapReduce model to better
support two common classes of analytics
apps:
»Iterative algorithms (machine learning,
graphs)
»Interactive data mining
Enhance programmability:
»Integrate into Scala programming
language
Project Goals
Extend the MapReduce model to better
support two common classes of analytics
apps:
»Iterative algorithms (machine learning,
graphs)
»Interactive data mining
Enhance programmability:
»Integrate into Scala programming
language
Explain why the original MapReduce model
does not efficiently support these use cases?
Motivation
Most current cluster programming
models are based on acyclic data flow
from stable storage to stable storage
Map
Map
Map
Reduc
e
Reduc
e
Input Output
Motivation
Map
Map
Map
Reduc
e
Reduc
e
Input Output
Benefits of data flow: runtime can
decide where to run tasks and can
automatically recover from failures
Most current cluster programming
models are based on acyclic data flow
from stable storage to stable storage
Motivation
Acyclic data flow is inefficient for
applications that repeatedly reuse a
working set of data:
»Iterative algorithms (machine learning,
graphs)
»Interactive data mining tools (R, Excel,
Python)
With current frameworks, apps reload
data from stable storage on each query
Solution: Resilient
Distributed Datasets
(RDDs)
Allow apps to keep working sets in
memory for efficient reuse
Retain the attractive properties of
MapReduce
» Fault tolerance, data locality, scalability
Support a wide range of applications
Programming Model
Resilient distributed datasets (RDDs)
» Immutable, partitioned collections of objects
» Created through parallel transformations (map,
filter, groupBy, join, …) on data in stable
storage
» Can be cached for efficient reuse
Actions on RDDs
» Count, reduce, collect, save, …
Example: Log Mining
Load error messages from a log into
memory, then interactively search for
various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worke
r
Worke
r
Worke
r
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base
RDD
Transformed
RDD
Action
Result: full-text search of
Wikipedia in <1 sec (vs 20 sec
for on-disk data)
Result: scaled to 1 TB data in 5-
7 sec
(vs 170 sec for on-disk data)
RDD Fault Tolerance
RDDs maintain lineage information that
can be used to reconstruct lost partitions
Ex: messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘t’)(2))
HDFS File Filtered RDD
Mapped
RDD
filter
(func = _.contains(...))
map
(func = _.split(...))
Example: Logistic
Regression
Goal: find best line separating two sets of
points
+
–
+
+
+
+
+
+
+
+
– –
–
–
–
–
–
–
+
target
–
random initial line
Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final w: " + w)
Logistic Regression
Performance
1 5 10 20 30
0
1000
2000
3000
4000
5000
Hadoop
Spark
Number of Iterations
Running
Time
(s)
127 s / iteration
first iteration 174
s
further iterations
6 s
Spark Operations
Transformations
(define a new
RDD)
map
filter
sample
groupByKey
reduceByKey
sortByKey
flatMap
union
join
cogroup
cross
mapValues
Actions
(return a result to
driver program)
collect
reduce
count
save
lookupKey
YARN
HDFS

overview-BD-lecture.pptxbbbvvvhhgggggggg

  • 1.
    Matei Zaharia, MosharafChowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive, Language- Integrated Cluster Computing UC BERKELEY www.spark-project.org
  • 2.
    Project Goals Extend theMapReduce model to better support two common classes of analytics apps: »Iterative algorithms (machine learning, graphs) »Interactive data mining Enhance programmability: »Integrate into Scala programming language
  • 3.
    Project Goals Extend theMapReduce model to better support two common classes of analytics apps: »Iterative algorithms (machine learning, graphs) »Interactive data mining Enhance programmability: »Integrate into Scala programming language Explain why the original MapReduce model does not efficiently support these use cases?
  • 4.
    Motivation Most current clusterprogramming models are based on acyclic data flow from stable storage to stable storage Map Map Map Reduc e Reduc e Input Output
  • 5.
    Motivation Map Map Map Reduc e Reduc e Input Output Benefits ofdata flow: runtime can decide where to run tasks and can automatically recover from failures Most current cluster programming models are based on acyclic data flow from stable storage to stable storage
  • 6.
    Motivation Acyclic data flowis inefficient for applications that repeatedly reuse a working set of data: »Iterative algorithms (machine learning, graphs) »Interactive data mining tools (R, Excel, Python) With current frameworks, apps reload data from stable storage on each query
  • 7.
    Solution: Resilient Distributed Datasets (RDDs) Allowapps to keep working sets in memory for efficient reuse Retain the attractive properties of MapReduce » Fault tolerance, data locality, scalability Support a wide range of applications
  • 8.
    Programming Model Resilient distributeddatasets (RDDs) » Immutable, partitioned collections of objects » Created through parallel transformations (map, filter, groupBy, join, …) on data in stable storage » Can be cached for efficient reuse Actions on RDDs » Count, reduce, collect, save, …
  • 9.
    Example: Log Mining Loaderror messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worke r Worke r Worke r Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5- 7 sec (vs 170 sec for on-disk data)
  • 10.
    RDD Fault Tolerance RDDsmaintain lineage information that can be used to reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘t’)(2)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))
  • 11.
    Example: Logistic Regression Goal: findbest line separating two sets of points + – + + + + + + + + – – – – – – – – + target – random initial line
  • 12.
    Example: Logistic Regression valdata = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)
  • 13.
    Logistic Regression Performance 1 510 20 30 0 1000 2000 3000 4000 5000 Hadoop Spark Number of Iterations Running Time (s) 127 s / iteration first iteration 174 s further iterations 6 s
  • 14.
    Spark Operations Transformations (define anew RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey
  • 15.

Editor's Notes

  • #2 Point out that Scala is a modern PL etc Mention DryadLINQ (but we go beyond it with RDDs) Point out that interactive use and iterative use go hand in hand because both require small tasks and dataset reuse
  • #3 Point out that Scala is a modern PL etc Mention DryadLINQ (but we go beyond it with RDDs) Point out that interactive use and iterative use go hand in hand because both require small tasks and dataset reuse
  • #4 Acyclic
  • #5 Also applies to Dryad, SQL, etc Benefits: easy to do fault tolerance and
  • #7 RDDs = first-class way to manipulate and persist intermediate datasets
  • #8 You write a single program  similar to DryadLINQ Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization Mention cached vars useful for some workloads that won’t be shown here Mention it’s all designed to be easy to distribute in a fault-tolerant fashion
  • #9 Key idea: add “variables” to the “functions” in functional programming
  • #11 Note that dataset is reused on each gradient computation
  • #12 Key idea: add “variables” to the “functions” in functional programming
  • #13 This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)