SlideShare a Scribd company logo
CONFIDENTIAL - RESTRICTED 
Introduction to Spark 
LV Big Data Monthly Meetup #2 
December 3rd 2014 
Maxime Dumas 
Systems Engineer, Cloudera
Thirty Seconds About Max 
• Systems Engineer 
• aka Sales Engineer 
• SoCal, AZ, NV 
• former coder of PHP 
• teaches meditation + yoga 
• from Montreal, Canada 
2
What Does Cloudera Do? 
• product 
• distribution of Hadoop components, Apache licensed 
• enterprise tooling 
• support 
• training 
• services (aka consulting) 
• community 
3
4 
But first… how did we get here?
What does Hadoop look like? 
5 
HDFS HDFS 
… 
master 
worker 
(“NN”) 
(“DN”) 
MR 
worker 
(“TT”) 
HDFS 
worker 
(“DN”) 
MR 
worker 
(“TT”) 
HDFS 
worker 
(“DN”) 
MR 
worker 
(“TT”) 
HDFS 
worker 
(“DN”) 
MR 
worker 
(“TT”) 
HDFS 
worker 
(“DN”) 
MR 
worker 
(“TT”) 
MR 
master 
(“JT”) 
Standby 
master
But I want MORE! 
6 
HDFS 
worker 
HDFS 
worker 
HDFS 
worker 
HDFS 
worker 
HDFS 
worker 
… 
MapReduce 
HDFS 
master 
(“NN”) 
MR 
master 
(“JT”) 
Standby 
master
Hadoop as an Architecture 
The Old Way 
Network 
Expensive, Special purpose, “Reliable” Servers 
Expensive Licensed Software 
• Hard to scale 
• Network is a bottleneck 
• Only handles relational data 
• Difficult to add new fields & data types 
Expensive & Unattainable 
$30,000+ per TB 
Data Storage 
(SAN, NAS) 
Compute 
(RDBMS, EDW) 
The Hadoop Way 
Commodity “Unreliable” Servers 
Hybrid Open Source Software 
• Scales out forever 
• No bottlenecks 
• Easy to ingest any data 
• Agile data access 
Affordable & Attainable 
$300-$1,000 per TB 
Compute 
(CPU) 
Memory Storage 
(Disk) 
z 
z
CDH: the App Store for Hadoop 
8 
Resource Management 
Storage 
Integration 
Metadata 
NoSQL 
DBMS 
… 
Analytic 
MPP 
DBMS 
Search 
Engine 
In- 
Memory 
Batch 
Processing 
Support 
Data 
Management 
System 
Management 
Security 
Machine 
Learning 
MapReduce
9 
Introduction to Apache Spark 
Credits: 
• Ben White 
• Todd Lipcon 
• Ted Malaska 
• Jairam Ranganathan 
• Jayant Shekhar 
• Sandy Ryza
Can we improve on MR? 
• Problems with MR: 
• Very low-level: requires a lot of code to do simple 
things 
• Very constrained: everything must be described as 
“map” and “reduce”. Powerful but sometimes 
difficult to think in these terms. 
10
Can we improve on MR? 
• Two approaches to improve on MapReduce: 
1. Special purpose systems to solve one problem domain 
well. 
• Giraph / Graphlab (graph processing) 
• Storm (stream processing) 
• Impala (real-time SQL) 
2. Generalize the capabilities of MapReduce to 
provide a richer foundation to solve problems. 
• Tez, MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs) 
Both are viable strategies depending on the problem! 
11
What is Apache Spark? 
Spark is a general purpose computational framework 
Retains the advantages of MapReduce: 
• Linear scalability 
• Fault-tolerance 
• Data Locality based computations 
…but offers so much more: 
• Leverages distributed memory for better performance 
• Supports iterative algorithms that are not feasible in MR 
• Improved developer experience 
• Full Directed Graph expressions for data parallel computations 
• Comes with libraries for machine learning, graph analysis, etc. 
12
What is Apache Spark? 
Run programs up to 100x faster than Hadoop 
MapReduce in memory, or 10x faster on disk. 
One of the largest open source projects in big data: 
• 170+ developers contributing 
• 30+ companies contributing 
• 400+ discussions per month on the mailing list 
13
Popular project 
14
Getting started with Spark 
• Java API 
• Interactive shells: 
• Scala (spark-shell) 
• Python (pyspark) 
15
Execution modes 
16
Execution modes 
• Standalone Mode 
• Dedicated master and worker daemons 
• YARN Client Mode 
• Launches a YARN application with the 
driver program running locally 
• YARN Cluster Mode 
• Launches a YARN application with the 
driver program running in the YARN 
ApplicationMaster 
17 
Dedicated Spark 
runtime with static 
resource limits 
Dynamic resource 
management 
between Spark, 
MR, Impala…
Spark Concepts 
18
RDD – Resilient Distributed Dataset 
• Collections of objects partitioned across a cluster 
• Stored in RAM or on Disk 
• You can control persistence and partitioning 
• Created by: 
• Distributing local collection objects 
• Transformation of data in storage 
• Transformation of RDDs 
• Automatically rebuilt on failure (resilient) 
• Contains lineage to compute from storage 
• Lazy materialization 
20
RDD transformations 
21
Operations on RDDs 
Transformations lazily transform a RDD 
to a new RDD 
• map 
• flatMap 
• filter 
• sample 
• join 
• sort 
• reduceByKey 
• … 
Actions run computation to return a 
value 
• collect 
• reduce(func) 
• foreach(func) 
• count 
• first, take(n) 
• saveAs 
• … 
22
Fault Tolerance 
• RDDs contain lineage. 
• Lineage – source location and list of transformations 
• Lost partitions can be re-computed from source data 
23 
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) 
.map(lambda s: s.split(“t”)[2]) 
HDFS File Filtered RDD Mapped RDD 
filter 
(func = startsWith(…)) 
map 
(func = split(...))
24 
Examples
Word Count in MapReduce 
package org.myorg; 
import java.io.IOException; 
import java.util.*; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.conf.*; 
import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapreduce.*; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 
public class WordCount { 
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
public void map(LongWritable key, Text value, Context context) throws IOException, 
InterruptedException { 
String line = value.toString(); 
StringTokenizer tokenizer = new StringTokenizer(line); 
while (tokenizer.hasMoreTokens()) { 
word.set(tokenizer.nextToken()); 
context.write(word, one); 
} 
} 
} 
25 
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 
public void reduce(Text key, Iterable<IntWritable> values, Context context) 
throws IOException, InterruptedException { 
int sum = 0; 
for (IntWritable val : values) { 
sum += val.get(); 
} 
context.write(key, new IntWritable(sum)); 
} 
} 
public static void main(String[] args) throws Exception { 
Configuration conf = new Configuration(); 
Job job = new Job(conf, "wordcount"); 
job.setOutputKeyClass(Text.class); 
job.setOutputValueClass(IntWritable.class); 
job.setMapperClass(Map.class); 
job.setReducerClass(Reduce.class); 
job.setInputFormatClass(TextInputFormat.class); 
job.setOutputFormatClass(TextOutputFormat.class); 
FileInputFormat.addInputPath(job, new Path(args[0])); 
FileOutputFormat.setOutputPath(job, new Path(args[1])); 
job.waitForCompletion(true); 
} 
}
Word Count in Spark 
sc.textFile(“words”) 
.flatMap(line => line.split(" ")) 
.map(word=>(word,1)) 
.reduceByKey(_+_).collect() 
26
Logistic Regression 
• Read two sets of points 
• Looks for a plane W that separates them 
• Perform gradient descent: 
• Start with random W 
• On each iteration, sum a function of W over the data 
• Move W in a direction that improves it 
27
Intuition 
28
Logistic Regression 
29
Logistic Regression Performance 
30
31 
Spark and Hadoop: 
a Framework within a Framework
32
33 
Map 
Reduce 
Resource Management 
Storage 
Integration 
Metadata 
HBase Impala Solr Spark … 
Support 
Data 
Management 
System 
Management 
Security
Spark Streaming 
• Takes the concept of RDDs and extends it to DStreams 
• Fault-tolerant like RDDs 
• Transformable like RDDs 
• Adds new “rolling window” operations 
• Rolling averages, etc. 
• But keeps everything else! 
• Regular Spark code works in Spark Streaming 
• Can still access HDFS data, etc. 
• Example use cases: 
• “On-the-fly” ETL as data is ingested into Hadoop/HDFS. 
• Detecting anomalous behavior and triggering alerts. 
• Continuous reporting of summary metrics for incoming data. 
34
Micro-batching for on the fly ETL 
35
What about SQL? 
36 
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html 
http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/
Fault Recovery Recap 
• RDDs store dependency graph 
• Because RDDs are deterministic: 
Missing RDDs are rebuilt in parallel on other nodes 
• Stateful RDDs can have infinite lineage 
• Periodic checkpoints to disk clears lineage 
• Faster recovery times 
• Better handling of stragglers vs row-by-row streaming 
37
Why Spark? 
• Flexible like MapReduce 
• High performance 
• Machine learning, 
iterative algorithms 
• Interactive data 
explorations 
• Concise, easy API for 
developer productivity 
38
39 
Demo Time! 
• Log file Analysis 
• Machine Learning 
• Spark Streaming
What’s Next? 
• Download Hadoop! 
• CDH available at www.cloudera.com 
• Try it online: Cloudera Live 
• Cloudera provides pre-loaded VMs 
• http://tiny.cloudera.com/quickstartvm 
40
41 
Questions? 
Preferably related to the talk… or not.
42 
Thank You! 
Maxime Dumas 
mdumas@cloudera.com 
We’re hiring.
43

More Related Content

PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Hadoop and Spark
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
PDF
Map reduce vs spark
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PPTX
Intro to Apache Spark by Marco Vasquez
PDF
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
PDF
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Hadoop and Spark
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Map reduce vs spark
Apache Spark: The Next Gen toolset for Big Data Processing
Intro to Apache Spark by Marco Vasquez
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Fedbench - A Benchmark Suite for Federated Semantic Data Processing

What's hot (20)

PDF
Apache Spark
PPTX
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
PDF
Introduction to apache spark
PDF
Analytics with Spark and Cassandra
PPTX
Introduction to spark
PPTX
Intro to Apache Spark
PDF
Fast Data Analytics with Spark and Python
PDF
Apache Spark Overview @ ferret
PDF
Hadoop Spark Introduction-20150130
PDF
Apache Spark & Hadoop
PPTX
Introduction to real time big data with Apache Spark
PDF
Spark meetup TCHUG
PPTX
Introduction to Apache Spark
PDF
Big data processing with apache spark
PDF
Introduction to Spark Training
PDF
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
PPTX
Big data Hadoop
PPTX
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
PPTX
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
PPTX
Introduction to the Hadoop EcoSystem
Apache Spark
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Introduction to apache spark
Analytics with Spark and Cassandra
Introduction to spark
Intro to Apache Spark
Fast Data Analytics with Spark and Python
Apache Spark Overview @ ferret
Hadoop Spark Introduction-20150130
Apache Spark & Hadoop
Introduction to real time big data with Apache Spark
Spark meetup TCHUG
Introduction to Apache Spark
Big data processing with apache spark
Introduction to Spark Training
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Big data Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Introduction to the Hadoop EcoSystem
Ad

Similar to Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014 (20)

PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
PDF
20140614 introduction to spark-ben white
PPTX
In Memory Analytics with Apache Spark
PPTX
Apache Spark Fundamentals
PDF
Introduction to apache spark
PPTX
Intro to Spark development
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PDF
Bds session 13 14
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
APACHE SPARK.pptx
PPTX
Intro to Spark - for Denver Big Data Meetup
PDF
Stanford CS347 Guest Lecture: Apache Spark
PDF
Apache Spark Introduction.pdf
PPTX
2016-07-21-Godil-presentation.pptx
PPTX
Apache Spark on HDinsight Training
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PPTX
Hadoop - A big data initiative
PDF
How Apache Spark fits into the Big Data landscape
PDF
Is Spark Replacing Hadoop
PPTX
Unit II Real Time Data Processing tools.pptx
Introduction to Spark - Phoenix Meetup 08-19-2014
20140614 introduction to spark-ben white
In Memory Analytics with Apache Spark
Apache Spark Fundamentals
Introduction to apache spark
Intro to Spark development
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Bds session 13 14
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
APACHE SPARK.pptx
Intro to Spark - for Denver Big Data Meetup
Stanford CS347 Guest Lecture: Apache Spark
Apache Spark Introduction.pdf
2016-07-21-Godil-presentation.pptx
Apache Spark on HDinsight Training
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Hadoop - A big data initiative
How Apache Spark fits into the Big Data landscape
Is Spark Replacing Hadoop
Unit II Real Time Data Processing tools.pptx
Ad

Recently uploaded (20)

PDF
Architecture types and enterprise applications.pdf
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Modernising the Digital Integration Hub
PPTX
observCloud-Native Containerability and monitoring.pptx
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Hybrid model detection and classification of lung cancer
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPT
Geologic Time for studying geology for geologist
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
Tartificialntelligence_presentation.pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
Architecture types and enterprise applications.pdf
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Modernising the Digital Integration Hub
observCloud-Native Containerability and monitoring.pptx
Benefits of Physical activity for teenagers.pptx
Hybrid model detection and classification of lung cancer
Web Crawler for Trend Tracking Gen Z Insights.pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Geologic Time for studying geology for geologist
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Enhancing emotion recognition model for a student engagement use case through...
Final SEM Unit 1 for mit wpu at pune .pptx
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
DP Operators-handbook-extract for the Mautical Institute
Tartificialntelligence_presentation.pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
Module 1.ppt Iot fundamentals and Architecture
WOOl fibre morphology and structure.pdf for textiles
NewMind AI Weekly Chronicles – August ’25 Week III

Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014

  • 1. CONFIDENTIAL - RESTRICTED Introduction to Spark LV Big Data Monthly Meetup #2 December 3rd 2014 Maxime Dumas Systems Engineer, Cloudera
  • 2. Thirty Seconds About Max • Systems Engineer • aka Sales Engineer • SoCal, AZ, NV • former coder of PHP • teaches meditation + yoga • from Montreal, Canada 2
  • 3. What Does Cloudera Do? • product • distribution of Hadoop components, Apache licensed • enterprise tooling • support • training • services (aka consulting) • community 3
  • 4. 4 But first… how did we get here?
  • 5. What does Hadoop look like? 5 HDFS HDFS … master worker (“NN”) (“DN”) MR worker (“TT”) HDFS worker (“DN”) MR worker (“TT”) HDFS worker (“DN”) MR worker (“TT”) HDFS worker (“DN”) MR worker (“TT”) HDFS worker (“DN”) MR worker (“TT”) MR master (“JT”) Standby master
  • 6. But I want MORE! 6 HDFS worker HDFS worker HDFS worker HDFS worker HDFS worker … MapReduce HDFS master (“NN”) MR master (“JT”) Standby master
  • 7. Hadoop as an Architecture The Old Way Network Expensive, Special purpose, “Reliable” Servers Expensive Licensed Software • Hard to scale • Network is a bottleneck • Only handles relational data • Difficult to add new fields & data types Expensive & Unattainable $30,000+ per TB Data Storage (SAN, NAS) Compute (RDBMS, EDW) The Hadoop Way Commodity “Unreliable” Servers Hybrid Open Source Software • Scales out forever • No bottlenecks • Easy to ingest any data • Agile data access Affordable & Attainable $300-$1,000 per TB Compute (CPU) Memory Storage (Disk) z z
  • 8. CDH: the App Store for Hadoop 8 Resource Management Storage Integration Metadata NoSQL DBMS … Analytic MPP DBMS Search Engine In- Memory Batch Processing Support Data Management System Management Security Machine Learning MapReduce
  • 9. 9 Introduction to Apache Spark Credits: • Ben White • Todd Lipcon • Ted Malaska • Jairam Ranganathan • Jayant Shekhar • Sandy Ryza
  • 10. Can we improve on MR? • Problems with MR: • Very low-level: requires a lot of code to do simple things • Very constrained: everything must be described as “map” and “reduce”. Powerful but sometimes difficult to think in these terms. 10
  • 11. Can we improve on MR? • Two approaches to improve on MapReduce: 1. Special purpose systems to solve one problem domain well. • Giraph / Graphlab (graph processing) • Storm (stream processing) • Impala (real-time SQL) 2. Generalize the capabilities of MapReduce to provide a richer foundation to solve problems. • Tez, MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs) Both are viable strategies depending on the problem! 11
  • 12. What is Apache Spark? Spark is a general purpose computational framework Retains the advantages of MapReduce: • Linear scalability • Fault-tolerance • Data Locality based computations …but offers so much more: • Leverages distributed memory for better performance • Supports iterative algorithms that are not feasible in MR • Improved developer experience • Full Directed Graph expressions for data parallel computations • Comes with libraries for machine learning, graph analysis, etc. 12
  • 13. What is Apache Spark? Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. One of the largest open source projects in big data: • 170+ developers contributing • 30+ companies contributing • 400+ discussions per month on the mailing list 13
  • 15. Getting started with Spark • Java API • Interactive shells: • Scala (spark-shell) • Python (pyspark) 15
  • 17. Execution modes • Standalone Mode • Dedicated master and worker daemons • YARN Client Mode • Launches a YARN application with the driver program running locally • YARN Cluster Mode • Launches a YARN application with the driver program running in the YARN ApplicationMaster 17 Dedicated Spark runtime with static resource limits Dynamic resource management between Spark, MR, Impala…
  • 19. RDD – Resilient Distributed Dataset • Collections of objects partitioned across a cluster • Stored in RAM or on Disk • You can control persistence and partitioning • Created by: • Distributing local collection objects • Transformation of data in storage • Transformation of RDDs • Automatically rebuilt on failure (resilient) • Contains lineage to compute from storage • Lazy materialization 20
  • 21. Operations on RDDs Transformations lazily transform a RDD to a new RDD • map • flatMap • filter • sample • join • sort • reduceByKey • … Actions run computation to return a value • collect • reduce(func) • foreach(func) • count • first, take(n) • saveAs • … 22
  • 22. Fault Tolerance • RDDs contain lineage. • Lineage – source location and list of transformations • Lost partitions can be re-computed from source data 23 msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 24. Word Count in MapReduce package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } 25 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 25. Word Count in Spark sc.textFile(“words”) .flatMap(line => line.split(" ")) .map(word=>(word,1)) .reduceByKey(_+_).collect() 26
  • 26. Logistic Regression • Read two sets of points • Looks for a plane W that separates them • Perform gradient descent: • Start with random W • On each iteration, sum a function of W over the data • Move W in a direction that improves it 27
  • 30. 31 Spark and Hadoop: a Framework within a Framework
  • 31. 32
  • 32. 33 Map Reduce Resource Management Storage Integration Metadata HBase Impala Solr Spark … Support Data Management System Management Security
  • 33. Spark Streaming • Takes the concept of RDDs and extends it to DStreams • Fault-tolerant like RDDs • Transformable like RDDs • Adds new “rolling window” operations • Rolling averages, etc. • But keeps everything else! • Regular Spark code works in Spark Streaming • Can still access HDFS data, etc. • Example use cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS. • Detecting anomalous behavior and triggering alerts. • Continuous reporting of summary metrics for incoming data. 34
  • 34. Micro-batching for on the fly ETL 35
  • 35. What about SQL? 36 http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/
  • 36. Fault Recovery Recap • RDDs store dependency graph • Because RDDs are deterministic: Missing RDDs are rebuilt in parallel on other nodes • Stateful RDDs can have infinite lineage • Periodic checkpoints to disk clears lineage • Faster recovery times • Better handling of stragglers vs row-by-row streaming 37
  • 37. Why Spark? • Flexible like MapReduce • High performance • Machine learning, iterative algorithms • Interactive data explorations • Concise, easy API for developer productivity 38
  • 38. 39 Demo Time! • Log file Analysis • Machine Learning • Spark Streaming
  • 39. What’s Next? • Download Hadoop! • CDH available at www.cloudera.com • Try it online: Cloudera Live • Cloudera provides pre-loaded VMs • http://tiny.cloudera.com/quickstartvm 40
  • 40. 41 Questions? Preferably related to the talk… or not.
  • 41. 42 Thank You! Maxime Dumas [email protected] We’re hiring.
  • 42. 43

Editor's Notes

  • #4: Similar to the Red Hat model. Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation, http://www.apache.org/foundation/marks/
  • #6: What’s the best bang-per-buck you can get for computational capacity? Intel boxes running Linux! Sharded databases (and shared-nothing architectures like Teradata) look like this Let’s just get a ton of them, and figure out the rest with software (fault tolerance, distributed process management, job control, etc). And the rest is history - this is how Google, Facebook, Yahoo – this is how these guys build out their data centers.
  • #7: So does this approach work for more than just MapReduce? Sure! You’ve got all this CPU and memory sitting here – let’s spin up more than just MapReduce. Let’s add an agent for a distributed SQL query engine, a distributed NoSQL database, distributed search…
  • #8: Pricing Data: Cloudera: HW + SW per-year list prices for Basic thru EDH at various configs Old Way: Various sources. One of note: - Cowen / Goldmacher coverage initiation of Teradata, June 17, 2013 - List price of high-end appliance (which he thinks is more comparable to our solution) is $57K/TB + maintenance for an annual cost of $39K/TB - Prices have likely decreased, but we estimate they are still in excess of $30K/TB/year - List price of their low-end appliance is $12K/TB + maint or $8K per year