SlideShare a Scribd company logo
Full stack analytics 
with Hadoop 2 
Trento, 2014-09-11 
GABRIELE MODENA LEARNING HADOOP 2
CS.ML! 
Data Scientist 
ML & Data Mining 
Academia & Industry 
! 
Learning Hadoop 2 for 
Packt_Publishing (together 
with Garry Turkington). TBD.
This talk is about tools
Your mileage may vary
I will avoid benchmarks
Back in 2012 
GABRIELE MODENA LEARNING HADOOP 2
HDFS 
Name Node 
Data Node 
! 
! 
Google paper (2003)! 
Distributed storage! 
Block ops 
Name Node 
Data Node Data Node 
GABRIELE MODENA LEARNING HADOOP 2
MapReduce 
Google paper (2006)! 
Divide and conquer functional model! 
Concepts from database research! 
Batch worloads! 
Aggregation operations (eg. GROUP BY) 
GABRIELE MODENA LEARNING HADOOP 2
Two phases 
Map 
Reduce 
GABRIELE MODENA LEARNING HADOOP 2
Programs are chains 
of jobs
GABRIELE MODENA LEARNING HADOOP 2
All in all 
Great when records (jobs) are independent! 
Composability monsters! 
Computation vs. Communication tradeoff! 
Low level API! 
Tuning required 
GABRIELE MODENA LEARNING HADOOP 2
Computation with 
MapReduce 
CRUNCH 
GABRIELE MODENA LEARNING HADOOP 2
Higher level abstractions, 
still geared towards batch 
loads
Dremel (Impala, Drill) 
Google paper (2010) ! 
Access blocks directly from data nodes (partition 
the fs namespace)! 
Columnar store (optimize for OLAP)! 
Appeals to database / BI crowds! 
Ridiculously fast (as long as you have memory) 
GABRIELE MODENA LEARNING HADOOP 2
Computation beyond 
MapReduce 
Iterative workloads! 
Low latency queries! 
Real-time computation! 
High level abstractions 
GABRIELE MODENA LEARNING HADOOP 2
Hadoop 2 
Applications (Hive, Pig, Crunch, Cascading, etc…) 
Streaming 
(storm, spark, 
samza) 
In memory 
(spark) 
Interactive 
(Tez) 
HPC 
(MPI) 
Resource Management (YARN) 
HDFS 
Batch 
(MapReduce) 
Graph 
(giraph) 
GABRIELE MODENA LEARNING HADOOP 2
Full stack analytics with Hadoop 2
Tez (Dryad) 
Microsoft paper (2007)! 
Generalization of MapReduce as dataflow! 
Express dependencies, I/O pipelining! 
Low level API for building DAGs! 
Mainly an execution engine (Hive-on-Tez, Pig-on-Tez) 
GABRIELE MODENA LEARNING HADOOP 2
GABRIELE MODENA LEARNING HADOOP 2
DAG dag = new DAG("WordCount"); 
dag.addVertex(tokenizerVertex) 
.addVertex(summerVertex) 
.addEdge( 
new Edge(tokenizerVertex, summerVertex, 
edgeConf.createDefaultEdgeProperty())); 
GABRIELE MODENA LEARNING HADOOP 2
p!!ackage org.apache.tez.mapreduce.examples; import java.io.IOException; 
import java.util.Map; 
import java.util.StringTokenizer; 
i!mport java.util.TreeMap; import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.conf.Configured; 
import org.apache.hadoop.fs.FileSystem; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapred.FileAlreadyExistsException; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 
import org.apache.hadoop.security.UserGroupInformation; 
import org.apache.hadoop.util.GenericOptionsParser; 
import org.apache.hadoop.util.Tool; 
import org.apache.hadoop.util.ToolRunner; 
import org.apache.hadoop.yarn.api.records.LocalResource; 
import org.apache.tez.client.TezClient; 
import org.apache.tez.dag.api.DAG; 
import org.apache.tez.dag.api.Edge; 
import org.apache.tez.dag.api.InputDescriptor; 
import org.apache.tez.dag.api.OutputDescriptor; 
import org.apache.tez.dag.api.ProcessorDescriptor; 
import org.apache.tez.dag.api.TezConfiguration; 
import org.apache.tez.dag.api.Vertex; 
import org.apache.tez.dag.api.client.DAGClient; 
import org.apache.tez.dag.api.client.DAGStatus; 
import org.apache.tez.mapreduce.committer.MROutputCommitter; 
import org.apache.tez.mapreduce.common.MRInputAMSplitGenerator; 
import org.apache.tez.mapreduce.hadoop.MRHelpers; 
import org.apache.tez.mapreduce.input.MRInput; 
import org.apache.tez.mapreduce.output.MROutput; 
import org.apache.tez.mapreduce.processor.SimpleMRProcessor; 
import org.apache.tez.runtime.api.Output; 
import org.apache.tez.runtime.library.api.KeyValueReader; 
import org.apache.tez.runtime.library.api.KeyValueWriter; 
import org.apache.tez.runtime.library.api.KeyValuesReader; 
i!mport org.apache.tez.runtime.library.conf.OrderedPartitionedKVEdgeConfigurer; import com.google.common.base.Preconditions; 
import org.apache.tez.runtime.library.partitioner.HashPartitioner; !! 
public class WordCount extends Configured implements Tool { 
public static class TokenProcessor extends SimpleMRProcessor { 
IntWritable one = new IntWritable(1); 
! Text word = new Text(); @Override 
public void run() throws Exception { 
Preconditions.checkArgument(getInputs().size() == 1); 
Preconditions.checkArgument(getOutputs().size() == 1); 
MRInput input = (MRInput) getInputs().values().iterator().next(); 
KeyValueReader kvReader = input.getReader(); 
Output output = getOutputs().values().iterator().next(); 
KeyValueWriter kvWriter = (KeyValueWriter) output.getWriter(); 
while (kvReader.next()) { 
StringTokenizer itr = new StringTokenizer(kvReader.getCurrentValue().toString()); 
while (itr.hasMoreTokens()) { 
word.set(itr.nextToken()); 
kvWriter.write(word, one); 
} 
} 
! } ! } public static class SumProcessor extends SimpleMRProcessor { 
@Override 
public void run() throws Exception { 
Preconditions.checkArgument(getInputs().size() == 1); 
MROutput out = (MROutput) getOutputs().values().iterator().next(); 
KeyValueWriter kvWriter = out.getWriter(); 
KeyValuesReader kvReader = (KeyValuesReader) getInputs().values().iterator().next() 
.getReader(); 
while (kvReader.next()) { 
Text word = (Text) kvReader.getCurrentKey(); 
int sum = 0; 
for (Object value : kvReader.getCurrentValues()) { 
sum += ((IntWritable) value).get(); 
} 
kvWriter.write(word, new IntWritable(sum)); 
} 
} 
} 
! private DAG createDAG(FileSystem fs, TezConfiguration tezConf, 
Map<String, LocalResource> localResources, Path stagingDir, 
! String inputPath, String outputPath) throws IOException { Configuration inputConf = new Configuration(tezConf); 
inputConf.set(FileInputFormat.INPUT_DIR, inputPath); 
InputDescriptor id = new InputDescriptor(MRInput.class.getName()) 
.setUserPayload(MRInput.createUserPayload(inputConf, 
! TextInputFormat.class.getName(), true, true)); Configuration outputConf = new Configuration(tezConf); 
outputConf.set(FileOutputFormat.OUTDIR, outputPath); 
OutputDescriptor od = new OutputDescriptor(MROutput.class.getName()) 
.setUserPayload(MROutput.createUserPayload( 
! outputConf, TextOutputFormat.class.getName(), true)); Vertex tokenizerVertex = new Vertex("tokenizer", new ProcessorDescriptor( 
TokenProcessor.class.getName()), -1, MRHelpers.getMapResource(tezConf)); 
! tokenizerVertex.addInput("MRInput", id, MRInputAMSplitGenerator.class); Vertex summerVertex = new Vertex("summer", 
! new ProcessorDescriptor( 
SumProcessor.class.getName()), 1, MRHelpers.getReduceResource(tezConf)); 
summerVertex.addOutput("MROutput", od, MROutputCommitter.class); OrderedPartitionedKVEdgeConfigurer edgeConf = OrderedPartitionedKVEdgeConfigurer 
.newBuilder(Text.class.getName(), IntWritable.class.getName(), 
! HashPartitioner.class.getName(), null).build(); DAG dag = new DAG("WordCount"); 
dag.addVertex(tokenizerVertex) 
.addVertex(summerVertex) 
.addEdge( 
return dag; 
! } private static void printUsage() { 
new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty())); 
System.err.println("Usage: " + " wordcount <in1> <out1>"); 
ToolRunner.printGenericCommandUsage(System.err); 
! } public boolean run(String inputPath, String outputPath, Configuration conf) throws Exception { 
System.out.println("Running WordCount"); 
// conf and UGI 
TezConfiguration tezConf; 
if (conf != null) { 
tezConf = new TezConfiguration(conf); 
} else { 
tezConf = new TezConfiguration(); 
} 
UserGroupInformation.setConfiguration(tezConf); 
! String user = UserGroupInformation.getCurrentUser().getShortUserName(); // staging dir 
FileSystem fs = FileSystem.get(tezConf); 
String stagingDirStr = Path.SEPARATOR + "user" + Path.SEPARATOR 
+ user + Path.SEPARATOR+ ".staging" + Path.SEPARATOR 
+ Path.SEPARATOR + Long.toString(System.currentTimeMillis()); 
Path stagingDir = new Path(stagingDirStr); 
tezConf.set(TezConfiguration.TEZ_AM_STAGING_DIR, stagingDirStr); 
stagingDir = fs.makeQualified(stagingDir); 
// No need to add jar containing this class as assumed to be part of 
! // the tez jars. // TEZ-674 Obtain tokens based on the Input / Output paths. For now assuming staging dir 
// is the same filesystem as the one used for Input/Output. 
TezClient tezSession = new TezClient("WordCountSession", tezConf); 
! tezSession.start(); ! DAGClient dagClient = null; try { 
if (fs.exists(new Path(outputPath))) { 
throw new FileAlreadyExistsException("Output directory " 
+ outputPath + " already exists"); 
} 
Map<String, LocalResource> localResources = 
new TreeMap<String, LocalResource>(); 
DAG dag = createDAG(fs, tezConf, localResources, 
! stagingDir, inputPath, outputPath); tezSession.waitTillReady(); 
! dagClient = tezSession.submitDAG(dag); // monitoring 
DAGStatus dagStatus = dagClient.waitForCompletionWithAllStatusUpdates(null); 
if (dagStatus.getState() != DAGStatus.State.SUCCEEDED) { 
System.out.println("DAG diagnostics: " + dagStatus.getDiagnostics()); 
return false; 
} 
return true; 
} finally { 
fs.delete(stagingDir, true); 
tezSession.stop(); 
} 
! } @Override 
public int run(String[] args) throws Exception { 
Configuration conf = getConf(); 
! String [] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { 
printUsage(); 
return 2; 
} 
WordCount job = new WordCount(); 
job.run(otherArgs[0], otherArgs[1], conf); 
return 0; 
! } public static void main(String[] args) throws Exception { 
int res = ToolRunner.run(new Configuration(), new WordCount(), args); 
System.exit(res); 
} 
} 
GABRIELE MODENA LEARNING HADOOP 2
Spark 
AMPLab paper (2010), builds on Dryad! 
Resilient Distributed Datasets (RDDs)! 
High level API (and a repl)! 
Also an execution engine (Hive-on-Spark, Pig-on- 
Spark) 
GABRIELE MODENA LEARNING HADOOP 2
JavaRDD<String> file = spark.textFile(“hdfs://infile.txt"); 
! 
JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() { 
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } 
}); 
! 
JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() { 
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } 
}); 
! 
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() { 
public Integer call(Integer a, Integer b) { return a + b; } 
}); 
! 
counts.saveAsTextFile(“hdfs://outfile.txt"); 
GABRIELE MODENA LEARNING HADOOP 2
Rule of thumb 
Avoid spill-to-disk! 
Spark and Tez don’t mix well! 
Join on 50+ TB = Hive+Tez, MapReduce! 
Direct access to API (in memory) = Spark! 
OLAP = Hive+Tez, Cloudera Impala! 
GABRIELE MODENA LEARNING HADOOP 2
Good stuff. So what?
The data <adjective> 
S3, mysql, nfs, … 
HDFS 
Workflow coordination 
Ingestion Metadata 
Processing 
GABRIELE MODENA LEARNING HADOOP 2
Analytics on Hadoop 2 
Batch & interactive! 
Datawarehousing & computing! 
Dataset size and velocity! 
Integrations with existing tools! 
Distributions will constrain your stack 
GABRIELE MODENA LEARNING HADOOP 2
Use cases 
Datawarehousing! 
Explorative Data Analysis! 
Stream processing! 
Predictive Analytics 
GABRIELE MODENA LEARNING HADOOP 2
Datawarehousing 
Data ingestion! 
Pipelines! 
Transform and enrich (ETL) queries - batch! 
Low latency (presentation) queries - interactive! 
Interoperable data formats and metadata! 
Workflow Orchestration 
GABRIELE MODENA LEARNING HADOOP 2
Collection and ingestion 
$ hadoop distcp 
GABRIELE MODENA LEARNING HADOOP 2
Once data is in HDFS
Apache Hive 
HiveQL ! 
Data stored on HDFS! 
Metadata kept in mysql (metastore)! 
Metadata exposed to third parties (HCatalog)! 
Suitable both for interactive and batch queries 
GABRIELE MODENA LEARNING HADOOP 2
set hive.execution.engine=tez
set hive.execution.engine=mr
The nature of Hive tables 
CREATE TABLE and (LOAD DATA) produce metadata! 
! 
Schema based on the data “as it has already arrived”! 
! 
Data files underlying a Hive table are no different from any 
other file on HDFS! 
! 
Primitive types behave as in Java 
GABRIELE MODENA LEARNING HADOOP 2
Data formats 
Record oriented (avro, text)! 
Column oriented (Parquet, Orc) 
GABRIELE MODENA LEARNING HADOOP 2
Text (tab separated) 
create external table tweets 
( 
created_at string, 
tweet_id string, 
text string, 
in_reply_to string, 
retweeted boolean, 
user_id string, 
place_id string 
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY 't' 
STORED AS TEXTFILE 
LOCATION ‘$input’ 
$ hadoop fs -cat /data/tweets.tsv 
2014-03-12T17:34:26.000Z!443802208698908672! Oh &amp; I'm chuffed for 
@GeraintThomas86, doing Wales proud in yellow!! #ParisNice #Cymru! NULL! 
223224878! NULL 
2014-03-12T17:34:26.000Z!443802208706908160! Stalker48 Kembali Lagi Cek 
Disini http://t.co/4BMTFByFH5 236! NULL! 629845435! NULL 
2014-03-12T17:34:26.000Z!443802208728268800! @Piconn ou melhor, eu era :c 
mudei! NULL! 255768055! NULL 
2014-03-12T17:34:26.000Z!443802208698912768! I swear Ryan's always in his 
own world. He's always like 4 hours behind everyone else.! NULL! 
2379282889! NULL 
2014-03-12T17:34:26.000Z!443802208702713856! @maggersforever0 lmfao you 
gotta see this, its awesome http://t.co/1PvXEELlqi! NULL! 355858832! 
NULL 
2014-03-12T17:34:26.000Z!443802208698896384! Crazy... http://t.co/ 
G4QRMSKGkh! NULL! 106439395! NULL! 
GABRIELE MODENA LEARNING HADOOP 2 
•
SELECT COUNT(*) 
FROM tweets
Apache Avro 
Record oriented! 
Migrations (forward, backward)! 
Schema on write! 
Interoperability 
{ 
“namespace”: “com.mycompany.avrotables”, 
"name": "tweets", 
"type": "record", 
"fields": [ 
{"name": "created_at", "type": "string", “doc”: “date_time of tweet”}, 
{"name": "tweet_id_str", "type": "string"}, 
{"name": "text", "type": "string"}, 
{"name": "in_reply_to", "type": ["string", "null"]}, 
{"name": "is_retweeted", "type": ["string", "null"]}, 
{"name": "user_id", "type": "string"}, 
{"name": "place_id", "type": ["string", "null"]} 
] 
} 
CREATE TABLE tweets 
ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' 
SERDEPROPERTIES ( 
'avro.schema.url'='hdfs:///schema/avro/tweets_avro.avsc' 
) ; 
insert 
into 
table 
tweets 
select 
* 
from 
tweets_ext; 
GABRIELE MODENA LEARNING HADOOP 2
Some thoughts on schemas 
Only make additive changes! 
Think about schema distribution! 
Manage schema versions explicitly 
GABRIELE MODENA LEARNING HADOOP 2
Parquet 
! 
Ad hoc use case! 
Cloudera Impala’s default file format! 
Execution engine agnostic! 
HIVE-5783! 
Let it handle block size! 
! 
create table tweets ( 
created_at string, 
tweet_id string, 
text string, 
in_reply_to string, 
retweeted boolean, 
user_id string, 
place_id string 
) STORED AS PARQUET; 
! 
insert into table tweets 
select * from tweets_ext; 
GABRIELE MODENA LEARNING HADOOP 2
If possible, use both
Table Optimization 
Create tables with workloads in mind! 
Partitions! 
Bucketing! 
Join strategies 
GABRIELE MODENA LEARNING HADOOP 2
Plenty of tunables !! 
# partitions 
SET hive.exec.dynamic.partition=true; 
SET hive.exec.dynamic.partition.mode=nonstrict; 
SET hive.exec.max.dynamic.partitions.pernode=10000; 
SET hive.exec.max.dynamic.partitions=100000; 
SET hive.exec.max.created.files=1000000; 
! 
# merge small files 
SET hive.merge.size.per.task=256000000; 
SET hive.merge.mapfiles=true; 
SET hive.merge.mapredfiles=true; 
SET hive.merge.smallfiles.avgsize=16000000; 
# Compression 
SET mapred.output.compress=true; 
SET mapred.output.compression.type=BLOCK; 
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec; 
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec; 
GABRIELE MODENA LEARNING HADOOP 2
Apache Oozie 
Data pipelines! 
Workflow execution and 
coordination! 
Time and availability based 
execution! 
Configuration over code! 
MapReduce centric! 
Actions Hive, Pig, fs, shell, 
sqoop 
<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">! 
...! 
<action name="[NODE-NAME]">! 
<hive xmlns="uri:oozie:hive-action:0.2">! 
<job-tracker>[JOB-TRACKER]</job-tracker>! 
<name-node>[NAME-NODE]</name-node>! 
<prepare>! 
<delete path="[PATH]"/>! 
...! 
<mkdir path="[PATH]"/>! 
...! 
</prepare>! 
<job-xml>[HIVE SETTINGS FILE]</job-xml>! 
<configuration>! 
<property>! 
<name>[PROPERTY-NAME]</name>! 
<value>[PROPERTY-VALUE]</value>! 
</property>! 
...! 
</configuration>! 
<script>[HIVE-SCRIPT]</script>! 
<param>[PARAM-VALUE]</param>! 
...! 
<param>[PARAM-VALUE]</param>! 
<file>[FILE-PATH]</file>! 
...! 
<archive>[FILE-PATH]</archive>! 
...! 
</hive>! 
<ok to="[NODE-NAME]"/>! 
<error to="[NODE-NAME]"/>! 
</action>! 
...! 
</workflow-app> 
GABRIELE MODENA LEARNING HADOOP 2
EDA 
Luminosity in xkcd comics (courtesy of rbloggers) 
GABRIELE MODENA LEARNING HADOOP 2
Sample the dataset
Use hive-on-tez, impala
Spark & Ipython Notebook 
! 
! from pyspark import SparkContext! 
! sc = SparkContext(CLUSTER_URL, 
‘ipython-notebook') ! 
Works with Avro, Parqeut etc! 
Move computation close to 
data! 
Numpy, scikit-learn, matplotlib! 
Setup can be tedious 
GABRIELE MODENA LEARNING HADOOP 2
Stream processing 
Statistics in real time! 
Data feeds! 
Machine generated (sensor data, logs)! 
Predictive analytics 
GABRIELE MODENA LEARNING HADOOP 2
Several niches 
Low latency (storm, s4)! 
Persistency and resiliency (samza)! 
Apply complex logic (spark-streaming)! 
Type of message stream (kafka) 
GABRIELE MODENA LEARNING HADOOP 2
Apache Samza 
Kafka for streaming ! 
Yarn for resource 
management and exec! 
Samza API for 
processing! 
Sweet spot: second, 
minutes 
Samza API 
Yarn 
Kafka 
GABRIELE MODENA LEARNING HADOOP 2
public void process( 
IncomingMessageEnvelope envelope, 
MessageCollector collector, 
TaskCoordinator coordinator)
public void window( 
MessageCollector collector, 
TaskCoordinator coordinator)
Bootstrap streams 
Samza can consume messages from multiple 
streams! 
Rewind on historical data does not preserve 
ordering! 
If a task has any bootstrap streams defined then 
it will read these streams until they are fully 
processed 
GABRIELE MODENA LEARNING HADOOP 2
Predictive modelling 
GABRIELE MODENA LEARNING HADOOP 2
Learning from data 
Predictive model = statistical learning! 
Simple = parallelizable! 
Garbage in = garbage out 
GABRIELE MODENA LEARNING HADOOP 2
Couple of things we can do 
1. Parameter tuning 
2. Feature engineering 
3. Learn on all data 
GABRIELE MODENA LEARNING HADOOP 2
Train against all data 
Ensamble methods (cooperative and competitive)! 
Avoid multi pass / iterations! 
Apply models to live data! 
Keep models up to date 
GABRIELE MODENA LEARNING HADOOP 2
Off the shelf 
Apache Mahout (MapReduce, Spark) ! 
MLlib (Spark)! 
Cascading-pattern (MapReduce, Tez, Spark) 
GABRIELE MODENA LEARNING HADOOP 2
Apache Mahout 0.9 
Once the default solution for ML with MapReduce! 
Quality may vary! 
Good components are really good! 
Is it a library? A framework? A recommendation 
system? 
GABRIELE MODENA LEARNING HADOOP 2
The good 
The go-to if you need a Recommendation System! 
SGD (optimization)! 
Random Forest (classification/regression)! 
SVD (feature engineering)! 
ALS (collaborative filtering) 
GABRIELE MODENA LEARNING HADOOP 2
The puzzling 
SVM? ! 
Model updates are implementation specific!! 
Feature encoding and input format are often 
model specific 
GABRIELE MODENA LEARNING HADOOP 2
Apache Mahout trunk 
Moving away from MapReduce! 
Spark + Scala DSL = new classes of algorithms! 
Major code cleanup 
GABRIELE MODENA LEARNING HADOOP 2
It needs major 
infrastructure work 
around it
batch + streaming
There’s a buzzword for that 
http://lambda-architecture.net/ 
GABRIELE MODENA LEARNING HADOOP 2
Wrap up
With hadoop 2 
Cluster as an Operating System! 
YARN, mostly! 
Multiparadigm, better interop! 
Same system, different tools, multiple use cases! 
Batch + interactive 
GABRIELE MODENA LEARNING HADOOP 2
This said 
Ops is where a lot of time goes! 
Building clusters is hard! 
Distro fragmentation! 
Bleeding edge rush! 
Heavy lifting needed 
GABRIELE MODENA LEARNING HADOOP 2
That’s all, folks
Thanks for having me
Let’s discuss

More Related Content

PDF
Resilient Distributed Datasets
Gabriele Modena
 
PDF
Approximation algorithms for stream and batch processing
Gabriele Modena
 
PDF
Introduction to Spark
Carol McDonald
 
PDF
Titan and Cassandra at WellAware
twilmes
 
PDF
Apache Spark Overview
Carol McDonald
 
PDF
Hadoop, MapReduce and R = RHadoop
Victoria López
 
PPTX
Neo, Titan & Cassandra
johnrjenson
 
PDF
Introduction to Spark on Hadoop
Carol McDonald
 
Resilient Distributed Datasets
Gabriele Modena
 
Approximation algorithms for stream and batch processing
Gabriele Modena
 
Introduction to Spark
Carol McDonald
 
Titan and Cassandra at WellAware
twilmes
 
Apache Spark Overview
Carol McDonald
 
Hadoop, MapReduce and R = RHadoop
Victoria López
 
Neo, Titan & Cassandra
johnrjenson
 
Introduction to Spark on Hadoop
Carol McDonald
 

What's hot (20)

PDF
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
PPTX
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
PDF
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
 
PPTX
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
Indexed Hive
NikhilDeshpande
 
PDF
Building Machine Learning Applications with Sparkling Water
Sri Ambati
 
PDF
Large Scale Math with Hadoop MapReduce
Hortonworks
 
PDF
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
DOCX
Big data processing using - Hadoop Technology
Shital Kat
 
PDF
Mapreduce Algorithms
Amund Tveit
 
PPTX
LocationTech Projects
Jody Garnett
 
PPT
Map Reduce introduction
Muralidharan Deenathayalan
 
PDF
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
 
PPTX
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill
 
PPTX
MATLAB, netCDF, and OPeNDAP
The HDF-EOS Tools and Information Center
 
PPTX
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Xiao Qin
 
PDF
Python in an Evolving Enterprise System (PyData SV 2013)
PyData
 
PPTX
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
PDF
Cassandra advanced data modeling
Romain Hardouin
 
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Vijay Srinivas Agneeswaran, Ph.D
 
Indexed Hive
NikhilDeshpande
 
Building Machine Learning Applications with Sparkling Water
Sri Ambati
 
Large Scale Math with Hadoop MapReduce
Hortonworks
 
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
Big data processing using - Hadoop Technology
Shital Kat
 
Mapreduce Algorithms
Amund Tveit
 
LocationTech Projects
Jody Garnett
 
Map Reduce introduction
Muralidharan Deenathayalan
 
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill
 
MATLAB, netCDF, and OPeNDAP
The HDF-EOS Tools and Information Center
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Xiao Qin
 
Python in an Evolving Enterprise System (PyData SV 2013)
PyData
 
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Cassandra advanced data modeling
Romain Hardouin
 
Ad

Viewers also liked (20)

PDF
Unikernels: in search of a killer app and a killer ecosystem
rhatr
 
PDF
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Ashutosh Sonaliya
 
PDF
Type Checking Scala Spark Datasets: Dataset Transforms
John Nestor
 
PDF
臺灣高中數學講義 - 第一冊 - 數與式
Xuan-Chao Huang
 
PDF
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
PPTX
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
 
PDF
Apache Spark: killer or savior of Apache Hadoop?
rhatr
 
PPTX
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
PPTX
Apache Spark Introduction @ University College London
Vitthal Gogate
 
PPTX
Think Like Spark
Alpine Data
 
PDF
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
PDF
Hadoop to spark_v2
elephantscale
 
PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Uwe Printz
 
PDF
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
PDF
Spark in 15 min
Christophe Marchal
 
PDF
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon
 
PPTX
Intro to Spark development
Spark Summit
 
PDF
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Unikernels: in search of a killer app and a killer ecosystem
rhatr
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Ashutosh Sonaliya
 
Type Checking Scala Spark Datasets: Dataset Transforms
John Nestor
 
臺灣高中數學講義 - 第一冊 - 數與式
Xuan-Chao Huang
 
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
 
Apache Spark: killer or savior of Apache Hadoop?
rhatr
 
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
Apache Spark Introduction @ University College London
Vitthal Gogate
 
Think Like Spark
Alpine Data
 
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Hadoop to spark_v2
elephantscale
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Uwe Printz
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
Spark in 15 min
Christophe Marchal
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon
 
Intro to Spark development
Spark Summit
 
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Ad

Similar to Full stack analytics with Hadoop 2 (20)

PPT
Hadoop trainingin bangalore
appaji intelhunt
 
PDF
Introduction to Scalding and Monoids
Hugo Gävert
 
PDF
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
PDF
Mapreduce by examples
Andrea Iacono
 
PDF
Hadoop Integration in Cassandra
Jairam Chandar
 
PDF
Hadoop
Naoyuki Kakuda
 
PDF
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
PDF
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
PPT
Big-data-analysis-training-in-mumbai
Unmesh Baile
 
PDF
Spark overview
Lisa Hua
 
PDF
Cascading Through Hadoop for the Boulder JUG
Matthew McCullough
 
PPTX
The Fundamentals Guide to HDP and HDInsight
Gert Drapers
 
PDF
Cloud jpl
Marc de Palol
 
PPT
hadoop.ppt
AnushkaChauhan68
 
PPT
Hadoop 2
EasyMedico.com
 
PPT
Hadoop 3
shams03159691010
 
PPTX
EuroPython 2015 - Big Data with Python and Hadoop
Max Tepkeev
 
PPTX
Hadoop ecosystem
Ran Silberman
 
PDF
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa
 
Hadoop trainingin bangalore
appaji intelhunt
 
Introduction to Scalding and Monoids
Hugo Gävert
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
Mapreduce by examples
Andrea Iacono
 
Hadoop Integration in Cassandra
Jairam Chandar
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
Big-data-analysis-training-in-mumbai
Unmesh Baile
 
Spark overview
Lisa Hua
 
Cascading Through Hadoop for the Boulder JUG
Matthew McCullough
 
The Fundamentals Guide to HDP and HDInsight
Gert Drapers
 
Cloud jpl
Marc de Palol
 
hadoop.ppt
AnushkaChauhan68
 
Hadoop 2
EasyMedico.com
 
EuroPython 2015 - Big Data with Python and Hadoop
Max Tepkeev
 
Hadoop ecosystem
Ran Silberman
 
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa
 

Recently uploaded (20)

PPTX
Logistic Regression ml machine learning.pptx
abdullahcocindia
 
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
PPTX
GR3-PPTFINAL (1).pptx 0.91 MbHIHUHUGG,HJGH
DarylArellaga1
 
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PaulYoung221210
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PDF
345_IT infrastructure for business management.pdf
LEANHTRAN4
 
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
abhinavmemories2026
 
PDF
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
Accentfuture
 
PPTX
International-health-agency and it's work.pptx
shreehareeshgs
 
PPTX
Presentation1.pptxvhhh. H ycycyyccycycvvv
ItratBatool16
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
Nashik East side PPT 01-08-25. vvvhvjvvvhvh
mandar401157
 
PPTX
Extract Transformation Load (3) (1).pptx
revathi148366
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
artificial intelligence deeplearning-200712115616.pptx
revathi148366
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PPTX
Azure Data management Engineer project.pptx
sumitmundhe77
 
Logistic Regression ml machine learning.pptx
abdullahcocindia
 
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
GR3-PPTFINAL (1).pptx 0.91 MbHIHUHUGG,HJGH
DarylArellaga1
 
Moving the Public Sector (Government) to a Digital Adoption
PaulYoung221210
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
345_IT infrastructure for business management.pdf
LEANHTRAN4
 
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
abhinavmemories2026
 
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
Accentfuture
 
International-health-agency and it's work.pptx
shreehareeshgs
 
Presentation1.pptxvhhh. H ycycyyccycycvvv
ItratBatool16
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Nashik East side PPT 01-08-25. vvvhvjvvvhvh
mandar401157
 
Extract Transformation Load (3) (1).pptx
revathi148366
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
artificial intelligence deeplearning-200712115616.pptx
revathi148366
 
Chad Readey - An Independent Thinker
Chad Readey
 
Azure Data management Engineer project.pptx
sumitmundhe77
 

Full stack analytics with Hadoop 2

  • 1. Full stack analytics with Hadoop 2 Trento, 2014-09-11 GABRIELE MODENA LEARNING HADOOP 2
  • 2. CS.ML! Data Scientist ML & Data Mining Academia & Industry ! Learning Hadoop 2 for Packt_Publishing (together with Garry Turkington). TBD.
  • 3. This talk is about tools
  • 5. I will avoid benchmarks
  • 6. Back in 2012 GABRIELE MODENA LEARNING HADOOP 2
  • 7. HDFS Name Node Data Node ! ! Google paper (2003)! Distributed storage! Block ops Name Node Data Node Data Node GABRIELE MODENA LEARNING HADOOP 2
  • 8. MapReduce Google paper (2006)! Divide and conquer functional model! Concepts from database research! Batch worloads! Aggregation operations (eg. GROUP BY) GABRIELE MODENA LEARNING HADOOP 2
  • 9. Two phases Map Reduce GABRIELE MODENA LEARNING HADOOP 2
  • 12. All in all Great when records (jobs) are independent! Composability monsters! Computation vs. Communication tradeoff! Low level API! Tuning required GABRIELE MODENA LEARNING HADOOP 2
  • 13. Computation with MapReduce CRUNCH GABRIELE MODENA LEARNING HADOOP 2
  • 14. Higher level abstractions, still geared towards batch loads
  • 15. Dremel (Impala, Drill) Google paper (2010) ! Access blocks directly from data nodes (partition the fs namespace)! Columnar store (optimize for OLAP)! Appeals to database / BI crowds! Ridiculously fast (as long as you have memory) GABRIELE MODENA LEARNING HADOOP 2
  • 16. Computation beyond MapReduce Iterative workloads! Low latency queries! Real-time computation! High level abstractions GABRIELE MODENA LEARNING HADOOP 2
  • 17. Hadoop 2 Applications (Hive, Pig, Crunch, Cascading, etc…) Streaming (storm, spark, samza) In memory (spark) Interactive (Tez) HPC (MPI) Resource Management (YARN) HDFS Batch (MapReduce) Graph (giraph) GABRIELE MODENA LEARNING HADOOP 2
  • 19. Tez (Dryad) Microsoft paper (2007)! Generalization of MapReduce as dataflow! Express dependencies, I/O pipelining! Low level API for building DAGs! Mainly an execution engine (Hive-on-Tez, Pig-on-Tez) GABRIELE MODENA LEARNING HADOOP 2
  • 21. DAG dag = new DAG("WordCount"); dag.addVertex(tokenizerVertex) .addVertex(summerVertex) .addEdge( new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty())); GABRIELE MODENA LEARNING HADOOP 2
  • 22. p!!ackage org.apache.tez.mapreduce.examples; import java.io.IOException; import java.util.Map; import java.util.StringTokenizer; i!mport java.util.TreeMap; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileAlreadyExistsException; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.security.UserGroupInformation; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.hadoop.yarn.api.records.LocalResource; import org.apache.tez.client.TezClient; import org.apache.tez.dag.api.DAG; import org.apache.tez.dag.api.Edge; import org.apache.tez.dag.api.InputDescriptor; import org.apache.tez.dag.api.OutputDescriptor; import org.apache.tez.dag.api.ProcessorDescriptor; import org.apache.tez.dag.api.TezConfiguration; import org.apache.tez.dag.api.Vertex; import org.apache.tez.dag.api.client.DAGClient; import org.apache.tez.dag.api.client.DAGStatus; import org.apache.tez.mapreduce.committer.MROutputCommitter; import org.apache.tez.mapreduce.common.MRInputAMSplitGenerator; import org.apache.tez.mapreduce.hadoop.MRHelpers; import org.apache.tez.mapreduce.input.MRInput; import org.apache.tez.mapreduce.output.MROutput; import org.apache.tez.mapreduce.processor.SimpleMRProcessor; import org.apache.tez.runtime.api.Output; import org.apache.tez.runtime.library.api.KeyValueReader; import org.apache.tez.runtime.library.api.KeyValueWriter; import org.apache.tez.runtime.library.api.KeyValuesReader; i!mport org.apache.tez.runtime.library.conf.OrderedPartitionedKVEdgeConfigurer; import com.google.common.base.Preconditions; import org.apache.tez.runtime.library.partitioner.HashPartitioner; !! public class WordCount extends Configured implements Tool { public static class TokenProcessor extends SimpleMRProcessor { IntWritable one = new IntWritable(1); ! Text word = new Text(); @Override public void run() throws Exception { Preconditions.checkArgument(getInputs().size() == 1); Preconditions.checkArgument(getOutputs().size() == 1); MRInput input = (MRInput) getInputs().values().iterator().next(); KeyValueReader kvReader = input.getReader(); Output output = getOutputs().values().iterator().next(); KeyValueWriter kvWriter = (KeyValueWriter) output.getWriter(); while (kvReader.next()) { StringTokenizer itr = new StringTokenizer(kvReader.getCurrentValue().toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); kvWriter.write(word, one); } } ! } ! } public static class SumProcessor extends SimpleMRProcessor { @Override public void run() throws Exception { Preconditions.checkArgument(getInputs().size() == 1); MROutput out = (MROutput) getOutputs().values().iterator().next(); KeyValueWriter kvWriter = out.getWriter(); KeyValuesReader kvReader = (KeyValuesReader) getInputs().values().iterator().next() .getReader(); while (kvReader.next()) { Text word = (Text) kvReader.getCurrentKey(); int sum = 0; for (Object value : kvReader.getCurrentValues()) { sum += ((IntWritable) value).get(); } kvWriter.write(word, new IntWritable(sum)); } } } ! private DAG createDAG(FileSystem fs, TezConfiguration tezConf, Map<String, LocalResource> localResources, Path stagingDir, ! String inputPath, String outputPath) throws IOException { Configuration inputConf = new Configuration(tezConf); inputConf.set(FileInputFormat.INPUT_DIR, inputPath); InputDescriptor id = new InputDescriptor(MRInput.class.getName()) .setUserPayload(MRInput.createUserPayload(inputConf, ! TextInputFormat.class.getName(), true, true)); Configuration outputConf = new Configuration(tezConf); outputConf.set(FileOutputFormat.OUTDIR, outputPath); OutputDescriptor od = new OutputDescriptor(MROutput.class.getName()) .setUserPayload(MROutput.createUserPayload( ! outputConf, TextOutputFormat.class.getName(), true)); Vertex tokenizerVertex = new Vertex("tokenizer", new ProcessorDescriptor( TokenProcessor.class.getName()), -1, MRHelpers.getMapResource(tezConf)); ! tokenizerVertex.addInput("MRInput", id, MRInputAMSplitGenerator.class); Vertex summerVertex = new Vertex("summer", ! new ProcessorDescriptor( SumProcessor.class.getName()), 1, MRHelpers.getReduceResource(tezConf)); summerVertex.addOutput("MROutput", od, MROutputCommitter.class); OrderedPartitionedKVEdgeConfigurer edgeConf = OrderedPartitionedKVEdgeConfigurer .newBuilder(Text.class.getName(), IntWritable.class.getName(), ! HashPartitioner.class.getName(), null).build(); DAG dag = new DAG("WordCount"); dag.addVertex(tokenizerVertex) .addVertex(summerVertex) .addEdge( return dag; ! } private static void printUsage() { new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty())); System.err.println("Usage: " + " wordcount <in1> <out1>"); ToolRunner.printGenericCommandUsage(System.err); ! } public boolean run(String inputPath, String outputPath, Configuration conf) throws Exception { System.out.println("Running WordCount"); // conf and UGI TezConfiguration tezConf; if (conf != null) { tezConf = new TezConfiguration(conf); } else { tezConf = new TezConfiguration(); } UserGroupInformation.setConfiguration(tezConf); ! String user = UserGroupInformation.getCurrentUser().getShortUserName(); // staging dir FileSystem fs = FileSystem.get(tezConf); String stagingDirStr = Path.SEPARATOR + "user" + Path.SEPARATOR + user + Path.SEPARATOR+ ".staging" + Path.SEPARATOR + Path.SEPARATOR + Long.toString(System.currentTimeMillis()); Path stagingDir = new Path(stagingDirStr); tezConf.set(TezConfiguration.TEZ_AM_STAGING_DIR, stagingDirStr); stagingDir = fs.makeQualified(stagingDir); // No need to add jar containing this class as assumed to be part of ! // the tez jars. // TEZ-674 Obtain tokens based on the Input / Output paths. For now assuming staging dir // is the same filesystem as the one used for Input/Output. TezClient tezSession = new TezClient("WordCountSession", tezConf); ! tezSession.start(); ! DAGClient dagClient = null; try { if (fs.exists(new Path(outputPath))) { throw new FileAlreadyExistsException("Output directory " + outputPath + " already exists"); } Map<String, LocalResource> localResources = new TreeMap<String, LocalResource>(); DAG dag = createDAG(fs, tezConf, localResources, ! stagingDir, inputPath, outputPath); tezSession.waitTillReady(); ! dagClient = tezSession.submitDAG(dag); // monitoring DAGStatus dagStatus = dagClient.waitForCompletionWithAllStatusUpdates(null); if (dagStatus.getState() != DAGStatus.State.SUCCEEDED) { System.out.println("DAG diagnostics: " + dagStatus.getDiagnostics()); return false; } return true; } finally { fs.delete(stagingDir, true); tezSession.stop(); } ! } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); ! String [] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { printUsage(); return 2; } WordCount job = new WordCount(); job.run(otherArgs[0], otherArgs[1], conf); return 0; ! } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new WordCount(), args); System.exit(res); } } GABRIELE MODENA LEARNING HADOOP 2
  • 23. Spark AMPLab paper (2010), builds on Dryad! Resilient Distributed Datasets (RDDs)! High level API (and a repl)! Also an execution engine (Hive-on-Spark, Pig-on- Spark) GABRIELE MODENA LEARNING HADOOP 2
  • 24. JavaRDD<String> file = spark.textFile(“hdfs://infile.txt"); ! JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); ! JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); ! JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); ! counts.saveAsTextFile(“hdfs://outfile.txt"); GABRIELE MODENA LEARNING HADOOP 2
  • 25. Rule of thumb Avoid spill-to-disk! Spark and Tez don’t mix well! Join on 50+ TB = Hive+Tez, MapReduce! Direct access to API (in memory) = Spark! OLAP = Hive+Tez, Cloudera Impala! GABRIELE MODENA LEARNING HADOOP 2
  • 26. Good stuff. So what?
  • 27. The data <adjective> S3, mysql, nfs, … HDFS Workflow coordination Ingestion Metadata Processing GABRIELE MODENA LEARNING HADOOP 2
  • 28. Analytics on Hadoop 2 Batch & interactive! Datawarehousing & computing! Dataset size and velocity! Integrations with existing tools! Distributions will constrain your stack GABRIELE MODENA LEARNING HADOOP 2
  • 29. Use cases Datawarehousing! Explorative Data Analysis! Stream processing! Predictive Analytics GABRIELE MODENA LEARNING HADOOP 2
  • 30. Datawarehousing Data ingestion! Pipelines! Transform and enrich (ETL) queries - batch! Low latency (presentation) queries - interactive! Interoperable data formats and metadata! Workflow Orchestration GABRIELE MODENA LEARNING HADOOP 2
  • 31. Collection and ingestion $ hadoop distcp GABRIELE MODENA LEARNING HADOOP 2
  • 32. Once data is in HDFS
  • 33. Apache Hive HiveQL ! Data stored on HDFS! Metadata kept in mysql (metastore)! Metadata exposed to third parties (HCatalog)! Suitable both for interactive and batch queries GABRIELE MODENA LEARNING HADOOP 2
  • 36. The nature of Hive tables CREATE TABLE and (LOAD DATA) produce metadata! ! Schema based on the data “as it has already arrived”! ! Data files underlying a Hive table are no different from any other file on HDFS! ! Primitive types behave as in Java GABRIELE MODENA LEARNING HADOOP 2
  • 37. Data formats Record oriented (avro, text)! Column oriented (Parquet, Orc) GABRIELE MODENA LEARNING HADOOP 2
  • 38. Text (tab separated) create external table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION ‘$input’ $ hadoop fs -cat /data/tweets.tsv 2014-03-12T17:34:26.000Z!443802208698908672! Oh &amp; I'm chuffed for @GeraintThomas86, doing Wales proud in yellow!! #ParisNice #Cymru! NULL! 223224878! NULL 2014-03-12T17:34:26.000Z!443802208706908160! Stalker48 Kembali Lagi Cek Disini http://t.co/4BMTFByFH5 236! NULL! 629845435! NULL 2014-03-12T17:34:26.000Z!443802208728268800! @Piconn ou melhor, eu era :c mudei! NULL! 255768055! NULL 2014-03-12T17:34:26.000Z!443802208698912768! I swear Ryan's always in his own world. He's always like 4 hours behind everyone else.! NULL! 2379282889! NULL 2014-03-12T17:34:26.000Z!443802208702713856! @maggersforever0 lmfao you gotta see this, its awesome http://t.co/1PvXEELlqi! NULL! 355858832! NULL 2014-03-12T17:34:26.000Z!443802208698896384! Crazy... http://t.co/ G4QRMSKGkh! NULL! 106439395! NULL! GABRIELE MODENA LEARNING HADOOP 2 •
  • 40. Apache Avro Record oriented! Migrations (forward, backward)! Schema on write! Interoperability { “namespace”: “com.mycompany.avrotables”, "name": "tweets", "type": "record", "fields": [ {"name": "created_at", "type": "string", “doc”: “date_time of tweet”}, {"name": "tweet_id_str", "type": "string"}, {"name": "text", "type": "string"}, {"name": "in_reply_to", "type": ["string", "null"]}, {"name": "is_retweeted", "type": ["string", "null"]}, {"name": "user_id", "type": "string"}, {"name": "place_id", "type": ["string", "null"]} ] } CREATE TABLE tweets ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' SERDEPROPERTIES ( 'avro.schema.url'='hdfs:///schema/avro/tweets_avro.avsc' ) ; insert into table tweets select * from tweets_ext; GABRIELE MODENA LEARNING HADOOP 2
  • 41. Some thoughts on schemas Only make additive changes! Think about schema distribution! Manage schema versions explicitly GABRIELE MODENA LEARNING HADOOP 2
  • 42. Parquet ! Ad hoc use case! Cloudera Impala’s default file format! Execution engine agnostic! HIVE-5783! Let it handle block size! ! create table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) STORED AS PARQUET; ! insert into table tweets select * from tweets_ext; GABRIELE MODENA LEARNING HADOOP 2
  • 44. Table Optimization Create tables with workloads in mind! Partitions! Bucketing! Join strategies GABRIELE MODENA LEARNING HADOOP 2
  • 45. Plenty of tunables !! # partitions SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions.pernode=10000; SET hive.exec.max.dynamic.partitions=100000; SET hive.exec.max.created.files=1000000; ! # merge small files SET hive.merge.size.per.task=256000000; SET hive.merge.mapfiles=true; SET hive.merge.mapredfiles=true; SET hive.merge.smallfiles.avgsize=16000000; # Compression SET mapred.output.compress=true; SET mapred.output.compression.type=BLOCK; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec; GABRIELE MODENA LEARNING HADOOP 2
  • 46. Apache Oozie Data pipelines! Workflow execution and coordination! Time and availability based execution! Configuration over code! MapReduce centric! Actions Hive, Pig, fs, shell, sqoop <workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">! ...! <action name="[NODE-NAME]">! <hive xmlns="uri:oozie:hive-action:0.2">! <job-tracker>[JOB-TRACKER]</job-tracker>! <name-node>[NAME-NODE]</name-node>! <prepare>! <delete path="[PATH]"/>! ...! <mkdir path="[PATH]"/>! ...! </prepare>! <job-xml>[HIVE SETTINGS FILE]</job-xml>! <configuration>! <property>! <name>[PROPERTY-NAME]</name>! <value>[PROPERTY-VALUE]</value>! </property>! ...! </configuration>! <script>[HIVE-SCRIPT]</script>! <param>[PARAM-VALUE]</param>! ...! <param>[PARAM-VALUE]</param>! <file>[FILE-PATH]</file>! ...! <archive>[FILE-PATH]</archive>! ...! </hive>! <ok to="[NODE-NAME]"/>! <error to="[NODE-NAME]"/>! </action>! ...! </workflow-app> GABRIELE MODENA LEARNING HADOOP 2
  • 47. EDA Luminosity in xkcd comics (courtesy of rbloggers) GABRIELE MODENA LEARNING HADOOP 2
  • 50. Spark & Ipython Notebook ! ! from pyspark import SparkContext! ! sc = SparkContext(CLUSTER_URL, ‘ipython-notebook') ! Works with Avro, Parqeut etc! Move computation close to data! Numpy, scikit-learn, matplotlib! Setup can be tedious GABRIELE MODENA LEARNING HADOOP 2
  • 51. Stream processing Statistics in real time! Data feeds! Machine generated (sensor data, logs)! Predictive analytics GABRIELE MODENA LEARNING HADOOP 2
  • 52. Several niches Low latency (storm, s4)! Persistency and resiliency (samza)! Apply complex logic (spark-streaming)! Type of message stream (kafka) GABRIELE MODENA LEARNING HADOOP 2
  • 53. Apache Samza Kafka for streaming ! Yarn for resource management and exec! Samza API for processing! Sweet spot: second, minutes Samza API Yarn Kafka GABRIELE MODENA LEARNING HADOOP 2
  • 54. public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator)
  • 55. public void window( MessageCollector collector, TaskCoordinator coordinator)
  • 56. Bootstrap streams Samza can consume messages from multiple streams! Rewind on historical data does not preserve ordering! If a task has any bootstrap streams defined then it will read these streams until they are fully processed GABRIELE MODENA LEARNING HADOOP 2
  • 57. Predictive modelling GABRIELE MODENA LEARNING HADOOP 2
  • 58. Learning from data Predictive model = statistical learning! Simple = parallelizable! Garbage in = garbage out GABRIELE MODENA LEARNING HADOOP 2
  • 59. Couple of things we can do 1. Parameter tuning 2. Feature engineering 3. Learn on all data GABRIELE MODENA LEARNING HADOOP 2
  • 60. Train against all data Ensamble methods (cooperative and competitive)! Avoid multi pass / iterations! Apply models to live data! Keep models up to date GABRIELE MODENA LEARNING HADOOP 2
  • 61. Off the shelf Apache Mahout (MapReduce, Spark) ! MLlib (Spark)! Cascading-pattern (MapReduce, Tez, Spark) GABRIELE MODENA LEARNING HADOOP 2
  • 62. Apache Mahout 0.9 Once the default solution for ML with MapReduce! Quality may vary! Good components are really good! Is it a library? A framework? A recommendation system? GABRIELE MODENA LEARNING HADOOP 2
  • 63. The good The go-to if you need a Recommendation System! SGD (optimization)! Random Forest (classification/regression)! SVD (feature engineering)! ALS (collaborative filtering) GABRIELE MODENA LEARNING HADOOP 2
  • 64. The puzzling SVM? ! Model updates are implementation specific!! Feature encoding and input format are often model specific GABRIELE MODENA LEARNING HADOOP 2
  • 65. Apache Mahout trunk Moving away from MapReduce! Spark + Scala DSL = new classes of algorithms! Major code cleanup GABRIELE MODENA LEARNING HADOOP 2
  • 66. It needs major infrastructure work around it
  • 68. There’s a buzzword for that http://lambda-architecture.net/ GABRIELE MODENA LEARNING HADOOP 2
  • 70. With hadoop 2 Cluster as an Operating System! YARN, mostly! Multiparadigm, better interop! Same system, different tools, multiple use cases! Batch + interactive GABRIELE MODENA LEARNING HADOOP 2
  • 71. This said Ops is where a lot of time goes! Building clusters is hard! Distro fragmentation! Bleeding edge rush! Heavy lifting needed GABRIELE MODENA LEARNING HADOOP 2