Full stack analytics with Hadoop 2

Full stack analytics
with Hadoop 2
Trento, 2014-09-11
GABRIELE MODENA LEARNING HADOOP 2

CS.ML!
Data Scientist
ML & Data Mining
Academia & Industry
!
Learning Hadoop 2 for
Packt_Publishing (together
with Garry Turkington). TBD.

Back in 2012

HDFS
Name Node
Data Node
!
!
Google paper (2003)!
Distributed storage!
Block ops
Name Node
Data Node Data Node

MapReduce
Google paper (2006)!
Divide and conquer functional model!
Concepts from database research!
Batch worloads!
Aggregation operations (eg. GROUP BY)

Two phases
Map
Reduce

All in all
Great when records (jobs) are independent!
Composability monsters!
Computation vs. Communication tradeoff!
Low level API!
Tuning required

Computation with
MapReduce
CRUNCH

Higher level abstractions,
still geared towards batch
loads

Dremel (Impala, Drill)
Google paper (2010) !
Access blocks directly from data nodes (partition
the fs namespace)!
Columnar store (optimize for OLAP)!
Appeals to database / BI crowds!
Ridiculously fast (as long as you have memory)

Computation beyond
MapReduce
Iterative workloads!
Low latency queries!
Real-time computation!
High level abstractions

Hadoop 2
Applications (Hive, Pig, Crunch, Cascading, etc…)
Streaming
(storm, spark,
samza)
In memory
(spark)
Interactive
(Tez)
HPC
(MPI)
Resource Management (YARN)
HDFS
Batch
(MapReduce)
Graph
(giraph)

Full stack analytics with Hadoop 2

Tez (Dryad)
Microsoft paper (2007)!
Generalization of MapReduce as dataflow!
Express dependencies, I/O pipelining!
Low level API for building DAGs!
Mainly an execution engine (Hive-on-Tez, Pig-on-Tez)

DAG dag = new DAG("WordCount");
dag.addVertex(tokenizerVertex)
.addVertex(summerVertex)
.addEdge(
new Edge(tokenizerVertex, summerVertex,
edgeConf.createDefaultEdgeProperty()));

p!!ackage org.apache.tez.mapreduce.examples; import java.io.IOException;
import java.util.Map;
import java.util.StringTokenizer;
i!mport java.util.TreeMap; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileAlreadyExistsException;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.security.UserGroupInformation;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.yarn.api.records.LocalResource;
import org.apache.tez.client.TezClient;
import org.apache.tez.dag.api.DAG;
import org.apache.tez.dag.api.Edge;
import org.apache.tez.dag.api.InputDescriptor;
import org.apache.tez.dag.api.OutputDescriptor;
import org.apache.tez.dag.api.ProcessorDescriptor;
import org.apache.tez.dag.api.TezConfiguration;
import org.apache.tez.dag.api.Vertex;
import org.apache.tez.dag.api.client.DAGClient;
import org.apache.tez.dag.api.client.DAGStatus;
import org.apache.tez.mapreduce.committer.MROutputCommitter;
import org.apache.tez.mapreduce.common.MRInputAMSplitGenerator;
import org.apache.tez.mapreduce.hadoop.MRHelpers;
import org.apache.tez.mapreduce.input.MRInput;
import org.apache.tez.mapreduce.output.MROutput;
import org.apache.tez.mapreduce.processor.SimpleMRProcessor;
import org.apache.tez.runtime.api.Output;
import org.apache.tez.runtime.library.api.KeyValueReader;
import org.apache.tez.runtime.library.api.KeyValueWriter;
import org.apache.tez.runtime.library.api.KeyValuesReader;
i!mport org.apache.tez.runtime.library.conf.OrderedPartitionedKVEdgeConfigurer; import com.google.common.base.Preconditions;
import org.apache.tez.runtime.library.partitioner.HashPartitioner; !!
public class WordCount extends Configured implements Tool {
public static class TokenProcessor extends SimpleMRProcessor {
IntWritable one = new IntWritable(1);
! Text word = new Text(); @Override
public void run() throws Exception {
Preconditions.checkArgument(getInputs().size() == 1);
Preconditions.checkArgument(getOutputs().size() == 1);
MRInput input = (MRInput) getInputs().values().iterator().next();
KeyValueReader kvReader = input.getReader();
Output output = getOutputs().values().iterator().next();
KeyValueWriter kvWriter = (KeyValueWriter) output.getWriter();
while (kvReader.next()) {
StringTokenizer itr = new StringTokenizer(kvReader.getCurrentValue().toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
kvWriter.write(word, one);
}
}
! } ! } public static class SumProcessor extends SimpleMRProcessor {
@Override
public void run() throws Exception {
Preconditions.checkArgument(getInputs().size() == 1);
MROutput out = (MROutput) getOutputs().values().iterator().next();
KeyValueWriter kvWriter = out.getWriter();
KeyValuesReader kvReader = (KeyValuesReader) getInputs().values().iterator().next()
.getReader();
while (kvReader.next()) {
Text word = (Text) kvReader.getCurrentKey();
int sum = 0;
for (Object value : kvReader.getCurrentValues()) {
sum += ((IntWritable) value).get();
}
kvWriter.write(word, new IntWritable(sum));
}
}
}
! private DAG createDAG(FileSystem fs, TezConfiguration tezConf,
Map<String, LocalResource> localResources, Path stagingDir,
! String inputPath, String outputPath) throws IOException { Configuration inputConf = new Configuration(tezConf);
inputConf.set(FileInputFormat.INPUT_DIR, inputPath);
InputDescriptor id = new InputDescriptor(MRInput.class.getName())
.setUserPayload(MRInput.createUserPayload(inputConf,
! TextInputFormat.class.getName(), true, true)); Configuration outputConf = new Configuration(tezConf);
outputConf.set(FileOutputFormat.OUTDIR, outputPath);
OutputDescriptor od = new OutputDescriptor(MROutput.class.getName())
.setUserPayload(MROutput.createUserPayload(
! outputConf, TextOutputFormat.class.getName(), true)); Vertex tokenizerVertex = new Vertex("tokenizer", new ProcessorDescriptor(
TokenProcessor.class.getName()), -1, MRHelpers.getMapResource(tezConf));
! tokenizerVertex.addInput("MRInput", id, MRInputAMSplitGenerator.class); Vertex summerVertex = new Vertex("summer",
! new ProcessorDescriptor(
SumProcessor.class.getName()), 1, MRHelpers.getReduceResource(tezConf));
summerVertex.addOutput("MROutput", od, MROutputCommitter.class); OrderedPartitionedKVEdgeConfigurer edgeConf = OrderedPartitionedKVEdgeConfigurer
.newBuilder(Text.class.getName(), IntWritable.class.getName(),
! HashPartitioner.class.getName(), null).build(); DAG dag = new DAG("WordCount");
dag.addVertex(tokenizerVertex)
.addVertex(summerVertex)
.addEdge(
return dag;
! } private static void printUsage() {
new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty()));
System.err.println("Usage: " + " wordcount <in1> <out1>");
ToolRunner.printGenericCommandUsage(System.err);
! } public boolean run(String inputPath, String outputPath, Configuration conf) throws Exception {
System.out.println("Running WordCount");
// conf and UGI
TezConfiguration tezConf;
if (conf != null) {
tezConf = new TezConfiguration(conf);
} else {
tezConf = new TezConfiguration();
}
UserGroupInformation.setConfiguration(tezConf);
! String user = UserGroupInformation.getCurrentUser().getShortUserName(); // staging dir
FileSystem fs = FileSystem.get(tezConf);
String stagingDirStr = Path.SEPARATOR + "user" + Path.SEPARATOR
+ user + Path.SEPARATOR+ ".staging" + Path.SEPARATOR
+ Path.SEPARATOR + Long.toString(System.currentTimeMillis());
Path stagingDir = new Path(stagingDirStr);
tezConf.set(TezConfiguration.TEZ_AM_STAGING_DIR, stagingDirStr);
stagingDir = fs.makeQualified(stagingDir);
// No need to add jar containing this class as assumed to be part of
! // the tez jars. // TEZ-674 Obtain tokens based on the Input / Output paths. For now assuming staging dir
// is the same filesystem as the one used for Input/Output.
TezClient tezSession = new TezClient("WordCountSession", tezConf);
! tezSession.start(); ! DAGClient dagClient = null; try {
if (fs.exists(new Path(outputPath))) {
throw new FileAlreadyExistsException("Output directory "
+ outputPath + " already exists");
}
Map<String, LocalResource> localResources =
new TreeMap<String, LocalResource>();
DAG dag = createDAG(fs, tezConf, localResources,
! stagingDir, inputPath, outputPath); tezSession.waitTillReady();
! dagClient = tezSession.submitDAG(dag); // monitoring
DAGStatus dagStatus = dagClient.waitForCompletionWithAllStatusUpdates(null);
if (dagStatus.getState() != DAGStatus.State.SUCCEEDED) {
System.out.println("DAG diagnostics: " + dagStatus.getDiagnostics());
return false;
}
return true;
} finally {
fs.delete(stagingDir, true);
tezSession.stop();
}
! } @Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
! String [] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) {
printUsage();
return 2;
}
WordCount job = new WordCount();
job.run(otherArgs[0], otherArgs[1], conf);
return 0;
! } public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new WordCount(), args);
System.exit(res);
}
}

Spark
AMPLab paper (2010), builds on Dryad!
Resilient Distributed Datasets (RDDs)!
High level API (and a repl)!
Also an execution engine (Hive-on-Spark, Pig-on-
Spark)

JavaRDD<String> file = spark.textFile(“hdfs://infile.txt");
!
JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
!
JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
!
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});
!
counts.saveAsTextFile(“hdfs://outfile.txt");

Rule of thumb
Avoid spill-to-disk!
Spark and Tez don’t mix well!
Join on 50+ TB = Hive+Tez, MapReduce!
Direct access to API (in memory) = Spark!
OLAP = Hive+Tez, Cloudera Impala!

The data <adjective>
S3, mysql, nfs, …
HDFS
Workflow coordination
Ingestion Metadata
Processing

Analytics on Hadoop 2
Batch & interactive!
Datawarehousing & computing!
Dataset size and velocity!
Integrations with existing tools!
Distributions will constrain your stack

Use cases
Datawarehousing!
Explorative Data Analysis!
Stream processing!
Predictive Analytics

Datawarehousing
Data ingestion!
Pipelines!
Transform and enrich (ETL) queries - batch!
Low latency (presentation) queries - interactive!
Interoperable data formats and metadata!
Workflow Orchestration

Collection and ingestion
$ hadoop distcp

Apache Hive
HiveQL !
Data stored on HDFS!
Metadata kept in mysql (metastore)!
Metadata exposed to third parties (HCatalog)!
Suitable both for interactive and batch queries

The nature of Hive tables
CREATE TABLE and (LOAD DATA) produce metadata!
!
Schema based on the data “as it has already arrived”!
!
Data files underlying a Hive table are no different from any
other file on HDFS!
!
Primitive types behave as in Java

Data formats
Record oriented (avro, text)!
Column oriented (Parquet, Orc)

Text (tab separated)
create external table tweets
(
created_at string,
tweet_id string,
text string,
in_reply_to string,
retweeted boolean,
user_id string,
place_id string
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION ‘$input’
$ hadoop fs -cat /data/tweets.tsv
2014-03-12T17:34:26.000Z!443802208698908672! Oh & I'm chuffed for
@GeraintThomas86, doing Wales proud in yellow!! #ParisNice #Cymru! NULL!
223224878! NULL
2014-03-12T17:34:26.000Z!443802208706908160! Stalker48 Kembali Lagi Cek
Disini http://t.co/4BMTFByFH5 236! NULL! 629845435! NULL
2014-03-12T17:34:26.000Z!443802208728268800! @Piconn ou melhor, eu era :c
mudei! NULL! 255768055! NULL
2014-03-12T17:34:26.000Z!443802208698912768! I swear Ryan's always in his
own world. He's always like 4 hours behind everyone else.! NULL!
2379282889! NULL
2014-03-12T17:34:26.000Z!443802208702713856! @maggersforever0 lmfao you
gotta see this, its awesome http://t.co/1PvXEELlqi! NULL! 355858832!
NULL
2014-03-12T17:34:26.000Z!443802208698896384! Crazy... http://t.co/
G4QRMSKGkh! NULL! 106439395! NULL!
•

Apache Avro
Record oriented!
Migrations (forward, backward)!
Schema on write!
Interoperability
{
“namespace”: “com.mycompany.avrotables”,
"name": "tweets",
"type": "record",
"fields": [
{"name": "created_at", "type": "string", “doc”: “date_time of tweet”},
{"name": "tweet_id_str", "type": "string"},
{"name": "text", "type": "string"},
{"name": "in_reply_to", "type": ["string", "null"]},
{"name": "is_retweeted", "type": ["string", "null"]},
{"name": "user_id", "type": "string"},
{"name": "place_id", "type": ["string", "null"]}
]
}
CREATE TABLE tweets
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
SERDEPROPERTIES (
'avro.schema.url'='hdfs:///schema/avro/tweets_avro.avsc'
) ;
insert
into
table
tweets
select
*
from
tweets_ext;

Some thoughts on schemas
Only make additive changes!
Think about schema distribution!
Manage schema versions explicitly

Parquet
!
Ad hoc use case!
Cloudera Impala’s default file format!
Execution engine agnostic!
HIVE-5783!
Let it handle block size!
!
create table tweets (
created_at string,
tweet_id string,
text string,
in_reply_to string,
retweeted boolean,
user_id string,
place_id string
) STORED AS PARQUET;
!
insert into table tweets
select * from tweets_ext;

Table Optimization
Create tables with workloads in mind!
Partitions!
Bucketing!
Join strategies

Plenty of tunables !!
# partitions
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=10000;
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.created.files=1000000;
!
# merge small files
SET hive.merge.size.per.task=256000000;
SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.smallfiles.avgsize=16000000;
# Compression
SET mapred.output.compress=true;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec;
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec;

Apache Oozie
Data pipelines!
Workflow execution and
coordination!
Time and availability based
execution!
Configuration over code!
MapReduce centric!
Actions Hive, Pig, fs, shell,
sqoop
<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">!
...!
<action name="[NODE-NAME]">!
<hive xmlns="uri:oozie:hive-action:0.2">!
<job-tracker>[JOB-TRACKER]</job-tracker>!
<name-node>[NAME-NODE]</name-node>!
<prepare>!
<delete path="[PATH]"/>!
...!
<mkdir path="[PATH]"/>!
...!
</prepare>!
<job-xml>[HIVE SETTINGS FILE]</job-xml>!
<configuration>!
<property>!
<name>[PROPERTY-NAME]</name>!
<value>[PROPERTY-VALUE]</value>!
</property>!
...!
</configuration>!
<script>[HIVE-SCRIPT]</script>!
<param>[PARAM-VALUE]</param>!
...!
<param>[PARAM-VALUE]</param>!
<file>[FILE-PATH]</file>!
...!
<archive>[FILE-PATH]</archive>!
...!
</hive>!
<ok to="[NODE-NAME]"/>!
<error to="[NODE-NAME]"/>!
</action>!
...!
</workflow-app>

EDA
Luminosity in xkcd comics (courtesy of rbloggers)

Spark & Ipython Notebook
!
! from pyspark import SparkContext!
! sc = SparkContext(CLUSTER_URL,
‘ipython-notebook') !
Works with Avro, Parqeut etc!
Move computation close to
data!
Numpy, scikit-learn, matplotlib!
Setup can be tedious

Stream processing
Statistics in real time!
Data feeds!
Machine generated (sensor data, logs)!
Predictive analytics

Several niches
Low latency (storm, s4)!
Persistency and resiliency (samza)!
Apply complex logic (spark-streaming)!
Type of message stream (kafka)

Apache Samza
Kafka for streaming !
Yarn for resource
management and exec!
Samza API for
processing!
Sweet spot: second,
minutes
Samza API
Yarn
Kafka

public void process(
IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator)

public void window(
MessageCollector collector,
TaskCoordinator coordinator)

Bootstrap streams
Samza can consume messages from multiple
streams!
Rewind on historical data does not preserve
ordering!
If a task has any bootstrap streams defined then
it will read these streams until they are fully
processed

Predictive modelling

Learning from data
Predictive model = statistical learning!
Simple = parallelizable!
Garbage in = garbage out

Couple of things we can do
1. Parameter tuning
2. Feature engineering
3. Learn on all data

Train against all data
Ensamble methods (cooperative and competitive)!
Avoid multi pass / iterations!
Apply models to live data!
Keep models up to date

Off the shelf
Apache Mahout (MapReduce, Spark) !
MLlib (Spark)!
Cascading-pattern (MapReduce, Tez, Spark)

Apache Mahout 0.9
Once the default solution for ML with MapReduce!
Quality may vary!
Good components are really good!
Is it a library? A framework? A recommendation
system?

The good
The go-to if you need a Recommendation System!
SGD (optimization)!
Random Forest (classification/regression)!
SVD (feature engineering)!
ALS (collaborative filtering)

The puzzling
SVM? !
Model updates are implementation specific!!
Feature encoding and input format are often
model specific

Apache Mahout trunk
Moving away from MapReduce!
Spark + Scala DSL = new classes of algorithms!
Major code cleanup

It needs major
infrastructure work
around it

There’s a buzzword for that
http://lambda-architecture.net/

With hadoop 2
Cluster as an Operating System!
YARN, mostly!
Multiparadigm, better interop!
Same system, different tools, multiple use cases!
Batch + interactive

This said
Ops is where a lot of time goes!
Building clusters is hard!
Distro fragmentation!
Bleeding edge rush!
Heavy lifting needed

Full stack analytics with Hadoop 2

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Full stack analytics with Hadoop 2 (20)

Recently uploaded (20)

Full stack analytics with Hadoop 2