Introduction to Hadoop - FinistJug

INTRODUCTION TO HADOOP
Brest – 29 octobre 2014
David Morin - @davAtBzh

Me
David Morin
@davAtBzh
Solutions Engineer at

10
Hadoop fundamentals
● Distributed FileSystem for high volume of data
● Use of common servers (limit costs)
● Scalable / fault tolerance

11
Hadoop Distributed FileSystem

12
Hadoop Distributed FileSystem

14
Mapreduce : word count
Map Reduce

15
Data Locality Optimization

17
Hadoop v1 : drawbacks
– One Namenode : SPOF
– One Jobtracker : SPOF and un-scalable (nodes
limitation)
– MapReduce only : open this platform to non MR
applications

18
Hadoop v2
Improvements :
– HDFS v2 : Secondary namenode
– YARN (Yet Another Resource Negociator)
● JobTracker => Resource Manager + Applications
Master (more than one)
● Can be used by non MapReduce applications
– MapReduce v2 : uses Yarn

27
Pig
● With Pig write MR Jobs becomes easy
● Dataflow model : data is the key !
● Langage : PigLatin
● No limit : Used Defined Functions
http://pig.apache.org/docs/r0.13.0/

28
● Pig-Wordcount
Pig
lines = LOAD '/user/XXX/file.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

29
Import …
public class WordCount2 {
Pig
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
static enum CountersEnum { INPUT_WORDS }
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private boolean caseSensitive;
private Set<String> patternsToSkip = new HashSet<String>();
private Configuration conf;
private BufferedReader fis;
...
=> 130 lines of code !

30
● SQL like : HQL
● UDFs
● Hive-Wordcount
Hive
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word FROM docs) w
GROUP BY word
ORDER BY word;

31
Zookeeper
● Distributed coordination service
● Dynamic configuration
● Distributed locking

Introduction to Hadoop - FinistJug

More Related Content

What's hot (20)

Similar to Introduction to Hadoop - FinistJug (20)

Recently uploaded (20)

Introduction to Hadoop - FinistJug