INTRODUCTION TO HADOOP 
Brest – 29 octobre 2014 
David Morin - @davAtBzh
Me 
David Morin 
@davAtBzh 
Solutions Engineer at
3 
What is Hadoop ?
4 
An elephant – This one ?
5 
No, this one !
6 
The father
7 
Let's go !
8 
Let's go !
9 
Timeline
10 
Hadoop fundamentals 
● Distributed FileSystem for high volume of data 
● Use of common servers (limit costs) 
● Scalable / fault tolerance
11 
Hadoop Distributed FileSystem
12 
Hadoop Distributed FileSystem
13 
Mapreduce
14 
Mapreduce : word count 
Map Reduce
15 
Data Locality Optimization
16 
Mapreduce in action
17 
Hadoop v1 : drawbacks 
– One Namenode : SPOF 
– One Jobtracker : SPOF and un-scalable (nodes 
limitation) 
– MapReduce only : open this platform to non MR 
applications
18 
Hadoop v2 
Improvements : 
– HDFS v2 : Secondary namenode 
– YARN (Yet Another Resource Negociator) 
● JobTracker => Resource Manager + Applications 
Master (more than one) 
● Can be used by non MapReduce applications 
– MapReduce v2 : uses Yarn
19 
Hadoop v2
20 
YARN
21 
YARN
22 
YARN
23 
YARN
24 
YARN
25 
YARN
26
27 
Pig 
● With Pig write MR Jobs becomes easy 
● Dataflow model : data is the key ! 
● Langage : PigLatin 
● No limit : Used Defined Functions 
http://pig.apache.org/docs/r0.13.0/
28 
● Pig-Wordcount 
Pig 
lines = LOAD '/user/XXX/file.txt' AS (line:chararray); 
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word; 
grouped = GROUP words BY word; 
wordcount = FOREACH grouped GENERATE group, COUNT(words); 
DUMP wordcount;
29 
Import … 
public class WordCount2 { 
Pig 
public static class TokenizerMapper 
extends Mapper<Object, Text, Text, IntWritable>{ 
static enum CountersEnum { INPUT_WORDS } 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
private boolean caseSensitive; 
private Set<String> patternsToSkip = new HashSet<String>(); 
private Configuration conf; 
private BufferedReader fis; 
... 
=> 130 lines of code !
30 
● SQL like : HQL 
● UDFs 
● Hive-Wordcount 
Hive 
CREATE TABLE docs (line STRING); 
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; 
CREATE TABLE word_counts AS 
SELECT word, count(1) AS count FROM 
(SELECT explode(split(line, 's')) AS word FROM docs) w 
GROUP BY word 
ORDER BY word;
31 
Zookeeper 
● Distributed coordination service 
● Dynamic configuration 
● Distributed locking
32 
Batch but not only..
33 
??

More Related Content

PPTX
Hadoop Jute Record Python
PPTX
Nov HUG 2009: Hadoop Record Reader In Python
PDF
Alexander Ignatyev "MapReduce infrastructure"
PDF
Geo Package and OWS Context at FOSS4G PDX
PDF
Spark - Alexis Seigneurin (English)
PPTX
Your data isn't that big @ Big Things Meetup 2016-05-16
PPTX
PPTX
Data analysis on hadoop
Hadoop Jute Record Python
Nov HUG 2009: Hadoop Record Reader In Python
Alexander Ignatyev "MapReduce infrastructure"
Geo Package and OWS Context at FOSS4G PDX
Spark - Alexis Seigneurin (English)
Your data isn't that big @ Big Things Meetup 2016-05-16
Data analysis on hadoop

What's hot (20)

PDF
Clique square storage
PPT
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
PPT
Getting started with PostGIS geographic database
PPTX
Introduction to Apache Pig
PPTX
A Hands-on Introduction to MapReduce (in Python)
PDF
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
PDF
Page compression. PGCON_2016
PDF
Dremel: interactive analysis of web-scale datasets
PPTX
Big data solution capacity planning
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Practical Hadoop using Pig
PDF
Case study ap log collector
PPT
Redis深入浅出
PPTX
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
PDF
In-core compression: how to shrink your database size in several times
PDF
PgconfSV compression
PDF
Hadoop - Simple. Scalable.
PDF
How to measure your dataflow using fio, pktgen and bandwidthTest
PDF
Using MongoDB and Python
Clique square storage
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting started with PostGIS geographic database
Introduction to Apache Pig
A Hands-on Introduction to MapReduce (in Python)
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Page compression. PGCON_2016
Dremel: interactive analysis of web-scale datasets
Big data solution capacity planning
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Practical Hadoop using Pig
Case study ap log collector
Redis深入浅出
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
In-core compression: how to shrink your database size in several times
PgconfSV compression
Hadoop - Simple. Scalable.
How to measure your dataflow using fio, pktgen and bandwidthTest
Using MongoDB and Python
Ad

Similar to Introduction to Hadoop - FinistJug (20)

PPTX
Hadoop and big data training
PDF
Hadoop breizhjug
PDF
Mr hadoop seedrocket
PPTX
Unit 4 lecture2
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
PDF
Hadoop - A Very Short Introduction
PDF
Sql saturday pig session (wes floyd) v2
PDF
Unit V.pdf
PDF
Hadoop Pig: MapReduce the easy way!
PPTX
Big Data and Hadoop with MapReduce Paradigms
PDF
Hadoop pig
PPTX
map Reduce.pptx
PPTX
Apache Flink - Hadoop MapReduce Compatibility
ODP
Training
PPTX
PPTX
Cppt Hadoop
PPTX
PPT
Hadoop ppt2
PDF
Hadoop interview questions
PPTX
Quadrupling your elephants - RDF and the Hadoop ecosystem
Hadoop and big data training
Hadoop breizhjug
Mr hadoop seedrocket
Unit 4 lecture2
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Hadoop - A Very Short Introduction
Sql saturday pig session (wes floyd) v2
Unit V.pdf
Hadoop Pig: MapReduce the easy way!
Big Data and Hadoop with MapReduce Paradigms
Hadoop pig
map Reduce.pptx
Apache Flink - Hadoop MapReduce Compatibility
Training
Cppt Hadoop
Hadoop ppt2
Hadoop interview questions
Quadrupling your elephants - RDF and the Hadoop ecosystem
Ad

Recently uploaded (20)

PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
Build Your First AI Agent with UiPath.pptx
PDF
Five Habits of High-Impact Board Members
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PPT
Geologic Time for studying geology for geologist
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PPTX
The various Industrial Revolutions .pptx
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
STKI Israel Market Study 2025 version august
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PPTX
Configure Apache Mutual Authentication
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
NewMind AI Weekly Chronicles – August ’25 Week III
A review of recent deep learning applications in wood surface defect identifi...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Chapter 5: Probability Theory and Statistics
Build Your First AI Agent with UiPath.pptx
Five Habits of High-Impact Board Members
Improvisation in detection of pomegranate leaf disease using transfer learni...
Geologic Time for studying geology for geologist
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
The various Industrial Revolutions .pptx
Enhancing plagiarism detection using data pre-processing and machine learning...
Custom Battery Pack Design Considerations for Performance and Safety
Microsoft Excel 365/2024 Beginner's training
A contest of sentiment analysis: k-nearest neighbor versus neural network
Zenith AI: Advanced Artificial Intelligence
STKI Israel Market Study 2025 version august
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
Configure Apache Mutual Authentication

Introduction to Hadoop - FinistJug

  • 1. INTRODUCTION TO HADOOP Brest – 29 octobre 2014 David Morin - @davAtBzh
  • 2. Me David Morin @davAtBzh Solutions Engineer at
  • 3. 3 What is Hadoop ?
  • 4. 4 An elephant – This one ?
  • 5. 5 No, this one !
  • 10. 10 Hadoop fundamentals ● Distributed FileSystem for high volume of data ● Use of common servers (limit costs) ● Scalable / fault tolerance
  • 11. 11 Hadoop Distributed FileSystem
  • 12. 12 Hadoop Distributed FileSystem
  • 14. 14 Mapreduce : word count Map Reduce
  • 15. 15 Data Locality Optimization
  • 16. 16 Mapreduce in action
  • 17. 17 Hadoop v1 : drawbacks – One Namenode : SPOF – One Jobtracker : SPOF and un-scalable (nodes limitation) – MapReduce only : open this platform to non MR applications
  • 18. 18 Hadoop v2 Improvements : – HDFS v2 : Secondary namenode – YARN (Yet Another Resource Negociator) ● JobTracker => Resource Manager + Applications Master (more than one) ● Can be used by non MapReduce applications – MapReduce v2 : uses Yarn
  • 26. 26
  • 27. 27 Pig ● With Pig write MR Jobs becomes easy ● Dataflow model : data is the key ! ● Langage : PigLatin ● No limit : Used Defined Functions http://pig.apache.org/docs/r0.13.0/
  • 28. 28 ● Pig-Wordcount Pig lines = LOAD '/user/XXX/file.txt' AS (line:chararray); words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word; grouped = GROUP words BY word; wordcount = FOREACH grouped GENERATE group, COUNT(words); DUMP wordcount;
  • 29. 29 Import … public class WordCount2 { Pig public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ static enum CountersEnum { INPUT_WORDS } private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private boolean caseSensitive; private Set<String> patternsToSkip = new HashSet<String>(); private Configuration conf; private BufferedReader fis; ... => 130 lines of code !
  • 30. 30 ● SQL like : HQL ● UDFs ● Hive-Wordcount Hive CREATE TABLE docs (line STRING); LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) w GROUP BY word ORDER BY word;
  • 31. 31 Zookeeper ● Distributed coordination service ● Dynamic configuration ● Distributed locking
  • 32. 32 Batch but not only..
  • 33. 33 ??