0% found this document useful (0 votes)

17 views

Hadoop

Uploaded by

219X1A3252 POTHU SAI ESWAR REDDY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Hadoop

Uploaded by

219X1A3252 POTHU SAI ESWAR REDDY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

5.

7 HADOOP OVERVIEW
Open-source software framework to store and process massive amounts of data in a distributed fashion on
e clusters of commodity hardware. Basically, Hadoop accomplishes tasks:
1. Massive data storage.
2. Faster data proceSSing.

5.7.1 Key Aspects of Hadoop

Figure 5.7 describes the key aspects of Hadoop.

Open sourcesoftware: It is free to download, use and contribute to.

Framework: Means everything that you will need to develop and execute
and application isprovided programs, tools, etc.

Distributed: Divides and stores data across multiple computers

Computation/Processing is done in parallel across multiple connected nodes

Massive storage: Stores colossal amounts of data acrosSnodes of

low-cost commodity hardware,

Faster processing: Large amounts of data is processed in parallel,

yielding qulckresponse,

Figure 5.7 Key aspects of Hadoop.

5.7.2 Hadoop Components
Figure 5.8 depicts the Hadoop components.
Hadoop Ecosystem
FLUME O0ZIE MAHOUT

HIVE PIG SOOOP HBASE

Core Components
MapReduce Programming

Hadoop Distributed File System (HDFS)

Figure 5.8 Hadoop components.

Hadoop Core Components

1. HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.
2. MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
Hadoop Ecosystem: Hadoop Ecosystem are support projects to enhance the functionality of Hadoop
Components. The Eco Projects are as follows:
1. HIVE
2. PIG
3. SQOOP
4. HBASE
5. FLUME
6. O0ZIE
7. MAHOUT

5.7.3 Hadoop Conceptual Layer Procet

Data
It is conceptually divided into Data Storage Layer which stores huge volumes of data and 5.9.
Layer which processes data in parallel to extract richer and meaningful insights from data (Figure?

5.7.4 High-Level Architecture of Hadoop save

and
Hadoop is a distributed Master-Slave Architecture. Master node is known as NameNodeFramework.
areknown as DataNodes. Figure 5.10 depicts the Master-Slave Architecture of Hadoop
Introduction to Hadoop " 87

Data Data
processing storage

Figure 5.9 Hadoop conceptual layer.

Master Node

Computation
(MapReduce)
Storage (HDFS)

Slave Node Slave Node Slave Node

Computation Computation Computation

(MapReduce) (MapReduce) (MapReduce)

Storage (HDFS) Storage (HDFS) Storage (HDFS)

Figure 5.10 Hadoop high-level architecture.

Reference: Hadoop in Practice, Alex Holmes.

Let us look at the key components of the Master Node.

1. Master HDFS: Its main responsibility is partitioningthe data storage across the slave nodes. It also
keeps track of locations of data on DataNodes.
2. Master MapReduce: It decides and schedules computation task on slave nodes.

5.8 USE CASE OF HADOOP

5.8.1 ClickStream Data
ClickStream data (mouse clicks) helps you to understand the purchasing behavior of customers. ClickStream
analysis helps online marketers to optimize their product web pages, promotional content, etc. to improve
their business.
\alyu
ClckStream Data Analysis using Hadoop Key Bonefits

Joins GlekStream data with Stores years of data without Hive or Pig Script to
CRM and sales däta. muoh lnoremental cost analyze data.

Figure 5. 11 ClickStream dataanalysis.

The ClickStream analyxis (igure 5.1) using Hadoop provides three key benefits:
1. Hadoop helps to join CickSreanm data with other daa sources such as Customer
Managemcnt Data (CIstomer Denographics Data, Sales Data, and Information
Campaigns). This additional data often provides the much needed information to
tomer bchavior,
Re l a tio n hip
underAdstavndertn ,
2. Hadoops walability property hclps you to store years of data without ample incremental
helps you to pertorm temporal or year over year analysis on ClickStream data which your CoSt. Th,
nnav miss, conpetitor,
3. Business analysts can use Apache Pig or Apache Hive for website analysis. With these tools,,
onganizeClickStream data by user session, refine it, and feed it to visualization or analytics toriyou an
Refernae: http://hortonworks.conm/wp-content/uploads/2014/05/Hortonworks. BusinessValueofHadoe.
vl.0.pdf

5.9 HADOOP DISTRIBUTORS

The companies shown in Figure 5.12 provide productsthat include Apache Hadoop, commercial
and/or tools and utilities related to Hadoop. suporm

Cloudera Hortonworks MAPR Apache Hadoop

CDH 4.0 HDP 1.0
CDH 5.0 M3 Hadoop 1.0
HDP 2.0 M5
M8
Hadoop 2.0

Figure 5.12 Common Hadoop distributors.

5.10 HDFS (HADOOP

DISTRIBUTED FILE SYSTEM)
Some key Points of Hadoop
Distributed File System areas follows:
1. Storage component of
2.
Distributed File System.
Hadoop.
3. Modeled after
4. Optimized for Google File System.
high throughput (HDFS
is stored). leverages large block size and moves wheredaa
5. You can replicate a
file for a
computation
and hardware. configured number of times, which is tolerant in of bothsoftwat
terms
6. Re-replicates data blocks automatically on nodes that have failed.
7. Youcan realize the power of HDFS when you perform read or write on large fles (gigatbytes and larger).
8. Sits on top of native file system such as ext3 and cxt4, which is described in Figure 5.13.
Figure 5.14 describes important key points of HDES, Figure 5. 15 describes Hadoop Distributed File
Svstem Architecture. Client Application interacts with NameNode for metadata related activities and com
municates with DataNodes to read and write files. DataNodes converse with cach other for pipeline reads
and writes.
Let us assume that the file "Sample.txt" is of size 192 MB.As per the default data block size (64 MB), it will
be split into three blocks and replicated across the nodes on the cluster based on the default replication factor.
5.10.1 HDFS Daemons
5.10.1.1 NameNode
HDESbreaks a large file into smaller pieces called blocks. NameNode uses arack ID to identify DataNodes
in che rack. A rack is a collection of DataNodes within thecluster. NameNode keeps tracks of blocks of a file
as it is placed on various DataNodes. NameNode manages file-related operations such as read, write, create,
and delete. Its main job is managing the File System Namespace. Afile system namespace is collectionof
Gles in the cluster. NameNode stores HDFS namespace. File system namespace
namespace includes mapping of blocks
tofile, file properties and is stored in a file called FsImage. NameNode uses an EditLog (transaction log) to
record every transaction that happens to the file system metadata. Refer Figure 5.16. When NameNode starts
up, it reads FsImage and EditLog from disk and applies all transactions from the EditLog to in-memory
representation of the Fslmage.Then it flushes out new version of FsImage on disk and truncates the old
EditLog because the changes are updated in the Fslmage. There is a single NameNode per cluster.
Reference: http://hadoop.apache.org/docs/r1.0.4/hdfs design.html

HDFS

Native OS file system

Disk storage

Figure 5.13 Hadoop Distributed File System.

Hadoop Distributed File System Key Points

Block Structured File Default Replication Factor :3 Default Block Size :64 MB
Big Data and
Analyi
NameNode
Block A
Clent Aplication Node A
Block B
Hadoop File System
Sample.txt Node B
Client Block C
Node C

DataNode B DataNode C
DataNode A

A B B
A B

C c
Figure 5.15 Hadoop Distributed File System Architecture.
Reference: Hadoop in Practice, Alex Holmes.

NameNode Manages File RelatedOperations

Fslmage-File, in which entire file system EditLog Records every transaction that
is stored. OcCurs to file system.metadata.

Figure 5.16 NameNode.

5.10.1.2 DataNode
There are muliple DataNodes per cluster. During Pipeline read and write DataNodes communicate wt
cach other. A DataNode also continuously sends heartbeat" message to
NameNode to ensure the
nectivity between the NameNode and DataNode. In case there is no heartbeat from a DataNode. t
NameNode replicates that DataNode within the cluster and keeps on
running as if nothing had happenal
Let us explain the concept behind sending the heartbeat report by the
DaraNodes to the NameNo.
Reference: Wrox Certified Big Data Developer.
PICTURE THIS..
You work for arenowned IT organization. Every day team members who are present in office. Thetasks
when you come to office, you are required to swipe for the day cannot be allocated to team members
in to record your
attendance. This record of atten
dance is then shared who have not turned in. Likewise heartbeat
report
with NameNode
Dosted on who all from hisyour manager to keep him
team have reported for is a way by which DataNodes inform the assigned
work. Your manager is able to allocate that they are up and functional and can be
tasks to the tasks. Figure 5.17 depicts the above scenal
NameNode

Replicates
Heartbeat No heartheat

DataNode DataNode DataNode

Disk Disk Disk

Figure 5.17 NameNode and DataNode Communication.

5.10.1.3 Secondary NameNode

The Secondary NameNode takes a snapshot of HDES metadata at intervals specified in the Hadoop config
uration. Since the memory requirements of Secondary NameNode are the same as NameNode, it is berter
to run NameNode and Secondary NameNode on different machines. In case of failure of the NameNode,
the Secondary NameNode can be configured manually to bring up the cluster. However, the Secondary
NameNode does not record any real-time changes that happen tothe HDESmetadata.
5.10.2 Anatomy of File Read
Figure 5.18 describes the anatomy of File Read.

2: Get Block Location

DistributedFileSystem NameNode
1:Open
NameNode
HDFS Client
3: Read

6: Close
FSDatalnput Stream
Client JVM

Client Node
5: read
4: read

DataNode DaaNode DataNode

DataNode DataNode DataNode

Figure 5.18 File Read.

AntoyoFe
51R3Nrite
Chert
NN Tishm
HOFS
DataNode
Pipelinesof Cent

Cinse 6 1.Create
Clent
Node
Wrte 3
packet
Write 4

FSDataOutputStream
DistrbutedFileSystem N
an
DataNode
FigureDataNode
DatautputNram
5.19 packetAck 5:

File 5
Wite.
w
DataNode
DataNode the
7:Complete2:Create dint

to
ertom NDaralnuta

NameNote
Nameoe wit.
DataNodeDataNode

sttateg perAs
t 5.10.4
5.10.4
places s= Ret
Act Ob) 5.10.5 In it Th
T
Introduction fo Hadoop " 93

the NameNode to allocate new blocks by selecting a list of suitable DaraNodes to store replscas.
This list of DataNodes makes a pipclinc. Herc, we will go with the default replicarion factor of thre.
so there will be three nodes in the pipeline for the first block.
4. DataStreamer strcams the packets to the first DataNode in the pipeline. It srores packet and forwards
it to the second DataNode in the pipeline. In the same way, the second DataNode stores the packet
and forwards it to the third DataNode in the pipeline.
5. In addition to the internal qucuc, DFSOutputStream also manages an "Ack queue" of packets that are
waiting for the acknowledgement by DatalNodes. Apacket is removed from the "Ack queue' only if it
is acknowledged by all the DataNodes in the pipeline.
6. When the client finishes writing the file, it calls close) on the stream.
7. This fushes all the remaining packets to the DataNode pipeline and waits for relevant acknowiedgments
before communicating with the NameNode to inform the client that the creation of the fle is compiete.
Guide, 3rd Edition, O'Reilly Publication.

Geography An Integrated Approach (GAIA)
86% (7)
Geography An Integrated Approach (GAIA)
660 pages
Microclimatic Envelope: Parkview Green, Beijing
No ratings yet
Microclimatic Envelope: Parkview Green, Beijing
8 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
BDA_UNIT-IV
No ratings yet
BDA_UNIT-IV
37 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
52 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
248 pages
Unit 2
No ratings yet
Unit 2
56 pages
(17CS82) 8 Semester CSE: Big Data Analytics
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
169 pages
Big Data Unit-2 PPT part1
No ratings yet
Big Data Unit-2 PPT part1
76 pages
5.Apache Hadoop Updated
No ratings yet
5.Apache Hadoop Updated
57 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
4 UNIT-4 Introduction To Hadoop
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
21CS72-BIGDATA-MODULE-2-HDFS (1)
No ratings yet
21CS72-BIGDATA-MODULE-2-HDFS (1)
55 pages
05 - Introduction To HDFS
No ratings yet
05 - Introduction To HDFS
27 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Unit III
No ratings yet
Unit III
86 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
43 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Hadoop
No ratings yet
Hadoop
31 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
Big Data Unit-3 PPT
No ratings yet
Big Data Unit-3 PPT
46 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
44 pages
Unit 5 Print
No ratings yet
Unit 5 Print
32 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
BD Module 1 Final
No ratings yet
BD Module 1 Final
17 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
Unit-I
No ratings yet
Unit-I
38 pages
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
UNIT 5-PLH
No ratings yet
UNIT 5-PLH
34 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Module 4 - Hadoop HDFS
No ratings yet
Module 4 - Hadoop HDFS
102 pages
Module-2-Introduction To HDFS and Tools
No ratings yet
Module-2-Introduction To HDFS and Tools
38 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
No ratings yet
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
56 pages
HADOOP FRAME WORK
No ratings yet
HADOOP FRAME WORK
38 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
BigData Fundamental and Hadoop Interview Questions
No ratings yet
BigData Fundamental and Hadoop Interview Questions
33 pages
BDT - Unit - II - Hdfs and Hadoop Io
No ratings yet
BDT - Unit - II - Hdfs and Hadoop Io
42 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
45 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
English 11 - Unit 1: The Generation Gap - Getting Started Practice Worksheet
No ratings yet
English 11 - Unit 1: The Generation Gap - Getting Started Practice Worksheet
7 pages
Dual Operational Amplifiers: Technical Data
No ratings yet
Dual Operational Amplifiers: Technical Data
4 pages
Culture and Customs of Kenya: Neal Sobania
No ratings yet
Culture and Customs of Kenya: Neal Sobania
257 pages
Science and Civilisation in China Vol 6 Biology and Biological Technology Part 4 Traditional Botany An Ethnobotanical Approach 1st Edition Needham Joseph 2024 scribd download
100% (11)
Science and Civilisation in China Vol 6 Biology and Biological Technology Part 4 Traditional Botany An Ethnobotanical Approach 1st Edition Needham Joseph 2024 scribd download
60 pages
Minor Project For Civil Engineering
No ratings yet
Minor Project For Civil Engineering
52 pages
Choice Behavior
No ratings yet
Choice Behavior
3 pages
Rajiv Gandhi University Thesis Topics Prosthodontics
75% (4)
Rajiv Gandhi University Thesis Topics Prosthodontics
8 pages
BAED STAT2112 Statistics and Probability Second Quarter Exam
No ratings yet
BAED STAT2112 Statistics and Probability Second Quarter Exam
51 pages
Datasheet Diodo Zener PDF
No ratings yet
Datasheet Diodo Zener PDF
5 pages
MR
No ratings yet
MR
642 pages
Dialnet CalculationOfMarineAirConditioningSystemsBasedOnEn 6769359
No ratings yet
Dialnet CalculationOfMarineAirConditioningSystemsBasedOnEn 6769359
15 pages
Council Drainage Easements Information Sheet
No ratings yet
Council Drainage Easements Information Sheet
4 pages
Speed Time & Train & Boat & Stream PDF
No ratings yet
Speed Time & Train & Boat & Stream PDF
15 pages
Transcribe
No ratings yet
Transcribe
5 pages
Um 306 Schedule of Delivery and Implementation of SBFP Milk Component For Sy 2023 2024
No ratings yet
Um 306 Schedule of Delivery and Implementation of SBFP Milk Component For Sy 2023 2024
7 pages
Eypad Verview: K636 - User's Quick Reference Guide
No ratings yet
Eypad Verview: K636 - User's Quick Reference Guide
2 pages
IPT2 Lesson
No ratings yet
IPT2 Lesson
25 pages
Offshore World - Track Record - Swiber Limited
100% (1)
Offshore World - Track Record - Swiber Limited
9 pages
EE4341 PPT LowNoiseCircuits V1.0
No ratings yet
EE4341 PPT LowNoiseCircuits V1.0
101 pages
2000 Yamaha SRX700
No ratings yet
2000 Yamaha SRX700
235 pages
0 Complete Book Electricity and Magnetism
100% (10)
0 Complete Book Electricity and Magnetism
134 pages
1.5 - 1 Conduct Test
No ratings yet
1.5 - 1 Conduct Test
23 pages
STAAD - Pro Course II
No ratings yet
STAAD - Pro Course II
5 pages
Mnemonic Guide
No ratings yet
Mnemonic Guide
884 pages
Tabel Baja
No ratings yet
Tabel Baja
1 page
Vitamins and Their Importance - UPSC Guide PDF
No ratings yet
Vitamins and Their Importance - UPSC Guide PDF
1 page
BD DBSS Operating Manual OVF30 AAA30288AAG - 2005-10-17
No ratings yet
BD DBSS Operating Manual OVF30 AAA30288AAG - 2005-10-17
120 pages
Possessive Exercises A. Add The S or The S
No ratings yet
Possessive Exercises A. Add The S or The S
8 pages

Hadoop

Uploaded by

Hadoop

Uploaded by

5.

5.7.1 Key Aspects of Hadoop

Open sourcesoftware: It is free to download, use and contribute to.

Distributed: Divides and stores data across multiple computers

Massive storage: Stores colossal amounts of data acrosSnodes of

Faster processing: Large amounts of data is processed in parallel,

Figure 5.7 Key aspects of Hadoop.

HIVE PIG SOOOP HBASE

Hadoop Distributed File System (HDFS)

Figure 5.8 Hadoop components.

Hadoop Core Components

5.7.3 Hadoop Conceptual Layer Procet

5.7.4 High-Level Architecture of Hadoop save

Figure 5.9 Hadoop conceptual layer.

Slave Node Slave Node Slave Node

Computation Computation Computation

Storage (HDFS) Storage (HDFS) Storage (HDFS)

Figure 5.10 Hadoop high-level architecture.

Let us look at the key components of the Master Node.

5.8 USE CASE OF HADOOP

Figure 5. 11 ClickStream dataanalysis.

5.9 HADOOP DISTRIBUTORS

Cloudera Hortonworks MAPR Apache Hadoop

Figure 5.12 Common Hadoop distributors.

5.10 HDFS (HADOOP

Native OS file system

Figure 5.13 Hadoop Distributed File System.

Hadoop Distributed File System Key Points

NameNode Manages File RelatedOperations

Figure 5.16 NameNode.

DataNode DataNode DataNode

Disk Disk Disk

Figure 5.17 NameNode and DataNode Communication.

5.10.1.3 Secondary NameNode

2: Get Block Location

DataNode DaaNode DataNode

DataNode DataNode DataNode

Figure 5.18 File Read.

You might also like