0% found this document useful (0 votes)
17 views

Hadoop

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Hadoop

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

5.

7 HADOOP OVERVIEW
Open-source software framework to store and process massive amounts of data in a distributed fashion on
e clusters of commodity hardware. Basically, Hadoop accomplishes tasks:
1. Massive data storage.
2. Faster data proceSSing.

5.7.1 Key Aspects of Hadoop


Figure 5.7 describes the key aspects of Hadoop.

Open sourcesoftware: It is free to download, use and contribute to.

Framework: Means everything that you will need to develop and execute
and application isprovided programs, tools, etc.

Distributed: Divides and stores data across multiple computers


Computation/Processing is done in parallel across multiple connected nodes

Massive storage: Stores colossal amounts of data acrosSnodes of


low-cost commodity hardware,

Faster processing: Large amounts of data is processed in parallel,


yielding qulckresponse,

Figure 5.7 Key aspects of Hadoop.


5.7.2 Hadoop Components
Figure 5.8 depicts the Hadoop components.
Hadoop Ecosystem
FLUME O0ZIE MAHOUT

HIVE PIG SOOOP HBASE

Core Components
MapReduce Programming

Hadoop Distributed File System (HDFS)

Figure 5.8 Hadoop components.

Hadoop Core Components


1. HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.
2. MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
Hadoop Ecosystem: Hadoop Ecosystem are support projects to enhance the functionality of Hadoop
Components. The Eco Projects are as follows:
1. HIVE
2. PIG
3. SQOOP
4. HBASE
5. FLUME
6. O0ZIE
7. MAHOUT

5.7.3 Hadoop Conceptual Layer Procet


Data
It is conceptually divided into Data Storage Layer which stores huge volumes of data and 5.9.
Layer which processes data in parallel to extract richer and meaningful insights from data (Figure?

5.7.4 High-Level Architecture of Hadoop save


and
Hadoop is a distributed Master-Slave Architecture. Master node is known as NameNodeFramework.
areknown as DataNodes. Figure 5.10 depicts the Master-Slave Architecture of Hadoop
Introduction to Hadoop " 87

Data Data
processing storage

Figure 5.9 Hadoop conceptual layer.

Master Node

Computation
(MapReduce)
Storage (HDFS)

Slave Node Slave Node Slave Node

Computation Computation Computation


(MapReduce) (MapReduce) (MapReduce)

Storage (HDFS) Storage (HDFS) Storage (HDFS)

Figure 5.10 Hadoop high-level architecture.


Reference: Hadoop in Practice, Alex Holmes.

Let us look at the key components of the Master Node.


1. Master HDFS: Its main responsibility is partitioningthe data storage across the slave nodes. It also
keeps track of locations of data on DataNodes.
2. Master MapReduce: It decides and schedules computation task on slave nodes.

5.8 USE CASE OF HADOOP


5.8.1 ClickStream Data
ClickStream data (mouse clicks) helps you to understand the purchasing behavior of customers. ClickStream
analysis helps online marketers to optimize their product web pages, promotional content, etc. to improve
their business.
\alyu
ClckStream Data Analysis using Hadoop Key Bonefits

Joins GlekStream data with Stores years of data without Hive or Pig Script to
CRM and sales däta. muoh lnoremental cost analyze data.

Figure 5. 11 ClickStream dataanalysis.


The ClickStream analyxis (igure 5.1) using Hadoop provides three key benefits:
1. Hadoop helps to join CickSreanm data with other daa sources such as Customer
Managemcnt Data (CIstomer Denographics Data, Sales Data, and Information
Campaigns). This additional data often provides the much needed information to
tomer bchavior,
Re l a tio n hip
underAdstavndertn ,
2. Hadoops walability property hclps you to store years of data without ample incremental
helps you to pertorm temporal or year over year analysis on ClickStream data which your CoSt. Th,
nnav miss, conpetitor,
3. Business analysts can use Apache Pig or Apache Hive for website analysis. With these tools,,
onganizeClickStream data by user session, refine it, and feed it to visualization or analytics toriyou an
Refernae: http://hortonworks.conm/wp-content/uploads/2014/05/Hortonworks. BusinessValueofHadoe.
vl.0.pdf

5.9 HADOOP DISTRIBUTORS

The companies shown in Figure 5.12 provide productsthat include Apache Hadoop, commercial
and/or tools and utilities related to Hadoop. suporm

Cloudera Hortonworks MAPR Apache Hadoop


CDH 4.0 HDP 1.0
CDH 5.0 M3 Hadoop 1.0
HDP 2.0 M5
M8
Hadoop 2.0

Figure 5.12 Common Hadoop distributors.

5.10 HDFS (HADOOP


DISTRIBUTED FILE SYSTEM)
Some key Points of Hadoop
Distributed File System areas follows:
1. Storage component of
2.
Distributed File System.
Hadoop.
3. Modeled after
4. Optimized for Google File System.
high throughput (HDFS
is stored). leverages large block size and moves wheredaa
5. You can replicate a
file for a
computation
and hardware. configured number of times, which is tolerant in of bothsoftwat
terms
6. Re-replicates data blocks automatically on nodes that have failed.
7. Youcan realize the power of HDFS when you perform read or write on large fles (gigatbytes and larger).
8. Sits on top of native file system such as ext3 and cxt4, which is described in Figure 5.13.
Figure 5.14 describes important key points of HDES, Figure 5. 15 describes Hadoop Distributed File
Svstem Architecture. Client Application interacts with NameNode for metadata related activities and com
municates with DataNodes to read and write files. DataNodes converse with cach other for pipeline reads
and writes.
Let us assume that the file "Sample.txt" is of size 192 MB.As per the default data block size (64 MB), it will
be split into three blocks and replicated across the nodes on the cluster based on the default replication factor.
5.10.1 HDFS Daemons
5.10.1.1 NameNode
HDESbreaks a large file into smaller pieces called blocks. NameNode uses arack ID to identify DataNodes
in che rack. A rack is a collection of DataNodes within thecluster. NameNode keeps tracks of blocks of a file
as it is placed on various DataNodes. NameNode manages file-related operations such as read, write, create,
and delete. Its main job is managing the File System Namespace. Afile system namespace is collectionof
Gles in the cluster. NameNode stores HDFS namespace. File system namespace
namespace includes mapping of blocks
tofile, file properties and is stored in a file called FsImage. NameNode uses an EditLog (transaction log) to
record every transaction that happens to the file system metadata. Refer Figure 5.16. When NameNode starts
up, it reads FsImage and EditLog from disk and applies all transactions from the EditLog to in-memory
representation of the Fslmage.Then it flushes out new version of FsImage on disk and truncates the old
EditLog because the changes are updated in the Fslmage. There is a single NameNode per cluster.
Reference: http://hadoop.apache.org/docs/r1.0.4/hdfs design.html

HDFS

Native OS file system

Disk storage

Figure 5.13 Hadoop Distributed File System.

Hadoop Distributed File System Key Points

Block Structured File Default Replication Factor :3 Default Block Size :64 MB
Big Data and
Analyi
NameNode
Block A
Clent Aplication Node A
Block B
Hadoop File System
Sample.txt Node B
Client Block C
Node C

DataNode B DataNode C
DataNode A

A B B
A B

C c
Figure 5.15 Hadoop Distributed File System Architecture.
Reference: Hadoop in Practice, Alex Holmes.

NameNode Manages File RelatedOperations


Fslmage-File, in which entire file system EditLog Records every transaction that
is stored. OcCurs to file system.metadata.

Figure 5.16 NameNode.


5.10.1.2 DataNode
There are muliple DataNodes per cluster. During Pipeline read and write DataNodes communicate wt
cach other. A DataNode also continuously sends heartbeat" message to
NameNode to ensure the
nectivity between the NameNode and DataNode. In case there is no heartbeat from a DataNode. t
NameNode replicates that DataNode within the cluster and keeps on
running as if nothing had happenal
Let us explain the concept behind sending the heartbeat report by the
DaraNodes to the NameNo.
Reference: Wrox Certified Big Data Developer.
PICTURE THIS..
You work for arenowned IT organization. Every day team members who are present in office. Thetasks
when you come to office, you are required to swipe for the day cannot be allocated to team members
in to record your
attendance. This record of atten
dance is then shared who have not turned in. Likewise heartbeat
report
with NameNode
Dosted on who all from hisyour manager to keep him
team have reported for is a way by which DataNodes inform the assigned
work. Your manager is able to allocate that they are up and functional and can be
tasks to the tasks. Figure 5.17 depicts the above scenal
NameNode

Replicates
Heartbeat No heartheat

DataNode DataNode DataNode

Disk Disk Disk

Figure 5.17 NameNode and DataNode Communication.

5.10.1.3 Secondary NameNode


The Secondary NameNode takes a snapshot of HDES metadata at intervals specified in the Hadoop config
uration. Since the memory requirements of Secondary NameNode are the same as NameNode, it is berter
to run NameNode and Secondary NameNode on different machines. In case of failure of the NameNode,
the Secondary NameNode can be configured manually to bring up the cluster. However, the Secondary
NameNode does not record any real-time changes that happen tothe HDESmetadata.
5.10.2 Anatomy of File Read
Figure 5.18 describes the anatomy of File Read.

2: Get Block Location


DistributedFileSystem NameNode
1:Open
NameNode
HDFS Client
3: Read

6: Close
FSDatalnput Stream
Client JVM

Client Node
5: read
4: read

DataNode DaaNode DataNode

DataNode DataNode DataNode

Figure 5.18 File Read.


AntoyoFe
51R3Nrite
Chert
NN Tishm
HOFS
DataNode
Pipelinesof Cent

Cinse 6 1.Create
Clent
Node
Wrte 3
packet
Write 4

FSDataOutputStream
DistrbutedFileSystem N
an
DataNode
FigureDataNode
DatautputNram
5.19 packetAck 5:

File 5
Wite.
w
DataNode
DataNode the
7:Complete2:Create dint

to
ertom NDaralnuta

NameNote
Nameoe wit.
DataNodeDataNode

sttateg perAs
t 5.10.4
5.10.4
places s= Ret
Act Ob) 5.10.5 In it Th
T
Introduction fo Hadoop " 93

the NameNode to allocate new blocks by selecting a list of suitable DaraNodes to store replscas.
This list of DataNodes makes a pipclinc. Herc, we will go with the default replicarion factor of thre.
so there will be three nodes in the pipeline for the first block.
4. DataStreamer strcams the packets to the first DataNode in the pipeline. It srores packet and forwards
it to the second DataNode in the pipeline. In the same way, the second DataNode stores the packet
and forwards it to the third DataNode in the pipeline.
5. In addition to the internal qucuc, DFSOutputStream also manages an "Ack queue" of packets that are
waiting for the acknowledgement by DatalNodes. Apacket is removed from the "Ack queue' only if it
is acknowledged by all the DataNodes in the pipeline.
6. When the client finishes writing the file, it calls close) on the stream.
7. This fushes all the remaining packets to the DataNode pipeline and waits for relevant acknowiedgments
before communicating with the NameNode to inform the client that the creation of the fle is compiete.
Guide, 3rd Edition, O'Reilly Publication.

You might also like