Hadoop
Hadoop
7 HADOOP OVERVIEW
Open-source software framework to store and process massive amounts of data in a distributed fashion on
e clusters of commodity hardware. Basically, Hadoop accomplishes tasks:
1. Massive data storage.
2. Faster data proceSSing.
Framework: Means everything that you will need to develop and execute
and application isprovided programs, tools, etc.
Core Components
MapReduce Programming
Data Data
processing storage
Master Node
Computation
(MapReduce)
Storage (HDFS)
Joins GlekStream data with Stores years of data without Hive or Pig Script to
CRM and sales däta. muoh lnoremental cost analyze data.
The companies shown in Figure 5.12 provide productsthat include Apache Hadoop, commercial
and/or tools and utilities related to Hadoop. suporm
HDFS
Disk storage
Block Structured File Default Replication Factor :3 Default Block Size :64 MB
Big Data and
Analyi
NameNode
Block A
Clent Aplication Node A
Block B
Hadoop File System
Sample.txt Node B
Client Block C
Node C
DataNode B DataNode C
DataNode A
A B B
A B
C c
Figure 5.15 Hadoop Distributed File System Architecture.
Reference: Hadoop in Practice, Alex Holmes.
Replicates
Heartbeat No heartheat
6: Close
FSDatalnput Stream
Client JVM
Client Node
5: read
4: read
Cinse 6 1.Create
Clent
Node
Wrte 3
packet
Write 4
FSDataOutputStream
DistrbutedFileSystem N
an
DataNode
FigureDataNode
DatautputNram
5.19 packetAck 5:
File 5
Wite.
w
DataNode
DataNode the
7:Complete2:Create dint
to
ertom NDaralnuta
NameNote
Nameoe wit.
DataNodeDataNode
sttateg perAs
t 5.10.4
5.10.4
places s= Ret
Act Ob) 5.10.5 In it Th
T
Introduction fo Hadoop " 93
the NameNode to allocate new blocks by selecting a list of suitable DaraNodes to store replscas.
This list of DataNodes makes a pipclinc. Herc, we will go with the default replicarion factor of thre.
so there will be three nodes in the pipeline for the first block.
4. DataStreamer strcams the packets to the first DataNode in the pipeline. It srores packet and forwards
it to the second DataNode in the pipeline. In the same way, the second DataNode stores the packet
and forwards it to the third DataNode in the pipeline.
5. In addition to the internal qucuc, DFSOutputStream also manages an "Ack queue" of packets that are
waiting for the acknowledgement by DatalNodes. Apacket is removed from the "Ack queue' only if it
is acknowledged by all the DataNodes in the pipeline.
6. When the client finishes writing the file, it calls close) on the stream.
7. This fushes all the remaining packets to the DataNode pipeline and waits for relevant acknowiedgments
before communicating with the NameNode to inform the client that the creation of the fle is compiete.
Guide, 3rd Edition, O'Reilly Publication.