Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop Tutorial | Simplilearn

Machine
Learning Basics
An Introduction

What’s in it for you?
Big Data Challenges
What is HDFS?
HDFS Cluster Architecture
HDFS Data Blocks
Data Node Failure
Rack Awareness
General Architecture of HDFS
Read/Write Mechanism

Why Hadoop?

Why Hadoop?
What is Hadoop?

Why Hadoop?
What is Hadoop?
Hadoop HDFS

Why Hadoop?
What is Hadoop?
Hadoop HDFSHadoop MapReduce

Why Hadoop?
What is Hadoop?
Hadoop YARN

Why Hadoop?
What is Hadoop?
Hadoop YARN
Use case of Hadoop

Why Hadoop?
What is Hadoop?
Hadoop YARN
Use case of Hadoop
Demo on HDFS, MapReduce
and YARN

Big Data Challenges
What is HDFS?
HDFS Data Blocks
Data Node Failure
Rack Awareness
Why Hadoop?

Tim sells food grains in his shop

The customers were happy as Tim was very quick
with the orders

Tim sensed a good demand for other products, so he
thought of expanding his business

He started selling fruits, vegetables, meat, and dairy
products in addition to food grains

But it wasn’t as easy as he expected it to be. The
number of customers increased, and he was not
able to cater to their needs on time

He had to look into assisting his customers with
each of their orders and billing. It was too difficult for
him to manage alone

To start delivering orders on time and to manage the
customers’ demands, Tim hired 3 more people to
work with him

Matt took care of the fruits and vegetable section.
Luke handled the dairy and meat section. Ann was
appointed as the cashier
Matt
Luke
Ann
Tim

However, this was still not a solution to Tim’s problem
as there was not enough space in the shop for all the
items
Storage area

The storage was a bottleneck since storing and accessing
became more and more difficult with increased supply and
demand
Storage area

Tim came up with an idea to overcome this issue. He
decided to expand the storage area and distribute each
category of product on different floors

Now, customers were happy, and after picking up their
products from the respective sections, it was then billed

Now, customers were happy, and after picking up their
products from the respective sections, it was then billed
Now, let us compare this story to big data

Earlier, data was generated at a moderate rate, and all the
data was structured in nature. One processor was enough to
process all of it

With the increase in data generation, different types of data
were generated at high speed. It became difficult for a single
processor to process different types of data

Massive amount of different types of data which cannot be
processed and stored using traditional databases is known as
big data

To overcome this issue, multiple processors were used to
process each type of data

But now the problem was that one storage system was
accessed by all the processors and the storage became the
bottleneck

Just like how Tim adopted the distributed approach, the
storage system was also distributed and by doing so, the data
was stored in individual databases

Just like how Tim adopted the distributed approach, the
storage system was also distributed and by doing so, the data
was stored in individual databases
Through this story, we see the two approaches that are
used by Hadoop that is HDFS and MapReduce

HDFS refers to the distributed storage space just like how Tim distributed the
storage space amongst the various sections

Each person took care of a separate section and at the end the customers
went to the cashier for the final billing, this sorted the process and made it
easier. This is how Hadoop MapReduce works

This was a rough story of big data
generation and why Hadoop is required. I
will now explain in detail as to what
Hadoop is

This sounds interesting. I would like
to know more about Hadoop

Big Data Challenges
What is HDFS?
HDFS Data Blocks
Data Node Failure
Rack Awareness
What is Hadoop?

What is Hadoop?
Hadoop is a framework which stores and processes big data in a distributed and parallel fashion

What is Hadoop?
Hadoop is a framework which stores and processes big data in a distributed and parallel fashion
BIG DATA

That sounds interesting, so how
does Hadoop store and process all
of this big data?

Hadoop has individual components, which
are used for storing and processing big
data

One day in an office..
HDFS
MapReduce
YARN
Components of Hadoop
The storage unit of Hadoop

HDFS
MapReduce
YARN
The processing unit of Hadoop

HDFS
MapReduce
YARN
The processing unit of Hadoop
The resource management unit of Hadoop

Big Data Challenges
What is HDFS?
HDFS Data Blocks
Data Node Failure
Rack Awareness
Hadoop HDFS

What is HDFS?
Each block of data is stored on multiple
systems and by default has 128 MB of data
Data
Datanode Datanode Datanode
Hadoop Distributed File System (HDFS) is known for its distributed storage method.
It distributes the data amongst many computers. In addition to this, replication of
data is also done to avoid loss of data

What is HDFS?
Let us now see how 500 MB of data is stored in the traditional method

Let us now see how 500 MB of data is stored in the
traditional method
500 MB data
What is HDFS?

Let us now see how 500 MB of data is stored in the
traditional method
Here, the entire set of data is stored in one
database. This overloads the database, and if it
crashes, we lose all our data
500 MB data
What is HDFS?

What is HDFS?
Using Hadoop HDFS, this problem is taken care of as data is distributed amongst
many systems

Using Hadoop HDFS, this problem is taken care of
as data is distributed amongst many databases
By doing so, a single database is not
overloaded
500 MB data
What is HDFS?
.
.
.
Using Hadoop HDFS, this problem is taken care of as data is distributed amongst
many systems

Hadoop Distributed File System (HDFS) is specially designed for
storing massive datasets in commodity hardware
What is HDFS?

What is HDFS?
HDFS has two main components that help
with its storage
NameNode DataNode
Hadoop Distributed File System (HDFS) is specially designed for
storing massive datasets in commodity hardware

What is HDFS?
DataNode DataNode DataNode DataNode
NameNode
• NameNode is the master of the
system
• It stores all the metadata

NameNode
What is HDFS?
• NameNode is the master of the
system
• It stores all the metadata• DataNode is known as the slave
node. There are multiple
DataNodes
• It performs the read/write
operations and stores the actual
data

What is HDFS?
NameNode
• NameNode manages all the
DataNodes
• The DataNodes send signals
known as heartbeats to the
NameNode. This signal gives the
status of the DataNode

As mentioned earlier, the actual data is stored in DataNodes. Data is stored in the
form of blocks here. The default size of each block is 128 MB
What is HDFS?

What is HDFS?
Now, let’s consider storing a file of size 530
MB in HDFS

What is HDFS?
MB in HDFS
File.txt
530 MB

What is HDFS?
MB in HDFS
File.txt
530 MB
Block B Block DBlock C
128 MB 128 MB128 MB 128 MB
Block A

What is HDFS?
MB in HDFS
File.txt
530 MB
Block B Block D Block E
18 MB
Block C
128 MB 128 MB128 MB 128 MB
Block A

What is HDFS?
MB in HDFS
File.txt
530 MB
18 MB
Block C
128 MB 128 MB128 MB 128 MB
Block A
The final block uses
only the remaining
space for storage

What is HDFS?
MB in HDFS
File.txt
530 MB
18 MB
Block C
128 MB 128 MB128 MB 128 MB
Block A
DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5

What is HDFS?
MB in HDFS
File.txt
530 MB
18 MB
Block C
128 MB 128 MB128 MB 128 MB
Block A
All these data blocks are stored
in DataNodes – computers
DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5

What happens if the computer that
contains block A crashes? Do we lose
the data in block A?

No, we don’t. That’s the beauty of Hadoop
HDFS. It uses replication to prevent the
loss of data

c
Rack 1
Replication in HDFS
HDFS overcomes the issue of DataNode failure by creating copies of the
data; this is known as the replication method
Block ADN 1

c
Rack 1 Rack 2
Replication in HDFS
Block ADN 1 DN 1
Block ADN 5
Block A is replicated. The
replication factor is 3. The
replicas are stored in
different DataNodes
Block ADN 6
2 replicas cannot be stored on the same datanode

c
Rack 1 Rack 2
Replication in HDFS
Rack 3 Rack 4 Rack 5
Similarly, every other
block is replicated
Block ADN 1
Block DDN 2
DN 1
Block ADN 5
Block DDN 10Block BDN 4 Block CDN 7
Block CDN 11
Block EDN 13
Block DDN 14
DN 12Block ADN 6
Block BDN 8
Block BDN 9 Block CDN 15Block EDN 3 Block EDN 12

Architecture of HDFS
Stores
Metadata (Name, replicas, ….)
NameNode

Stores
DataNodes
DataNodes
NameNode
…..….

…..….
Stores
DataNodes
DataNodes
NameNode
Rack is a collection
of DataNodes
Replication
Rack 1 Rack 2

Metadata ops Stores
Client
DataNodes
DataNodes
NameNode
Read
request?
Replication
…..….

Stores
Client
DataNodes
DataNodes
NameNode
Replication
…..….
Read
request?
Okay, read data from
DataNodes
Read permission

Stores
DataNodes
DataNodes
NameNode
Here is the data
that is read
ReplicationRead
data
…..….
Metadata ops
Client
Read
request?

Metadata ops Stores
Client
DataNodes
DataNodes
NameNode
Write Write
Client
ReplicationRead
data
…..….

Features of HDFS
HDFS is fault tolerant as
multiple copies of data are
made
Fault tolerant Data security Scalability Flexibility

Features of HDFS
Provides end-to-end
encryption that protects
data

Features of HDFS
Multiple nodes can be
added to the cluster
depending on the
requirement

Features of HDFS
Hadoop is flexible in storing any type
of data, like structured, semi
structured or unstructured data

Now that we have stored data in
HDFS, how can we process it?

For processing data, Hadoop has a unit
known as MapReduce

In the traditional approach, big data was processed at the master node
Why MapReduce?
big data

In the traditional approach, big data was processed at the master node
Why MapReduce?
Master
Slave Slave
Slave Slave
big data

This was a disadvantage as it consumed more time to process various types of
data
Master
Slave Slave
Slave Slave
Why MapReduce?
big data

To overcome this issue, data was processed at each slave node. This approach
is known as MapReduce
Master
Slave Slave
Slave Slave
Why MapReduce?
big data

Big Data Challenges
What is HDFS?
HDFS Data Blocks
Data Node Failure
Rack Awareness
Hadoop MapReduce

What is MapReduce?
Programming technique where huge data is processed in a parallel and
distributed fashion is known as Hadoop MapReduce

What is MapReduce?
MapReduce tasks
Map tasks Reduce tasks

What is MapReduce?
Map and Reduce steps
Input Data Output Data
map()
map()
map()
Shuffle and
Sort
reduce()
reduce()
Input Data is divided to form the input splits

What is MapReduce?
map()
map()
map()
Shuffle and
Sort
reduce()
reduce()
Map phase is the first phase, here data in each split is passed to produce output
values

What is MapReduce?
map()
map()
map()
Shuffle and
Sort
reduce()
reduce()
In the shuffle and sort phase, output of mapping phase is taken and similar data
is grouped

What is MapReduce?
map()
map()
map()
Shuffle and
Sort
reduce()
reduce()
Here, the output values from the shuffling phase are aggregated. It then returns
a single output value

What is MapReduce?
Let us now see how MapReduce works with an example

What is MapReduce?
Input data
Welcome to Hadoop
Hadoop is interesting
Hadoop is easy

What is MapReduce?
Input data
Welcome to Hadoop
Hadoop is easy
Welcome to Hadoop
Hadoop is easy
Input Splits

What is MapReduce?
Input data
Welcome to Hadoop
Hadoop is easy
Welcome to Hadoop
Hadoop is easy
Input Splits
Hadoop, 1
is, 1
interesting, 1
Welcome, 1
to, 1
Hadoop, 1
Hadoop, 1
is, 1
easy, 1
Map phase

What is MapReduce?
Map phase Shuffle and Sort phase
Hadoop, 1
is, 1
interesting, 1
Welcome, 1
to, 1
Hadoop, 1
Hadoop, 1
is, 1
easy, 1
to, 1
Hadoop, 1
Hadoop, 1
Hadoop, 1
is, 1
is, 1
interesting, 1
Welcome, 1
easy, 1

What is MapReduce?
Map phase Shuffle and Sort phase
Hadoop, 1
is, 1
interesting, 1
Welcome, 1
to, 1
Hadoop, 1
Hadoop, 1
is, 1
easy, 1
to, 1
Hadoop, 1
Hadoop, 1
Hadoop, 1
is, 1
is, 1
interesting, 1
Welcome, 1
Reducer phase
easy, 1easy, 1
Hadoop, 3
interesting, 1
is, 2
to, 1
Welcome, 1

What is MapReduce?
Map phase Shuffle and Sort phase Final Output
Hadoop, 1
is, 1
interesting, 1
Welcome, 1
to, 1
Hadoop, 1
Hadoop, 1
is, 1
easy, 1
to, 1
Hadoop, 1
Hadoop, 1
Hadoop, 1
is, 1
is, 1
interesting, 1
Welcome, 1
Reducer phase
easy, 1
easy 1
Hadoop 3
interesting 1
is 2
to 1
Welcome 1
easy, 1
Hadoop, 3
interesting, 1
is, 2
to, 1
Welcome, 1

Features of MapReduce
Good load
balancing
Re-execution of
tasks
Simple programming
model
Map task +
Reduce task
Splitting the stages into Map and
Reduce tasks improves the load
balancing

Good load
balancing
Re-execution of
tasks
Simple programming
model
There is an automatic re-execution if a
certain task fails
Map task +
Reduce task

Good load
balancing
Re-execution of
tasks
Simple programming
model
MapReduce has one of the simplest
programming model which is based on
Java. Java is a very common
programming language
Map task +
Reduce task

HDFS and MapReduce were the two units
of Hadoop 1.0

Hadoop 1.0 was also known as
MapReduce Version 1

The disadvantage with this version was
that the Job tracker did both the
processing of data and resource
allocation

As a result, Job tracker was overburdened
due to handling job scheduling, and
resource management

To overcome this issue, Hadoop 2
introduced YARN as the processing layer
that supported many frameworks

Big Data Challenges
What is HDFS?
HDFS Data Blocks
Data Node Failure
Rack Awareness
Hadoop YARN

What is YARN?
Yet Another Resource Negotiator (YARN) acts as the resource management
unit of Hadoop

What is YARN?
unit of Hadoop
Apache YARN consists of
Resource
Manager
It is the master daemon. Manages the assignment of
resources such as CPU, memory

What is YARN?
unit of Hadoop
Resource
Manager
Node
Manager
It is the slave daemon. It reports the resource
usage to the Resource Manager

What is YARN?
unit of Hadoop
Resource
Manager
Application
Master
Node
Manager
Works with the negotiation of resources from resource
manager and works with node manager

What is YARN?
Client
Client
Resource
Manager

What is YARN?
Client
Client
Resource
Manager
Node
Manager
Node
Manager
Node
Manager

What is YARN?
Client
Client
Resource
Manager
Node
Manager
container
Node
Manager
Node
Manager
container
container container
Container is a collection of physical resources such as CPU, RAM

What is YARN?
Client
Client
Resource
Manager
Node
Manager
App Master
App Master
Node
Manager
Node
Manager
container
container
container container
App Master requests container to Resource Manager. It uses
container allocated by Node Manager

Node
Manager
App Master
App Master
Node
Manager
Node
Manager
container
container
container container
What is YARN?
Client
Client
Application
Resource
Manager
Client program sends application request to the resource
manager

What is YARN?
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Client
Client
Resource
Manager
Node status
Job request
Node manager updates the status of the nodes to the resource
manager

What is YARN?
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Client
Client
Resource
Manager
Job request
Resource Manager contacts the Node Manager requesting for
resources(containers). The Node Manager grants the request

What is YARN?
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Client
Client
Resource
Manager
Job request
App Master contacts the Node Manager to use the container and runs in
one of the container allocated on one of the nodes

Features of YARN
Job scheduling Multitenancy
YARN is responsible to
process job requests and
allocate resources
Scalability

Features of YARN
Job scheduling Multitenancy
Different versions of MapReduce
can run on YARN. This makes
upgrading of MapReduce
manageable
Scalability

Features of YARN
Job scheduling Multitenancy Scalability
Depending on the requirement, the
number of nodes can be increased

Many companies use Hadoop for storing
and processing data. Now, let me tell you
about one such company

Big Data Challenges
What is HDFS?
HDFS Data Blocks
Data Node Failure
Rack Awareness
Use case - Pinterest

You would have probably heard of the
popular image sharing website Pinterest

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop Tutorial | Simplilearn

`
Pinterest is a social media platform which allows you to pin any
interesting information you find on its site

`
Pinterest has more than 250 million users and nearly 30 billion pins. All these
account to big data concerning Pinterest

`
Problem
Pinterest faced a challenge in processing tremendous amount of data

`
Problem
Pinterest faced a challenge in processing tremendous amount of data
There was a difficulty in analyzing which data needs to be displayed in a user’s
personalized discovery engine

`
Solution

`
Solution
Pinterest uses Hadoop to process and analyze big data in a way that it helps
the company to show the most relevant content to its users

`
Solution
Pinterest uses Hadoop to process and analyze big data in a way that it helps
the company to show the most relevant content to its users
Through continuous analysis of the data, Pinterest can provide its users with
features such as related pins, guided search and so on

This is how Pinterest benefited from
Hadoop. Let’s also start using Hadoop to
put an end to the big data challenges we
are facing

Big Data Challenges
What is HDFS?
HDFS Data Blocks
Data Node Failure
Rack Awareness
Demo on HDFS, MapReduce
and YARN

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop Tutorial | Simplilearn

Recommended

More Related Content

What's hot (20)

Similar to Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop Tutorial | Simplilearn (20)

More from Simplilearn (20)

Recently uploaded (20)

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop Tutorial | Simplilearn

Editor's Notes