Apache Hadoop,
The Hadoop
Ecosystem
BY ROHIT RAJ
Why Hadoop?
Suppose we want to process a data. In the traditional approach, we used to store data on local machines. This data was then
processed. Now as data started increasing, the local machines or computers were not capable enough to store this huge data set.
So, data was then started to be stored on remote servers. Now suppose we need to process that data. So, in the traditional
approach, this data has to be fetched from the servers and then processed upon. Suppose this data is of 500 GB. Now, practically it
is very complex and expensive to fetch this data. This approach is also called Enterprise Approach.
In the new Hadoop Approach, instead of fetching the data on local machines we send the query to the data. Obviously, the
query to process the data will not be as huge as the data itself. Moreover, at the server, the query is divided into several
parts. All these parts process the data simultaneously. This is called parallel execution and is possible because of Map
Reduce. So, now not only there is no need to fetch the data, but also the processing takes lesser time. The result of the query
is then sent to the user. Thus the Hadoop makes data storage, processing and analyzing way easier than its traditional
approach.
What Is Hadoop?
Facebook
Google
JPMorgan and Chase
Goldmansachs
Yahoo
AWS
Microsoft
IBM
Cloudera
IQVIA
Rackspace Technology
And Many More
Companies Uses Hadoop Based System
Data Scientist
Big Data Visualizer
Big Data Research Analyst
Big Data Engineer
Big Data Analyst
Big Data Architect
And Many More
Jobs Opportunity With Apache Hadoop
Hadoop Architecture
Hadoop Distributed File System: In our local PC, by default the block size in Hard Disk is 4KB. When we install Hadoop, the HDFS by
default changes the block size to 64 MB. Since it is used to store huge data. We can also change the block size to 128 MB. Now HDFS
works with Data Node and Name Node. While Name Node is a master service and it keeps the metadata as for on which commodity
hardware, the data is residing, the Data Node stores the actual data. Now, since the block size is of 64 MB thus the storage required to
store metadata is reduced thus making HDFS better. Also, Hadoop stores three copies of every dataset at three different locations.
This ensures that the Hadoop is not prone to single point of failure.
Map Reduce: In the simplest manner, it can be understood that MapReduce breaks a query into multiple parts and now each part
process the data coherently. This parallel execution helps to execute a query faster and makes Hadoop a suitable and optimal choice
to deal with Big Data.
YARN: As we know that Yet Another Resource Negotiator works like an operating system to Hadoop and as operating systems are
resource managers so YARN manages the resources of Hadoop so that Hadoop serves big data in a better way.
1.
2.
3.
Hadoop has a master-slave
topology. In this topology, we have
one master node and multiple slave
nodes. Master node’s function is to
assign a task to various slave nodes
and manage resources. The slave
nodes do the actual computing.
Slave nodes store the real data
whereas on master we have
metadata. This means it stores data
about data.
Map Reduce
MapReduce is the processing layer of Hadoop. MapReduce programming model is designed for
processing large volumes of data in parallel by dividing the work into a set of independent tasks.
You need to put business logic in the way MapReduce works and rest things will be taken care by
the framework. Work (complete job) which is submitted by the user to master is divided into small
works (tasks) and assigned to slaves.
MapReduce programs are written in a particular style influenced by functional programming
constructs, specifical idioms for processing lists of data. Here in MapReduce, we get inputs from a
list and it converts it into output which is again a list. It is the heart of Hadoop. Hadoop is so much
powerful and efficient due to MapRreduce as here parallel processing is done.
Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
Reduce(), as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as input
and combines those tuples into smaller set of tuples
Two Key Words :- 1. Map , 2.Reduce
1.
2.
Hadoop Yarn
Yet Another Resource Negotiator, as the name implies, YARN is the
one who helps to manage the resources across the clusters. In short,
it performs scheduling and resource allocation for the Hadoop
System.
Consists of three major components i.e.
Resource Manager
Nodes Manager
Application Manager
Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the
allocation of resources such as CPU, memory, bandwidth per
machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the
requirement of the two.
a.
b.
c.
Hadoop HDFS
HDFS stands for Hadoop Distributed File System.
It provides for data storage of Hadoop. HDFS
splits the data unit into smaller units called
blocks and stores them in a distributed manner.
It has got two daemons running. One for master
node – NameNode and other for slave nodes –
DataNode.
HDFS has a Master-slave architecture. The daemon called
NameNode runs on the master server. It is responsible for
Namespace management and regulates file access by the client.
DataNode daemon runs on slave nodes. It is responsible for
storing actual business data. Internally, a file gets split into a
number of data blocks and stored on a group of slave machines.
Namenode manages modifications to file system namespace.
These are actions like the opening, closing and renaming files or
directories. NameNode also keeps track of mapping of blocks to
DataNodes. This DataNodes serves read/write request from the file
system’s client. DataNode also creates, deletes and replicates
blocks on demand from NameNode.
HeartBeat : It is the signal that datanode continuously sends to
namenode. If namenode doesn’t receive heartbeat from a
datanode then it will consider it dead.
Balancing : If a datanode is crashed the blocks present on it will
be gone too and the blocks will be under-replicated compared to
the remaining blocks. Here master node(namenode) will give a
signal to datanodes containing replicas of those lost blocks to
replicate so that overall distribution of blocks is balanced.
Replication:: It is done by datanode.
Terms related to HDFS:
Features Of Hadoop
Economically Feasible:
It is cheaper to store
data and process it than
it was in the traditional
approach. Since the
actual machines used to
store data are only
commodity hardware.
Easy to Use: The
projects or set of tools
provided by Apache
Hadoop are easy to
work upon in order to
analyze complex data
sets.
Open Source: Since
Hadoop is distributed as
an open source software
under Apache License,
so one does not need to
pay for it, just download
it and use it.
Scalability: Hadoop is
highly scalable in
nature. If one needs to
scale up or scale down
the cluster, one only
needs to change the
number of commodity
hardware in the cluster.
Fault Tolerance: Since Hadoop stores
three copies of data, so even if one copy is
lost because of any commodity hardware
failure, the data is safe. Moreover, as
Hadoop version 3 has multiple name
nodes, so even the single point of failure
of Hadoop has also been removed.
Locality of Data: This is one of the most
alluring and promising features of
Hadoop. In Hadoop, to process a query
over a data set, instead of bringing the
data to the local computer we send the
query to the server and fetch the final
result from there. This is called data
locality.
Distributed Processing:
HDFS and Map Reduce
ensures distributed
storage and processing
of the data.
Advantage and Disadvantage
Difference B/w Hadoop and RDBMS
The Hadoop EcoSystem
Its A Platform/Framework
Helps To Solve Big Data Problems
Bibliography
https://data-flair.training/blogs/hadoop-architecture/
https://www.geeksforgeeks.org/hadoop-introduction/
https://hadoop.apache.org/docs/current/
https://github.com/apache/hadoop
https://youtu.be/1vbXmCrkT3Y
Thank You

More Related Content

PPTX
Big Data and Hadoop - An Introduction
ODT
Hadoop Interview Questions and Answers by rohit kapa
PPTX
PDF
Introduction to Hadoop
PDF
Hadoop interview question
PDF
PPT
hadoop
Big Data and Hadoop - An Introduction
Hadoop Interview Questions and Answers by rohit kapa
Introduction to Hadoop
Hadoop interview question
hadoop

What's hot (17)

PPT
An Introduction to Hadoop
PDF
Hadoop installation, Configuration, and Mapreduce program
PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
PPT
Seminar Presentation Hadoop
PPTX
Introduction to Hadoop
PPTX
A Basic Introduction to the Hadoop eco system - no animation
PDF
Hadoop hdfs interview questions
PPTX
BIG DATA: Apache Hadoop
PDF
Hadoop Distributed file system.pdf
DOCX
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
KEY
Intro to Hadoop
PPTX
Understanding hdfs
PDF
Seminar_Report_hadoop
PDF
White paper hadoop performancetuning
PPT
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
PDF
02.28.13 WANdisco ApacheCon 2013
An Introduction to Hadoop
Hadoop installation, Configuration, and Mapreduce program
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Seminar Presentation Hadoop
Introduction to Hadoop
A Basic Introduction to the Hadoop eco system - no animation
Hadoop hdfs interview questions
BIG DATA: Apache Hadoop
Hadoop Distributed file system.pdf
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Intro to Hadoop
Understanding hdfs
Seminar_Report_hadoop
White paper hadoop performancetuning
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
02.28.13 WANdisco ApacheCon 2013
Ad

Similar to Hadoop Ecosystem (20)

PPTX
Hadoop – Architecture.pptx
PDF
Hadoop Tutorial for Big Data Enthusiasts
PPTX
Distributed Systems Hadoop.pptx
PPTX
OPERATING SYSTEM .pptx
PPTX
Big Data Analytics With Hadoop
PPTX
Apache Hadoop Big Data Technology
PPTX
Hadoop and It_s Components_PPT .pptx
PPTX
Introduction to Hadoop and Big Data
PPT
Apache hadoop, hdfs and map reduce Overview
PPTX
Hadoop ppt on the basics and architecture
PPTX
Hadoop
PDF
Understanding Hadoop
PDF
Hadoop 2.0 handout 5.0
PDF
Hadoop installation by santosh nage
PPTX
Hadoop
PPT
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PPTX
Introduction to Hadoop
DOCX
Hadoop Tutorial for Beginners
PDF
Hadoop overview.pdf
Hadoop – Architecture.pptx
Hadoop Tutorial for Big Data Enthusiasts
Distributed Systems Hadoop.pptx
OPERATING SYSTEM .pptx
Big Data Analytics With Hadoop
Apache Hadoop Big Data Technology
Hadoop and It_s Components_PPT .pptx
Introduction to Hadoop and Big Data
Apache hadoop, hdfs and map reduce Overview
Hadoop ppt on the basics and architecture
Hadoop
Understanding Hadoop
Hadoop 2.0 handout 5.0
Hadoop installation by santosh nage
Hadoop
hdfs readrmation ghghg bigdats analytics info.pdf
Introduction to Hadoop
Hadoop Tutorial for Beginners
Hadoop overview.pdf
Ad

Recently uploaded (20)

PDF
Design Guidelines and solutions for Plastics parts
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PPTX
Fundamentals of Mechanical Engineering.pptx
PPTX
introduction to high performance computing
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
August -2025_Top10 Read_Articles_ijait.pdf
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PDF
Abrasive, erosive and cavitation wear.pdf
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PDF
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
Visual Aids for Exploratory Data Analysis.pdf
PPT
Total quality management ppt for engineering students
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
Design Guidelines and solutions for Plastics parts
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
Fundamentals of Mechanical Engineering.pptx
introduction to high performance computing
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Fundamentals of safety and accident prevention -final (1).pptx
August -2025_Top10 Read_Articles_ijait.pdf
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Abrasive, erosive and cavitation wear.pdf
Exploratory_Data_Analysis_Fundamentals.pdf
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Information Storage and Retrieval Techniques Unit III
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
Visual Aids for Exploratory Data Analysis.pdf
Total quality management ppt for engineering students
tack Data Structure with Array and Linked List Implementation, Push and Pop O...

Hadoop Ecosystem

  • 2. Why Hadoop? Suppose we want to process a data. In the traditional approach, we used to store data on local machines. This data was then processed. Now as data started increasing, the local machines or computers were not capable enough to store this huge data set. So, data was then started to be stored on remote servers. Now suppose we need to process that data. So, in the traditional approach, this data has to be fetched from the servers and then processed upon. Suppose this data is of 500 GB. Now, practically it is very complex and expensive to fetch this data. This approach is also called Enterprise Approach. In the new Hadoop Approach, instead of fetching the data on local machines we send the query to the data. Obviously, the query to process the data will not be as huge as the data itself. Moreover, at the server, the query is divided into several parts. All these parts process the data simultaneously. This is called parallel execution and is possible because of Map Reduce. So, now not only there is no need to fetch the data, but also the processing takes lesser time. The result of the query is then sent to the user. Thus the Hadoop makes data storage, processing and analyzing way easier than its traditional approach.
  • 3. What Is Hadoop? Facebook Google JPMorgan and Chase Goldmansachs Yahoo AWS Microsoft IBM Cloudera IQVIA Rackspace Technology And Many More Companies Uses Hadoop Based System Data Scientist Big Data Visualizer Big Data Research Analyst Big Data Engineer Big Data Analyst Big Data Architect And Many More Jobs Opportunity With Apache Hadoop
  • 4. Hadoop Architecture Hadoop Distributed File System: In our local PC, by default the block size in Hard Disk is 4KB. When we install Hadoop, the HDFS by default changes the block size to 64 MB. Since it is used to store huge data. We can also change the block size to 128 MB. Now HDFS works with Data Node and Name Node. While Name Node is a master service and it keeps the metadata as for on which commodity hardware, the data is residing, the Data Node stores the actual data. Now, since the block size is of 64 MB thus the storage required to store metadata is reduced thus making HDFS better. Also, Hadoop stores three copies of every dataset at three different locations. This ensures that the Hadoop is not prone to single point of failure. Map Reduce: In the simplest manner, it can be understood that MapReduce breaks a query into multiple parts and now each part process the data coherently. This parallel execution helps to execute a query faster and makes Hadoop a suitable and optimal choice to deal with Big Data. YARN: As we know that Yet Another Resource Negotiator works like an operating system to Hadoop and as operating systems are resource managers so YARN manages the resources of Hadoop so that Hadoop serves big data in a better way. 1. 2. 3. Hadoop has a master-slave topology. In this topology, we have one master node and multiple slave nodes. Master node’s function is to assign a task to various slave nodes and manage resources. The slave nodes do the actual computing. Slave nodes store the real data whereas on master we have metadata. This means it stores data about data.
  • 5. Map Reduce MapReduce is the processing layer of Hadoop. MapReduce programming model is designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. You need to put business logic in the way MapReduce works and rest things will be taken care by the framework. Work (complete job) which is submitted by the user to master is divided into small works (tasks) and assigned to slaves. MapReduce programs are written in a particular style influenced by functional programming constructs, specifical idioms for processing lists of data. Here in MapReduce, we get inputs from a list and it converts it into output which is again a list. It is the heart of Hadoop. Hadoop is so much powerful and efficient due to MapRreduce as here parallel processing is done. Map() performs sorting and filtering of data and thereby organizing them in the form of group. Map generates a key-value pair based result which is later on processed by the Reduce() method. Reduce(), as the name suggests does the summarization by aggregating the mapped data. In simple, Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples Two Key Words :- 1. Map , 2.Reduce 1. 2.
  • 6. Hadoop Yarn Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System. Consists of three major components i.e. Resource Manager Nodes Manager Application Manager Resource manager has the privilege of allocating resources for the applications in a system whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on acknowledges the resource manager. Application manager works as an interface between the resource manager and node manager and performs negotiations as per the requirement of the two. a. b. c.
  • 7. Hadoop HDFS HDFS stands for Hadoop Distributed File System. It provides for data storage of Hadoop. HDFS splits the data unit into smaller units called blocks and stores them in a distributed manner. It has got two daemons running. One for master node – NameNode and other for slave nodes – DataNode. HDFS has a Master-slave architecture. The daemon called NameNode runs on the master server. It is responsible for Namespace management and regulates file access by the client. DataNode daemon runs on slave nodes. It is responsible for storing actual business data. Internally, a file gets split into a number of data blocks and stored on a group of slave machines. Namenode manages modifications to file system namespace. These are actions like the opening, closing and renaming files or directories. NameNode also keeps track of mapping of blocks to DataNodes. This DataNodes serves read/write request from the file system’s client. DataNode also creates, deletes and replicates blocks on demand from NameNode. HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode doesn’t receive heartbeat from a datanode then it will consider it dead. Balancing : If a datanode is crashed the blocks present on it will be gone too and the blocks will be under-replicated compared to the remaining blocks. Here master node(namenode) will give a signal to datanodes containing replicas of those lost blocks to replicate so that overall distribution of blocks is balanced. Replication:: It is done by datanode. Terms related to HDFS:
  • 8. Features Of Hadoop Economically Feasible: It is cheaper to store data and process it than it was in the traditional approach. Since the actual machines used to store data are only commodity hardware. Easy to Use: The projects or set of tools provided by Apache Hadoop are easy to work upon in order to analyze complex data sets. Open Source: Since Hadoop is distributed as an open source software under Apache License, so one does not need to pay for it, just download it and use it. Scalability: Hadoop is highly scalable in nature. If one needs to scale up or scale down the cluster, one only needs to change the number of commodity hardware in the cluster. Fault Tolerance: Since Hadoop stores three copies of data, so even if one copy is lost because of any commodity hardware failure, the data is safe. Moreover, as Hadoop version 3 has multiple name nodes, so even the single point of failure of Hadoop has also been removed. Locality of Data: This is one of the most alluring and promising features of Hadoop. In Hadoop, to process a query over a data set, instead of bringing the data to the local computer we send the query to the server and fetch the final result from there. This is called data locality. Distributed Processing: HDFS and Map Reduce ensures distributed storage and processing of the data.
  • 11. The Hadoop EcoSystem Its A Platform/Framework Helps To Solve Big Data Problems