SlideShare a Scribd company logo
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri
Technology Enthusiast
linkedin.com/in/pothiq
twitter.com/pothiq
pothiq@gmail.com
"The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell and
pronounce, meaningless and not used
elsewhere: those are my naming criteria.
Kids are good at generating such."
- Doug Cutting, Creator of Hadoop
“Hadoop is the popular open
source implementation of
MapReduce, a powerful tool
designed for deep analysis
and transformation of very
large data sets.”
https://hadoop.apache.org/
When to Use Hadoop?
1. For Processing Really BIG Data.
2. For Storing a Diverse Set of Data.
3. For Parallel Data Processing.
When NOT to Use Hadoop?
1. For Real-Time Data Analysis.
2. For a Relational Database System.
3. For a General Network File System.
4. For Non-Parallel Data Processing.
Hadoop feature releases
Map-Reduce vs YARN Architecture
Hadoop Core Components:
What is JobTracker?
JobTracker is a daemon which
runs on Apache Hadoop's
MapReduce engine.
JobTracker is an essential
service which farms out all
MapReduce tasks to the
different nodes in the cluster,
ideally to those nodes which
already contain the data, or at
the very least are located in
the same rack as nodes
containing the data.
What is NameNode?
NameNode- It is also known as Master in Hadoop cluster.
Below listed are the main function performed by NameNode:
 NameNode stores metadata of actual data. e.g. filename,
path, No. of Blocks, Block IDs, Block location, no. of
replicas, and also Slave related configuration.
 It manages Filesystem namespace.
 NameNode regulates client access to files.
 It assigns work to Slaves (DataNode).
 It executes file system namespace operation like
opening/closing files, renaming files/directories.
 As NameNode keep metadata in memory for fast retrieval.
So it requires the huge amount of memory for its
operation.
What is Secondary NameNode?
Secondary NameNode, by its name we assume that it as a backup
node but its not. First let me give a brief about NameNode.
NameNode holds the metadata for HDFS like Block information,
size etc. This Information is stored in main memory as well as disk
for persistence storage.
The information is stored in 2 different files .They are
Editlogs- It keeps track of each and every changes to HDFS.
Fsimage- It stores the snapshot of the file system.
What is DataNode?
 DataNode is also known as Slave node.
 In Hadoop HDFS Architecture, DataNode stores
actual data in HDFS.
 DataNodes responsible for serving, read and write
requests for the clients.
 DataNodes can deploy on commodity hardware.
 DataNodes sends information to the NameNode
about the files and blocks stored in that node and
responds to the NameNode for all filesystem
operations.
 When a DataNode starts up it announce itself to
the NameNode along with the list of blocks it is
responsible for.
 DataNode is usually configured with a lot of hard
disk space. Because the actual data is stored in
the DataNode.
What is HDFS?
HDFS is a distributed file system allowing multiple files to be stored and
retrieved at the same time at an unprecedented speed. It is one of the basic
components of Hadoop framework.
Sequence Diagram for Hadoop-MapReduce
Programming Model
Big Data Hadoop Real Life Use Cases:
1. Healthcare
2. Wildlife
3. Retail Industry
4. Income Tax to scrutinize bank accounts
5. Fraud Detection
6. Sentimental Security
7. Networking Security
8. Education etc.
Companies Using Hadoop:
Why Hadoop?
1. Ability to store and process huge amounts of any kind of data, quickly.
2. Computing model processes big data fast
3. Fault tolerance
4. Flexibility
5. Low Cost
6. Scalability
 Vertical scaling doesn’t cut it
 Disk seek times
 Hardware failures
 Processing times
 Horizontal scaling is linear
7. It’s not just for batch processing anymore
Hadoop Timeline
• Google published GFS and MapReduce papers in 2003-2004.
• Yahoo! Was building “Nutch”, an open source web search engine at the same time.
• Hadoop was primarily driven by Doug Cutting and Tom White in 2006.
• It’s been evolving ever since
What is BIG-DATA?
Big data is a term that describes the
large volume of data – both
structured and unstructured – that
inundates a business on a day-to-day
basis. But it’s not the amount of data
that’s important. It’s what
organizations do with the data that
matters. Big data can be analyzed for
insights that lead to better decisions
and strategic business moves.
Big Data Current Considerations
Volume. Organizations collect data from a variety of sources, including business transactions, social media
and information from sensor or machine-to-machine data.
Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags,
sensors and smart metering are driving the need to deal with torrents of data in near-real time.
Variety. Data comes in all types of formats – from structured, numeric data in traditional databases to
unstructured text documents, email, video, audio, stock ticker data and financial transactions.
Variability. In addition to the increasing velocities and varieties of data, data flows can be highly
inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered
peak data loads can be challenging to manage. Even more so with unstructured data.
Complexity. Today's data comes from multiple sources, which makes it difficult to link, match, cleanse and
transform data across systems. However, it’s necessary to connect and correlate relationships, hierarchies
and multiple data linkages or your data can quickly spiral out of control.
What is MapReduce?
MapReduce is a programming
model or pattern within the
Hadoop framework that is used to
access big data stored in the
Hadoop File System (HDFS). It is a
core component, integral to the
functioning of the Hadoop
framework.
MapReduce is a programming model
Major Components of Hadoop
Core Hadoop EcosystemQuery Engines External Data Storage
Core Hadoop Ecosystem
Query Engines
Real World Application Architecture
External Data Storage
Useful URLs
https://data-flair.training/blogs/hadoop-ecosystem-components/
https://www.quora.com/What-is-a-Hadoop-ecosystem
https://www.geeksforgeeks.org/hadoop-ecosystem/
https://www.edureka.co/blog/hadoop-ecosystem
https://www.simplilearn.com/big-data-and-hadoop-ecosystem-tutorial
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System

More Related Content

PDF
Hadoop YARN
PPTX
Relational databases vs Non-relational databases
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PPTX
Hadoop File system (HDFS)
PPTX
Introduction to Hadoop Technology
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PPTX
Introduction to Hadoop and Hadoop component
PPTX
Introduction to HDFS
Hadoop YARN
Relational databases vs Non-relational databases
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop File system (HDFS)
Introduction to Hadoop Technology
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Introduction to Hadoop and Hadoop component
Introduction to HDFS

What's hot (20)

PPTX
Introduction to Hadoop
PPTX
NoSQL databases - An introduction
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
PPTX
Big data Hadoop presentation
PDF
Hadoop Overview & Architecture
 
PPTX
PPT on Hadoop
PDF
HDFS Architecture
PPTX
Apache HBase™
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
PPTX
Hadoop
PPTX
Apache PIG
PPTX
Map Reduce
PDF
TP2 Big Data HBase
PPTX
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
PPTX
Apache Hadoop
PPSX
PPTX
Hadoop And Their Ecosystem
PPTX
HADOOP TECHNOLOGY ppt
PPTX
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
PDF
Introduction to Apache Cassandra
Introduction to Hadoop
NoSQL databases - An introduction
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Big data Hadoop presentation
Hadoop Overview & Architecture
 
PPT on Hadoop
HDFS Architecture
Apache HBase™
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop
Apache PIG
Map Reduce
TP2 Big Data HBase
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Apache Hadoop
Hadoop And Their Ecosystem
HADOOP TECHNOLOGY ppt
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
Introduction to Apache Cassandra
Ad

Similar to Introduction to Apache Hadoop Eco-System (20)

PPTX
Hadoop by kamran khan
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PPTX
Hadoop
PDF
Understanding Hadoop
PDF
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PPSX
Hadoop-Quick introduction
PPTX
OPERATING SYSTEM .pptx
PPTX
Apache Hadoop Big Data Technology
PPT
hadoop
PPT
hadoop
PDF
Hadoop Ecosystem
PPTX
Big Data and Hadoop
PPTX
Big Data & Hadoop
PPTX
Big Data Analytics With Hadoop
PPTX
Hadoop ppt1
PPT
Big Data and Hadoop Basics
PPTX
Big Data and Hadoop
ODP
Hadoop seminar
PPTX
Big data
Hadoop by kamran khan
hdfs readrmation ghghg bigdats analytics info.pdf
Hadoop
Understanding Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
Hadoop_EcoSystem slide by CIDAC India.pptx
Hadoop-Quick introduction
OPERATING SYSTEM .pptx
Apache Hadoop Big Data Technology
hadoop
hadoop
Hadoop Ecosystem
Big Data and Hadoop
Big Data & Hadoop
Big Data Analytics With Hadoop
Hadoop ppt1
Big Data and Hadoop Basics
Big Data and Hadoop
Hadoop seminar
Big data
Ad

More from Md. Hasan Basri (Angel) (9)

PPTX
Information Security Engineering
PPTX
Introduction to Blockchain Technology
PPTX
MicroService Architecture
PPTX
Test Driven Development
PPTX
Introduction to Bank Reconciliation
PPTX
Agile/Scrum Methodology Gains Your Productivity
PPTX
ISO 8583 Financial Message Format
PPT
Signature based virus detection and protection system
PPTX
XML Key Management Protocol for Secure Web Service
Information Security Engineering
Introduction to Blockchain Technology
MicroService Architecture
Test Driven Development
Introduction to Bank Reconciliation
Agile/Scrum Methodology Gains Your Productivity
ISO 8583 Financial Message Format
Signature based virus detection and protection system
XML Key Management Protocol for Secure Web Service

Recently uploaded (20)

PPTX
ai tools demonstartion for schools and inter college
PDF
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Introduction to Artificial Intelligence
PPTX
FLIGHT TICKET RESERVATION SYSTEM | FLIGHT BOOKING ENGINE API
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
How to Confidently Manage Project Budgets
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Best Practices for Rolling Out Competency Management Software.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
System and Network Administraation Chapter 3
PDF
medical staffing services at VALiNTRY
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
ai tools demonstartion for schools and inter college
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
VVF-Customer-Presentation2025-Ver1.9.pptx
Introduction to Artificial Intelligence
FLIGHT TICKET RESERVATION SYSTEM | FLIGHT BOOKING ENGINE API
PTS Company Brochure 2025 (1).pdf.......
Materi-Enum-and-Record-Data-Type (1).pptx
ManageIQ - Sprint 268 Review - Slide Deck
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
How to Confidently Manage Project Budgets
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Best Practices for Rolling Out Competency Management Software.pdf
Transform Your Business with a Software ERP System
System and Network Administraation Chapter 3
medical staffing services at VALiNTRY
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...

Introduction to Apache Hadoop Eco-System

  • 2. Md. Hasan Basri Technology Enthusiast linkedin.com/in/pothiq twitter.com/pothiq [email protected]
  • 3. "The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless and not used elsewhere: those are my naming criteria. Kids are good at generating such." - Doug Cutting, Creator of Hadoop
  • 4. “Hadoop is the popular open source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets.” https://hadoop.apache.org/
  • 5. When to Use Hadoop? 1. For Processing Really BIG Data. 2. For Storing a Diverse Set of Data. 3. For Parallel Data Processing. When NOT to Use Hadoop? 1. For Real-Time Data Analysis. 2. For a Relational Database System. 3. For a General Network File System. 4. For Non-Parallel Data Processing.
  • 7. Map-Reduce vs YARN Architecture
  • 9. What is JobTracker? JobTracker is a daemon which runs on Apache Hadoop's MapReduce engine. JobTracker is an essential service which farms out all MapReduce tasks to the different nodes in the cluster, ideally to those nodes which already contain the data, or at the very least are located in the same rack as nodes containing the data.
  • 10. What is NameNode? NameNode- It is also known as Master in Hadoop cluster. Below listed are the main function performed by NameNode:  NameNode stores metadata of actual data. e.g. filename, path, No. of Blocks, Block IDs, Block location, no. of replicas, and also Slave related configuration.  It manages Filesystem namespace.  NameNode regulates client access to files.  It assigns work to Slaves (DataNode).  It executes file system namespace operation like opening/closing files, renaming files/directories.  As NameNode keep metadata in memory for fast retrieval. So it requires the huge amount of memory for its operation.
  • 11. What is Secondary NameNode? Secondary NameNode, by its name we assume that it as a backup node but its not. First let me give a brief about NameNode. NameNode holds the metadata for HDFS like Block information, size etc. This Information is stored in main memory as well as disk for persistence storage. The information is stored in 2 different files .They are Editlogs- It keeps track of each and every changes to HDFS. Fsimage- It stores the snapshot of the file system.
  • 12. What is DataNode?  DataNode is also known as Slave node.  In Hadoop HDFS Architecture, DataNode stores actual data in HDFS.  DataNodes responsible for serving, read and write requests for the clients.  DataNodes can deploy on commodity hardware.  DataNodes sends information to the NameNode about the files and blocks stored in that node and responds to the NameNode for all filesystem operations.  When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for.  DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode.
  • 13. What is HDFS? HDFS is a distributed file system allowing multiple files to be stored and retrieved at the same time at an unprecedented speed. It is one of the basic components of Hadoop framework.
  • 14. Sequence Diagram for Hadoop-MapReduce Programming Model
  • 15. Big Data Hadoop Real Life Use Cases: 1. Healthcare 2. Wildlife 3. Retail Industry 4. Income Tax to scrutinize bank accounts 5. Fraud Detection 6. Sentimental Security 7. Networking Security 8. Education etc.
  • 17. Why Hadoop? 1. Ability to store and process huge amounts of any kind of data, quickly. 2. Computing model processes big data fast 3. Fault tolerance 4. Flexibility 5. Low Cost 6. Scalability  Vertical scaling doesn’t cut it  Disk seek times  Hardware failures  Processing times  Horizontal scaling is linear 7. It’s not just for batch processing anymore
  • 18. Hadoop Timeline • Google published GFS and MapReduce papers in 2003-2004. • Yahoo! Was building “Nutch”, an open source web search engine at the same time. • Hadoop was primarily driven by Doug Cutting and Tom White in 2006. • It’s been evolving ever since
  • 19. What is BIG-DATA? Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
  • 20. Big Data Current Considerations Volume. Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Variety. Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions. Variability. In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data. Complexity. Today's data comes from multiple sources, which makes it difficult to link, match, cleanse and transform data across systems. However, it’s necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control.
  • 21. What is MapReduce? MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored in the Hadoop File System (HDFS). It is a core component, integral to the functioning of the Hadoop framework.
  • 22. MapReduce is a programming model
  • 23. Major Components of Hadoop Core Hadoop EcosystemQuery Engines External Data Storage
  • 26. Real World Application Architecture