HADOOP A DISTRIBUTED
FRAMEWORK FOR BIG DATA
Submitted by
Name:P.MAHARAJOTHI
Class:IIMsc(Computer Science)
Batch:2017-2019
Incharge Staff:Ms.M.Florence Dayana




Hadoop’s history and
Advantages
Architecture in detail
Hadoop in industry
INTRODUCTION


Apache top level project, open-source implementation of
frameworks for reliable, scalable, distributed computing and
storage.
It is a flexible and highly-available architecture for large scale
computation and data processing on a network of commodity
hardware.
DEFINE HADOOP
Designed to answer the question:
“How to process big data with reasonable cost and time?”
BRIEF HISTORY OF HADOOP
EXAMPLE GOOGLE
2003
GOOGLE ORGINS
•
•


Hadoop:
an open-source software framework that supports data-
intensive distributed applications, licensed under the Apache v2
license.
Goals / Requirements:
Abstract and facilitate the storage and processing of large and/
or rapidly growing data sets
Structured and non-structured data
Simple programming models
WHAT IS HADOOP?
HADOOP FRAMEORK TOOL
•
•
•
•
Distributed, with some centralization
Main nodes of cluster are where most of the computational
power and storage of the system lies
Main nodes run Task Tracker to accept and reply to
MapReduce tasks, and also Data Node to store needed blocks
closely as possible
Central control node runs to keep track of HDFS directories &
files, and JobTracker to dispatch compute tasks to Task
Tracker
HADOOP ARCHITECTURE
DAIGRAM SWITCH NODE
•
•
•
•
Name Node:
Stores metadata for the files, like the directory structure of a typical
FS.
The server holding the Name Node instance is quite crucial, as there
is only one.
Transaction log for file deletes/adds, etc. Does not use transactions
for whole blocks or file-streams, only metadata.
Handles creation of more replica blocks when necessary after a
Data Node failure
NAME NODE FRAM
•
•
•
•
Data Node:
Stores the actual data in HDFS
Can run on any underlying file system (ext3/4, NTFS, etc)
Notifies Name Node of what blocks it has
Name Node replicates blocks 2x in local rack, 1x elsewhere
DATA NODE ARCHITEC
•
•
•
•
•
•
•
•
Hadoop Distributed File system
Tailored to needs of MapReduce
Targeted towards many reads of file streams
Writes are more costly
High degree of data replication (3x by default)
No need for RAID on normal nodes
Large block size (64MB)
Location awareness of Data Nodes in network
HADOOP FILE SYSTEM
HADOOP MAPREDUCE ENGINE
•
•
•
MapReduce Engine:
Job Tracker & Task Tracker
JobTracker splits up data into smaller tasks(“Map”) and sends it to
the Task Tracker process in each node
Task Tracker reports back to the JobTracker node and reports on job
progress, sends data (“Reduce”) or requests new jobs



None of these components are necessarily limited to using HDFS
Many other distributed file-systems with quite different
architectures work
Many other software packages besides Hadoop's MapReduce
platform make use of HDFS
COMPONENTS
•
o
o
o
o
o
•
o
o
Hadoop is in use at most organizations that handle big data:
Yahoo!
Face book
Amazon
Netflix
Etc…
Some examples of scale:
Yahoo!’s Search Webmap runs on 10,000 core Linux cluster and
powers Yahoo! Web search
FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) &
growing at ½ PB/day (Nov, 2012)
GOOGLE ORGINS
THANK YOU

Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women

  • 1.
    HADOOP A DISTRIBUTED FRAMEWORKFOR BIG DATA Submitted by Name:P.MAHARAJOTHI Class:IIMsc(Computer Science) Batch:2017-2019 Incharge Staff:Ms.M.Florence Dayana
  • 2.
        Hadoop’s history and Advantages Architecturein detail Hadoop in industry INTRODUCTION
  • 3.
      Apache top levelproject, open-source implementation of frameworks for reliable, scalable, distributed computing and storage. It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware. DEFINE HADOOP
  • 4.
    Designed to answerthe question: “How to process big data with reasonable cost and time?” BRIEF HISTORY OF HADOOP
  • 5.
  • 6.
  • 7.
    • •   Hadoop: an open-source softwareframework that supports data- intensive distributed applications, licensed under the Apache v2 license. Goals / Requirements: Abstract and facilitate the storage and processing of large and/ or rapidly growing data sets Structured and non-structured data Simple programming models WHAT IS HADOOP?
  • 8.
  • 9.
    • • • • Distributed, with somecentralization Main nodes of cluster are where most of the computational power and storage of the system lies Main nodes run Task Tracker to accept and reply to MapReduce tasks, and also Data Node to store needed blocks closely as possible Central control node runs to keep track of HDFS directories & files, and JobTracker to dispatch compute tasks to Task Tracker HADOOP ARCHITECTURE
  • 10.
  • 11.
    • • • • Name Node: Stores metadatafor the files, like the directory structure of a typical FS. The server holding the Name Node instance is quite crucial, as there is only one. Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks or file-streams, only metadata. Handles creation of more replica blocks when necessary after a Data Node failure NAME NODE FRAM
  • 12.
    • • • • Data Node: Stores theactual data in HDFS Can run on any underlying file system (ext3/4, NTFS, etc) Notifies Name Node of what blocks it has Name Node replicates blocks 2x in local rack, 1x elsewhere DATA NODE ARCHITEC
  • 13.
    • • • • • • • • Hadoop Distributed Filesystem Tailored to needs of MapReduce Targeted towards many reads of file streams Writes are more costly High degree of data replication (3x by default) No need for RAID on normal nodes Large block size (64MB) Location awareness of Data Nodes in network HADOOP FILE SYSTEM
  • 14.
  • 16.
    • • • MapReduce Engine: Job Tracker& Task Tracker JobTracker splits up data into smaller tasks(“Map”) and sends it to the Task Tracker process in each node Task Tracker reports back to the JobTracker node and reports on job progress, sends data (“Reduce”) or requests new jobs
  • 17.
       None of thesecomponents are necessarily limited to using HDFS Many other distributed file-systems with quite different architectures work Many other software packages besides Hadoop's MapReduce platform make use of HDFS COMPONENTS
  • 18.
    • o o o o o • o o Hadoop is inuse at most organizations that handle big data: Yahoo! Face book Amazon Netflix Etc… Some examples of scale: Yahoo!’s Search Webmap runs on 10,000 core Linux cluster and powers Yahoo! Web search FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) & growing at ½ PB/day (Nov, 2012) GOOGLE ORGINS
  • 19.