Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women

HADOOP A DISTRIBUTED
FRAMEWORK FOR BIG DATA
Submitted by
Name:P.MAHARAJOTHI
Class:IIMsc(Computer Science)
Batch:2017-2019
Incharge Staff:Ms.M.Florence Dayana





Hadoop’s history and
Advantages
Architecture in detail
Hadoop in industry
INTRODUCTION



Apache top level project, open-source implementation of
frameworks for reliable, scalable, distributed computing and
storage.
It is a ﬂexible and highly-available architecture for large scale
computation and data processing on a network of commodity
hardware.
DEFINE HADOOP

Designed to answer the question:
“How to process big data with reasonable cost and time?”
BRIEF HISTORY OF HADOOP

•
•


Hadoop:
an open-source software framework that supports data-
intensive distributed applications, licensed under the Apache v2
license.
Goals / Requirements:
Abstract and facilitate the storage and processing of large and/
or rapidly growing data sets
Structured and non-structured data
Simple programming models
WHAT IS HADOOP?

•
•
•
•
Distributed, with some centralization
Main nodes of cluster are where most of the computational
power and storage of the system lies
Main nodes run Task Tracker to accept and reply to
MapReduce tasks, and also Data Node to store needed blocks
closely as possible
Central control node runs to keep track of HDFS directories &
ﬁles, and JobTracker to dispatch compute tasks to Task
Tracker
HADOOP ARCHITECTURE

•
•
•
•
Name Node:
Stores metadata for the files, like the directory structure of a typical
FS.
The server holding the Name Node instance is quite crucial, as there
is only one.
Transaction log for file deletes/adds, etc. Does not use transactions
for whole blocks or file-streams, only metadata.
Handles creation of more replica blocks when necessary after a
Data Node failure
NAME NODE FRAM

•
•
•
•
Data Node:
Stores the actual data in HDFS
Can run on any underlying ﬁle system (ext3/4, NTFS, etc)
Notiﬁes Name Node of what blocks it has
Name Node replicates blocks 2x in local rack, 1x elsewhere
DATA NODE ARCHITEC

•
•
•
•
•
•
•
•
Hadoop Distributed File system
Tailored to needs of MapReduce
Targeted towards many reads of ﬁle streams
Writes are more costly
High degree of data replication (3x by default)
No need for RAID on normal nodes
Large block size (64MB)
Location awareness of Data Nodes in network
HADOOP FILE SYSTEM

•
•
•
MapReduce Engine:
Job Tracker & Task Tracker
JobTracker splits up data into smaller tasks(“Map”) and sends it to
the Task Tracker process in each node
Task Tracker reports back to the JobTracker node and reports on job
progress, sends data (“Reduce”) or requests new jobs




None of these components are necessarily limited to using HDFS
Many other distributed ﬁle-systems with quite different
architectures work
Many other software packages besides Hadoop's MapReduce
platform make use of HDFS
COMPONENTS

•
o
o
o
o
o
•
o
o
Hadoop is in use at most organizations that handle big data:
Yahoo!
Face book
Amazon
Netﬂix
Etc…
Some examples of scale:
Yahoo!’s Search Webmap runs on 10,000 core Linux cluster and
powers Yahoo! Web search
FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) &
growing at ½ PB/day (Nov, 2012)
GOOGLE ORGINS

Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women

More Related Content

What's hot

Similar to Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women

Recently uploaded

Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women