0% found this document useful (0 votes)
48 views

Hadoop

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce as a programming model and HDFS as a distributed file system. HDFS stores large files across clusters and replicates data for reliability, while MapReduce allows parallel processing of datasets in a fault-tolerant manner. A typical Hadoop cluster integrates these components, with a master node running job and name nodes and slave nodes running task and data nodes.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Hadoop

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce as a programming model and HDFS as a distributed file system. HDFS stores large files across clusters and replicates data for reliability, while MapReduce allows parallel processing of datasets in a fault-tolerant manner. A typical Hadoop cluster integrates these components, with a master node running job and name nodes and slave nodes running task and data nodes.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Hadoop/MapReduce

Object-oriented framework presentation


CSCI 5448
Casey McTaggart
What is Apache Hadoop?
• Large scale, open source software framework
▫ Yahoo! has been the largest contributor to date
• Dedicated to scalable, distributed, data-intensive
computing
• Handles thousands of nodes and petabytes of data
• Supports applications under a free license
• 3 Hadoop subprojects:
▫ Hadoop Common: common utilities package
▫ HFDS: Hadoop Distributed File System with high
throughput access to application data
▫ MapReduce: A software framework for distributed
processing of large data sets on computer clusters
Hadoop MapReduce
• MapReduce is a programming model and software
framework first developed by Google (Google’s
MapReduce paper submitted in 2004)
• Intended to facilitate and simplify the processing of
vast amounts of data in parallel on large clusters of
commodity hardware in a reliable, fault-tolerant
manner
▫ Petabytes of data
▫ Thousands of nodes
• Computational processing occurs on both:
▫ Unstructured data : filesystem
▫ Structured data : database
Hadoop Distributed File System (HFDS)
• Inspired by Google File System
• Scalable, distributed, portable filesystem written in Java for
Hadoop framework
▫ Primary distributed storage used by Hadoop applications
• HFDS can be part of a Hadoop cluster or can be a stand-alone
general purpose distributed file system
• An HFDS cluster primarily consists of
▫ NameNode that manages file system metadata
▫ DataNode that stores actual data
• Stores very large files in blocks across machines in a large
cluster
▫ Reliability and fault tolerance ensured by replicating data across
multiple hosts
• Has data awareness between nodes
• Designed to be deployed on low-cost hardware
More on Hadoop file systems

• Hadoop can work directly with any distributed


file system which can be mounted by the
underlying OS
• However, doing this means a loss of locality as
Hadoop needs to know which servers are closest
to the data
• Hadoop-specific file systems like HFDS are
developed for locality, speed, fault tolerance,
integration with Hadoop, and reliability
Typical Hadoop cluster integrates
MapReduce and HFDS
• Master/slave architecture
• Master node contains
▫ Job tracker node (MapReduce layer)
▫ Task tracker node (MapReduce layer)
▫ Name node (HFDS layer)
▫ Data node (HFDS layer)
• Multiple slave nodes contain
▫ Task tracker node (MapReduce layer)
▫ Data node (HFDS layer)
• MapReduce layer has job and task tracker nodes
• HFDS layer has name and data nodes
Hadoop simple cluster graphic
MapReduce layer HFDS layer

Master Node

JobTracker TaskTracker Name Data

Slave Node
1..*
TaskTracker Data

You might also like