Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce as a programming model and HDFS as a distributed file system. HDFS stores large files across clusters and replicates data for reliability, while MapReduce allows parallel processing of datasets in a fault-tolerant manner. A typical Hadoop cluster integrates these components, with a master node running job and name nodes and slave nodes running task and data nodes.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
48 views
Hadoop
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce as a programming model and HDFS as a distributed file system. HDFS stores large files across clusters and replicates data for reliability, while MapReduce allows parallel processing of datasets in a fault-tolerant manner. A typical Hadoop cluster integrates these components, with a master node running job and name nodes and slave nodes running task and data nodes.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7
Hadoop/MapReduce
Object-oriented framework presentation
CSCI 5448 Casey McTaggart What is Apache Hadoop? • Large scale, open source software framework ▫ Yahoo! has been the largest contributor to date • Dedicated to scalable, distributed, data-intensive computing • Handles thousands of nodes and petabytes of data • Supports applications under a free license • 3 Hadoop subprojects: ▫ Hadoop Common: common utilities package ▫ HFDS: Hadoop Distributed File System with high throughput access to application data ▫ MapReduce: A software framework for distributed processing of large data sets on computer clusters Hadoop MapReduce • MapReduce is a programming model and software framework first developed by Google (Google’s MapReduce paper submitted in 2004) • Intended to facilitate and simplify the processing of vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner ▫ Petabytes of data ▫ Thousands of nodes • Computational processing occurs on both: ▫ Unstructured data : filesystem ▫ Structured data : database Hadoop Distributed File System (HFDS) • Inspired by Google File System • Scalable, distributed, portable filesystem written in Java for Hadoop framework ▫ Primary distributed storage used by Hadoop applications • HFDS can be part of a Hadoop cluster or can be a stand-alone general purpose distributed file system • An HFDS cluster primarily consists of ▫ NameNode that manages file system metadata ▫ DataNode that stores actual data • Stores very large files in blocks across machines in a large cluster ▫ Reliability and fault tolerance ensured by replicating data across multiple hosts • Has data awareness between nodes • Designed to be deployed on low-cost hardware More on Hadoop file systems
• Hadoop can work directly with any distributed
file system which can be mounted by the underlying OS • However, doing this means a loss of locality as Hadoop needs to know which servers are closest to the data • Hadoop-specific file systems like HFDS are developed for locality, speed, fault tolerance, integration with Hadoop, and reliability Typical Hadoop cluster integrates MapReduce and HFDS • Master/slave architecture • Master node contains ▫ Job tracker node (MapReduce layer) ▫ Task tracker node (MapReduce layer) ▫ Name node (HFDS layer) ▫ Data node (HFDS layer) • Multiple slave nodes contain ▫ Task tracker node (MapReduce layer) ▫ Data node (HFDS layer) • MapReduce layer has job and task tracker nodes • HFDS layer has name and data nodes Hadoop simple cluster graphic MapReduce layer HFDS layer
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!