0% found this document useful (0 votes)
1 views31 pages

Unit 4 Hadoop

BIG DATA

Uploaded by

azhagu sundari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views31 pages

Unit 4 Hadoop

BIG DATA

Uploaded by

azhagu sundari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Hadoop

High Availability Distributed Object


Oriented Platform
Hadoop
Hadoop is an open-source framework that
allows to store and process big data in a
distributed environment across clusters of
computers using simple programming models.
latest stable version of Apache Hadoop is 3.3.1
• The four main components of Hadoop are −
• Hadoop Distributed File System (HDFS) − This is a storage system that breaks large
files into smaller pieces and distributes them across multiple computers in a
cluster. It ensures data reliability and enables parallel processing of data across the
cluster.
• MapReduce − This is a programming model used for processing and analyzing
large datasets in parallel across the cluster. It consists of two main tasks: Map,
which processes and transforms input data into intermediate key-value pairs, and
Reduce, which aggregates and summarizes the intermediate data to produce the
final output.
• YARN (Yet Another Resource Negotiator) − YARN is a resource management and
job scheduling component of Hadoop. It allocates resources (CPU, memory) to
various applications running on the cluster and manages their execution efficiently.
• Hadoop Common − This includes libraries and utilities used by other Hadoop
components. It provides tools and infrastructure for the entire Hadoop ecosystem,
such as authentication, configuration, and logging.
Hadoop is not a data warehouse
Hadoop is not a data warehouse because they
serve different purposes and have different
architectures. Hadoop is a framework for
storing and processing large volumes of
unstructured and semi-structured data across
distributed clusters of computers. It is
designed for handling big data and supports
batch processing of large datasets using
technologies like HDFS and MapReduce.
advantage of Hadoop
The biggest advantage of Hadoop is its ability to handle
and process large volumes of data efficiently.
Hadoop is designed to distribute data and processing tasks
across multiple computers in a cluster, allowing it to
scale easily to handle massive datasets that traditional
databases or processing systems struggle to manage.
This enables organizations to store, process, and analyze
huge amounts of data, gaining valuable insights and
making informed decisions that would not be possible
with conventional technologies.
Which software is used in Hadoop?
• Hadoop Distributed File System (HDFS) tores large
datasets across a cluster of computers, breaking
them into smaller pieces for efficient storage and
retrieval.
• YARN manages computing resources across the
cluster, allocating resources to different applications
and ensuring efficient execution.
• MapReduce is the processing engine that divides
data processing tasks into smaller parts and
executes them in parallel across the cluster.
Traditional Approach
• In this approach, an enterprise will have a
computer to store and process big data. For
storage purpose, the programmers will take
the help of their choice of database vendors
such as Oracle, IBM, etc.
• In this approach, the user interacts with the
application, which in turn handles the part of
data storage and analysis.
Limitation
• This approach works fine with those
applications that process less voluminous data
that can be accommodated by standard
database servers, or up to the limit of the
processor that is processing the data.
• But when it comes to dealing with huge
amounts of scalable data, it is a hectic task to
process such data through a single database
bottleneck.
Google’s Solution
• Google solved this problem using an algorithm called
MapReduce. This algorithm divides the task into
small parts and assigns them to many computers,
and collects the results from them which when
integrated, form the result dataset.
Hadoop
• Using the solution provided by Google, Doug
Cutting and his team developed an Open Source
Project called HADOOP.
• Hadoop runs applications using the MapReduce
algorithm, where the data is processed in parallel
with others.
• Hadoop is used to develop applications that could
perform complete statistical analysis on huge
amounts of data.
How does Hadoop solve the problem of Big Data?
• The proposed solution for the problem of big data
should:
• Implement good recovery strategies
• Be horizontally scalable as data grows
• Be cost-effective
• Minimize the learning curve
• Be easy for programmers and data analysts, and
even for non-programmers, to work with
Hadoop Architecture
Hadoop has two major
layers namely −
• Processing/Computation
layer (MapReduce), and
• Storage layer (Hadoop
Distributed File System).
MapReduce

• MapReduce is a parallel programming model


for writing distributed applications devised at
Google for efficient processing of large
amounts of data (multi-terabyte data-sets), on
large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-
tolerant manner.
• The MapReduce program runs on Hadoop
which is an Apache open-source framework.
Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS)
is based on the Google File System (GFS)
and provides a distributed file system that
is designed to run on commodity hardware.
• It has many similarities with existing
distributed file systems. However, the
differences from other distributed file
systems are significant.
• It is highly fault-tolerant and is designed to
be deployed on low-cost hardware.

Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource management.
Hadoop with big data- Applications
• Is Hadoop a Database?
• Typically, Hadoop is not a database. Rather, it
is a software ecosystem that allows for parallel
computing of extremely large data sets.

You might also like