0% found this document useful (0 votes)

1 views31 pages

Unit 4 Hadoop

BIG DATA

Uploaded by

azhagu sundari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views31 pages

Unit 4 Hadoop

BIG DATA

Uploaded by

azhagu sundari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Hadoop

High Availability Distributed Object

Oriented Platform
Hadoop
Hadoop is an open-source framework that
allows to store and process big data in a
distributed environment across clusters of
computers using simple programming models.
latest stable version of Apache Hadoop is 3.3.1
• The four main components of Hadoop are −
• Hadoop Distributed File System (HDFS) − This is a storage system that breaks large
files into smaller pieces and distributes them across multiple computers in a
cluster. It ensures data reliability and enables parallel processing of data across the
cluster.
• MapReduce − This is a programming model used for processing and analyzing
large datasets in parallel across the cluster. It consists of two main tasks: Map,
which processes and transforms input data into intermediate key-value pairs, and
Reduce, which aggregates and summarizes the intermediate data to produce the
final output.
• YARN (Yet Another Resource Negotiator) − YARN is a resource management and
job scheduling component of Hadoop. It allocates resources (CPU, memory) to
various applications running on the cluster and manages their execution efficiently.
• Hadoop Common − This includes libraries and utilities used by other Hadoop
components. It provides tools and infrastructure for the entire Hadoop ecosystem,
such as authentication, configuration, and logging.
Hadoop is not a data warehouse
Hadoop is not a data warehouse because they
serve different purposes and have different
architectures. Hadoop is a framework for
storing and processing large volumes of
unstructured and semi-structured data across
distributed clusters of computers. It is
designed for handling big data and supports
batch processing of large datasets using
technologies like HDFS and MapReduce.
advantage of Hadoop
The biggest advantage of Hadoop is its ability to handle
and process large volumes of data efficiently.
Hadoop is designed to distribute data and processing tasks
across multiple computers in a cluster, allowing it to
scale easily to handle massive datasets that traditional
databases or processing systems struggle to manage.
This enables organizations to store, process, and analyze
huge amounts of data, gaining valuable insights and
making informed decisions that would not be possible
with conventional technologies.
Which software is used in Hadoop?
• Hadoop Distributed File System (HDFS) tores large
datasets across a cluster of computers, breaking
them into smaller pieces for efficient storage and
retrieval.
• YARN manages computing resources across the
cluster, allocating resources to different applications
and ensuring efficient execution.
• MapReduce is the processing engine that divides
data processing tasks into smaller parts and
executes them in parallel across the cluster.
Traditional Approach
• In this approach, an enterprise will have a
computer to store and process big data. For
storage purpose, the programmers will take
the help of their choice of database vendors
such as Oracle, IBM, etc.
• In this approach, the user interacts with the
application, which in turn handles the part of
data storage and analysis.
Limitation
• This approach works fine with those
applications that process less voluminous data
that can be accommodated by standard
database servers, or up to the limit of the
processor that is processing the data.
• But when it comes to dealing with huge
amounts of scalable data, it is a hectic task to
process such data through a single database
bottleneck.
Google’s Solution
• Google solved this problem using an algorithm called
MapReduce. This algorithm divides the task into
small parts and assigns them to many computers,
and collects the results from them which when
integrated, form the result dataset.
Hadoop
• Using the solution provided by Google, Doug
Cutting and his team developed an Open Source
Project called HADOOP.
• Hadoop runs applications using the MapReduce
algorithm, where the data is processed in parallel
with others.
• Hadoop is used to develop applications that could
perform complete statistical analysis on huge
amounts of data.
How does Hadoop solve the problem of Big Data?
• The proposed solution for the problem of big data
should:
• Implement good recovery strategies
• Be horizontally scalable as data grows
• Be cost-effective
• Minimize the learning curve
• Be easy for programmers and data analysts, and
even for non-programmers, to work with
Hadoop Architecture
Hadoop has two major
layers namely −
• Processing/Computation
layer (MapReduce), and
• Storage layer (Hadoop
Distributed File System).
MapReduce

• MapReduce is a parallel programming model

for writing distributed applications devised at
Google for efficient processing of large
amounts of data (multi-terabyte data-sets), on
large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-
tolerant manner.
• The MapReduce program runs on Hadoop
which is an Apache open-source framework.
Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS)
is based on the Google File System (GFS)
and provides a distributed file system that
is designed to run on commodity hardware.
• It has many similarities with existing
distributed file systems. However, the
differences from other distributed file
systems are significant.
• It is highly fault-tolerant and is designed to
be deployed on low-cost hardware.

Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource management.
Hadoop with big data- Applications
• Is Hadoop a Database?
• Typically, Hadoop is not a database. Rather, it
is a software ecosystem that allows for parallel
computing of extremely large data sets.

Unit Iii
No ratings yet
Unit Iii
20 pages
clf-c02 4
No ratings yet
clf-c02 4
27 pages
IT Due Diligence Questionnaire
No ratings yet
IT Due Diligence Questionnaire
6 pages
Chapter 3 Hadoop
No ratings yet
Chapter 3 Hadoop
10 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
Part 02 - Big Data Solutions
No ratings yet
Part 02 - Big Data Solutions
17 pages
Big Data RAJNEESH CCC
No ratings yet
Big Data RAJNEESH CCC
11 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Hadoop
No ratings yet
Hadoop
11 pages
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
No ratings yet
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
11 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
CC UNIT 2 (1)
No ratings yet
CC UNIT 2 (1)
29 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Big data 2 - part
No ratings yet
Big data 2 - part
40 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
unit 2
No ratings yet
unit 2
9 pages
INTRODUCTION TO DATA SCIENCE
No ratings yet
INTRODUCTION TO DATA SCIENCE
14 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
UNIT-4-Hadoop Ecosystem-Part 1
No ratings yet
UNIT-4-Hadoop Ecosystem-Part 1
22 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
Hadoop in bigdata processing concept
No ratings yet
Hadoop in bigdata processing concept
2 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
HADOOP
No ratings yet
HADOOP
10 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
Hadoop
No ratings yet
Hadoop
13 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
No ratings yet
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
5 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
Hadoop Presentation: Swarnali B.SC Computer Science Hons. 2 Year Chandernagore Govt. College Halder
No ratings yet
Hadoop Presentation: Swarnali B.SC Computer Science Hons. 2 Year Chandernagore Govt. College Halder
8 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Unit 2
No ratings yet
Unit 2
73 pages
CC-Unit 3
No ratings yet
CC-Unit 3
22 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Hadoop
No ratings yet
Hadoop
5 pages
Hadoop is an Open
No ratings yet
Hadoop is an Open
4 pages
By - Shubham Parmar
No ratings yet
By - Shubham Parmar
14 pages
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
No ratings yet
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
6 pages
UNIT II
No ratings yet
UNIT II
30 pages
Bda Unit 4 Material
No ratings yet
Bda Unit 4 Material
37 pages
Seminar Report PDF
100% (2)
Seminar Report PDF
35 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
P S T U: Atuakhali Cience AND Echnology Niversity
No ratings yet
P S T U: Atuakhali Cience AND Echnology Niversity
20 pages
Python Training
No ratings yet
Python Training
5 pages
Mysql Workbench Tutorial: Ron Mak
No ratings yet
Mysql Workbench Tutorial: Ron Mak
8 pages
Data Engineering With Databricks
No ratings yet
Data Engineering With Databricks
11 pages
Harshwardhan.M CV
No ratings yet
Harshwardhan.M CV
4 pages
NetApp CVO Webinar 180422 - SP - USPS
100% (1)
NetApp CVO Webinar 180422 - SP - USPS
19 pages
JD - Data Engineer
No ratings yet
JD - Data Engineer
3 pages
Unit 6 Dbms Unit 6
No ratings yet
Unit 6 Dbms Unit 6
4 pages
PYTHON PROGRAMMING LAB 2021-24
No ratings yet
PYTHON PROGRAMMING LAB 2021-24
29 pages
Unit I
No ratings yet
Unit I
33 pages
CS 356 - Lecture 25 and 26 Operating System Security: Spring 2013
No ratings yet
CS 356 - Lecture 25 and 26 Operating System Security: Spring 2013
36 pages
Lab Manual LPII 2
No ratings yet
Lab Manual LPII 2
43 pages
New Exam Guide
No ratings yet
New Exam Guide
6 pages
Mining in Social Media (Part 1) : Unit 3
No ratings yet
Mining in Social Media (Part 1) : Unit 3
15 pages
Create A Login and Registration Form in Android Using SQLite Database
100% (2)
Create A Login and Registration Form in Android Using SQLite Database
65 pages
Power BI
No ratings yet
Power BI
15 pages
Mca 1 Sem Database Systems 1c8114 2022
No ratings yet
Mca 1 Sem Database Systems 1c8114 2022
2 pages
JD - Business Analyst of A Company
No ratings yet
JD - Business Analyst of A Company
3 pages
NiceLabel Designer Standard
No ratings yet
NiceLabel Designer Standard
2 pages
A Project Report Cerificate (1) (5) - 1
No ratings yet
A Project Report Cerificate (1) (5) - 1
81 pages
CSE2004 Syllabus
No ratings yet
CSE2004 Syllabus
2 pages
spark theory
No ratings yet
spark theory
26 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
CS 412 Intro. To Data Mining
No ratings yet
CS 412 Intro. To Data Mining
55 pages
DEFINITIONS
No ratings yet
DEFINITIONS
2 pages
FastGeo Efficient Geometric Range Queries
No ratings yet
FastGeo Efficient Geometric Range Queries
5 pages
3.how Can I Retrive All Records of Emp1 Those Should Not Present in Emp2?
No ratings yet
3.how Can I Retrive All Records of Emp1 Those Should Not Present in Emp2?
6 pages
Score: 100%: Jordan Modee Pre-Assessment
No ratings yet
Score: 100%: Jordan Modee Pre-Assessment
7 pages

Unit 4 Hadoop

Uploaded by

Unit 4 Hadoop

Uploaded by

Hadoop

High Availability Distributed Object

• MapReduce is a parallel programming model

You might also like