0% found this document useful (0 votes)

4 views

Bigdata Lecture 2

The document discusses the evolution of Hadoop File Systems, focusing on the need for a distributed file system due to limitations of traditional systems. It outlines the influence of the Google File System (GFS) and the development of Hadoop 1.0, highlighting key features such as fault tolerance, scalability, and high throughput. The document also addresses the architecture of HDFS, including its advantages and limitations.

Uploaded by

ماسبيرو مباشر

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Bigdata Lecture 2

Uploaded by

ماسبيرو مباشر

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Big Data & Big Data

Analytics
Dr. Iman Ahmed ElSayed

Spring 24-25 / fourth level

Lecture 2- Evolution of Hadoop File

Systems
Big Data Big Data & Analytics
Lecture Contents:

1. The Beginning and the Need for a Distributed File System

2. Google influence (GFS)

3. Nutch Distributed File System (NDFS)

4. Birth of Hadoop 1.0

5. Hadoop’s Rise

6. Evolution of HDFS

7. Current HDFS System & EcoSystem Integration

8. Key Features of HDFS

2
Big Data
The Beginning and the Need
for a Distributed File System
Limitations of a traditional file system

Single point of failure

Capacity (storage) limitations

Performance Bottlenecks

Lack of Parallelism

3
Big Data
The Beginning and the Need
for a Distributed File System

"Imagine searching
through a library. One
person searching a huge
library takes a long time.
But if many people each
search a small section,
it's much faster."

4
Big Data
The Beginning and the Need
for a Distributed File System
Distributed Parallel Data
Storage Processing Locality

• data is spread across multiple • processing happens on the

machines (nodes) in a cluster. • multiple machines can same machines where the
• eliminates the single point of failure work on different parts of data resides, minimizing data
• allows for increased storage the data simultaneously. transfer and improving
performance.
capacity. • speeding up analysis

5
Big Data
The Beginning and the Need
for a Distributed File System

01 Fault Tolerance

Principles of a 02 Scalability
Distributed File System

03 High Throughput

6
Big Data The Origins

Google File Apache Nutch

System (1990) (2002)

7
Big Data The Origins - GFS

Google File Apache Nutch

System (1990) (2002)

• It’s a scalable, distributed and fault • Doug Cutting and Mike Cafarella started
tolerant file system. working on a web search engine project.

• Tailored for data intensive • Had a significant impact on handling big

applications. data.

• Running on inexpensive commodity • necessity for a distributed file system to

hardware. manage vast datasets

• Delivers high aggregate performance. • pav thede way for the development of
(HDFS). 8
Big Data Google File System (GFS)

Google's Need for a Scalable File System:

Explosive Data Growth

Commodity Hardware
.
Web Crawling and Indexing

The Problem: existing file systems couldn't meet these

demands, leading Google to develop GFS.

9
Big Data Google File System (GFS)

10
Big Data Key Concepts of GFS

Chunk Servers:
files are divided into fixed-size chunks (typically 64MB).
These chunks are stored on multiple chunk servers, which are
the worker nodes in the GFS cluster.
P.S.: "The data is broken into pieces, and those pieces are
stored on many machines.“

Master Node: It is the central coordinator of the GFS cluster.

It stores metadata about the file system, including the location of
chunks, file namespaces, and access control information.
It does not store the actual data.
P.S.: "The master node keeps track of where all the pieces are."

11
Big Data Key Concepts of GFS

Large File Sizes:

GFS was designed to handle very large files, which are
common in Big Data applications.
The large chunk size helps to reduce metadata overhead and
improve performance.

Data Replication:
GFS achieves fault tolerance through data replication.
Each chunk is replicated multiple times (typically three) and
stored on different chunk servers.
P.S.: "Each piece of data is copied multiple times, so if one
machine fails, the data is still safe."

This influenced the architecture of Hadoop 1.0

which was the first inline 12
Big Data Hadoop v1.0 (HDFS)

HDFS Inspiration from GFS

Open-Source Implementation: HDFS is an open-source

implementation of the concepts pioneered by Google's GFS.

Core Principles Adopted: HDFS adopted the core principles of

GFS, including:
• Distributed storage.
• Data replication for fault tolerance.
• Handling large files.
• Using commodity hardware.

Adaptation: HDFS was designed to be more general-purpose

than GFS, catering to a broader range of Big Data applications.

13
Big Data Hadoop v1.0 (HDFS) success

14
Big Data Hadoop v1.0 (HDFS) architecture

NameNode (Single Point of Failure): the NameNode is the

central master server that manages the file system namespace
and metadata.
In HDFS v1, there's only one NameNode, making it a single
point of failure. If it goes down, the entire file system becomes
inaccessible.

P.S.: "The NameNode is like the librarian, it knows where all the
books are, but there is only one librarian.“

DataNodes: the worker nodes that store the actual data blocks.
DataNodes report to the NameNode and perform read/write
operations on the data blocks.

15
Big Data Hadoop v1.0 (HDFS) architecture

Blocks: files are divided into fixed-size blocks (default 64MB or

128MB).
These blocks are distributed across multiple DataNodes.

Replication Factor: HDFS achieves fault tolerance through

data replication.
The replication factor determines the number of copies of each
block (default is 3).
P.S.: "Each block is copied 3 times, and those 3 copies are on
different DataNodes."

16
Big Data Hadoop v1.0 (HDFS)

Advantages Limitations

Scalability: the ability to scale Single NameNode (Scalability

horizontally by adding more DataNodes Bottleneck): the single NameNode is a
major limitation, as it can become a
Fault Tolerance: Emphasize the fault bottleneck for large clusters and a single
tolerance provided by data replication. point of failure.

High Throughput: the ability to handle Limited Namespace: the NameNode's

large volumes of data and provide high memory limits the number of files and blocks
throughput for read/write operations.. that can be managed.

Modelling The Smart City Performance
No ratings yet
Modelling The Smart City Performance
14 pages
Bluepha® PHA BP350-15 Technical Data Sheet V3.2
No ratings yet
Bluepha® PHA BP350-15 Technical Data Sheet V3.2
3 pages
Elevator Controler TK
100% (1)
Elevator Controler TK
17 pages
Ashta Sakhis
100% (1)
Ashta Sakhis
3 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
24 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
CH-05_cc
No ratings yet
CH-05_cc
21 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
bda 2_hadoop
No ratings yet
bda 2_hadoop
112 pages
Hdfs Part 1
No ratings yet
Hdfs Part 1
72 pages
HADOOP
No ratings yet
HADOOP
18 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
4
No ratings yet
4
53 pages
Unit 3
No ratings yet
Unit 3
5 pages
Big Data
No ratings yet
Big Data
51 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Notes - 3 Unit neha
No ratings yet
Notes - 3 Unit neha
25 pages
5.Apache Hadoop Updated
No ratings yet
5.Apache Hadoop Updated
57 pages
10th August Morning and Afternoon session Hadoop (1)
No ratings yet
10th August Morning and Afternoon session Hadoop (1)
18 pages
Big data aktu unit 3
No ratings yet
Big data aktu unit 3
90 pages
BD-Unit-II (1)
No ratings yet
BD-Unit-II (1)
57 pages
Unit-I
No ratings yet
Unit-I
38 pages
When it comes to cloud file systems like GFS
No ratings yet
When it comes to cloud file systems like GFS
6 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Bda Unit2
No ratings yet
Bda Unit2
24 pages
Distributed File Systems Leading To Hadoop File System: UNIT-2
No ratings yet
Distributed File Systems Leading To Hadoop File System: UNIT-2
12 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
14 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
6 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
UNIT 3 FULL
No ratings yet
UNIT 3 FULL
89 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Mapreduce: Simplified Data Processing On Large Clusters
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters
38 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
248 pages
Big data unit 2
No ratings yet
Big data unit 2
25 pages
Big Data Refers to Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers to Extremely Large and Complex Datasets That 1
421 pages
BIGDTA_UNIT_3
No ratings yet
BIGDTA_UNIT_3
65 pages
Introduction To HDFS
No ratings yet
Introduction To HDFS
25 pages
BDA-3
No ratings yet
BDA-3
70 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Unit-5 -Hadoop.pptx
No ratings yet
Unit-5 -Hadoop.pptx
29 pages
Brown and Black Modern Watercolor Presentation[1]
No ratings yet
Brown and Black Modern Watercolor Presentation[1]
11 pages
HDFS
No ratings yet
HDFS
11 pages
BDT - Unit - II - Hdfs and Hadoop Io
No ratings yet
BDT - Unit - II - Hdfs and Hadoop Io
42 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
Big Data Analytics
No ratings yet
Big Data Analytics
26 pages
Unit 1 (Chapter 2) - Big Data Storage
No ratings yet
Unit 1 (Chapter 2) - Big Data Storage
34 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
42 pages
Big Data
No ratings yet
Big Data
67 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Wa0001.
No ratings yet
Wa0001.
56 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
BDA-UNIT-2 - 2023
No ratings yet
BDA-UNIT-2 - 2023
58 pages
DBMS Final
No ratings yet
DBMS Final
21 pages
03 BigData DFS MapReduce Hadoop
No ratings yet
03 BigData DFS MapReduce Hadoop
66 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Chapter 3. Transmission and Switching Techniques PDF
No ratings yet
Chapter 3. Transmission and Switching Techniques PDF
63 pages
Full Wave Report
No ratings yet
Full Wave Report
7 pages
ERC Advanced Grants 2020 List of Principal Investigators - All Domains Host Institution Refers To Institution at Time of Application
No ratings yet
ERC Advanced Grants 2020 List of Principal Investigators - All Domains Host Institution Refers To Institution at Time of Application
28 pages
FGD-2 Instruction Manual, Rev A-9501000990
No ratings yet
FGD-2 Instruction Manual, Rev A-9501000990
8 pages
"Hydraulic Braking System": Presented BY, Mohit Sharma (151200224
No ratings yet
"Hydraulic Braking System": Presented BY, Mohit Sharma (151200224
24 pages
Char Dham Yatra by Helicopter in Bhavnagar
No ratings yet
Char Dham Yatra by Helicopter in Bhavnagar
6 pages
Banana Pseudo-Stem Fiber - Chap5
No ratings yet
Banana Pseudo-Stem Fiber - Chap5
20 pages
Buckling and Free Vibration Analysis of Arch 529814 - 1 - en - 30 - Chapter - Author
No ratings yet
Buckling and Free Vibration Analysis of Arch 529814 - 1 - en - 30 - Chapter - Author
16 pages
Health Optimizing Physical Education 1: Quarter 1 - Module 2: Set Fitness Goal
50% (6)
Health Optimizing Physical Education 1: Quarter 1 - Module 2: Set Fitness Goal
28 pages
NSTP 12 Requirement - Community Profile
No ratings yet
NSTP 12 Requirement - Community Profile
5 pages
Mata Pelajaran: Bahasa Inggris Tema: Greeting Kelas/ Semester: Iv/ I KD: 3.2, 4.1 A. Read The Following Text and Answer The Questions!
No ratings yet
Mata Pelajaran: Bahasa Inggris Tema: Greeting Kelas/ Semester: Iv/ I KD: 3.2, 4.1 A. Read The Following Text and Answer The Questions!
10 pages
MID 140 Fault Codes DTC
100% (1)
MID 140 Fault Codes DTC
7 pages
UNIT22024
No ratings yet
UNIT22024
49 pages
PinoyMountaineer Difficulty Scale (2009)
No ratings yet
PinoyMountaineer Difficulty Scale (2009)
2 pages
Air Brake: History and Etymology
No ratings yet
Air Brake: History and Etymology
7 pages
3rd Q weekly sheet 3
No ratings yet
3rd Q weekly sheet 3
22 pages
Support Beam AFT-MISC-132 UC305x305x240-REV1
No ratings yet
Support Beam AFT-MISC-132 UC305x305x240-REV1
12 pages
Dante 3.5
No ratings yet
Dante 3.5
2 pages
Atomic Structure Mole Concept - 17 July - NP - 1, 2, 3
No ratings yet
Atomic Structure Mole Concept - 17 July - NP - 1, 2, 3
4 pages
Teng2012 PDF
No ratings yet
Teng2012 PDF
13 pages
Randall Collins Theory
100% (1)
Randall Collins Theory
11 pages
Cloud4C Ai Powered Sap Application Management Services: A Ctrls Company
No ratings yet
Cloud4C Ai Powered Sap Application Management Services: A Ctrls Company
11 pages
@airbus: Component Maintenance Manual With Illustrated Part List
No ratings yet
@airbus: Component Maintenance Manual With Illustrated Part List
13 pages
Department Administrative Order No. 24-02
No ratings yet
Department Administrative Order No. 24-02
28 pages
BSEE Curriculum
No ratings yet
BSEE Curriculum
5 pages
Curriculum Vitae: Personal Data
No ratings yet
Curriculum Vitae: Personal Data
4 pages

Bigdata Lecture 2

Uploaded by

Bigdata Lecture 2

Uploaded by

Big Data & Big Data

Spring 24-25 / fourth level

Lecture 2- Evolution of Hadoop File

1. The Beginning and the Need for a Distributed File System

2. Google influence (GFS)

3. Nutch Distributed File System (NDFS)

4. Birth of Hadoop 1.0

7. Current HDFS System & EcoSystem Integration

8. Key Features of HDFS

Single point of failure

Capacity (storage) limitations

• data is spread across multiple • processing happens on the

Google File Apache Nutch

Google File Apache Nutch

• Tailored for data intensive • Had a significant impact on handling big

• Running on inexpensive commodity • necessity for a distributed file system to

Google's Need for a Scalable File System:

The Problem: existing file systems couldn't meet these

Master Node: It is the central coordinator of the GFS cluster.

Large File Sizes:

This influenced the architecture of Hadoop 1.0

HDFS Inspiration from GFS

Open-Source Implementation: HDFS is an open-source

Core Principles Adopted: HDFS adopted the core principles of

Adaptation: HDFS was designed to be more general-purpose

NameNode (Single Point of Failure): the NameNode is the

Blocks: files are divided into fixed-size blocks (default 64MB or

Replication Factor: HDFS achieves fault tolerance through

Scalability: the ability to scale Single NameNode (Scalability

High Throughput: the ability to handle Limited Namespace: the NameNode's

You might also like