0% found this document useful (0 votes)
4 views

Bigdata Lecture 2

The document discusses the evolution of Hadoop File Systems, focusing on the need for a distributed file system due to limitations of traditional systems. It outlines the influence of the Google File System (GFS) and the development of Hadoop 1.0, highlighting key features such as fault tolerance, scalability, and high throughput. The document also addresses the architecture of HDFS, including its advantages and limitations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Bigdata Lecture 2

The document discusses the evolution of Hadoop File Systems, focusing on the need for a distributed file system due to limitations of traditional systems. It outlines the influence of the Google File System (GFS) and the development of Hadoop 1.0, highlighting key features such as fault tolerance, scalability, and high throughput. The document also addresses the architecture of HDFS, including its advantages and limitations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Big Data & Big Data

Analytics
Dr. Iman Ahmed ElSayed

Spring 24-25 / fourth level

Lecture 2- Evolution of Hadoop File


Systems
Big Data Big Data & Analytics
Lecture Contents:

1. The Beginning and the Need for a Distributed File System

2. Google influence (GFS)

3. Nutch Distributed File System (NDFS)

4. Birth of Hadoop 1.0

5. Hadoop’s Rise

6. Evolution of HDFS

7. Current HDFS System & EcoSystem Integration

8. Key Features of HDFS


2
Big Data
The Beginning and the Need
for a Distributed File System
Limitations of a traditional file system

Single point of failure

Capacity (storage) limitations

Performance Bottlenecks

Lack of Parallelism

3
Big Data
The Beginning and the Need
for a Distributed File System

"Imagine searching
through a library. One
person searching a huge
library takes a long time.
But if many people each
search a small section,
it's much faster."

4
Big Data
The Beginning and the Need
for a Distributed File System
Distributed Parallel Data
Storage Processing Locality

• data is spread across multiple • processing happens on the


machines (nodes) in a cluster. • multiple machines can same machines where the
• eliminates the single point of failure work on different parts of data resides, minimizing data
• allows for increased storage the data simultaneously. transfer and improving
performance.
capacity. • speeding up analysis

5
Big Data
The Beginning and the Need
for a Distributed File System

01 Fault Tolerance

Principles of a 02 Scalability
Distributed File System

03 High Throughput

6
Big Data The Origins

Google File Apache Nutch


System (1990) (2002)

7
Big Data The Origins - GFS

Google File Apache Nutch


System (1990) (2002)

• It’s a scalable, distributed and fault • Doug Cutting and Mike Cafarella started
tolerant file system. working on a web search engine project.

• Tailored for data intensive • Had a significant impact on handling big


applications. data.

• Running on inexpensive commodity • necessity for a distributed file system to


hardware. manage vast datasets

• Delivers high aggregate performance. • pav thede way for the development of
(HDFS). 8
Big Data Google File System (GFS)

Google's Need for a Scalable File System:


Explosive Data Growth

Commodity Hardware
.
Web Crawling and Indexing

The Problem: existing file systems couldn't meet these


demands, leading Google to develop GFS.

9
Big Data Google File System (GFS)

10
Big Data Key Concepts of GFS

Chunk Servers:
files are divided into fixed-size chunks (typically 64MB).
These chunks are stored on multiple chunk servers, which are
the worker nodes in the GFS cluster.
P.S.: "The data is broken into pieces, and those pieces are
stored on many machines.“

Master Node: It is the central coordinator of the GFS cluster.


It stores metadata about the file system, including the location of
chunks, file namespaces, and access control information.
It does not store the actual data.
P.S.: "The master node keeps track of where all the pieces are."

11
Big Data Key Concepts of GFS

Large File Sizes:


GFS was designed to handle very large files, which are
common in Big Data applications.
The large chunk size helps to reduce metadata overhead and
improve performance.

Data Replication:
GFS achieves fault tolerance through data replication.
Each chunk is replicated multiple times (typically three) and
stored on different chunk servers.
P.S.: "Each piece of data is copied multiple times, so if one
machine fails, the data is still safe."

This influenced the architecture of Hadoop 1.0


which was the first inline 12
Big Data Hadoop v1.0 (HDFS)

HDFS Inspiration from GFS

Open-Source Implementation: HDFS is an open-source


implementation of the concepts pioneered by Google's GFS.

Core Principles Adopted: HDFS adopted the core principles of


GFS, including:
• Distributed storage.
• Data replication for fault tolerance.
• Handling large files.
• Using commodity hardware.

Adaptation: HDFS was designed to be more general-purpose


than GFS, catering to a broader range of Big Data applications.

13
Big Data Hadoop v1.0 (HDFS) success

14
Big Data Hadoop v1.0 (HDFS) architecture

NameNode (Single Point of Failure): the NameNode is the


central master server that manages the file system namespace
and metadata.
In HDFS v1, there's only one NameNode, making it a single
point of failure. If it goes down, the entire file system becomes
inaccessible.

P.S.: "The NameNode is like the librarian, it knows where all the
books are, but there is only one librarian.“

DataNodes: the worker nodes that store the actual data blocks.
DataNodes report to the NameNode and perform read/write
operations on the data blocks.

15
Big Data Hadoop v1.0 (HDFS) architecture

Blocks: files are divided into fixed-size blocks (default 64MB or


128MB).
These blocks are distributed across multiple DataNodes.

Replication Factor: HDFS achieves fault tolerance through


data replication.
The replication factor determines the number of copies of each
block (default is 3).
P.S.: "Each block is copied 3 times, and those 3 copies are on
different DataNodes."

16
Big Data Hadoop v1.0 (HDFS)

Advantages Limitations

Scalability: the ability to scale Single NameNode (Scalability


horizontally by adding more DataNodes Bottleneck): the single NameNode is a
major limitation, as it can become a
Fault Tolerance: Emphasize the fault bottleneck for large clusters and a single
tolerance provided by data replication. point of failure.

High Throughput: the ability to handle Limited Namespace: the NameNode's


large volumes of data and provide high memory limits the number of files and blocks
throughput for read/write operations.. that can be managed.

17

You might also like