Bigdata Lecture 2
Bigdata Lecture 2
Analytics
Dr. Iman Ahmed ElSayed
5. Hadoop’s Rise
6. Evolution of HDFS
Performance Bottlenecks
Lack of Parallelism
3
Big Data
The Beginning and the Need
for a Distributed File System
"Imagine searching
through a library. One
person searching a huge
library takes a long time.
But if many people each
search a small section,
it's much faster."
4
Big Data
The Beginning and the Need
for a Distributed File System
Distributed Parallel Data
Storage Processing Locality
5
Big Data
The Beginning and the Need
for a Distributed File System
01 Fault Tolerance
Principles of a 02 Scalability
Distributed File System
03 High Throughput
6
Big Data The Origins
7
Big Data The Origins - GFS
• It’s a scalable, distributed and fault • Doug Cutting and Mike Cafarella started
tolerant file system. working on a web search engine project.
• Delivers high aggregate performance. • pav thede way for the development of
(HDFS). 8
Big Data Google File System (GFS)
Commodity Hardware
.
Web Crawling and Indexing
9
Big Data Google File System (GFS)
10
Big Data Key Concepts of GFS
Chunk Servers:
files are divided into fixed-size chunks (typically 64MB).
These chunks are stored on multiple chunk servers, which are
the worker nodes in the GFS cluster.
P.S.: "The data is broken into pieces, and those pieces are
stored on many machines.“
11
Big Data Key Concepts of GFS
Data Replication:
GFS achieves fault tolerance through data replication.
Each chunk is replicated multiple times (typically three) and
stored on different chunk servers.
P.S.: "Each piece of data is copied multiple times, so if one
machine fails, the data is still safe."
13
Big Data Hadoop v1.0 (HDFS) success
14
Big Data Hadoop v1.0 (HDFS) architecture
P.S.: "The NameNode is like the librarian, it knows where all the
books are, but there is only one librarian.“
DataNodes: the worker nodes that store the actual data blocks.
DataNodes report to the NameNode and perform read/write
operations on the data blocks.
15
Big Data Hadoop v1.0 (HDFS) architecture
16
Big Data Hadoop v1.0 (HDFS)
Advantages Limitations
17