0% found this document useful (0 votes)
11 views4 pages

? Chapter 2 - Hadoop & HDFS - Detailed Summary

Chapter 2 provides an overview of Big Data and its limitations, introducing Hadoop as an open-source framework for processing large-scale data. It details the architecture of Hadoop, particularly HDFS, including its components, read/write processes, and features such as fault tolerance and scalability. Additionally, it outlines various components of the Hadoop ecosystem, including data ingestion, processing, and management tools.

Uploaded by

iheb tb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views4 pages

? Chapter 2 - Hadoop & HDFS - Detailed Summary

Chapter 2 provides an overview of Big Data and its limitations, introducing Hadoop as an open-source framework for processing large-scale data. It details the architecture of Hadoop, particularly HDFS, including its components, read/write processes, and features such as fault tolerance and scalability. Additionally, it outlines various components of the Hadoop ecosystem, including data ingestion, processing, and management tools.

Uploaded by

iheb tb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

📌 Chapter 2: Hadoop & HDFS - Detailed

Summary
🔹 1. Introduction to Big Data
●​ Big Data emerged due to the limitations of traditional systems in handling
large-scale data.
●​ Traditional systems work well for OLTP (Online Transaction Processing) and
Business Intelligence (BI) but are not scalable due to:
○​ High costs
○​ Complex management
○​ Memory limitations
○​ Heavy computations
●​ Big Data use cases:
○​ Predictive analytics, fraud detection, machine learning, trend analysis,
semi-structured & unstructured data processing

5Vs of Big Data

1.​ Volume – Large-scale data


2.​ Velocity – Speed of data generation
3.​ Variety – Structured, semi-structured, unstructured
4.​ Veracity – Data accuracy and reliability
5.​ Value – Extracting useful insights

Big Data Architectural Strategies

●​ Distributed Computing
●​ Massively Parallel Processing (MPP)
●​ NoSQL Databases

🔹 2. Introduction to Hadoop
●​ Hadoop is an open-source framework used for processing large-scale data.
●​ Components of Hadoop:
○​ HDFS (Hadoop Distributed File System) → Storage
○​ MapReduce → Parallel Processing
●​ Advantages of Hadoop:
○​ Low-cost (runs on commodity hardware)
○​ Fault-tolerant (data replication)
○​ Scalable (easy to add more machines)
○​ Flexible (works with structured & unstructured data)
Hadoop Ecosystem

●​ Storage → HDFS
●​ Processing → MapReduce, Spark
●​ Data Ingestion → Flume, Sqoop, Storm
●​ Data Management → Hive, Pig, HBase
●​ Scheduling & Monitoring → YARN, Zookeeper, Oozie
●​ Machine Learning → Apache Mahout

🔹 3. Hadoop Architecture
HDFS (Hadoop Distributed File System)

●​ Key Features:
○​ Distributed storage – Data is split into blocks (default size: 128MB)
○​ Fault-tolerant – Blocks are replicated across multiple nodes (default: 3
copies)
○​ Scalability – Can add more machines without downtime

HDFS Components:

1.​ NameNode (Master Node)​

○​ Stores metadata (file locations, blocks)


○​ Controls client access
○​ Single point of failure (solution: Standby NameNode)
2.​ DataNode (Worker Nodes)​

○​ Stores actual data blocks


○​ Sends periodic heartbeats to NameNode
○​ Handles read/write requests
3.​ Secondary NameNode​

○​ NOT a backup NameNode


○​ Merges edits and fsimage for checkpointing

🔹 4. HDFS Read/Write Process


HDFS Read Process:

1.​ Client requests to read a file


2.​ NameNode sends block locations
3.​ Client connects to DataNode & reads the file
4.​ If a DataNode fails, it switches to another DataNode

HDFS Write Process:


1.​ Client requests to write a file
2.​ NameNode checks permissions & assigns DataNodes
3.​ Data is split into blocks and written in a pipeline
4.​ Replication happens (default = 3 copies)
5.​ Success acknowledgment sent to NameNode

🔹 5. HDFS Features & Enhancements


Rack Awareness in HDFS

●​ DataNodes are grouped into racks (connected via network switches).


●​ HDFS tries to store replicas across different racks to ensure:
○​ Fault tolerance (if one rack fails, another has copies)
○​ Reduced network traffic (minimizes cross-rack data transfer)

HDFS Federation (Hadoop 2.0+)

●​ Before: Single NameNode (Bottleneck issue)


●​ Now: Multiple NameNodes handle different namespaces, solving scalability issues

🔹 6. Hadoop Ecosystem Components


Distributed Processing:

●​ MapReduce → Traditional batch processing


●​ Apache Spark → 100x faster than MapReduce (real-time, in-memory computing)

Querying & Data Management:

●​ Hive → SQL-like queries for Hadoop (HiveQL)


●​ Pig → Dataflow scripting language
●​ HBase → NoSQL database (column-oriented)

Data Ingestion:

●​ Sqoop → Import/export data between Hadoop & relational databases


●​ Flume → Collect & transfer logs & streaming data
●​ Storm → Real-time processing (low-latency)

Job Scheduling & Monitoring:

●​ YARN → Manages resource allocation for tasks


●​ Zookeeper → Synchronization & coordination service
●​ Oozie → Job scheduling and workflow management

🔹 7. HDFS Commands (CLI Usage)


Basic File Commands
Command Description

hdfs dfs -ls / List directory contents

hdfs dfs -mkdir /user/output Create a directory

hdfs dfs -put file.txt Upload a file to HDFS


/user/output/

hdfs dfs -cat View file content


/user/output/file.txt

hdfs dfs -get Download file from HDFS


/user/output/file.txt ~/

File Management Commands


Command Description

hdfs dfs -rm /user/output/file.txt Delete a file

hdfs dfs -mv /user/output/file.txt Move/rename file


/user/newfile.txt

hdfs dfs -setrep 4 /user/output/file.txt Change replication factor

You might also like