📌 Chapter 2: Hadoop & HDFS - Detailed
Summary
🔹 1. Introduction to Big Data
● Big Data emerged due to the limitations of traditional systems in handling
large-scale data.
● Traditional systems work well for OLTP (Online Transaction Processing) and
Business Intelligence (BI) but are not scalable due to:
○ High costs
○ Complex management
○ Memory limitations
○ Heavy computations
● Big Data use cases:
○ Predictive analytics, fraud detection, machine learning, trend analysis,
semi-structured & unstructured data processing
5Vs of Big Data
1. Volume – Large-scale data
2. Velocity – Speed of data generation
3. Variety – Structured, semi-structured, unstructured
4. Veracity – Data accuracy and reliability
5. Value – Extracting useful insights
Big Data Architectural Strategies
● Distributed Computing
● Massively Parallel Processing (MPP)
● NoSQL Databases
🔹 2. Introduction to Hadoop
● Hadoop is an open-source framework used for processing large-scale data.
● Components of Hadoop:
○ HDFS (Hadoop Distributed File System) → Storage
○ MapReduce → Parallel Processing
● Advantages of Hadoop:
○ Low-cost (runs on commodity hardware)
○ Fault-tolerant (data replication)
○ Scalable (easy to add more machines)
○ Flexible (works with structured & unstructured data)
Hadoop Ecosystem
● Storage → HDFS
● Processing → MapReduce, Spark
● Data Ingestion → Flume, Sqoop, Storm
● Data Management → Hive, Pig, HBase
● Scheduling & Monitoring → YARN, Zookeeper, Oozie
● Machine Learning → Apache Mahout
🔹 3. Hadoop Architecture
HDFS (Hadoop Distributed File System)
● Key Features:
○ Distributed storage – Data is split into blocks (default size: 128MB)
○ Fault-tolerant – Blocks are replicated across multiple nodes (default: 3
copies)
○ Scalability – Can add more machines without downtime
HDFS Components:
1. NameNode (Master Node)
○ Stores metadata (file locations, blocks)
○ Controls client access
○ Single point of failure (solution: Standby NameNode)
2. DataNode (Worker Nodes)
○ Stores actual data blocks
○ Sends periodic heartbeats to NameNode
○ Handles read/write requests
3. Secondary NameNode
○ NOT a backup NameNode
○ Merges edits and fsimage for checkpointing
🔹 4. HDFS Read/Write Process
HDFS Read Process:
1. Client requests to read a file
2. NameNode sends block locations
3. Client connects to DataNode & reads the file
4. If a DataNode fails, it switches to another DataNode
HDFS Write Process:
1. Client requests to write a file
2. NameNode checks permissions & assigns DataNodes
3. Data is split into blocks and written in a pipeline
4. Replication happens (default = 3 copies)
5. Success acknowledgment sent to NameNode
🔹 5. HDFS Features & Enhancements
Rack Awareness in HDFS
● DataNodes are grouped into racks (connected via network switches).
● HDFS tries to store replicas across different racks to ensure:
○ Fault tolerance (if one rack fails, another has copies)
○ Reduced network traffic (minimizes cross-rack data transfer)
HDFS Federation (Hadoop 2.0+)
● Before: Single NameNode (Bottleneck issue)
● Now: Multiple NameNodes handle different namespaces, solving scalability issues
🔹 6. Hadoop Ecosystem Components
Distributed Processing:
● MapReduce → Traditional batch processing
● Apache Spark → 100x faster than MapReduce (real-time, in-memory computing)
Querying & Data Management:
● Hive → SQL-like queries for Hadoop (HiveQL)
● Pig → Dataflow scripting language
● HBase → NoSQL database (column-oriented)
Data Ingestion:
● Sqoop → Import/export data between Hadoop & relational databases
● Flume → Collect & transfer logs & streaming data
● Storm → Real-time processing (low-latency)
Job Scheduling & Monitoring:
● YARN → Manages resource allocation for tasks
● Zookeeper → Synchronization & coordination service
● Oozie → Job scheduling and workflow management
🔹 7. HDFS Commands (CLI Usage)
Basic File Commands
Command Description
hdfs dfs -ls / List directory contents
hdfs dfs -mkdir /user/output Create a directory
hdfs dfs -put file.txt Upload a file to HDFS
/user/output/
hdfs dfs -cat View file content
/user/output/file.txt
hdfs dfs -get Download file from HDFS
/user/output/file.txt ~/
File Management Commands
Command Description
hdfs dfs -rm /user/output/file.txt Delete a file
hdfs dfs -mv /user/output/file.txt Move/rename file
/user/newfile.txt
hdfs dfs -setrep 4 /user/output/file.txt Change replication factor