HadoopRecap
HadoopRecap
2. Unstructured Data
• Definition: Data that does not have a fixed format or schema and cannot be easily
stored in traditional databases.
• Characteristics:
o Often large in volume and stored in files or object storage.
o Requires special tools for processing (e.g., Natural Language Processing,
computer vision).
o Harder to query and analyze directly.
• Examples:
o Media files: Images, videos, audio recordings.
o Text data: Social media posts, emails, PDF documents.
o IoT data in raw formats: Logs, binary sensor outputs.
3. Semi-Structured Data
• Definition: Data that does not follow a strict tabular format but contains
organizational markers (tags, keys) to separate elements. It is flexible yet partially
organized.
• Characteristics:
o Does not conform to relational models but can be parsed with tools.
o Common in data exchange formats.
o Easier to work with compared to unstructured data.
• Examples:
o NoSQL databases (e.g., MongoDB, Cassandra).
o CSV files with inconsistent row structures.
4. Geospatial Data
• Format: Represents geographical information (coordinates, polygons).
• Examples:
o GPS data, satellite imagery, map boundaries (GeoJSON, Shapefiles).
Monolithic System
• Architecture:
o Consists of a single, integrated system that contains all resources (CPU, RAM,
Storage).
• Resources:
o CPU Cores: 4
o RAM: 8 GB
o Storage: 1 TB
• Scaling:
o Vertical Scaling: Increases performance by adding more resources (e.g.,
upgrading CPU, RAM, or storage) to the same machine.
o Limitations:
• Performance gains diminish after a certain point due to hardware
constraints.
• Eventually, hardware limitations restrict the ability to effectively scale
performance in proportion to the resources added.
Distributed System
• Architecture:
o Composed of multiple interconnected nodes, each operating independently
but contributing to a common system.
• Node Resources (Example of three nodes):
o Node 1:
• CPU Cores: 4
• RAM: 8 GB
• Storage: 1 TB
o Node 2:
• CPU Cores: 4
• RAM: 8 GB
• Storage: 1 TB
o Node 3:
• CPU Cores: 4
• RAM: 8 GB
• Storage: 1 TB
• Scaling:
o Horizontal Scaling: Increases performance by adding more nodes to the system
(e.g., adding more machines).
o Advantages:
• Performance increases in direct proportion to the number of nodes
added, achieving true scaling.
• Each node can independently handle its own workload, improving
overall system performance and fault tolerance.
1. Monolithic Systems
• Definition: In monolithic architecture, all components of a system (data storage,
processing, and user interface) are tightly integrated into a single framework or
application.
• Characteristics:
o Centralized design with all operations performed on a single machine or tightly
coupled system.
o Easier to manage in small-scale applications.
o Requires more resources as data grows, leading to performance bottlenecks.
o Limited scalability, as the system can only handle the data and processing
power of one machine.
• Challenges in Big Data:
o Difficulty in handling large volumes of data.
o Single point of failure: If the system goes down, the entire operation stops.
o Harder to scale horizontally (i.e., adding more machines).
• Example: Traditional relational databases (like MySQL on a single server) where both
storage and processing occur on the same machine.
2. Distributed Systems
• Definition: In distributed architecture, data storage and processing are split across
multiple machines or nodes, working together as a unified system.
• Characteristics:
o Data and tasks are distributed across multiple machines (nodes) that
collaborate to process data efficiently.
o Highly scalable by adding more nodes to handle increasing data volumes and
processing demands.
o Fault-tolerant: Even if one node fails, the system continues to operate using the
remaining nodes.
o Offers better performance and resilience for handling massive datasets.
o Enables parallel processing, which significantly speeds up data analysis in Big
Data environments.
• Advantages for Big Data:
o Scales horizontally, making it ideal for processing large datasets.
o Handles a variety of data types and sources efficiently.
o Can process data in real-time, distributing workloads across multiple nodes.
• Example: Apache Hadoop, Apache Spark, and other distributed systems designed for
Big Data processing, where tasks are spread across multiple servers.
Key Differences:
Fault Tolerance Low (single point of failure) High (nodes can fail without affecting
the system)
Suitability for Big Not well-suited for large Ideal for handling Big Data
Data datasets
Conclusion:
• Monolithic systems may work for smaller-scale applications but struggle with the
volume, variety, and velocity of Big Data.
• Distributed systems are essential for efficiently processing and analyzing Big Data,
offering scalability, fault tolerance, and high performance.
Master Spark Concept Zero to Hero:
Design the big data system
Here are the notes on the three essential factors to consider when designing a good Big
Data System:
1. Storage
• Requirement: Big Data systems need to store massive volumes of data that
traditional systems cannot handle effectively.
• Solution: Implement Distributed Storage.
o Definition: A storage system that spreads data across multiple locations or
nodes, allowing for better management of large datasets.
o Benefits:
• Scalability: Easily accommodates increasing data sizes by adding more
storage nodes.
• Reliability: Redundant storage across nodes enhances data durability and
availability.
• Performance: Enables faster data retrieval and processing through
parallel access.
2. Processing / Computation
• Challenge: Traditional processing systems are designed for data residing on a single
machine, which limits their capability to handle distributed data.
• Solution: Utilize Distributed Processing.
o Definition: A computation model where data processing tasks are distributed
across multiple nodes in a cluster.
o Benefits:
• Efficiency: Processes large volumes of data in parallel, significantly
reducing processing time.
• Flexibility: Adapts to various types of data and processing tasks without
requiring significant changes to the underlying architecture.
• Fault Tolerance: If one node fails, other nodes can continue processing,
ensuring system reliability.
3. Scalability
• Requirement: The system must be able to adapt to increasing data volumes and
processing demands.
• Solution: Design for Scalability.
o Definition: The capability of a system to grow and manage increased demands
efficiently.
o Benefits:
• Horizontal Scaling: Adding more nodes to the system allows for increased
capacity and performance.
• Cost-Effectiveness: Scaling out (adding more machines) is often more
economical than scaling up (upgrading a single machine).
• Future-Proofing: A scalable architecture can accommodate future growth
without requiring a complete redesign.
Summary
When designing a Big Data system, it's crucial to focus on:
• Storage: Implement distributed storage solutions to handle massive datasets.
• Processing: Use distributed processing methods to efficiently compute across
multiple nodes.
• Scalability: Ensure the system can grow to meet increasing data and processing
demands, leveraging both horizontal scaling and cost-effective strategies.
Master Spark Concepts Zero to Big Data Hero:
Hadoop Architecture Evolution
Hadoop Architecture 1.0
Core Components:
1. HDFS (Hadoop Distributed File System):
o A distributed storage system that stores large data files across multiple nodes
in the cluster.
o Data is split into blocks (default 64MB or 128MB) and stored on DataNodes.
o Key Components:
▪ NameNode (Master): Stores metadata like file structure, block locations,
and replication details.
▪ DataNode (Slave): Stores the actual data blocks and sends periodic
updates to the NameNode.
2. MapReduce:
o A distributed data processing framework that processes data in parallel.
o It involves two phases:
▪ Map Phase: Processes and filters data, generating key-value pairs.
▪ Reduce Phase: Aggregates the data produced by the Map phase.
o Key Components:
▪ JobTracker (Master): Manages job scheduling and resource allocation.
▪ TaskTracker (Slave): Executes individual tasks assigned by the JobTracker.
1. Introduction of YARN:
o YARN separates resource management from job scheduling and monitoring.
o Key components:
▪ ResourceManager (Master): Manages resources across the cluster.
▪ NodeManager (Slave): Runs on each node and reports resource usage.
▪ ApplicationMaster: Handles application-specific task scheduling.
2. Data Storage in HDFS:
o HDFS remains the storage layer, but now it supports fault tolerance and
NameNode High Availability (HA) using standby NameNodes.
3. Resource Allocation:
o ResourceManager assigns resources dynamically to various applications (not
just MapReduce) via containers.
4. Application Submission:
o User submits a job to the ResourceManager.
o The ApplicationMaster is launched to coordinate tasks for that specific job.
5. Task Execution:
o NodeManagers on individual nodes launch containers to execute tasks.
o Tasks can belong to any framework, such as MapReduce, Spark, or Tez.
6. Dynamic Resource Utilization:
o Containers dynamically allocate CPU, memory, and disk resources based on the
workload, improving utilization.
7. Improved Scalability and Fault Tolerance:
o YARN allows scaling to thousands of nodes by delegating specific
responsibilities to the ApplicationMaster and NodeManagers.
o NameNode HA minimizes downtime.
8. Support for Multiple Workloads:
o Beyond MapReduce, Hadoop 2.0 supports frameworks like Spark, Flink, and
HBase for a variety of workloads.
• Sqoop:
o Facilitates data transfer between HDFS and relational databases. It automates
the import/export processes, making data movement efficient. Cloud
Alternative: Azure Data Factory (ADF).
• Pig:
o A high-level scripting language used for data cleaning and transformation,
which simplifies complex data manipulation tasks. Underlying Technology: Uses
MapReduce.
• Hive:
o Provides a SQL-like interface for querying data stored in HDFS, translating
queries into MapReduce jobs for execution.
• Oozie:
o A workflow scheduler for managing complex data processing workflows,
allowing for dependencies and scheduling of multiple tasks. Cloud Alternative:
Azure Data Factory.
• HBase:
o A NoSQL database for quick, random access to large datasets, facilitating real-
time data processing. Cloud Alternative: CosmosDB.
Conclusion
Hadoop revolutionized the way we process and store Big Data. While its core components
like HDFS and YARN remain vital, the complexity of MapReduce and the necessity to learn
multiple ecosystem tools present significant challenges. Despite its evolution, alternatives
are emerging to simplify Big Data processing, but Hadoop’s foundational role in the Big Data
era is undeniable.
Master Spark Concepts Zero to Big Data Hero:
How does Map Reduce work in Hadoop (Simplified)
MapReduce is a programming model used for processing large datasets across distributed
systems like Hadoop. It divides tasks into smaller jobs, processes them in parallel, and then
combines the results.
Here’s a simple breakdown of how MapReduce works:
1. The Basic Process:
MapReduce involves two main steps:
• Map: Process input data and generate intermediate results.
• Reduce: Aggregate the results to produce the final output.
4. Reduce Step:
o The Reduce function takes each key and its list of values (e.g., for the word
"apple", the values could be [1, 1, 1]).
o The Reduce function processes these values (e.g., sums them up) and produces
a final result (e.g., the total count of the word "apple").
5. Output Data:
o The final results are saved in HDFS for further use or analysis.
Reduce Function:
• Input: Key ("apple") and list of values ([1, 1]).
• Output: Total count for the word ("apple", 2).
Input:
Output:
4. Summary of Steps:
1. Map: Process each record and generate key-value pairs.
2. Shuffle & Sort: Group the pairs by key.
3. Reduce: Aggregate the values for each key.
4. Output: Save the final results.
Master Spark Concepts Zero to Big Data Hero:
Disadvantages of MapReduce and Why Hadoop Became Obsolete
Hadoop’s MapReduce framework revolutionized big data processing in its early years but
eventually became less favorable due to the following disadvantages:
1. Limitations of MapReduce
1. Complex Programming Model:
o Writing MapReduce jobs requires significant boilerplate code for even simple
operations.
o Developers need to write multiple jobs for iterative tasks.
2. Batch Processing Only:
o MapReduce is designed for batch processing, making it unsuitable for real-time
or streaming data processing.
3. High Latency:
o The system writes intermediate data to disk between the Map and Reduce
phases, resulting in high input/output overhead and slower performance.
4. Iterative Computations Are Inefficient:
o Iterative tasks like machine learning or graph processing require multiple
MapReduce jobs, with each job reading and writing to disk, causing
inefficiency.
5. Lack of In-Memory Processing:
o MapReduce does not leverage in-memory computation, which is faster
compared to disk-based processing.
6. Resource Utilization:
o MapReduce uses a static allocation of resources, leading to underutilization of
cluster resources.
7. Not Fault-Tolerant for Iterative Tasks:
o While MapReduce can recover from node failures, the re-execution of failed
tasks for iterative workloads is time-consuming.
8. Dependency on Hadoop 1.0’s Architecture:
o The reliance on the JobTracker/ TaskTracker model caused scalability issues
and made resource management inefficient.
Conclusion
Hadoop MapReduce played a pivotal role in big data processing during its time but became
obsolete due to its inefficiency and inability to adapt to modern requirements. Apache
Spark, with its fast, versatile, and easy-to-use framework, has emerged as the go-to solution
for distributed data processing.