Here is a detailed explanation of the key topics related to Big Data and Data Analytics,
structured for easy understanding or academic use.
1. Big Data Definition
Big Data refers to extremely large and complex datasets that cannot be managed, processed,
or analyzed using traditional data processing tools. These datasets are generated at high
velocity from various sources like social media, sensors, mobile devices, and IoT.
2. Characteristic Features of Big Data (5Vs)
1. Volume
o Refers to the sheer amount of data (terabytes to zettabytes).
o Example: Facebook generates over 4 petabytes of data daily.
2. Velocity
o The speed at which data is generated, collected, and processed.
o Example: Financial markets require real-time processing of stock data.
3. Variety
o Different types of data: structured (SQL), semi-structured (XML, JSON), and
unstructured (text, video).
o Example: Emails, sensor data, images.
4. Veracity
o The quality and trustworthiness of the data.
o Big Data may include incomplete, inconsistent, or inaccurate data.
5. Value
o The potential to extract meaningful insights that drive decision-making.
o Example: Predicting customer behavior from shopping patterns.
3. Big Data Applications
Healthcare: Disease prediction, personalized medicine, hospital resource
optimization.
Finance: Fraud detection, risk analysis, algorithmic trading.
Retail: Customer behavior analysis, dynamic pricing, inventory management.
Telecommunications: Network optimization, customer churn prediction.
Smart Cities: Traffic management, pollution monitoring.
Education: Student performance prediction, personalized learning.
Social Media: Sentiment analysis, trend prediction.
4. Big Data vs Traditional Data
Feature Big Data Traditional Data
Volume Terabytes to Zettabytes Megabytes to Gigabytes
Velocity Real-time or near real-time Batch processing
Variety Structured, semi-structured, unstructured Mainly structured
Storage Distributed systems (HDFS, Cloud) Centralized (RDBMS)
Processing Parallel and distributed Sequential
Tools Used Hadoop, Spark, NoSQL SQL, Excel, OLAP
5. Risks of Big Data
Privacy Concerns: Sensitive personal data can be misused.
Security Threats: Larger data sets increase exposure to breaches.
Data Quality Issues: Inconsistent or inaccurate data leads to misleading insights.
Ethical Dilemmas: Misuse of data for surveillance, manipulation.
Compliance Risk: Non-compliance with data laws (e.g., GDPR).
6. Structure of Big Data
Big Data is classified based on structure:
Structured Data: Tabular data stored in relational databases (e.g., customer records).
Semi-Structured Data: JSON, XML, logs (e.g., API data).
Unstructured Data: Text, audio, video, social media posts.
7. Challenges of Conventional Systems
Scalability Limitations: Not designed for growing data volume.
Performance Bottlenecks: Slow when handling high-velocity or variety data.
Data Silos: Hard to integrate various data types from different sources.
Cost of Infrastructure: Expensive to scale traditional systems.
Limited Real-time Capability: Mostly batch processing, not suitable for streaming
data.
8. Web Data
Web Data refers to information collected from the internet:
Types:
Social media (Facebook, Twitter)
Blogs and forums
Clickstream data
E-commerce transactions
Characteristics:
High volume and velocity
Unstructured or semi-structured
Rich in user-generated content
Uses:
Sentiment analysis
User behavior analytics
Market research
9. Evolution of Analytic Scalability
a. Manual Analysis
Excel, Access, manual charting for small datasets.
b. Data Warehousing
Centralized systems using ETL processes (Extract, Transform, Load).
SQL and OLAP tools for analysis.
c. Big Data Tools
Hadoop, Spark, and NoSQL for massive datasets.
Can handle structured + unstructured data in parallel.
d. Cloud Analytics
AWS, Azure, GCP platforms offer scalable infrastructure.
Serverless, pay-as-you-go options.
e. Edge Analytics
Data analyzed at the point of generation (IoT sensors, mobile).
Evolution of Analytic Scalability
Analytic scalability refers to the ability of data systems to handle growing volumes, velocity,
and complexity of data while maintaining performance. Its evolution has paralleled
advancements in data storage, processing, and infrastructure:
1. Manual Analysis(before 1980’s)
Excel, Access, manual charting for small datasets.
2. Data Warehousing (1980s–1990s):
Centralized systems using ETL processes (Extract, Transform, Load).
o Focus: Structured data from internal sources.
o Scalability Strategy: Vertical scaling (larger machines).
o Tools: Relational databases, OLAP cubes.
o Limitations: Expensive hardware, rigid schema, limited scalability.
3. Distributed Computing Era (2000s):
o Shift to Big Data.
Hadoop, Spark, and NoSQL for massive datasets.
Can handle structured + unstructured data in parallel.
o
o Key Technologies: Hadoop, MapReduce.
o Scalability Strategy: Horizontal scaling (more machines).
o Benefits: Can process massive datasets across commodity hardware.
o Drawbacks: High latency, complex programming models.
4. Real-Time and Cloud Analytics (2010s):
o Rise of streaming data and cloud services.
o Tools: Spark, Kafka, Redshift, BigQuery, Snowflake.
o Scalability Strategy: Elastic scaling in cloud environments.
o Features: In-memory processing, real-time analytics, pay-as-you-go pricing.
5. Modern AI-Driven Analytics (2020s–present):
o Integration with ML/AI workflows.
o Technologies: Lakehouses (e.g., Delta Lake), vector databases, GPU
acceleration.
o Scalability Strategy: Serverless, autoscaling, containerized environments
(e.g., Kubernetes).
o Focus: Unstructured data, multimodal analytics, scalable AI pipelines.
10. Evolution of Analytic Processes, Tools, and Methods
Type Description Tools
Descriptive What happened? Excel, SQL, Tableau
Why did it BI Tools, SQL,
Diagnostic
happen? Python
What is likely to Machine Learning,
Predictive
happen? Python, R
What should be
Prescriptive AI, Optimization tools
done?
Chatbots,
AI that mimics
Cognitive recommendation
human thinking
systems
Here's a detailed but brief overview of the
evolution of analytic processes, tools, and
methods across time:
🧠 1. Early Stage: Descriptive
Type Description Tools
Analytics (Pre-1980s)
🔧 Tools:
Spreadsheets (e.g., Lotus 1-2-3, later Excel)
Basic statistical software (e.g., SPSS, SAS
beginnings)
📊 Methods:
Descriptive statistics (mean, median,
variance)
Basic reporting and trend analysis
⚙️Processes:
Manual data collection and processing
Reports generated periodically
(weekly/monthly)
Focus on "What happened?"
📈 2. Growth Phase: Diagnostic
Analytics (1980s–1990s)
🔧 Tools:
Relational databases (e.g., Oracle, DB2,
SQL Server)
Business Intelligence (BI) tools (e.g.,
Cognos, SAP BW)
📊 Methods:
OLAP (Online Analytical Processing)
Drill-down and slice-and-dice techniques
⚙️Processes:
Centralized data warehouses
Type Description Tools
Data cleansing and transformation begin
Start of ETL (Extract, Transform, Load)
processes
Focus on "Why did it happen?"
📊 3. Expansion Phase:
Predictive Analytics (2000s–
2010s)
🔧 Tools:
Advanced statistical software (SAS, R,
Python)
Machine learning libraries (Scikit-learn,
TensorFlow)
Big data platforms (Hadoop, Spark)
📊 Methods:
Regression, decision trees, clustering
Forecasting, classification, and time-series
models
⚙️Processes:
Data lakes emerge alongside data
warehouses
Real-time processing becomes possible
Focus on "What will happen?"
🤖 4. Current Phase:
Prescriptive & Cognitive
Analytics (2015–Now)
🔧 Tools:
AI/ML platforms (AWS SageMaker, Azure
ML, Google Vertex AI)
Type Description Tools
NLP, deep learning, LLMs (e.g., ChatGPT,
BERT)
No-code/low-code platforms, AutoML
📊 Methods:
Optimization algorithms
Reinforcement learning
Natural Language Processing (NLP)
Explainable AI (XAI)
⚙️Processes:
Cloud-native architectures
DataOps and MLOps practices
Automated decision-making
Focus on "What should we do?"
🧭 Future Trends
Generative AI for data storytelling and
simulation
Quantum analytics (in early research)
Ethical and responsible AI
Full automation with real-time adaptive
analytics
Let me know if you’d like a visual timeline or a
table version of this!
11. Analysis vs Reporting
Criteria Analysis Reporting
Purpose Discover Communicate
insights, trends, known data
Criteria Analysis Reporting
relationships
Future or real- Past and current
Time Frame
time predictions events
R, Python, ML Excel, BI Tools,
Tools
Tools, Spark Crystal Reports
Business
Data Scientists,
User Managers,
Analysts
Executives
Complex, Simple,
Complexity
interactive formatted
Aspect Reporting Analysis
The process of
examining data to The process of organizing
Definition uncover patterns, and presenting data in a
relationships, or structured format.
insights.
To understand why it
To inform about
Purpose happened and what to
what has happened
do next
Why did it happen?
Questions
What happened? What does it mean?
Answered
What should we do?
Diagnostic, predictive,
Nature Descriptive
prescriptive
Tools Used BI tools (Power BI, BI + Analytics tools (R,
Criteria Analysis Reporting
Aspect Reporting Analysis
Tableau), Excel, Python, SQL, ML
Dashboards platforms)
Frequency Regular, scheduled
Ad hoc, real-time, or
(daily, weekly,
project-specific
monthly)
Static reports, charts, Insights, models,
Output
dashboards recommendations
Skill Level Advanced (statistics,
Basic (data reading
data science, critical
& visualization)
thinking)
Decision Low – informs High – drives strategic
Support stakeholders decisions
12. Modern Data Analytic Tools
Tool/Platform Use Case
Distributed storage and processing (batch
Hadoop
jobs)
Apache Spark Fast, in-memory processing
Tableau / Power BI Visualizations and dashboards
Python / R Statistical analysis, machine learning
Jupyter Notebooks Interactive coding and visualizations
Google BigQuery / AWS Redshift / Azure
Cloud-based scalable analytics
Synapse
Databricks Unified data engineering and ML platform
1. Business Intelligence (BI) Tools
Used for data visualization, dashboards, and reporting.
Tool Description
Power BI User-friendly dashboarding and reporting with strong integration to
(Microsoft) Microsoft products.
Tableau (Salesforce) Advanced, interactive visualizations; widely used in enterprise BI.
Looker (Google
Data modeling + BI; good for centralized data governance.
Cloud)
Qlik Sense Strong associative data engine and self-service analytics.
2. Data Integration & ETL Tools
Used to extract, transform, and load data from multiple sources.
Tool Description
Apache NiFi Open-source tool for automating data flows.
Talend ETL and data quality management.
Fivetran / Stitch Cloud-native, no-code ELT pipelines.
Apache Airflow Workflow orchestration for managing ETL pipelines.
🧠 3. Advanced Analytics & Machine Learning Tools
Used for statistical modeling, predictive analytics, and machine learning.
Tool Description
Most popular language for data science (with libraries like
Python
pandas, scikit-learn, TensorFlow).
R Specialized in statistics and visual analytics.
Azure ML / AWS SageMaker /
Cloud platforms for end-to-end ML workflows.
Google Vertex AI
AutoML platforms for fast model development without
RapidMiner / DataRobot
heavy coding.
☁️4. Cloud Data Warehouses
Used for storing and querying large volumes of structured data.
Tool Description
Scalable, cloud-native data warehouse with separation of compute and
Snowflake
storage.
Google BigQuery Serverless, high-speed analytics on large datasets.
Tool Description
Amazon Redshift Fully managed, petabyte-scale data warehouse.
Unified analytics platform built on Apache Spark; supports ML and big
Databricks
data.
🛠️5. Data Governance & Cataloging Tools
Used for data discovery, lineage, and compliance.
Tool Description
Alation Data catalog for governance and self-service discovery.
Collibra Enterprise data governance and compliance platform.
Apache Atlas Open-source metadata management and data lineage tool.
Let me know if you want this tailored by use case (e.g., for marketing, finance, healthcare) or
organization
UNIT II
Hadoop-Requirement of Hadoop Framework-Design principle of Hadoop-Comparison with other
system- Hadoop Components-Hadoop 1 vs Hadoop 2- Hadoop Daemon's-HDFS Commands -Map
Reduce Programming: I/O formats-Map side join-Reduce Side Join-Secondary sorting-Pipelining
MapReduce jobs
1. Hadoop
Hadoop is an open-source framework that allows distributed processing of large data sets
across clusters of computers using simple programming models. It is designed to scale up
from single servers to thousands of machines.
2. Requirement of Hadoop Framework
Traditional systems struggle with big data due to volume, variety, and velocity. Hadoop
addresses these issues by enabling:
Distributed storage (via HDFS)
Parallel processing (via MapReduce)
Fault tolerance and scalability
Design Principles of Hadoop
1. Scalability
Hadoop is designed to scale horizontally across thousands of commodity machines. It
can easily handle petabytes of data by simply adding more nodes.
2. Fault Tolerance
o HDFS replicates data blocks across multiple nodes (default: 3 copies).
o If a node fails, Hadoop continues processing using other replicas without data
loss.
3. Data Locality Optimization
o Moves computation close to where the data resides instead of moving large
data across the network.
o Reduces network congestion and improves processing speed.
4. High Throughput
Optimized for batch processing of large datasets using parallelism. Hadoop uses the
MapReduce model to ensure efficient use of cluster resources.
5. Simplicity & Flexibility
Developers can write applications in Java, Python, or other languages using simple
APIs. It supports structured, semi-structured, and unstructured data.
6. Cost-Effectiveness
Designed to run on low-cost, commodity hardware rather than high-end servers.
7. Open-Source and Extensible
Built and maintained under the Apache Foundation, Hadoop is free to use and can be
extended with tools like Hive, Pig, Spark, and HBase.
Feature Traditional Systems Hadoop
Scalability Limited Highly scalable
Fault Tolerance Low Built-in fault tolerance
Cost Expensive infrastructure Commodity hardware
Data Handling Structured only Structured + Unstructured
Hadoop – Architecture / Components
Hadoop is a framework written in Java that utilizes a large cluster of
commodity hardware to maintain and store big size data.
Hadoop works on MapReduce Programming Algorithm that was
introduced by Google. Today lots of Big Brand Companies are using
Hadoop in their Organization to deal with big data, eg. Facebook, Yahoo,
Netflix, eBay, etc.
The Hadoop Architecture Mainly consists of 4 components.
HDFS(Hadoop Distributed File System)
MapReduce
YARN(Yet Another Resource Negotiator)
Common Utilities or Hadoop Common
1. MapReduce
MapReduce is an Algorithm or a data structure that is based on the YARN
framework. The major feature of MapReduce is to perform the distributed
processing in parallel in a Hadoop cluster which Makes Hadoop working so
fast.
Purpose: Data processing engine
Function: Processes data in parallel using Map and Reduce tasks.
Workflow: Input → Map → Shuffle/Sort → Reduce → Output
In first phase, Map is utilized and in next phase Reduce is utilized.
Input is provided to the Map () function then it's output is used as an input
to the Reduce function and after that, we receive our final output.
The Input is a set of Data.
The Map() function here breaks this DataBlocks into Tuples that are
nothing but a key-value pair. These key-value pairs are now sent as input
to the Reduce(). The Reduce() function then combines this broken Tuples
or key-value pair based on its Key value and form set of Tuples, and
perform some operation like sorting, summation type job, etc. which is then
sent to the final Output Node. Finally, the Output is Obtained.
The data processing is always done in Reducer depending upon the
business requirement of that industry. This is How First Map() and then
Reduce is utilized one by one.
Map Task:
RecordReader The purpose of recordreader is to break the records.
It is responsible for providing key-value pairs in a Map() function. The
key is actually is its locational information and value is the data
associated with it.
Map: A map is nothing but a user-defined function whose work is to
process the Tuples obtained from record reader. The Map() function
either does not generate any key-value pair or generate multiple pairs of
these tuples.
Combiner: Combiner is used for grouping the data in the Map
workflow. It is similar to a Local reducer. The intermediate key-value that
are generated in the Map is combined with the help of this combiner.
Using a combiner is not necessary as it is optional.
Partitionar: Partitional is responsible for fetching key-value pairs
generated in the Mapper Phases. The partitioner generates the shards
corresponding to each reducer. Hashcode of each key is also fetched by
this partition. Then partitioner performs it's(Hashcode) modulus with the
number of reducers([Link]()%(number of reducers)).
Reduce Task
Shuffle and Sort: The Task of Reducer starts with this step, the
process in which the Mapper generates the intermediate key-value and
transfers them to the Reducer task is known as Shuffling. Using the
Shuffling process the system can sort the data using its key value.
Once some of the Mapping tasks are done Shuffling begins that is why it
is a faster process and does not wait for the completion of the task
performed by Mapper.
Reduce: The main function or task of the Reduce is to gather the
Tuple generated from Map and then perform some sorting and
aggregation sort of process on those key-value depending on its key
element.
OutputFormat: Once all the operations are performed, the key-value
pairs are written into the file with the help of record writer, each record in
a new line, and the key and value in a space-separated manner.
2. HDFS
HDFS(Hadoop Distributed File System) is utilized for storage permission. It
is mainly designed for working on commodity Hardware
devices(inexpensive devices), working on a distributed file system design.
HDFS is designed in such a way that it believes more in storing the data in
a large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the
storage layer and the other devices present in that Hadoop cluster. Data
storage Nodes in HDFS.
NameNode(Master)
DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides
the Datanode(Slaves). Namenode is mainly used for storing the Metadata
i.e. the data about the data. Meta Data can be the transaction logs that
keep track of the user’s activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about
the location (Block number, Block ids) of Datanode that Namenode stores
to find the closest DataNode for Faster Communication. Namenode
instructs the DataNodes with the operation like delete, create, Replicate,
etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for
storing the data in a Hadoop cluster, the number of DataNodes can be from
1 to 500 or even more than that. The more number of DataNode, the
Hadoop cluster will be able to store more data. So it is advised that the
DataNode should have High storing capacity to store a large number of file
blocks.
High Level Architecture Of Hadoop
File Block In HDFS: Data in HDFS is always stored in terms of blocks. So
the single block of data is divided into multiple blocks of size 128MB which
is default and you can also change it manually.
Let’s understand this concept of breaking down of file in blocks with an
example. Suppose you have uploaded a file of 400MB to your HDFS then
what happens is this file got divided into blocks of
128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created
each of 128MB except the last one. Hadoop doesn’t know or it doesn’t care
about what data is stored in these blocks so it considers the final file blocks
as a partial record as it does not have any idea regarding it. In the Linux file
system, the size of a file block is about 4KB which is very much less than
the default size of file blocks in the Hadoop file system. As we all know
Hadoop is mainly configured for storing the large size data which is in
petabyte, this is what makes Hadoop file system different from other file
systems as it can be scaled, nowadays file blocks of 128MB to 256MB are
considered in Hadoop.
Replication in HDFS Replication ensures the availability of the data.
Replication is making a copy of something and the number of times you
make a copy of that particular thing can be expressed as it’s Replication
Factor. As we have seen in File blocks that the HDFS stores the data in the
form of various blocks at the same time Hadoop is also configured to make
a copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be
configured means you can change it manually as per your requirement like
in above example we have made 4 file blocks which means that 3 Replica
or copy of each file block is made means total of 4×3 = 12 blocks are made
for the backup purpose.
This is because for running Hadoop we are using commodity hardware
(inexpensive system hardware) which can be crashed at any time. We are
not using the supercomputer for our Hadoop setup. That is why we need
such a feature in HDFS which can make copies of that file blocks for
backup purposes, this is known as fault tolerance.
Now one thing we also need to notice that after making so many replica’s
of our file blocks we are wasting so much of our storage but for the big
brand organization the data is very much important than the storage so
nobody cares for this extra storage. You can configure the Replication
factor in your [Link] file.
Rack Awareness The rack is nothing but just the physical collection of
nodes in our Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is
consists of so many Racks . with the help of this Racks information
Namenode chooses the closest Datanode to achieve the maximum
performance while performing the read/write information which reduces the
Network Traffic.
HDFS Architecture
3.
YARN(Yet Another Resource Negotiator)
YARN is a Framework on which MapReduce works. YARN performs 2
operations that are Job scheduling and Resource Management. The
Purpose of Job schedular is to divide a big task into small jobs so that each
job can be assigned to various slaves in a Hadoop cluster and Processing
can be Maximized. Job Scheduler also keeps track of which job is
important, which job has more priority, dependencies between the jobs and
all the other information like job timing, etc. And the use of Resource
Manager is to manage all the resources that are made available for running
a Hadoop cluster.
Features of YARN
Multi-Tenancy
Scalability
Cluster-Utilization
Compatibility
4. Hadoop common or Common Utilities
Purpose: Shared utilities and libraries
Function: Provides necessary Java libraries and APIs used by other components.
Feature Hadoop 1 Hadoop 2
There is YARN which works as
Map Reduce which works as Resource Management. It
Resource Management as well as basically allocates the resources
Resource
Management
Data Processing. Due to this and keeps all the things going
workload on Map Reduce, it will on.
affect the performance. MRv2 for Data Processing
Limited (up to ~4,000 nodes, Highly scalable (tens of thousands of
Scalability
~40,000 tasks) nodes)
Job Types Supports MapReduce + other
Only MapReduce
Supported frameworks (e.g., Spark)
Handled by JobTracker (single Better fault tolerance via YARN
Fault
point of failure) architecture
Tolerance
(Single master and multiple slaves) (multiple master and multiple slaves)
Cluster Improved utilization with container-
Less efficient (resource contention)
Utilization based execution
Supported (multiple apps/users share
Multi-tenancy Not supported
cluster)
Flexibility Tightly coupled MapReduce engine Decoupled processing model via YARN
1. Components: In Hadoop 1 we have MapReduce but Hadoop 2 has YARN(Yet Another
Resource Negotiator) and MapReduce version 2.
Hadoop 1 Hadoop 2
HDFS HDFS
Map Reduce YARN / MRv2
2. Daemons:
Hadoop 1 Hadoop 2
Namenode Namenode
Datanode Datanode
Secondary Namenode Secondary Namenode
Job Tracker Resource Manager
Task Tracker Node Manager
3. Working:
In Hadoop 1, there is HDFS which is used for storage and top of it,
Map Reduce which works as Resource Management as well as Data
Processing. Due to this workload on Map Reduce, it will affect the
performance.
In Hadoop 2, there is again HDFS which is again used for storage
and on the top of HDFS, there is YARN which works as Resource
Management. It basically allocates the resources and keeps all the
things going on.
4. Limitations: Hadoop 1 is a Master-Slave architecture. It consists of a
single master and multiple slaves. Suppose if master node got crashed
then irrespective of your best slave nodes, your cluster will be destroyed.
Again for creating that cluster means copying system files, image files,
etc. on another system is too much time consuming which will not be
tolerated by organizations in today's time. Hadoop 2 is also a Master-
Slave architecture. But this consists of multiple masters (i.e active
namenodes and standby namenodes) and multiple slaves. If here master
node got crashed then standby master node will take over it. You can
make multiple combinations of active-standby nodes. Thus Hadoop 2 will
eliminate the problem of a single point of failure.
7. Hadoop Daemons
NameNode: Manages metadata in HDFS.
DataNode: Stores actual data blocks.
Secondary Name node
ResourceManager: Manages cluster resources (Hadoop 2).
NodeManager: Manages node-level tasks (Hadoop 2).
JobTracker/TaskTracker: Used in Hadoop 1 for job execution.
1. NameNode
NameNode works on the Master System. The primary purpose of
Namenode is to manage all the MetaData. Metadata is the list of files
stored in HDFS(Hadoop Distributed File System). As we know the data is
stored in the form of blocks in a Hadoop cluster. So the DataNode on
which or the location at which that block of the file is stored is mentioned
in MetaData. All information regarding the logs of the transactions
happening in a Hadoop cluster (when or who read/wrote the data) will be
stored in MetaData. MetaData is stored in the memory.
Features:
It never stores the data that is present in the file.
As Namenode works on the Master System, the Master system
should have good processing power and more RAM than Slaves.
It stores the information of DataNode such as their Block id's and
Number of Blocks
2. DataNode
DataNode works on the Slave system. The NameNode always
instructs DataNode for storing the Data. DataNode is a program that
runs on the slave system that serves the read/write request from the
client. As the data is stored in this DataNode, they should possess
high memory to store more Data.
3. Secondary NameNode
Secondary NameNode is used for taking the hourly backup of the data. In
case the Hadoop cluster fails, or crashes, the secondary Namenode will
take the hourly backup or checkpoints of that data and store this data into
a file name fsimage. This file then gets transferred to a new system. A
new MetaData is assigned to that new system and a new Master is
created with this MetaData, and the cluster is made to run again
correctly.
This is the benefit of Secondary Name Node. Now in Hadoop2, we have
High-Availability and Federation features that minimize the importance of
this Secondary Name Node in Hadoop2.
Major Function Of Secondary NameNode:
It groups the Edit logs and Fsimage from NameNode together.
It continuously reads the MetaData from the RAM of NameNode and
writes into the Hard Disk.
As secondary NameNode keeps track of checkpoints in a Hadoop
Distributed File System, it is also known as the checkpoint Node.
The Hadoop Daemon's Port
Name Node 50070
Data Node 50075
Secondary Name Node 50090
These ports can be configured manually in [Link] and mapred-
[Link] files.
4. Resource Manager
Resource Manager is also known as the Global Master Daemon that
works on the Master System. The Resource Manager Manages the
resources for the applications that are running in a Hadoop Cluster. The
Resource Manager Mainly consists of 2 things.
A. ApplicationsManager
B. Scheduler
An Application Manager is responsible for accepting the request for a
client and also makes a memory resource on the Slaves in a Hadoop
cluster to host the Application Master. The scheduler is utilized for
providing resources for applications in a Hadoop cluster and for
monitoring this application.
How to start ResourceManager?
[Link] start resourcemanager
How to stop ResourceManager?
stop:[Link] stop resourcemanager
5. Node Manager
The Node Manager works on the Slaves System that manages the
memory resource within the Node and Memory Disk. Each Slave Node in
a Hadoop cluster has a single NodeManager Daemon running in it. It also
sends this monitoring information to the Resource Manager.
How to start Node Manager?
[Link] start nodemanager
How to stop Node Manager?
[Link] stop nodemanager
In a Hadoop cluster, Resource Manager and Node Manager can be
tracked with the specific URLs, of type [Link]
The Hadoop Daemon's Port
ResourceManager 8088
NodeManager 8042
The below diagram shows how Hadoop works.
8. HDFS Commands
Some common commands include:
hdfs dfs -ls /: List directory
hdfs dfs -put [Link] /: Upload file
hdfs dfs -get /[Link] .: Download file
hdfs dfs -rm /[Link]: Delete file
hdfs dfs -mkdir <folder name>
jps
hdfs dfs -touchz <file_path> create empty file
hdfs dfs -copyFromLocal <local file path> <dest(present
on hdfs)> copyFromLocal (or) put: To copy files/folders from
local file system to hdfs store. This is the most important command.
Local filesystem means the files present on the OS.
cat: To print file contents. Syntax:
bin/hdfs dfs -cat <path>
copyToLocal (or) get: To copy files/folders from hdfs store to local
file system. Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local
file dest>
cp: This command is used to copy files within hdfs. Lets copy
folder geeks to geeks_copied. Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
mv: This command is used to move files within hdfs. Lets cut-paste
a file [Link] from geeks folder to geeks_copied. Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
rmr: This command deletes a file from HDFS recursively. It is very
useful command when you want to delete a non-empty
directory. Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
1. I/O Formats in MapReduce
InputFormat: Defines how input files are split and read.
➤ Examples: TextInputFormat, KeyValueInputFormat,
SequenceFileInputFormat
OutputFormat: Defines how output data is written.
➤ Examples: TextOutputFormat, SequenceFileOutputFormat
Purpose: Converts input into key-value pairs for processing and formats output from
Reducer.
In MapReduce, both input and output data are formatted as key-value
pairs. Input formats define how input data is read and converted into key-
value pairs, while output formats specify how the processed data is
written. These formats determine how the MapReduce framework interacts
with the input and output data sources.
Input Formats:
TextInputFormat:
Reads input data as lines, with the line's byte offset as the key and the line content
as the value. This is the default input format.
SequenceFileInputFormat:
Reads binary data, commonly used when processing data from other MapReduce
jobs.
KeyValueTextInputFormat:
Parses input data as key-value pairs separated by a delimiter, often used for
structured data.
Other Input Formats:
Other input formats exist for specific data sources, such as database tables or
HBase.
Output Formats:
TextOutputFormat: Writes output as plain text files, where keys and values
are separated by a tab character by default.
SequenceFileOutputFormat: Writes binary output, useful for passing data
between MapReduce jobs.
DBOutputFormat: Writes output to relational databases or HBase tables.
Other Output Formats: Output formats can be customized to write to various
data sinks, including other Hadoop file systems or cloud storage.
Key-Value Pairs:
The MapReduce framework operates on key-value pairs, where the key can
be any serializable object, and the value can be any serializable object.
The specific types of keys and values used in a MapReduce job depend on
the job's logic and the chosen input and output formats.
The key and value classes must implement the Writable interface to allow the
framework to serialize and deserialize them.
Additionally, key classes often implement the WritableComparable interface to
facilitate sorting by the framework.
In summary: Input and output formats are essential components of
MapReduce that determine how data is read, processed, and written. They
define the structure and format of the data that the MapReduce framework
uses to perform distributed processing.
🔗 2. Map Side Join
Definition: A join operation performed in the Map phase, avoiding the shuffle and
sort phases.
Best For: When one dataset is small enough to fit into memory (lookup table).
Efficiency: Faster and more efficient than reduce-side join.
🔗 3. Reduce Side Join
Definition: Join logic executed during the Reduce phase after data is shuffled and
sorted.
Best For: Large datasets or when both datasets are dynamic.
Drawback: Requires network transfer and more sorting — less efficient.
🔁 4. Secondary Sorting
Definition: Sorting values within the same key before they reach the Reducer.
Use Case: When Reducer needs values in a specific order (e.g., time-series data).
Implementation: Requires a custom partitioner, sort comparator, and grouping
comparator.
🔁 Secondary Sorting – Explained
✅ Definition
Secondary Sorting is a technique in Hadoop MapReduce that allows sorting values
associated with a key before they reach the Reducer.
It helps when the Reducer logic depends on the order of values for each key.
🧠 When to Use
When the order of values matters for computation (e.g., finding top-N,
chronological order).
Common in time-series data, log processing, or ranking problems.
🔍 How It Works
1. Composite key: A key that combines both the main key and the sort key (e.g.,
(userID, timestamp)).
2. Custom Partitioner: Ensures all records with the same primary key go to the same
Reducer.
3. Custom Sort Comparator: Sorts composite keys by both primary and secondary
parts.
4. Custom Grouping Comparator: Groups only by the primary key so the Reducer
sees values in sorted order.
📦 Example Use Case
Sort all transactions (amount) for each customer_id by transaction date before aggregation.
✅ Advantages
Enables ordered processing of values in the Reducer.
Essential for top-k queries, trend analysis, and sequential decision tasks.
❌ Limitations
Requires custom code (comparators and partitioner).
Slightly more complex than standard MapReduce.
5. Pipelining MapReduce Jobs
Definition: Output of one MapReduce job is used as input to the next job in a
sequence.
Purpose: Enables complex data processing workflows.
Tools: Achieved through job chaining in code or using tools like Apache Oozie.
🔗 Map Side Join – Explained
✅ Definition
A Map Side Join is a join operation performed entirely in the Map phase, without using
the Reducer. It allows combining two datasets before the shuffle and sort stage, improving
performance.
🧠 When to Use
One of the datasets is small enough to fit into memory.
The larger dataset is sorted and partitioned in the same way as the smaller dataset.
You want to avoid network overhead and improve speed.
🔍 How It Works
1. The small dataset is loaded into memory (e.g., a HashMap) at the Mapper’s setup
phase.
2. Each record of the large dataset is streamed through the Mapper.
3. The Mapper performs a lookup and join using the in-memory data.
4. The output is written directly—no need for a Reduce phase.
📦 Example Use Case
Joining a large transaction dataset with a small customer lookup file (e.g., customer_id →
customer_name).
✅ Advantages
Faster execution (no shuffle/sort).
Less resource-intensive.
Simple to implement when small dataset is static.
❌ Limitations
Not suitable when both datasets are large.
Requires careful partitioning and sorting if pre-joining sorted files.
🔗 Reduce Side Join – Explained
✅ Definition
A Reduce Side Join is a join operation that takes place during the Reduce phase. Both
datasets are independently processed by the Mappers and then shuffled and sorted by key
before being joined in the Reducer.
🧠 When to Use
Both datasets are large or dynamic.
Datasets cannot be loaded into memory.
You need a general-purpose join regardless of dataset size.
🔍 How It Works
1. Both datasets are passed through separate Mappers.
2. Each Mapper tags the records with an identifier (e.g., A or B) and emits them using
the join key.
3. During the shuffle phase, records with the same key from both datasets are grouped
together.
4. The Reducer receives all matching records and performs the join logic.
📦 Example Use Case
Joining a large orders dataset with a large products dataset using product_id as the join
key.
✅ Advantages
Works with any dataset size.
Handles complex joins.
Does not require pre-sorting or partitioning.
❌ Limitations
Slower due to shuffle and sort phase.
Higher network overhead.
More resource-intensive than Map Side Join.
Let me know if you'd like a visual comparison diagram for Map Side Join vs Reduce Side
Join!