0% found this document useful (0 votes)

47 views33 pages

Big Data Unit 1

The document provides a comprehensive overview of Big Data and Data Analytics, detailing its definition, characteristics (5Vs), applications, and differences from traditional data systems. It discusses the evolution of analytic processes and tools, highlighting the transition from manual analysis to modern AI-driven analytics, as well as the risks and challenges associated with Big Data. Additionally, it covers Hadoop's framework, design principles, and components, emphasizing its scalability, fault tolerance, and cost-effectiveness for managing large datasets.

Uploaded by

Celin Narayanan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views33 pages

Big Data Unit 1

Uploaded by

Celin Narayanan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

Here is a detailed explanation of the key topics related to Big Data and Data Analytics,

structured for easy understanding or academic use.

1. Big Data Definition

Big Data refers to extremely large and complex datasets that cannot be managed, processed,
or analyzed using traditional data processing tools. These datasets are generated at high
velocity from various sources like social media, sensors, mobile devices, and IoT.

2. Characteristic Features of Big Data (5Vs)

1. Volume
o Refers to the sheer amount of data (terabytes to zettabytes).

o Example: Facebook generates over 4 petabytes of data daily.

2. Velocity
o The speed at which data is generated, collected, and processed.

o Example: Financial markets require real-time processing of stock data.

3. Variety
o Different types of data: structured (SQL), semi-structured (XML, JSON), and
unstructured (text, video).
o Example: Emails, sensor data, images.

4. Veracity
o The quality and trustworthiness of the data.

o Big Data may include incomplete, inconsistent, or inaccurate data.

5. Value
o The potential to extract meaningful insights that drive decision-making.

o Example: Predicting customer behavior from shopping patterns.

3. Big Data Applications

 Healthcare: Disease prediction, personalized medicine, hospital resource
optimization.
 Finance: Fraud detection, risk analysis, algorithmic trading.
 Retail: Customer behavior analysis, dynamic pricing, inventory management.
 Telecommunications: Network optimization, customer churn prediction.
 Smart Cities: Traffic management, pollution monitoring.
 Education: Student performance prediction, personalized learning.
 Social Media: Sentiment analysis, trend prediction.

4. Big Data vs Traditional Data

Feature Big Data Traditional Data

Volume Terabytes to Zettabytes Megabytes to Gigabytes
Velocity Real-time or near real-time Batch processing
Variety Structured, semi-structured, unstructured Mainly structured
Storage Distributed systems (HDFS, Cloud) Centralized (RDBMS)
Processing Parallel and distributed Sequential
Tools Used Hadoop, Spark, NoSQL SQL, Excel, OLAP

5. Risks of Big Data

 Privacy Concerns: Sensitive personal data can be misused.
 Security Threats: Larger data sets increase exposure to breaches.
 Data Quality Issues: Inconsistent or inaccurate data leads to misleading insights.
 Ethical Dilemmas: Misuse of data for surveillance, manipulation.
 Compliance Risk: Non-compliance with data laws (e.g., GDPR).

6. Structure of Big Data

Big Data is classified based on structure:

 Structured Data: Tabular data stored in relational databases (e.g., customer records).
 Semi-Structured Data: JSON, XML, logs (e.g., API data).
 Unstructured Data: Text, audio, video, social media posts.

7. Challenges of Conventional Systems

 Scalability Limitations: Not designed for growing data volume.
 Performance Bottlenecks: Slow when handling high-velocity or variety data.
 Data Silos: Hard to integrate various data types from different sources.
 Cost of Infrastructure: Expensive to scale traditional systems.
 Limited Real-time Capability: Mostly batch processing, not suitable for streaming
data.

8. Web Data
Web Data refers to information collected from the internet:

Types:

 Social media (Facebook, Twitter)

 Blogs and forums
 Clickstream data
 E-commerce transactions

Characteristics:

 High volume and velocity

 Unstructured or semi-structured
 Rich in user-generated content

Uses:

 Sentiment analysis
 User behavior analytics
 Market research

9. Evolution of Analytic Scalability

a. Manual Analysis

 Excel, Access, manual charting for small datasets.

b. Data Warehousing

 Centralized systems using ETL processes (Extract, Transform, Load).

 SQL and OLAP tools for analysis.
c. Big Data Tools

 Hadoop, Spark, and NoSQL for massive datasets.

 Can handle structured + unstructured data in parallel.

d. Cloud Analytics

 AWS, Azure, GCP platforms offer scalable infrastructure.

 Serverless, pay-as-you-go options.

e. Edge Analytics

 Data analyzed at the point of generation (IoT sensors, mobile).

Evolution of Analytic Scalability

Analytic scalability refers to the ability of data systems to handle growing volumes, velocity,
and complexity of data while maintaining performance. Its evolution has paralleled
advancements in data storage, processing, and infrastructure:

1. Manual Analysis(before 1980’s)

Excel, Access, manual charting for small datasets.

2. Data Warehousing (1980s–1990s):

Centralized systems using ETL processes (Extract, Transform, Load).

o Focus: Structured data from internal sources.

o Scalability Strategy: Vertical scaling (larger machines).

o Tools: Relational databases, OLAP cubes.

o Limitations: Expensive hardware, rigid schema, limited scalability.

3. Distributed Computing Era (2000s):

o Shift to Big Data.

Hadoop, Spark, and NoSQL for massive datasets.

Can handle structured + unstructured data in parallel.
o
o Key Technologies: Hadoop, MapReduce.

o Scalability Strategy: Horizontal scaling (more machines).

o Benefits: Can process massive datasets across commodity hardware.

o Drawbacks: High latency, complex programming models.

4. Real-Time and Cloud Analytics (2010s):

o Rise of streaming data and cloud services.

o Tools: Spark, Kafka, Redshift, BigQuery, Snowflake.

o Scalability Strategy: Elastic scaling in cloud environments.

o Features: In-memory processing, real-time analytics, pay-as-you-go pricing.

5. Modern AI-Driven Analytics (2020s–present):

o Integration with ML/AI workflows.

o Technologies: Lakehouses (e.g., Delta Lake), vector databases, GPU

acceleration.
o Scalability Strategy: Serverless, autoscaling, containerized environments
(e.g., Kubernetes).
o Focus: Unstructured data, multimodal analytics, scalable AI pipelines.

10. Evolution of Analytic Processes, Tools, and Methods

Type Description Tools

Descriptive What happened? Excel, SQL, Tableau
Why did it BI Tools, SQL,
Diagnostic
happen? Python
What is likely to Machine Learning,
Predictive
happen? Python, R
What should be
Prescriptive AI, Optimization tools
done?
Chatbots,
AI that mimics
Cognitive recommendation
human thinking
systems

Here's a detailed but brief overview of the

evolution of analytic processes, tools, and
methods across time:

🧠 1. Early Stage: Descriptive

Type Description Tools

Analytics (Pre-1980s)
🔧 Tools:

 Spreadsheets (e.g., Lotus 1-2-3, later Excel)

 Basic statistical software (e.g., SPSS, SAS
beginnings)

📊 Methods:

 Descriptive statistics (mean, median,

variance)
 Basic reporting and trend analysis

⚙️Processes:

 Manual data collection and processing

 Reports generated periodically
(weekly/monthly)
 Focus on "What happened?"

📈 2. Growth Phase: Diagnostic

Analytics (1980s–1990s)
🔧 Tools:

 Relational databases (e.g., Oracle, DB2,

SQL Server)
 Business Intelligence (BI) tools (e.g.,
Cognos, SAP BW)

📊 Methods:

 OLAP (Online Analytical Processing)

 Drill-down and slice-and-dice techniques

⚙️Processes:

 Centralized data warehouses

Type Description Tools
 Data cleansing and transformation begin
 Start of ETL (Extract, Transform, Load)
processes
 Focus on "Why did it happen?"

📊 3. Expansion Phase:
Predictive Analytics (2000s–
2010s)
🔧 Tools:

 Advanced statistical software (SAS, R,

Python)
 Machine learning libraries (Scikit-learn,
TensorFlow)
 Big data platforms (Hadoop, Spark)

📊 Methods:

 Regression, decision trees, clustering

 Forecasting, classification, and time-series
models

⚙️Processes:

 Data lakes emerge alongside data

warehouses
 Real-time processing becomes possible
 Focus on "What will happen?"

🤖 4. Current Phase:
Prescriptive & Cognitive
Analytics (2015–Now)
🔧 Tools:

 AI/ML platforms (AWS SageMaker, Azure

ML, Google Vertex AI)
Type Description Tools
 NLP, deep learning, LLMs (e.g., ChatGPT,
BERT)
 No-code/low-code platforms, AutoML

📊 Methods:

 Optimization algorithms
 Reinforcement learning
 Natural Language Processing (NLP)
 Explainable AI (XAI)

⚙️Processes:

 Cloud-native architectures
 DataOps and MLOps practices
 Automated decision-making
 Focus on "What should we do?"

🧭 Future Trends
 Generative AI for data storytelling and
simulation
 Quantum analytics (in early research)
 Ethical and responsible AI
 Full automation with real-time adaptive
analytics

Let me know if you’d like a visual timeline or a

table version of this!

11. Analysis vs Reporting

Criteria Analysis Reporting

Purpose Discover Communicate
insights, trends, known data
Criteria Analysis Reporting
relationships
Future or real- Past and current
Time Frame
time predictions events
R, Python, ML Excel, BI Tools,
Tools
Tools, Spark Crystal Reports
Business
Data Scientists,
User Managers,
Analysts
Executives
Complex, Simple,
Complexity
interactive formatted

Aspect Reporting Analysis

The process of
examining data to The process of organizing
Definition uncover patterns, and presenting data in a
relationships, or structured format.
insights.
To understand why it
To inform about
Purpose happened and what to
what has happened
do next
Why did it happen?
Questions
What happened? What does it mean?
Answered
What should we do?
Diagnostic, predictive,
Nature Descriptive
prescriptive
Tools Used BI tools (Power BI, BI + Analytics tools (R,
Criteria Analysis Reporting
Aspect Reporting Analysis
Tableau), Excel, Python, SQL, ML
Dashboards platforms)
Frequency Regular, scheduled
Ad hoc, real-time, or
(daily, weekly,
project-specific
monthly)
Static reports, charts, Insights, models,
Output
dashboards recommendations
Skill Level Advanced (statistics,
Basic (data reading
data science, critical
& visualization)
thinking)
Decision Low – informs High – drives strategic
Support stakeholders decisions

12. Modern Data Analytic Tools

Tool/Platform Use Case

Distributed storage and processing (batch
Hadoop
jobs)
Apache Spark Fast, in-memory processing
Tableau / Power BI Visualizations and dashboards
Python / R Statistical analysis, machine learning
Jupyter Notebooks Interactive coding and visualizations
Google BigQuery / AWS Redshift / Azure
Cloud-based scalable analytics
Synapse
Databricks Unified data engineering and ML platform
1. Business Intelligence (BI) Tools
Used for data visualization, dashboards, and reporting.

Tool Description
Power BI User-friendly dashboarding and reporting with strong integration to
(Microsoft) Microsoft products.
Tableau (Salesforce) Advanced, interactive visualizations; widely used in enterprise BI.
Looker (Google
Data modeling + BI; good for centralized data governance.
Cloud)
Qlik Sense Strong associative data engine and self-service analytics.

2. Data Integration & ETL Tools

Used to extract, transform, and load data from multiple sources.

Tool Description
Apache NiFi Open-source tool for automating data flows.
Talend ETL and data quality management.
Fivetran / Stitch Cloud-native, no-code ELT pipelines.
Apache Airflow Workflow orchestration for managing ETL pipelines.

🧠 3. Advanced Analytics & Machine Learning Tools

Used for statistical modeling, predictive analytics, and machine learning.

Tool Description
Most popular language for data science (with libraries like
Python
pandas, scikit-learn, TensorFlow).
R Specialized in statistics and visual analytics.
Azure ML / AWS SageMaker /
Cloud platforms for end-to-end ML workflows.
Google Vertex AI
AutoML platforms for fast model development without
RapidMiner / DataRobot
heavy coding.

☁️4. Cloud Data Warehouses

Used for storing and querying large volumes of structured data.

Tool Description
Scalable, cloud-native data warehouse with separation of compute and
Snowflake
storage.
Google BigQuery Serverless, high-speed analytics on large datasets.
Tool Description
Amazon Redshift Fully managed, petabyte-scale data warehouse.
Unified analytics platform built on Apache Spark; supports ML and big
Databricks
data.

🛠️5. Data Governance & Cataloging Tools

Used for data discovery, lineage, and compliance.

Tool Description
Alation Data catalog for governance and self-service discovery.
Collibra Enterprise data governance and compliance platform.
Apache Atlas Open-source metadata management and data lineage tool.

Let me know if you want this tailored by use case (e.g., for marketing, finance, healthcare) or
organization

UNIT II

Hadoop-Requirement of Hadoop Framework-Design principle of Hadoop-Comparison with other

system- Hadoop Components-Hadoop 1 vs Hadoop 2- Hadoop Daemon's-HDFS Commands -Map
Reduce Programming: I/O formats-Map side join-Reduce Side Join-Secondary sorting-Pipelining
MapReduce jobs

1. Hadoop

Hadoop is an open-source framework that allows distributed processing of large data sets
across clusters of computers using simple programming models. It is designed to scale up
from single servers to thousands of machines.

2. Requirement of Hadoop Framework

Traditional systems struggle with big data due to volume, variety, and velocity. Hadoop
addresses these issues by enabling:

 Distributed storage (via HDFS)

 Parallel processing (via MapReduce)

 Fault tolerance and scalability

Design Principles of Hadoop

1. Scalability
Hadoop is designed to scale horizontally across thousands of commodity machines. It
can easily handle petabytes of data by simply adding more nodes.
2. Fault Tolerance

o HDFS replicates data blocks across multiple nodes (default: 3 copies).

o If a node fails, Hadoop continues processing using other replicas without data
loss.

3. Data Locality Optimization

o Moves computation close to where the data resides instead of moving large
data across the network.

o Reduces network congestion and improves processing speed.

4. High Throughput
Optimized for batch processing of large datasets using parallelism. Hadoop uses the
MapReduce model to ensure efficient use of cluster resources.

5. Simplicity & Flexibility

Developers can write applications in Java, Python, or other languages using simple
APIs. It supports structured, semi-structured, and unstructured data.

6. Cost-Effectiveness
Designed to run on low-cost, commodity hardware rather than high-end servers.

7. Open-Source and Extensible

Built and maintained under the Apache Foundation, Hadoop is free to use and can be
extended with tools like Hive, Pig, Spark, and HBase.

Feature Traditional Systems Hadoop

Scalability Limited Highly scalable
Fault Tolerance Low Built-in fault tolerance
Cost Expensive infrastructure Commodity hardware
Data Handling Structured only Structured + Unstructured

Hadoop – Architecture / Components

Hadoop is a framework written in Java that utilizes a large cluster of
commodity hardware to maintain and store big size data.
Hadoop works on MapReduce Programming Algorithm that was
introduced by Google. Today lots of Big Brand Companies are using
Hadoop in their Organization to deal with big data, eg. Facebook, Yahoo,
Netflix, eBay, etc.
The Hadoop Architecture Mainly consists of 4 components.

 HDFS(Hadoop Distributed File System)

 MapReduce
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common

1. MapReduce

MapReduce is an Algorithm or a data structure that is based on the YARN

framework. The major feature of MapReduce is to perform the distributed
processing in parallel in a Hadoop cluster which Makes Hadoop working so
fast.

 Purpose: Data processing engine

 Function: Processes data in parallel using Map and Reduce tasks.

 Workflow: Input → Map → Shuffle/Sort → Reduce → Output

In first phase, Map is utilized and in next phase Reduce is utilized.

Input is provided to the Map () function then it's output is used as an input
to the Reduce function and after that, we receive our final output.

The Input is a set of Data.

The Map() function here breaks this DataBlocks into Tuples that are
nothing but a key-value pair. These key-value pairs are now sent as input
to the Reduce(). The Reduce() function then combines this broken Tuples
or key-value pair based on its Key value and form set of Tuples, and
perform some operation like sorting, summation type job, etc. which is then
sent to the final Output Node. Finally, the Output is Obtained.

The data processing is always done in Reducer depending upon the

business requirement of that industry. This is How First Map() and then
Reduce is utilized one by one.

Map Task:

 RecordReader The purpose of recordreader is to break the records.

It is responsible for providing key-value pairs in a Map() function. The
key is actually is its locational information and value is the data
associated with it.
 Map: A map is nothing but a user-defined function whose work is to
process the Tuples obtained from record reader. The Map() function
either does not generate any key-value pair or generate multiple pairs of
these tuples.
 Combiner: Combiner is used for grouping the data in the Map
workflow. It is similar to a Local reducer. The intermediate key-value that
are generated in the Map is combined with the help of this combiner.
Using a combiner is not necessary as it is optional.
 Partitionar: Partitional is responsible for fetching key-value pairs
generated in the Mapper Phases. The partitioner generates the shards
corresponding to each reducer. Hashcode of each key is also fetched by
this partition. Then partitioner performs it's(Hashcode) modulus with the
number of reducers([Link]()%(number of reducers)).

Reduce Task

 Shuffle and Sort: The Task of Reducer starts with this step, the
process in which the Mapper generates the intermediate key-value and
transfers them to the Reducer task is known as Shuffling. Using the
Shuffling process the system can sort the data using its key value.

Once some of the Mapping tasks are done Shuffling begins that is why it
is a faster process and does not wait for the completion of the task
performed by Mapper.
 Reduce: The main function or task of the Reduce is to gather the
Tuple generated from Map and then perform some sorting and
aggregation sort of process on those key-value depending on its key
element.
 OutputFormat: Once all the operations are performed, the key-value
pairs are written into the file with the help of record writer, each record in
a new line, and the key and value in a space-separated manner.
2. HDFS

HDFS(Hadoop Distributed File System) is utilized for storage permission. It

is mainly designed for working on commodity Hardware
devices(inexpensive devices), working on a distributed file system design.
HDFS is designed in such a way that it believes more in storing the data in
a large chunk of blocks rather than storing small data blocks.

HDFS in Hadoop provides Fault-tolerance and High availability to the

storage layer and the other devices present in that Hadoop cluster. Data
storage Nodes in HDFS.

 NameNode(Master)
 DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides
the Datanode(Slaves). Namenode is mainly used for storing the Metadata
i.e. the data about the data. Meta Data can be the transaction logs that
keep track of the user’s activity in a Hadoop cluster.

Meta Data can also be the name of the file, size, and the information about
the location (Block number, Block ids) of Datanode that Namenode stores
to find the closest DataNode for Faster Communication. Namenode
instructs the DataNodes with the operation like delete, create, Replicate,
etc.

DataNode: DataNodes works as a Slave DataNodes are mainly utilized for

storing the data in a Hadoop cluster, the number of DataNodes can be from
1 to 500 or even more than that. The more number of DataNode, the
Hadoop cluster will be able to store more data. So it is advised that the
DataNode should have High storing capacity to store a large number of file
blocks.

High Level Architecture Of Hadoop

File Block In HDFS: Data in HDFS is always stored in terms of blocks. So

the single block of data is divided into multiple blocks of size 128MB which
is default and you can also change it manually.
Let’s understand this concept of breaking down of file in blocks with an
example. Suppose you have uploaded a file of 400MB to your HDFS then
what happens is this file got divided into blocks of
128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created
each of 128MB except the last one. Hadoop doesn’t know or it doesn’t care
about what data is stored in these blocks so it considers the final file blocks
as a partial record as it does not have any idea regarding it. In the Linux file
system, the size of a file block is about 4KB which is very much less than
the default size of file blocks in the Hadoop file system. As we all know
Hadoop is mainly configured for storing the large size data which is in
petabyte, this is what makes Hadoop file system different from other file
systems as it can be scaled, nowadays file blocks of 128MB to 256MB are
considered in Hadoop.

Replication in HDFS Replication ensures the availability of the data.

Replication is making a copy of something and the number of times you
make a copy of that particular thing can be expressed as it’s Replication
Factor. As we have seen in File blocks that the HDFS stores the data in the
form of various blocks at the same time Hadoop is also configured to make
a copy of those file blocks.

By default, the Replication Factor for Hadoop is set to 3 which can be

configured means you can change it manually as per your requirement like
in above example we have made 4 file blocks which means that 3 Replica
or copy of each file block is made means total of 4×3 = 12 blocks are made
for the backup purpose.

This is because for running Hadoop we are using commodity hardware

(inexpensive system hardware) which can be crashed at any time. We are
not using the supercomputer for our Hadoop setup. That is why we need
such a feature in HDFS which can make copies of that file blocks for
backup purposes, this is known as fault tolerance.

Now one thing we also need to notice that after making so many replica’s
of our file blocks we are wasting so much of our storage but for the big
brand organization the data is very much important than the storage so
nobody cares for this extra storage. You can configure the Replication
factor in your [Link] file.

Rack Awareness The rack is nothing but just the physical collection of
nodes in our Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is
consists of so many Racks . with the help of this Racks information
Namenode chooses the closest Datanode to achieve the maximum
performance while performing the read/write information which reduces the
Network Traffic.

HDFS Architecture

YARN(Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2

operations that are Job scheduling and Resource Management. The
Purpose of Job schedular is to divide a big task into small jobs so that each
job can be assigned to various slaves in a Hadoop cluster and Processing
can be Maximized. Job Scheduler also keeps track of which job is
important, which job has more priority, dependencies between the jobs and
all the other information like job timing, etc. And the use of Resource
Manager is to manage all the resources that are made available for running
a Hadoop cluster.

Features of YARN

 Multi-Tenancy
 Scalability
 Cluster-Utilization
 Compatibility

4. Hadoop common or Common Utilities

 Purpose: Shared utilities and libraries

 Function: Provides necessary Java libraries and APIs used by other components.

Feature Hadoop 1 Hadoop 2

There is YARN which works as
Map Reduce which works as Resource Management. It
Resource Management as well as basically allocates the resources
Resource
Management
Data Processing. Due to this and keeps all the things going
workload on Map Reduce, it will on.
affect the performance. MRv2 for Data Processing

Limited (up to ~4,000 nodes, Highly scalable (tens of thousands of

Scalability
~40,000 tasks) nodes)
Job Types Supports MapReduce + other
Only MapReduce
Supported frameworks (e.g., Spark)
Handled by JobTracker (single Better fault tolerance via YARN
Fault
point of failure) architecture
Tolerance
(Single master and multiple slaves) (multiple master and multiple slaves)
Cluster Improved utilization with container-
Less efficient (resource contention)
Utilization based execution
Supported (multiple apps/users share
Multi-tenancy Not supported
cluster)
Flexibility Tightly coupled MapReduce engine Decoupled processing model via YARN
1. Components: In Hadoop 1 we have MapReduce but Hadoop 2 has YARN(Yet Another
Resource Negotiator) and MapReduce version 2.
Hadoop 1 Hadoop 2
HDFS HDFS
Map Reduce YARN / MRv2
2. Daemons:
Hadoop 1 Hadoop 2
Namenode Namenode
Datanode Datanode
Secondary Namenode Secondary Namenode
Job Tracker Resource Manager
Task Tracker Node Manager
3. Working:
 In Hadoop 1, there is HDFS which is used for storage and top of it,
Map Reduce which works as Resource Management as well as Data
Processing. Due to this workload on Map Reduce, it will affect the
performance.
 In Hadoop 2, there is again HDFS which is again used for storage
and on the top of HDFS, there is YARN which works as Resource
Management. It basically allocates the resources and keeps all the
things going on.
4. Limitations: Hadoop 1 is a Master-Slave architecture. It consists of a
single master and multiple slaves. Suppose if master node got crashed
then irrespective of your best slave nodes, your cluster will be destroyed.
Again for creating that cluster means copying system files, image files,
etc. on another system is too much time consuming which will not be
tolerated by organizations in today's time. Hadoop 2 is also a Master-
Slave architecture. But this consists of multiple masters (i.e active
namenodes and standby namenodes) and multiple slaves. If here master
node got crashed then standby master node will take over it. You can
make multiple combinations of active-standby nodes. Thus Hadoop 2 will
eliminate the problem of a single point of failure.

7. Hadoop Daemons

 NameNode: Manages metadata in HDFS.

 DataNode: Stores actual data blocks.

 Secondary Name node

 ResourceManager: Manages cluster resources (Hadoop 2).

 NodeManager: Manages node-level tasks (Hadoop 2).

 JobTracker/TaskTracker: Used in Hadoop 1 for job execution.

1. NameNode
NameNode works on the Master System. The primary purpose of
Namenode is to manage all the MetaData. Metadata is the list of files
stored in HDFS(Hadoop Distributed File System). As we know the data is
stored in the form of blocks in a Hadoop cluster. So the DataNode on
which or the location at which that block of the file is stored is mentioned
in MetaData. All information regarding the logs of the transactions
happening in a Hadoop cluster (when or who read/wrote the data) will be
stored in MetaData. MetaData is stored in the memory.

Features:
 It never stores the data that is present in the file.
 As Namenode works on the Master System, the Master system
should have good processing power and more RAM than Slaves.
 It stores the information of DataNode such as their Block id's and
Number of Blocks
 2. DataNode
 DataNode works on the Slave system. The NameNode always
instructs DataNode for storing the Data. DataNode is a program that
runs on the slave system that serves the read/write request from the
client. As the data is stored in this DataNode, they should possess
high memory to store more Data.
3. Secondary NameNode
Secondary NameNode is used for taking the hourly backup of the data. In
case the Hadoop cluster fails, or crashes, the secondary Namenode will
take the hourly backup or checkpoints of that data and store this data into
a file name fsimage. This file then gets transferred to a new system. A
new MetaData is assigned to that new system and a new Master is
created with this MetaData, and the cluster is made to run again
correctly.
This is the benefit of Secondary Name Node. Now in Hadoop2, we have
High-Availability and Federation features that minimize the importance of
this Secondary Name Node in Hadoop2.

Major Function Of Secondary NameNode:

 It groups the Edit logs and Fsimage from NameNode together.
 It continuously reads the MetaData from the RAM of NameNode and
writes into the Hard Disk.
As secondary NameNode keeps track of checkpoints in a Hadoop
Distributed File System, it is also known as the checkpoint Node.
The Hadoop Daemon's Port

Name Node 50070

Data Node 50075

Secondary Name Node 50090

These ports can be configured manually in [Link] and mapred-

[Link] files.
4. Resource Manager
Resource Manager is also known as the Global Master Daemon that
works on the Master System. The Resource Manager Manages the
resources for the applications that are running in a Hadoop Cluster. The
Resource Manager Mainly consists of 2 things.

A. ApplicationsManager
B. Scheduler
An Application Manager is responsible for accepting the request for a
client and also makes a memory resource on the Slaves in a Hadoop
cluster to host the Application Master. The scheduler is utilized for
providing resources for applications in a Hadoop cluster and for
monitoring this application.

How to start ResourceManager?

[Link] start resourcemanager
How to stop ResourceManager?
stop:[Link] stop resourcemanager

5. Node Manager
The Node Manager works on the Slaves System that manages the
memory resource within the Node and Memory Disk. Each Slave Node in
a Hadoop cluster has a single NodeManager Daemon running in it. It also
sends this monitoring information to the Resource Manager.

How to start Node Manager?

[Link] start nodemanager
How to stop Node Manager?
[Link] stop nodemanager

In a Hadoop cluster, Resource Manager and Node Manager can be

tracked with the specific URLs, of type [Link]
The Hadoop Daemon's Port

ResourceManager 8088

NodeManager 8042

The below diagram shows how Hadoop works.

8. HDFS Commands

Some common commands include:

 hdfs dfs -ls /: List directory

 hdfs dfs -put [Link] /: Upload file

 hdfs dfs -get /[Link] .: Download file

 hdfs dfs -rm /[Link]: Delete file

 hdfs dfs -mkdir <folder name>

 jps

 hdfs dfs -touchz <file_path> create empty file

 hdfs dfs -copyFromLocal <local file path> <dest(present
on hdfs)> copyFromLocal (or) put: To copy files/folders from
local file system to hdfs store. This is the most important command.
Local filesystem means the files present on the OS.
 cat: To print file contents. Syntax:
 bin/hdfs dfs -cat <path>
 copyToLocal (or) get: To copy files/folders from hdfs store to local
file system. Syntax:
 bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local
file dest>
 cp: This command is used to copy files within hdfs. Lets copy
folder geeks to geeks_copied. Syntax:
 bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
 mv: This command is used to move files within hdfs. Lets cut-paste
a file [Link] from geeks folder to geeks_copied. Syntax:
 bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
 rmr: This command deletes a file from HDFS recursively. It is very
useful command when you want to delete a non-empty
directory. Syntax:
 bin/hdfs dfs -rmr <filename/directoryName>

1. I/O Formats in MapReduce

 InputFormat: Defines how input files are split and read.
➤ Examples: TextInputFormat, KeyValueInputFormat,
SequenceFileInputFormat
 OutputFormat: Defines how output data is written.
➤ Examples: TextOutputFormat, SequenceFileOutputFormat

 Purpose: Converts input into key-value pairs for processing and formats output from
Reducer.

In MapReduce, both input and output data are formatted as key-value

pairs. Input formats define how input data is read and converted into key-
value pairs, while output formats specify how the processed data is
written. These formats determine how the MapReduce framework interacts
with the input and output data sources.

Input Formats:
 TextInputFormat:
Reads input data as lines, with the line's byte offset as the key and the line content
as the value. This is the default input format.
 SequenceFileInputFormat:
Reads binary data, commonly used when processing data from other MapReduce
jobs.
 KeyValueTextInputFormat:
Parses input data as key-value pairs separated by a delimiter, often used for
structured data.
 Other Input Formats:
Other input formats exist for specific data sources, such as database tables or
HBase.
Output Formats:
 TextOutputFormat: Writes output as plain text files, where keys and values
are separated by a tab character by default.
 SequenceFileOutputFormat: Writes binary output, useful for passing data
between MapReduce jobs.
 DBOutputFormat: Writes output to relational databases or HBase tables.
 Other Output Formats: Output formats can be customized to write to various
data sinks, including other Hadoop file systems or cloud storage.
Key-Value Pairs:
 The MapReduce framework operates on key-value pairs, where the key can
be any serializable object, and the value can be any serializable object.
 The specific types of keys and values used in a MapReduce job depend on
the job's logic and the chosen input and output formats.
 The key and value classes must implement the Writable interface to allow the
framework to serialize and deserialize them.
 Additionally, key classes often implement the WritableComparable interface to
facilitate sorting by the framework.
In summary: Input and output formats are essential components of
MapReduce that determine how data is read, processed, and written. They
define the structure and format of the data that the MapReduce framework
uses to perform distributed processing.

🔗 2. Map Side Join

 Definition: A join operation performed in the Map phase, avoiding the shuffle and
sort phases.
 Best For: When one dataset is small enough to fit into memory (lookup table).

 Efficiency: Faster and more efficient than reduce-side join.

🔗 3. Reduce Side Join

 Definition: Join logic executed during the Reduce phase after data is shuffled and
sorted.
 Best For: Large datasets or when both datasets are dynamic.

 Drawback: Requires network transfer and more sorting — less efficient.

🔁 4. Secondary Sorting
 Definition: Sorting values within the same key before they reach the Reducer.
 Use Case: When Reducer needs values in a specific order (e.g., time-series data).

 Implementation: Requires a custom partitioner, sort comparator, and grouping

comparator.

🔁 Secondary Sorting – Explained

✅ Definition

Secondary Sorting is a technique in Hadoop MapReduce that allows sorting values

associated with a key before they reach the Reducer.
It helps when the Reducer logic depends on the order of values for each key.

🧠 When to Use

 When the order of values matters for computation (e.g., finding top-N,
chronological order).

 Common in time-series data, log processing, or ranking problems.

🔍 How It Works

1. Composite key: A key that combines both the main key and the sort key (e.g.,
(userID, timestamp)).
2. Custom Partitioner: Ensures all records with the same primary key go to the same
Reducer.

3. Custom Sort Comparator: Sorts composite keys by both primary and secondary
parts.

4. Custom Grouping Comparator: Groups only by the primary key so the Reducer
sees values in sorted order.

📦 Example Use Case

Sort all transactions (amount) for each customer_id by transaction date before aggregation.
✅ Advantages

 Enables ordered processing of values in the Reducer.

 Essential for top-k queries, trend analysis, and sequential decision tasks.

❌ Limitations

 Requires custom code (comparators and partitioner).

 Slightly more complex than standard MapReduce.

5. Pipelining MapReduce Jobs

 Definition: Output of one MapReduce job is used as input to the next job in a
sequence.
 Purpose: Enables complex data processing workflows.

 Tools: Achieved through job chaining in code or using tools like Apache Oozie.

🔗 Map Side Join – Explained

✅ Definition

A Map Side Join is a join operation performed entirely in the Map phase, without using
the Reducer. It allows combining two datasets before the shuffle and sort stage, improving
performance.

🧠 When to Use

 One of the datasets is small enough to fit into memory.

 The larger dataset is sorted and partitioned in the same way as the smaller dataset.

 You want to avoid network overhead and improve speed.

🔍 How It Works
1. The small dataset is loaded into memory (e.g., a HashMap) at the Mapper’s setup
phase.
2. Each record of the large dataset is streamed through the Mapper.

3. The Mapper performs a lookup and join using the in-memory data.

4. The output is written directly—no need for a Reduce phase.

📦 Example Use Case

Joining a large transaction dataset with a small customer lookup file (e.g., customer_id →
customer_name).

✅ Advantages

 Faster execution (no shuffle/sort).

 Less resource-intensive.

 Simple to implement when small dataset is static.

❌ Limitations

 Not suitable when both datasets are large.

 Requires careful partitioning and sorting if pre-joining sorted files.

🔗 Reduce Side Join – Explained

✅ Definition

A Reduce Side Join is a join operation that takes place during the Reduce phase. Both
datasets are independently processed by the Mappers and then shuffled and sorted by key
before being joined in the Reducer.

🧠 When to Use

 Both datasets are large or dynamic.

 Datasets cannot be loaded into memory.

 You need a general-purpose join regardless of dataset size.

🔍 How It Works

1. Both datasets are passed through separate Mappers.

2. Each Mapper tags the records with an identifier (e.g., A or B) and emits them using
the join key.

3. During the shuffle phase, records with the same key from both datasets are grouped
together.

4. The Reducer receives all matching records and performs the join logic.

📦 Example Use Case

Joining a large orders dataset with a large products dataset using product_id as the join
key.

✅ Advantages

 Works with any dataset size.

 Handles complex joins.

 Does not require pre-sorting or partitioning.

❌ Limitations

 Slower due to shuffle and sort phase.

 Higher network overhead.

 More resource-intensive than Map Side Join.

Let me know if you'd like a visual comparison diagram for Map Side Join vs Reduce Side
Join!

BDA Notes
No ratings yet
BDA Notes
54 pages
Introduction To Big Data Notes
No ratings yet
Introduction To Big Data Notes
4 pages
Very Imp Read Once
No ratings yet
Very Imp Read Once
30 pages
Big Data Analytics Unit-1
100% (2)
Big Data Analytics Unit-1
5 pages
BDA1-4 Bunits
No ratings yet
BDA1-4 Bunits
113 pages
Big Data Analytics. Notes
No ratings yet
Big Data Analytics. Notes
32 pages
Unit 1 - Bda
No ratings yet
Unit 1 - Bda
21 pages
Bda Notes
No ratings yet
Bda Notes
13 pages
BDA Module
No ratings yet
BDA Module
6 pages
Unit 1 Introduction To Data Analytics
No ratings yet
Unit 1 Introduction To Data Analytics
20 pages
Data Analytics For Healthcare Notes
No ratings yet
Data Analytics For Healthcare Notes
11 pages
Unit1 Introduction To Data Analytics and Data Analytics Lifecycle Notes
No ratings yet
Unit1 Introduction To Data Analytics and Data Analytics Lifecycle Notes
13 pages
Module 1
No ratings yet
Module 1
21 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
4 pages
Data Architecture for Analytics Tools
No ratings yet
Data Architecture for Analytics Tools
21 pages
Bda A23v12bigdata Analytics Unit1
No ratings yet
Bda A23v12bigdata Analytics Unit1
36 pages
Chap 1
No ratings yet
Chap 1
41 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
15 pages
Big Data Analytics for Students
No ratings yet
Big Data Analytics for Students
23 pages
Big Data Analytics Vaibhav and Vansh
No ratings yet
Big Data Analytics Vaibhav and Vansh
7 pages
Big Data Analytics & Hadoop Guide
No ratings yet
Big Data Analytics & Hadoop Guide
14 pages
Big Data Analytics
No ratings yet
Big Data Analytics
6 pages
Intro To Big Data Analytics
No ratings yet
Intro To Big Data Analytics
14 pages
Attachment
No ratings yet
Attachment
25 pages
What's Is Big D-WPS Office
No ratings yet
What's Is Big D-WPS Office
3 pages
Big Data Analytics
No ratings yet
Big Data Analytics
7 pages
DataAnalytics - Reading Material
No ratings yet
DataAnalytics - Reading Material
11 pages
Big Data Analytics for Businesses
No ratings yet
Big Data Analytics for Businesses
39 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
Introduction To Business Analytics
No ratings yet
Introduction To Business Analytics
63 pages
Classifying Data For Big Data Analytics
No ratings yet
Classifying Data For Big Data Analytics
28 pages
Unit 1
No ratings yet
Unit 1
51 pages
1.2 Big Data
No ratings yet
1.2 Big Data
23 pages
File 1
No ratings yet
File 1
3 pages
Challenges in Big Data Analytics
No ratings yet
Challenges in Big Data Analytics
5 pages
Chapter - 01 - Introduction To Big Data
No ratings yet
Chapter - 01 - Introduction To Big Data
22 pages
Ak As2
No ratings yet
Ak As2
15 pages
Cp5293 Big Data Analytics Question Bank
0% (1)
Cp5293 Big Data Analytics Question Bank
13 pages
cp5293 Big Data Analytics Question Bank
0% (1)
cp5293 Big Data Analytics Question Bank
13 pages
Harteg Notes
No ratings yet
Harteg Notes
4 pages
Big Data Analytics
No ratings yet
Big Data Analytics
5 pages
Class 12 BD & MMS
No ratings yet
Class 12 BD & MMS
8 pages
Introduction to Big Data Analytics
No ratings yet
Introduction to Big Data Analytics
10 pages
BDA Unit 1
No ratings yet
BDA Unit 1
39 pages
R Sharmila
No ratings yet
R Sharmila
6 pages
DSBDA EndSem2023 12F FlyHigh
No ratings yet
DSBDA EndSem2023 12F FlyHigh
20 pages
Reinforcement Learning (RL) - Definition
No ratings yet
Reinforcement Learning (RL) - Definition
6 pages
DBMS Unit1
No ratings yet
DBMS Unit1
30 pages
BDS Session 3
No ratings yet
BDS Session 3
56 pages
Unit 1
No ratings yet
Unit 1
8 pages
Big Data
No ratings yet
Big Data
28 pages
Manoj Kumari Roll No. 20
No ratings yet
Manoj Kumari Roll No. 20
11 pages
Big Data Analytics Unit1
No ratings yet
Big Data Analytics Unit1
20 pages
Unit 1 Divya
No ratings yet
Unit 1 Divya
24 pages
BDS Session 3
No ratings yet
BDS Session 3
64 pages
Big Data
No ratings yet
Big Data
67 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
Big Data Notes
No ratings yet
Big Data Notes
291 pages
Big Data Analytics Unit 1
No ratings yet
Big Data Analytics Unit 1
8 pages
Problem Solving and Python Programming
No ratings yet
Problem Solving and Python Programming
8 pages
It Criterion 7 28.6.25 Final
No ratings yet
It Criterion 7 28.6.25 Final
17 pages
Python Programming
No ratings yet
Python Programming
17 pages
Unit 5 - Part A
No ratings yet
Unit 5 - Part A
13 pages
Python For BME and EEE
No ratings yet
Python For BME and EEE
9 pages
DAA Recursive and Non-Recursive Algorithms
No ratings yet
DAA Recursive and Non-Recursive Algorithms
47 pages
Aiml Record
No ratings yet
Aiml Record
42 pages
AI - Module 1&2
No ratings yet
AI - Module 1&2
2 pages
Database Design & Management Lab Syllabus
No ratings yet
Database Design & Management Lab Syllabus
98 pages
Online Chapter Tests: 7. Integrals
No ratings yet
Online Chapter Tests: 7. Integrals
3 pages
Comprehensive Guide to Bearing Housings
No ratings yet
Comprehensive Guide to Bearing Housings
8 pages
Practical File For Class-X (AI)
No ratings yet
Practical File For Class-X (AI)
16 pages
Exponential Fourier Series Lab Guide
No ratings yet
Exponential Fourier Series Lab Guide
2 pages
UV Visible Spectroscopy - Docx 1
No ratings yet
UV Visible Spectroscopy - Docx 1
12 pages
Surveying #6
No ratings yet
Surveying #6
4 pages
8085 Microprocessor 8251 Interface Guide
No ratings yet
8085 Microprocessor 8251 Interface Guide
14 pages
Simulink-2-18
No ratings yet
Simulink-2-18
17 pages
IEC 62271-200 en
100% (1)
IEC 62271-200 en
8 pages
Introduction To Stata: 1 Data Manipulation
No ratings yet
Introduction To Stata: 1 Data Manipulation
6 pages
Coding Theory: Error Correction Basics
No ratings yet
Coding Theory: Error Correction Basics
1 page
Subsea Cable Impedance Analysis
No ratings yet
Subsea Cable Impedance Analysis
6 pages
Human Dimension & Interior Space (By Julius Panero and Martin Zelnik) - Ergonomia e Antropometria
91% (157)
Human Dimension & Interior Space (By Julius Panero and Martin Zelnik) - Ergonomia e Antropometria
154 pages
Analysis of A Structure With Magneto-Rheological Fluid Damper-IJAERDV04I0917914
No ratings yet
Analysis of A Structure With Magneto-Rheological Fluid Damper-IJAERDV04I0917914
7 pages
Vendor Document Review & Specs
No ratings yet
Vendor Document Review & Specs
23 pages
Tie Rod Manual
100% (1)
Tie Rod Manual
20 pages
203J1A04F8 IntReport
No ratings yet
203J1A04F8 IntReport
57 pages
Lecture Notes 2 by Prof. Kaushik Pal
No ratings yet
Lecture Notes 2 by Prof. Kaushik Pal
34 pages
11.A Nested Squares
No ratings yet
11.A Nested Squares
1 page
MCQs in Fixed Partial Dentures - Tooth Preparation
50% (2)
MCQs in Fixed Partial Dentures - Tooth Preparation
7 pages
CH1 3
No ratings yet
CH1 3
18 pages
Generator CB Interruption of Current With Non-Zero Passage
No ratings yet
Generator CB Interruption of Current With Non-Zero Passage
10 pages
Mrs Lincolns Dressmaker A Novel Jennifer Chiaverini PDF Download
No ratings yet
Mrs Lincolns Dressmaker A Novel Jennifer Chiaverini PDF Download
33 pages
Static Vs Dynamic Liking in Chewing Gum
No ratings yet
Static Vs Dynamic Liking in Chewing Gum
6 pages
Surface Wave Methods For Near-Surface Site Characterization: Sebastiano Foti Carlo G. Lai Glenn J. Rix Claudio Strobbia
100% (1)
Surface Wave Methods For Near-Surface Site Characterization: Sebastiano Foti Carlo G. Lai Glenn J. Rix Claudio Strobbia
482 pages
Tesla Turbomachinery CFD Analysis
100% (5)
Tesla Turbomachinery CFD Analysis
75 pages
CXC Math Exam Paper 1 and Paper 2 Exam Questions
29% (34)
CXC Math Exam Paper 1 and Paper 2 Exam Questions
4 pages
Guide Bend Test - AWS D1.1 - Extract
No ratings yet
Guide Bend Test - AWS D1.1 - Extract
5 pages
BCA - 440-18, BCA-440-20, BCA-CS-440-20 Java
No ratings yet
BCA - 440-18, BCA-440-20, BCA-CS-440-20 Java
2 pages
Industrial Cellular Router ICR-1601G/W
No ratings yet
Industrial Cellular Router ICR-1601G/W
2 pages