0% found this document useful (0 votes)
2 views9 pages

Big Data Complete Notes

The document provides comprehensive semester notes on Big Data, covering its introduction, types, analytics, and the Hadoop ecosystem. It discusses the evolution, challenges, and importance of Big Data, as well as the differences between traditional BI and Big Data. Additionally, it includes sections on NoSQL databases, MongoDB, and R programming, highlighting their features, advantages, and applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views9 pages

Big Data Complete Notes

The document provides comprehensive semester notes on Big Data, covering its introduction, types, analytics, and the Hadoop ecosystem. It discusses the evolution, challenges, and importance of Big Data, as well as the differences between traditional BI and Big Data. Additionally, it includes sections on NoSQL databases, MongoDB, and R programming, highlighting their features, advantages, and applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

BIG DATA - COMPLETE SEMESTER NOTES

UNIT – I: INTRODUCTION TO BIG DATA


Types and Classification of Digital Data

Types of Digital Data:

1. Structured Data

2. Organized in tabular format.

3. Stored in RDBMS.

4. Examples: Bank transactions, sensor logs.

5. Semi-Structured Data

6. Partially organized.

7. Doesn’t conform to formal data models.

8. Examples: XML, JSON, NoSQL documents.

9. Unstructured Data

10. No fixed format.

11. Examples: Emails, audio, video, social media content.

Classification of Digital Data:

• Human-generated Data: Emails, social media posts.


• Machine-generated Data: Sensor data, server logs.
• Metadata: Data about other data.

Introduction to Big Data

Evolution of Big Data:

• Emerged due to exponential growth of internet, mobile data, IoT.


• Traditional systems failed to process unstructured or huge volumes of data.

1
Definition of Big Data:

• Big Data is defined by the 5 V’s:


• Volume – Large amounts of data.
• Velocity – Speed of data generation and processing.
• Variety – Different types of data (text, video, logs).
• Veracity – Trustworthiness of the data.
• Value – Useful insights extracted from data.

Traditional BI vs Big Data

Feature Traditional BI Big Data

Storage GB to TB TB to PB

Data Type Structured All Types

Architecture Centralized Distributed

Processing Batch Batch + Real-time

Tools SQL, OLAP Hadoop, Spark

Coexistence of Big Data and Data Warehouse


• Big Data complements data warehouses.
• Warehouses handle structured historical data.
• Big Data handles real-time and semi/unstructured data.

Big Data Analytics

What It Is:

• Advanced techniques to extract actionable insights from huge and diverse data.

What It Isn’t:

• Not just collecting massive data or using fast computers.


• It's not only for data scientists.

Why Sudden Hype:

• Cost-effective storage.
• Real-time decisions.
• Cloud computing.

2
Classification of Analytics:

1. Descriptive – What happened?


2. Diagnostic – Why it happened?
3. Predictive – What will happen?
4. Prescriptive – What action should be taken?

Challenges for Businesses:

• Poor data quality.


• Lack of skilled professionals.
• Integration with existing systems.
• Privacy and security.

Importance of Big Data Analytics:

• Customer behavior analysis.


• Fraud detection.
• Operational efficiency.
• Real-time alerts.

Data Science and Terminologies

Data Science:

• Interdisciplinary field.
• Combines statistics, machine learning, data engineering, domain expertise.

Important Terminologies:

• HDFS: Distributed file storage.


• MapReduce: Batch processing framework.
• Hive: SQL-based query tool.
• Pig: Dataflow scripting language.
• Spark: In-memory data processing engine.
• Flume: Ingests unstructured data.
• Sqoop: Transfers data between RDBMS and Hadoop.
• YARN: Resource manager in Hadoop.

UNIT – II: HADOOP ECOSYSTEM


Features of Hadoop:
• Open-source.

3
• Highly scalable.
• Fault-tolerant.
• Runs on commodity hardware.
• Data replication for fault recovery.

Key Advantages:
• Cost-effective.
• Handles structured, semi-structured, and unstructured data.
• Supports multiple languages (Java, Python, etc.).
• Ecosystem includes various tools for different tasks.

Versions of Hadoop:
• Hadoop 1.x: Single NameNode, scalability issues.
• Hadoop 2.x: Introduced YARN, better resource management.
• Hadoop 3.x: Erasure coding, containerization support, better performance.

Hadoop Ecosystem Overview:


• HDFS – Storage layer.
• MapReduce – Processing layer.
• YARN – Resource manager.
• Hive – SQL-like queries.
• Pig – Scripting language.
• HBase – Columnar storage DB.
• Oozie – Workflow scheduler.
• Flume – Ingest logs.
• Sqoop – Transfers data from RDBMS.

Distributions:
• Cloudera, Hortonworks, MapR, Amazon EMR.

Need for Hadoop:


• Traditional RDBMSs can’t handle high volume and variety.
• Provides distributed storage and processing.

4
RDBMS vs Hadoop

Aspect RDBMS Hadoop

Data Types Structured All Types

Schema Fixed Dynamic

Scalability Vertical Horizontal

Cost Expensive Low-cost (commodity hardware)

Real-time Possible Not in MapReduce (Spark preferred)

Distributed Computing Challenges:


• Node failure.
• Network latency.
• Synchronization.
• Load balancing.

History of Hadoop:
• Inspired by Google File System (GFS).
• Created by Doug Cutting and Mike Cafarella.
• Yahoo adopted and funded development.

HDFS:
• Master-slave architecture.
• NameNode: Metadata.
• DataNodes: Store blocks.
• Replication factor (default = 3).
• Designed for write-once, read-many workloads.

UNIT – III: PROCESSING DATA WITH HADOOP &


NOSQL
MapReduce Programming

Introduction:

• Programming model for distributed processing of large datasets.

5
Components:

• Mapper: Processes input data and emits key-value pairs.


• Reducer: Aggregates values based on keys from the mapper.
• Combiner: Optional local reducer to optimize performance.
• Partitioner: Decides which reducer a key-value pair should go to.

NoSQL Databases

Introduction:

• Non-relational databases designed for horizontal scalability and flexible data models.

Types:

1. Key-Value Stores (e.g., Redis, Riak)


2. Document Stores (e.g., MongoDB, CouchDB)
3. Column Stores (e.g., Cassandra, HBase)
4. Graph Databases (e.g., Neo4j)

Advantages:

• Schema-free
• Horizontal scaling
• High performance
• Better handling of unstructured data

Use in Industry:

• Real-time web apps


• E-commerce
• Social media analytics
• IoT applications

SQL vs NoSQL vs NewSQL

Feature SQL NoSQL NewSQL

Schema Fixed Dynamic Fixed

Scalability Vertical Horizontal Horizontal

ACID Support Full Limited Full

Query Language SQL Varies SQL

Ideal for Structured data Unstructured/semi-structured OLTP + Big Data

6
UNIT – IV: MONGODB
Necessity of MongoDB
• High availability and scalability
• Schema flexibility
• Rich querying and indexing capabilities

Terms in MongoDB vs RDBMS

MongoDB RDBMS

Document Row

Collection Table

Field Column

Index Index

_id Primary Key

Datatypes in MongoDB
• String, Integer, Double, Boolean
• Array
• ObjectId
• Embedded documents
• Null, Date

MongoDB Query Language

// Insert
> db.users.insert({name: "Alice", age: 25});

// Find
> db.users.find({age: {$gt: 20}});

// Update
> db.users.update({name: "Alice"}, {$set: {age: 26}});

// Delete
> db.users.remove({name: "Alice"});

7
UNIT – V: R PROGRAMMING
Introduction to R
• Statistical computing language
• Open-source and powerful for data analysis and visualization

Operators in R
• Arithmetic: +, -, *, /, ^
• Relational: <, <=, >, >=, ==, !=
• Logical: &, |, !

Control Statements and Functions


• if, else, for, while, repeat

add <- function(x, y) {


return(x + y)
}

Data Structures
• Vectors: One-dimensional
• Matrices: Two-dimensional
• Lists: Collection of elements
• Data Frames: Table-like structure
• Factors: Categorical data
• Tables: Frequency counts

Input and Output

name <- readline("Enter your name: ")


write.csv(df, "output.csv")

Graphs in R
• plot(), barplot(), hist(), boxplot(), pie()

Apply Family
• apply(), lapply(), sapply(), tapply(), mapply()
• Used for repetitive operations on data structures

8
END OF SEMESTER NOTES

Let me know if you need revision MCQs, model answers, or a formatted PDF.

You might also like