0% found this document useful (0 votes)

2 views9 pages

Big Data Complete Notes

The document provides comprehensive semester notes on Big Data, covering its introduction, types, analytics, and the Hadoop ecosystem. It discusses the evolution, challenges, and importance of Big Data, as well as the differences between traditional BI and Big Data. Additionally, it includes sections on NoSQL databases, MongoDB, and R programming, highlighting their features, advantages, and applications.

Uploaded by

anishithacsd226702

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views9 pages

Big Data Complete Notes

Uploaded by

anishithacsd226702

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

BIG DATA - COMPLETE SEMESTER NOTES

UNIT – I: INTRODUCTION TO BIG DATA

Types and Classification of Digital Data

Types of Digital Data:

1. Structured Data

2. Organized in tabular format.

3. Stored in RDBMS.

4. Examples: Bank transactions, sensor logs.

5. Semi-Structured Data

6. Partially organized.

7. Doesn’t conform to formal data models.

8. Examples: XML, JSON, NoSQL documents.

9. Unstructured Data

10. No fixed format.

11. Examples: Emails, audio, video, social media content.

Classification of Digital Data:

• Human-generated Data: Emails, social media posts.

• Machine-generated Data: Sensor data, server logs.
• Metadata: Data about other data.

Introduction to Big Data

Evolution of Big Data:

• Emerged due to exponential growth of internet, mobile data, IoT.

• Traditional systems failed to process unstructured or huge volumes of data.

1
Definition of Big Data:

• Big Data is defined by the 5 V’s:

• Volume – Large amounts of data.
• Velocity – Speed of data generation and processing.
• Variety – Different types of data (text, video, logs).
• Veracity – Trustworthiness of the data.
• Value – Useful insights extracted from data.

Traditional BI vs Big Data

Feature Traditional BI Big Data

Storage GB to TB TB to PB

Data Type Structured All Types

Architecture Centralized Distributed

Processing Batch Batch + Real-time

Tools SQL, OLAP Hadoop, Spark

Coexistence of Big Data and Data Warehouse

• Big Data complements data warehouses.
• Warehouses handle structured historical data.
• Big Data handles real-time and semi/unstructured data.

Big Data Analytics

What It Is:

• Advanced techniques to extract actionable insights from huge and diverse data.

What It Isn’t:

• Not just collecting massive data or using fast computers.

• It's not only for data scientists.

Why Sudden Hype:

• Cost-effective storage.
• Real-time decisions.
• Cloud computing.

2
Classification of Analytics:

1. Descriptive – What happened?

2. Diagnostic – Why it happened?
3. Predictive – What will happen?
4. Prescriptive – What action should be taken?

Challenges for Businesses:

• Poor data quality.

• Lack of skilled professionals.
• Integration with existing systems.
• Privacy and security.

Importance of Big Data Analytics:

• Customer behavior analysis.

• Fraud detection.
• Operational efficiency.
• Real-time alerts.

Data Science and Terminologies

Data Science:

• Interdisciplinary field.
• Combines statistics, machine learning, data engineering, domain expertise.

Important Terminologies:

• HDFS: Distributed file storage.

• MapReduce: Batch processing framework.
• Hive: SQL-based query tool.
• Pig: Dataflow scripting language.
• Spark: In-memory data processing engine.
• Flume: Ingests unstructured data.
• Sqoop: Transfers data between RDBMS and Hadoop.
• YARN: Resource manager in Hadoop.

UNIT – II: HADOOP ECOSYSTEM

Features of Hadoop:
• Open-source.

3
• Highly scalable.
• Fault-tolerant.
• Runs on commodity hardware.
• Data replication for fault recovery.

Key Advantages:
• Cost-effective.
• Handles structured, semi-structured, and unstructured data.
• Supports multiple languages (Java, Python, etc.).
• Ecosystem includes various tools for different tasks.

Versions of Hadoop:
• Hadoop 1.x: Single NameNode, scalability issues.
• Hadoop 2.x: Introduced YARN, better resource management.
• Hadoop 3.x: Erasure coding, containerization support, better performance.

Hadoop Ecosystem Overview:

• HDFS – Storage layer.
• MapReduce – Processing layer.
• YARN – Resource manager.
• Hive – SQL-like queries.
• Pig – Scripting language.
• HBase – Columnar storage DB.
• Oozie – Workflow scheduler.
• Flume – Ingest logs.
• Sqoop – Transfers data from RDBMS.

Distributions:
• Cloudera, Hortonworks, MapR, Amazon EMR.

Need for Hadoop:

• Traditional RDBMSs can’t handle high volume and variety.
• Provides distributed storage and processing.

4
RDBMS vs Hadoop

Aspect RDBMS Hadoop

Data Types Structured All Types

Schema Fixed Dynamic

Scalability Vertical Horizontal

Cost Expensive Low-cost (commodity hardware)

Real-time Possible Not in MapReduce (Spark preferred)

Distributed Computing Challenges:

• Node failure.
• Network latency.
• Synchronization.
• Load balancing.

History of Hadoop:
• Inspired by Google File System (GFS).
• Created by Doug Cutting and Mike Cafarella.
• Yahoo adopted and funded development.

HDFS:
• Master-slave architecture.
• NameNode: Metadata.
• DataNodes: Store blocks.
• Replication factor (default = 3).
• Designed for write-once, read-many workloads.

UNIT – III: PROCESSING DATA WITH HADOOP &

NOSQL
MapReduce Programming

Introduction:

• Programming model for distributed processing of large datasets.

5
Components:

• Mapper: Processes input data and emits key-value pairs.

• Reducer: Aggregates values based on keys from the mapper.
• Combiner: Optional local reducer to optimize performance.
• Partitioner: Decides which reducer a key-value pair should go to.

NoSQL Databases

Introduction:

• Non-relational databases designed for horizontal scalability and flexible data models.

Types:

1. Key-Value Stores (e.g., Redis, Riak)

2. Document Stores (e.g., MongoDB, CouchDB)
3. Column Stores (e.g., Cassandra, HBase)
4. Graph Databases (e.g., Neo4j)

Advantages:

• Schema-free
• Horizontal scaling
• High performance
• Better handling of unstructured data

Use in Industry:

• Real-time web apps

• E-commerce
• Social media analytics
• IoT applications

SQL vs NoSQL vs NewSQL

Feature SQL NoSQL NewSQL

Schema Fixed Dynamic Fixed

Scalability Vertical Horizontal Horizontal

ACID Support Full Limited Full

Query Language SQL Varies SQL

Ideal for Structured data Unstructured/semi-structured OLTP + Big Data

6
UNIT – IV: MONGODB
Necessity of MongoDB
• High availability and scalability
• Schema flexibility
• Rich querying and indexing capabilities

Terms in MongoDB vs RDBMS

MongoDB RDBMS

Document Row

Collection Table

Field Column

Index Index

_id Primary Key

Datatypes in MongoDB
• String, Integer, Double, Boolean
• Array
• ObjectId
• Embedded documents
• Null, Date

MongoDB Query Language

// Insert
> db.users.insert({name: "Alice", age: 25});

// Find
> db.users.find({age: {$gt: 20}});

// Update
> db.users.update({name: "Alice"}, {$set: {age: 26}});

// Delete
> db.users.remove({name: "Alice"});

7
UNIT – V: R PROGRAMMING
Introduction to R
• Statistical computing language
• Open-source and powerful for data analysis and visualization

Operators in R
• Arithmetic: +, -, *, /, ^
• Relational: <, <=, >, >=, ==, !=
• Logical: &, |, !

Control Statements and Functions

• if, else, for, while, repeat

add <- function(x, y) {

return(x + y)
}

Data Structures
• Vectors: One-dimensional
• Matrices: Two-dimensional
• Lists: Collection of elements
• Data Frames: Table-like structure
• Factors: Categorical data
• Tables: Frequency counts

Input and Output

name <- readline("Enter your name: ")

write.csv(df, "output.csv")

Graphs in R
• plot(), barplot(), hist(), boxplot(), pie()

Apply Family
• apply(), lapply(), sapply(), tapply(), mapply()
• Used for repetitive operations on data structures

8
END OF SEMESTER NOTES

Let me know if you need revision MCQs, model answers, or a formatted PDF.

20IT503 - Big Data Analytics - Unit4
No ratings yet
20IT503 - Big Data Analytics - Unit4
73 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
134 pages
Module 1
No ratings yet
Module 1
54 pages
Big Data Analytics Digital Notes
No ratings yet
Big Data Analytics Digital Notes
119 pages
Chapter-14
No ratings yet
Chapter-14
35 pages
03 Unit Bda Hadoop,Map Reduce
No ratings yet
03 Unit Bda Hadoop,Map Reduce
80 pages
2 emerging
No ratings yet
2 emerging
10 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
1 Bda A6515 Intro Bda
No ratings yet
1 Bda A6515 Intro Bda
48 pages
Unit 1
No ratings yet
Unit 1
19 pages
Big Data Analytics Unit-1
No ratings yet
Big Data Analytics Unit-1
39 pages
Big Data complete Notes
No ratings yet
Big Data complete Notes
33 pages
1.5 Module-1
No ratings yet
1.5 Module-1
21 pages
Chapter-1-Introduction to Big Data
No ratings yet
Chapter-1-Introduction to Big Data
25 pages
BIG DATA PYQ 21-22
No ratings yet
BIG DATA PYQ 21-22
9 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
bda unit 1 - mam
No ratings yet
bda unit 1 - mam
198 pages
01 Unit-BDA- Intro BDA
No ratings yet
01 Unit-BDA- Intro BDA
37 pages
BigData-Session1
No ratings yet
BigData-Session1
14 pages
2 Big Data Analytics-Hadoop R21 A7902 ABP
No ratings yet
2 Big Data Analytics-Hadoop R21 A7902 ABP
16 pages
ESE_BDA
No ratings yet
ESE_BDA
28 pages
BIG Data_Unit_1
No ratings yet
BIG Data_Unit_1
24 pages
Big Data Analytics Notess
No ratings yet
Big Data Analytics Notess
69 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
biggdata
No ratings yet
biggdata
24 pages
Big Data Analytics (R20a0520)
No ratings yet
Big Data Analytics (R20a0520)
84 pages
BD IMP QUES 1
No ratings yet
BD IMP QUES 1
22 pages
2 BDA A6515 Hadoop
No ratings yet
2 BDA A6515 Hadoop
55 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
153 pages
Lecture8 -Big Data (Hadoop)
No ratings yet
Lecture8 -Big Data (Hadoop)
29 pages
Unit 1
No ratings yet
Unit 1
118 pages
BDT Viva Questions
No ratings yet
BDT Viva Questions
2 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
5.1 Intro Nosql
No ratings yet
5.1 Intro Nosql
22 pages
BD by maaz
No ratings yet
BD by maaz
19 pages
No SQL Database in Bda
No ratings yet
No SQL Database in Bda
84 pages
Big data
No ratings yet
Big data
79 pages
BDS Session 1
100% (1)
BDS Session 1
70 pages
Big Data Analytics
No ratings yet
Big Data Analytics
131 pages
IT_(R20)_4-1_BIG DATA ANALYTICS_DIGITAL NOTES (1)
No ratings yet
IT_(R20)_4-1_BIG DATA ANALYTICS_DIGITAL NOTES (1)
117 pages
BDA - Unit-1
No ratings yet
BDA - Unit-1
24 pages
The Big Data Technology Landscape
No ratings yet
The Big Data Technology Landscape
36 pages
What Is Bigdata
No ratings yet
What Is Bigdata
5 pages
big-data-2022-notes
No ratings yet
big-data-2022-notes
118 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data
No ratings yet
Big Data
24 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
07-BigData-DataAnalysis
No ratings yet
07-BigData-DataAnalysis
66 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
BIG data1
No ratings yet
BIG data1
49 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
The DynamoDB Handbook: Practical Solutions for Modern NoSQL Database Management
From Everand
The DynamoDB Handbook: Practical Solutions for Modern NoSQL Database Management
Robert Johnson
No ratings yet
Pandas Essentials for Data Analysis: Definitive Reference for Developers and Engineers
From Everand
Pandas Essentials for Data Analysis: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
Database Assignment
No ratings yet
Database Assignment
6 pages
Oracle Database 11g R2 + Asm Si Ocurre Error
No ratings yet
Oracle Database 11g R2 + Asm Si Ocurre Error
6 pages
LNCS 2784 An Architecture for Managing Database Evolution 1st Edition by Eladio DomÃnguez, Jorge Lloret, MarÃa Antonia Zapata ISBN 3540452753 9783540452751 - Download the ebook now to never miss important content
100% (10)
LNCS 2784 An Architecture for Managing Database Evolution 1st Edition by Eladio DomÃnguez, Jorge Lloret, MarÃa Antonia Zapata ISBN 3540452753 9783540452751 - Download the ebook now to never miss important content
50 pages
Inside RavenDB 3 0
No ratings yet
Inside RavenDB 3 0
187 pages
08 Database and Data Modelling
No ratings yet
08 Database and Data Modelling
74 pages
Fybbaca DBMS Assignment Lot 2
No ratings yet
Fybbaca DBMS Assignment Lot 2
1 page
Fantastico Fileslist
No ratings yet
Fantastico Fileslist
1 page
Creating A SCD Type 2 Mapping Using The Informatica PowerCenter Mapping Wizard
0% (1)
Creating A SCD Type 2 Mapping Using The Informatica PowerCenter Mapping Wizard
16 pages
How To Solve It by Computer
No ratings yet
How To Solve It by Computer
1 page
DBMS Lab Manual (BCS403)
No ratings yet
DBMS Lab Manual (BCS403)
18 pages
1b Query Optimization Sil 7ed ch16
No ratings yet
1b Query Optimization Sil 7ed ch16
35 pages
Unit #3 - Data Warehouse and Data Mining
No ratings yet
Unit #3 - Data Warehouse and Data Mining
70 pages
Database Management System
No ratings yet
Database Management System
4 pages
LP VI Bi Lab Manual
No ratings yet
LP VI Bi Lab Manual
28 pages
4sem CE246 DBMS Practical List 2019-20
No ratings yet
4sem CE246 DBMS Practical List 2019-20
6 pages
Advance Package Concepts
No ratings yet
Advance Package Concepts
5 pages
Course 1 Module 02 Lesson 2
No ratings yet
Course 1 Module 02 Lesson 2
6 pages
Unit 4 BDTT
No ratings yet
Unit 4 BDTT
23 pages
QB Sycs DBMS
No ratings yet
QB Sycs DBMS
1 page
Worksheet 7 PL - SQL
No ratings yet
Worksheet 7 PL - SQL
4 pages
DBMS Q PAPER 1 - Btech5thsem
No ratings yet
DBMS Q PAPER 1 - Btech5thsem
1 page
Spring JDBC Application: Package
No ratings yet
Spring JDBC Application: Package
8 pages
Practice Exercises For SELECT Statement
No ratings yet
Practice Exercises For SELECT Statement
8 pages
MySQL - Week 2 Quiz
100% (2)
MySQL - Week 2 Quiz
6 pages
Module 4 - Oracle Tablespaces and Datafiles
No ratings yet
Module 4 - Oracle Tablespaces and Datafiles
26 pages
22521-2019-Winter-Model-Answer-Paper (Msbte Study Resources)
No ratings yet
22521-2019-Winter-Model-Answer-Paper (Msbte Study Resources)
25 pages
Sto DB SQL
No ratings yet
Sto DB SQL
45 pages
Tejaswi Pradhanang
No ratings yet
Tejaswi Pradhanang
118 pages
361المشروع اسيات قواعد البيانات
No ratings yet
361المشروع اسيات قواعد البيانات
8 pages
Final DBMS
No ratings yet
Final DBMS
51 pages

Big Data Complete Notes

Uploaded by

Big Data Complete Notes

Uploaded by

BIG DATA - COMPLETE SEMESTER NOTES

UNIT – I: INTRODUCTION TO BIG DATA

Types of Digital Data:

2. Organized in tabular format.

4. Examples: Bank transactions, sensor logs.

7. Doesn’t conform to formal data models.

8. Examples: XML, JSON, NoSQL documents.

10. No fixed format.

11. Examples: Emails, audio, video, social media content.

Classification of Digital Data:

• Human-generated Data: Emails, social media posts.

Introduction to Big Data

Evolution of Big Data:

• Emerged due to exponential growth of internet, mobile data, IoT.

• Big Data is defined by the 5 V’s:

Traditional BI vs Big Data

Feature Traditional BI Big Data

Data Type Structured All Types

Architecture Centralized Distributed

Processing Batch Batch + Real-time

Tools SQL, OLAP Hadoop, Spark

Coexistence of Big Data and Data Warehouse

Big Data Analytics

• Not just collecting massive data or using fast computers.

Why Sudden Hype:

1. Descriptive – What happened?

Challenges for Businesses:

• Poor data quality.

Importance of Big Data Analytics:

• Customer behavior analysis.

Data Science and Terminologies

• HDFS: Distributed file storage.

UNIT – II: HADOOP ECOSYSTEM

Hadoop Ecosystem Overview:

Need for Hadoop:

Aspect RDBMS Hadoop

Data Types Structured All Types

Schema Fixed Dynamic

Scalability Vertical Horizontal

Cost Expensive Low-cost (commodity hardware)

Real-time Possible Not in MapReduce (Spark preferred)

Distributed Computing Challenges:

UNIT – III: PROCESSING DATA WITH HADOOP &

• Programming model for distributed processing of large datasets.

• Mapper: Processes input data and emits key-value pairs.

1. Key-Value Stores (e.g., Redis, Riak)

• Real-time web apps

SQL vs NoSQL vs NewSQL

Feature SQL NoSQL NewSQL

Schema Fixed Dynamic Fixed

Scalability Vertical Horizontal Horizontal

ACID Support Full Limited Full

Query Language SQL Varies SQL

Ideal for Structured data Unstructured/semi-structured OLTP + Big Data

Terms in MongoDB vs RDBMS

_id Primary Key

MongoDB Query Language

Control Statements and Functions

add <- function(x, y) {

Input and Output

name <- readline("Enter your name: ")

You might also like