BigData-Session1

Big Data refers to massive volumes of data that traditional systems struggle to manage, characterized by the 3 Vs: Volume, Variety, and Velocity, along with a 4th V, Veracity. It is essential for businesses to analyze this data to uncover trends and make informed decisions, often utilizing frameworks like Hadoop and Spark for efficient processing. Spark, in particular, is a fast, in-memory compute engine that enhances data processing speed and flexibility compared to traditional methods.

Uploaded by

imvinod0811

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

BigData-Session1

Uploaded by

imvinod0811

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Big Data and Spark

What is Big Data?

• De nition: A term that describes massive volumes of data that
traditional systems struggle to handle due to size, complexity, and
speed.
• Scale: Approximately 2.5 quintillion bytes
(2,500,000,000,000,000,000) of data generated daily worldwide.
◦ Example: Imagine every photo uploaded to Instagram, every tweet,
and every Google search in a single day—combined, that’s Big Data!
• Why it matters: Businesses use Big Data to uncover trends, predict
behaviors, and make smarter decisions.
fi
The 3 Vs of Big Data (+1 Bonus V)
1. Volume: The sheer scale of data.
◦ Example: Net ix storing petabytes of user watch history.
2. Variety: Different types and forms of data.
◦ Structured: Organized data like spreadsheets or databases (e.g., MySQL customer
records, CSV les).
◦ Semi-Structured: Partially organized, like JSON or XML les (e.g., API responses).
◦ Unstructured: Unorganized data like audio (podcasts), video (YouTube clips), images
( memes), and log les (server logs).
3. Velocity: The speed at which data is generated and processed.
◦ Examples:
▪ 900 million photos uploaded daily on Facebook.
▪ 600 million tweets posted on Twitter daily.
▪ 0.5 million hours of video uploaded to YouTube daily.
▪ 3.5 billion searches on Google daily.
4. Veracity (The 4th V): The uncertainty, noise, or poor quality of data.
◦ Example: Social media posts with typos, incomplete sensor data from IoT devices, or
outdated customer records.
Fun Fact: Some experts also talk about a 5th V—Value—extracting meaningful insights
from data.
fi
fl
fi
fi
Why Big Data?
• Purpose: To process and analyze massive datasets that traditional
systems (e.g., relational databases) can’t handle ef ciently.
• Real-World Use Cases:
◦ E-commerce: Amazon recommending products based on your
browsing history.
◦ Healthcare: Analyzing patient data to predict disease outbreaks.
◦ Finance: Detecting fraudulent transactions in real-time.

Key Insight: Big Data isn’t just about size—it’s about unlocking hidden
patterns and insights.
fi
Big Data System Requirements

1. Store: Must store massive amounts of data reliably.

◦ Example: Storing years of social media posts or IoT sensor readings.
2. Process: Must process data quickly and ef ciently.
◦ Example: Analyzing customer reviews to improve a product in
hours, not weeks.
3. Scale: Must grow seamlessly as data needs increase.
◦ Example: Adding more servers to handle Black Friday shopping
spikes.
fi
Two Ways to Build a System
1.Monolithic:
◦ De nition: One powerful machine with lots of CPU, RAM, and
storage.
◦ Pros: Simple to set up initially.
◦ Cons:
▪ Hard to scale after hitting hardware limits.
▪ Adding resources (vertical scaling) doesn’t always double
performance.
◦ Example: A single supercomputer struggling to process a year’s
worth of Twitter data.
fi
Two Ways to Build a System
2.Distributed:
◦ De nition: Many smaller machines working together as one system.
◦ Pros:
▪ Linear scalability (2x machines = ~2x performance).
▪ True horizontal scaling—add more machines as needed.
◦ Cons: More complex to manage.
◦ Example: Google’s search engine running on thousands of servers
worldwide.
◦ Key Takeaway: All modern Big Data systems (like Hadoop and
Spark) use distributed architecture.
fi
What is Hadoop?
• De nition: An open-source framework
designed to solve Big Data problems by
enabling distributed storage and processing.
• Core Idea: Break data into smaller chunks,
store them across multiple machines, and
process them in parallel.
fi
Hadoop Evolution
• 2003: Google publishes the Google File System (GFS) paper—how to
store massive datasets across many machines.
• 2004: Google releases the MapReduce paper—a programming model
for processing large datasets in parallel.
• 2006: Yahoo builds HDFS (Hadoop Distributed File System) and
MapReduce based on Google’s ideas.
• 2009: Hadoop becomes an Apache open-source project, freely
available to all.
• 2013: Hadoop 2.0 introduces YARN and major performance upgrades.

Fun Fact: Hadoop is named after a toy elephant belonging to its creator
Doug Cutting’s son!
Hadoop Core Components
1.HDFS (Hadoop Distributed File System):
◦ Distributed storage system that splits data into blocks and spreads
them across multiple nodes.
◦ Example: A 1TB video le split into 128MB chunks stored on 10
machines.
2.YARN (Yet Another Resource Negotiator):
◦ Manages resources (CPU, memory) across the cluster and schedules
tasks.
◦ Example: Ensures one job doesn’t hog all the computing power.
3.MapReduce:
◦ A programming model for distributed data processing.
◦ Example: Counting word frequencies in a massive text le by
splitting the task across nodes.
fi
fi
Hadoop Ecosystem
• Hive: SQL-like tool for querying and analyzing data stored in HDFS.
◦ Example: Finding the most popular product in a sales dataset.
• Pig: Scripting language to process and transform data (great for
unstructured data).
◦ Example: Converting raw log les into a structured report.
• Sqoop: Transfers data between Hadoop and relational databases.
◦ Example: Importing customer data from MySQL into HDFS.
• HBase: NoSQL database for real-time, random access to data on HDFS.
◦ Example: Storing and querying live Twitter feeds.
• Oozie: Work ow scheduler to manage and automate Hadoop jobs.
◦ Example: Running a daily report generation job at midnight.
fl
fi
Introduction to Apache Spark
• De nition: A distributed, general-purpose, in-memory compute engine
designed for speed and exibility.
• Key Features:
◦ Processes data in-memory (much faster than Hadoop’s disk-based
MapReduce).
◦ Plug & Play: Works with various systems:
▪ Storage: Local storage, HDFS, Amazon S3, etc.
▪ Resource Managers: YARN, Mesos, Kubernetes.
◦ Written in Scala, with of cial support for Java, Scala, Python, and
R.
• Why Spark?:
◦ Up to 100x faster than Hadoop MapReduce for certain tasks (e.g.,
iterative machine learning).
◦ Easier to use with high-level APIs.
Example: Analyzing live streaming data (e.g., stock market ticks) in real-
time.
fi
fl
fi
Thank You

Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
Bigdata Intro
No ratings yet
Bigdata Intro
76 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Big Data complete Notes
No ratings yet
Big Data complete Notes
33 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Master Spark Concepts
No ratings yet
Master Spark Concepts
112 pages
biggdata
No ratings yet
biggdata
24 pages
Data Science
No ratings yet
Data Science
87 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
BD by maaz
No ratings yet
BD by maaz
19 pages
IOT and Comp.architecture
No ratings yet
IOT and Comp.architecture
17 pages
Bba13 Notes Bdf Unit 1
No ratings yet
Bba13 Notes Bdf Unit 1
3 pages
Chap3_OverviewOfBigDataEcosystem
No ratings yet
Chap3_OverviewOfBigDataEcosystem
91 pages
BD IMP QUES 1
No ratings yet
BD IMP QUES 1
22 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
UNIT1 -BDH
No ratings yet
UNIT1 -BDH
77 pages
Big Data
No ratings yet
Big Data
4 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Introduction To Big Data and Hadoop
No ratings yet
Introduction To Big Data and Hadoop
10 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
unit 1 b tech 3 year bd
No ratings yet
unit 1 b tech 3 year bd
10 pages
Big Data
No ratings yet
Big Data
190 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
BIG DATA ANALYTICS (1)
No ratings yet
BIG DATA ANALYTICS (1)
20 pages
Big-Data-A-Comprehensive-Overview
No ratings yet
Big-Data-A-Comprehensive-Overview
25 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Introduction To Big Data Analytics
No ratings yet
Introduction To Big Data Analytics
33 pages
Taming Big Data
No ratings yet
Taming Big Data
268 pages
Experiment No _ 1 Bda
No ratings yet
Experiment No _ 1 Bda
10 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
Intr Oduction of Big Data
No ratings yet
Intr Oduction of Big Data
12 pages
0 The BigDataEra
No ratings yet
0 The BigDataEra
36 pages
PPT 2.1.1.
No ratings yet
PPT 2.1.1.
24 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Big Data Analytics 0th Lecture
No ratings yet
Big Data Analytics 0th Lecture
19 pages
1 Introduction
No ratings yet
1 Introduction
31 pages
(15) Big Data
No ratings yet
(15) Big Data
10 pages
Big data
No ratings yet
Big data
79 pages
BDA - Unit-1
No ratings yet
BDA - Unit-1
24 pages
Map Reduce
No ratings yet
Map Reduce
20 pages
1 Introduction To Big Data Management and Processing
No ratings yet
1 Introduction To Big Data Management and Processing
42 pages
In9040 Phd Presentation Selimozcan 2
No ratings yet
In9040 Phd Presentation Selimozcan 2
36 pages
Module 1
No ratings yet
Module 1
54 pages
Lecture8 -Big Data (Hadoop)
No ratings yet
Lecture8 -Big Data (Hadoop)
29 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
153 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Hands-On Machine Learning Recommender Systems with Apache Spark
From Everand
Hands-On Machine Learning Recommender Systems with Apache Spark
Ernesto Lee
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
LCA21
No ratings yet
LCA21
4 pages
P&G Corporate Espionage
No ratings yet
P&G Corporate Espionage
26 pages
QF Qms Guidelines
No ratings yet
QF Qms Guidelines
13 pages
Sekyewa V Attorney General (MISCELLANEOUS CAUSE NO 354 OF 2013) 2017 UGHCCD 63 (2 March 2017)
No ratings yet
Sekyewa V Attorney General (MISCELLANEOUS CAUSE NO 354 OF 2013) 2017 UGHCCD 63 (2 March 2017)
12 pages
Class 10 Unit 1 Session 2
No ratings yet
Class 10 Unit 1 Session 2
19 pages
Factories and Machinery (Asbestos) Regulations, 1986 Ve - Pua289 - 1986
No ratings yet
Factories and Machinery (Asbestos) Regulations, 1986 Ve - Pua289 - 1986
11 pages
IN4740
No ratings yet
IN4740
4 pages
Sequence of Summer Training Project Report
No ratings yet
Sequence of Summer Training Project Report
8 pages
Libero IDE - Session 1
No ratings yet
Libero IDE - Session 1
190 pages
RCM Training
67% (3)
RCM Training
147 pages
Avanza Chassis
No ratings yet
Avanza Chassis
24 pages
Name: Joel Tabor: Individual Development Plan
100% (4)
Name: Joel Tabor: Individual Development Plan
8 pages
Banking Law Notes 8
No ratings yet
Banking Law Notes 8
10 pages
Part 2 Prelim Ha Lec Transes
No ratings yet
Part 2 Prelim Ha Lec Transes
2 pages
Decisions making Traditional techniques
No ratings yet
Decisions making Traditional techniques
6 pages
94 Keyboard Shortcuts For Adobe Illustrator 10
No ratings yet
94 Keyboard Shortcuts For Adobe Illustrator 10
4 pages
Fiat Barchetta: Electrical Equipment
No ratings yet
Fiat Barchetta: Electrical Equipment
9 pages
Full Test Bank For Corporate Finance Core Principles and Applications 5th Edition by Ross All Chapters
100% (4)
Full Test Bank For Corporate Finance Core Principles and Applications 5th Edition by Ross All Chapters
49 pages
p Series Post Railings Noa
No ratings yet
p Series Post Railings Noa
57 pages
Moblie computing
No ratings yet
Moblie computing
9 pages
Wa0019.
No ratings yet
Wa0019.
7 pages
Topic - Walmart
No ratings yet
Topic - Walmart
11 pages
Complete List of Obituaries 2024
No ratings yet
Complete List of Obituaries 2024
13 pages
Bachelor of Computer Science Second Year Semester Two (Bcs Ii - I)
No ratings yet
Bachelor of Computer Science Second Year Semester Two (Bcs Ii - I)
11 pages
Yoyo Op Manual-2010
No ratings yet
Yoyo Op Manual-2010
17 pages
Seemp Study
No ratings yet
Seemp Study
71 pages
Tax SOLVING
No ratings yet
Tax SOLVING
3 pages
Bass Pod Preset Chart English
No ratings yet
Bass Pod Preset Chart English
1 page
Youwin Internal Control
No ratings yet
Youwin Internal Control
4 pages
Untitled Notebook
No ratings yet
Untitled Notebook
3 pages

BigData-Session1

Uploaded by

BigData-Session1

Uploaded by

Big Data and Spark

What is Big Data?

1. Store: Must store massive amounts of data reliably.

You might also like