0% found this document useful (0 votes)
8 views

BigData-Session1

Big Data refers to massive volumes of data that traditional systems struggle to manage, characterized by the 3 Vs: Volume, Variety, and Velocity, along with a 4th V, Veracity. It is essential for businesses to analyze this data to uncover trends and make informed decisions, often utilizing frameworks like Hadoop and Spark for efficient processing. Spark, in particular, is a fast, in-memory compute engine that enhances data processing speed and flexibility compared to traditional methods.

Uploaded by

imvinod0811
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

BigData-Session1

Big Data refers to massive volumes of data that traditional systems struggle to manage, characterized by the 3 Vs: Volume, Variety, and Velocity, along with a 4th V, Veracity. It is essential for businesses to analyze this data to uncover trends and make informed decisions, often utilizing frameworks like Hadoop and Spark for efficient processing. Spark, in particular, is a fast, in-memory compute engine that enhances data processing speed and flexibility compared to traditional methods.

Uploaded by

imvinod0811
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Big Data and Spark

What is Big Data?


• De nition: A term that describes massive volumes of data that
traditional systems struggle to handle due to size, complexity, and
speed.
• Scale: Approximately 2.5 quintillion bytes
(2,500,000,000,000,000,000) of data generated daily worldwide.
◦ Example: Imagine every photo uploaded to Instagram, every tweet,
and every Google search in a single day—combined, that’s Big Data!
• Why it matters: Businesses use Big Data to uncover trends, predict
behaviors, and make smarter decisions.
fi
The 3 Vs of Big Data (+1 Bonus V)
1. Volume: The sheer scale of data.
◦ Example: Net ix storing petabytes of user watch history.
2. Variety: Different types and forms of data.
◦ Structured: Organized data like spreadsheets or databases (e.g., MySQL customer
records, CSV les).
◦ Semi-Structured: Partially organized, like JSON or XML les (e.g., API responses).
◦ Unstructured: Unorganized data like audio (podcasts), video (YouTube clips), images
( memes), and log les (server logs).
3. Velocity: The speed at which data is generated and processed.
◦ Examples:
▪ 900 million photos uploaded daily on Facebook.
▪ 600 million tweets posted on Twitter daily.
▪ 0.5 million hours of video uploaded to YouTube daily.
▪ 3.5 billion searches on Google daily.
4. Veracity (The 4th V): The uncertainty, noise, or poor quality of data.
◦ Example: Social media posts with typos, incomplete sensor data from IoT devices, or
outdated customer records.
Fun Fact: Some experts also talk about a 5th V—Value—extracting meaningful insights
from data.
fi
fl
fi
fi
Why Big Data?
• Purpose: To process and analyze massive datasets that traditional
systems (e.g., relational databases) can’t handle ef ciently.
• Real-World Use Cases:
◦ E-commerce: Amazon recommending products based on your
browsing history.
◦ Healthcare: Analyzing patient data to predict disease outbreaks.
◦ Finance: Detecting fraudulent transactions in real-time.

Key Insight: Big Data isn’t just about size—it’s about unlocking hidden
patterns and insights.
fi
Big Data System Requirements

1. Store: Must store massive amounts of data reliably.


◦ Example: Storing years of social media posts or IoT sensor readings.
2. Process: Must process data quickly and ef ciently.
◦ Example: Analyzing customer reviews to improve a product in
hours, not weeks.
3. Scale: Must grow seamlessly as data needs increase.
◦ Example: Adding more servers to handle Black Friday shopping
spikes.
fi
Two Ways to Build a System
1.Monolithic:
◦ De nition: One powerful machine with lots of CPU, RAM, and
storage.
◦ Pros: Simple to set up initially.
◦ Cons:
▪ Hard to scale after hitting hardware limits.
▪ Adding resources (vertical scaling) doesn’t always double
performance.
◦ Example: A single supercomputer struggling to process a year’s
worth of Twitter data.
fi
Two Ways to Build a System
2.Distributed:
◦ De nition: Many smaller machines working together as one system.
◦ Pros:
▪ Linear scalability (2x machines = ~2x performance).
▪ True horizontal scaling—add more machines as needed.
◦ Cons: More complex to manage.
◦ Example: Google’s search engine running on thousands of servers
worldwide.
◦ Key Takeaway: All modern Big Data systems (like Hadoop and
Spark) use distributed architecture.
fi
What is Hadoop?
• De nition: An open-source framework
designed to solve Big Data problems by
enabling distributed storage and processing.
• Core Idea: Break data into smaller chunks,
store them across multiple machines, and
process them in parallel.
fi
Hadoop Evolution
• 2003: Google publishes the Google File System (GFS) paper—how to
store massive datasets across many machines.
• 2004: Google releases the MapReduce paper—a programming model
for processing large datasets in parallel.
• 2006: Yahoo builds HDFS (Hadoop Distributed File System) and
MapReduce based on Google’s ideas.
• 2009: Hadoop becomes an Apache open-source project, freely
available to all.
• 2013: Hadoop 2.0 introduces YARN and major performance upgrades.

Fun Fact: Hadoop is named after a toy elephant belonging to its creator
Doug Cutting’s son!
Hadoop Core Components
1.HDFS (Hadoop Distributed File System):
◦ Distributed storage system that splits data into blocks and spreads
them across multiple nodes.
◦ Example: A 1TB video le split into 128MB chunks stored on 10
machines.
2.YARN (Yet Another Resource Negotiator):
◦ Manages resources (CPU, memory) across the cluster and schedules
tasks.
◦ Example: Ensures one job doesn’t hog all the computing power.
3.MapReduce:
◦ A programming model for distributed data processing.
◦ Example: Counting word frequencies in a massive text le by
splitting the task across nodes.
fi
fi
Hadoop Ecosystem
• Hive: SQL-like tool for querying and analyzing data stored in HDFS.
◦ Example: Finding the most popular product in a sales dataset.
• Pig: Scripting language to process and transform data (great for
unstructured data).
◦ Example: Converting raw log les into a structured report.
• Sqoop: Transfers data between Hadoop and relational databases.
◦ Example: Importing customer data from MySQL into HDFS.
• HBase: NoSQL database for real-time, random access to data on HDFS.
◦ Example: Storing and querying live Twitter feeds.
• Oozie: Work ow scheduler to manage and automate Hadoop jobs.
◦ Example: Running a daily report generation job at midnight.
fl
fi
Introduction to Apache Spark
• De nition: A distributed, general-purpose, in-memory compute engine
designed for speed and exibility.
• Key Features:
◦ Processes data in-memory (much faster than Hadoop’s disk-based
MapReduce).
◦ Plug & Play: Works with various systems:
▪ Storage: Local storage, HDFS, Amazon S3, etc.
▪ Resource Managers: YARN, Mesos, Kubernetes.
◦ Written in Scala, with of cial support for Java, Scala, Python, and
R.
• Why Spark?:
◦ Up to 100x faster than Hadoop MapReduce for certain tasks (e.g.,
iterative machine learning).
◦ Easier to use with high-level APIs.
Example: Analyzing live streaming data (e.g., stock market ticks) in real-
time.
fi
fl
fi
Thank You

You might also like