BigData-Session1
BigData-Session1
Key Insight: Big Data isn’t just about size—it’s about unlocking hidden
patterns and insights.
fi
Big Data System Requirements
Fun Fact: Hadoop is named after a toy elephant belonging to its creator
Doug Cutting’s son!
Hadoop Core Components
1.HDFS (Hadoop Distributed File System):
◦ Distributed storage system that splits data into blocks and spreads
them across multiple nodes.
◦ Example: A 1TB video le split into 128MB chunks stored on 10
machines.
2.YARN (Yet Another Resource Negotiator):
◦ Manages resources (CPU, memory) across the cluster and schedules
tasks.
◦ Example: Ensures one job doesn’t hog all the computing power.
3.MapReduce:
◦ A programming model for distributed data processing.
◦ Example: Counting word frequencies in a massive text le by
splitting the task across nodes.
fi
fi
Hadoop Ecosystem
• Hive: SQL-like tool for querying and analyzing data stored in HDFS.
◦ Example: Finding the most popular product in a sales dataset.
• Pig: Scripting language to process and transform data (great for
unstructured data).
◦ Example: Converting raw log les into a structured report.
• Sqoop: Transfers data between Hadoop and relational databases.
◦ Example: Importing customer data from MySQL into HDFS.
• HBase: NoSQL database for real-time, random access to data on HDFS.
◦ Example: Storing and querying live Twitter feeds.
• Oozie: Work ow scheduler to manage and automate Hadoop jobs.
◦ Example: Running a daily report generation job at midnight.
fl
fi
Introduction to Apache Spark
• De nition: A distributed, general-purpose, in-memory compute engine
designed for speed and exibility.
• Key Features:
◦ Processes data in-memory (much faster than Hadoop’s disk-based
MapReduce).
◦ Plug & Play: Works with various systems:
▪ Storage: Local storage, HDFS, Amazon S3, etc.
▪ Resource Managers: YARN, Mesos, Kubernetes.
◦ Written in Scala, with of cial support for Java, Scala, Python, and
R.
• Why Spark?:
◦ Up to 100x faster than Hadoop MapReduce for certain tasks (e.g.,
iterative machine learning).
◦ Easier to use with high-level APIs.
Example: Analyzing live streaming data (e.g., stock market ticks) in real-
time.
fi
fl
fi
Thank You