BIG DATA WITH HADOOP AND SPARK
Course Objectives:
1. Understand the concepts of big data and its impact on businesses.
2. Learn about the Hadoop ecosystem and its components, such as HDFS, MapReduce, Hive,
Pig, and HBase.
3. Gain hands-on experience with Hadoop and Spark, including writing applications and running
them on a cluster.
4. Learn about the different types of big data analytics and how to use Hadoop and Spark to
perform them.
5. Be able to apply Hadoop and Spark to real-world big data problems.
Course Outcomes:
At the end of the course students will be able to:
CO1: Advanced Data Processing Skills.
CO2: Scalable Data Storage and Management.
CO3: Distributed Computing Concepts.
CO4: Real-time Data Processing.
UNIT-I:
1. Introduction to Big Data and Hadoop (7 Hours)
(a) What is Big Data?
(b) The Rise of Bytes
(c) Data Explosion and its Sources
(d) Types of Data – Structured, Semi-structured, Unstructured data
(e) Characteristics of Big Data
(f) Limitations of Traditional Large-Scale Systems
(g) Use Cases for Big Data
(h) Challenges of Big Data
(i) Hadoop Introduction - What is Hadoop? Why Hadoop?
(j) Supported Operating Systems
(k) Organizations using Hadoop
(l) Hadoop Job Trends
(m) History of Hadoop
(n) Hadoop Core Components – MapReduce & HDFS
UNIT-II
2. Hdfs Architecture (4 Hours)
(a) Regular File System v/s HDFS
Page 1 of 4
(b) HDFS Architecture
(c) Components of HDFS - NameNode, DataNode, SecondayNameNode
(d) Components of HDFS - NameNode, DataNode, SecondayNameNode
(e) HDFS Features - Fault Tolerance, Horizontal Scaling, Data Replication, Rack Awareness
(f) Anatomy of a file write on HDFS
(g) Anatomy of a file read on HDFS
(h) Hands on with Hadoop HDFS, WebUI and Linux Terminal Commands
(i) HDFS File System Operations
(j) Name Node Metadata, File System Namespace, NameNode Operation,
(k) Data Block Split
(l) Benefits of Data Block Approach
(m) Topology, Data Replication Representation
(n) HDFS Programming Basics – Java API
(o) Hadoop Configuration API
(p) HDFS API Overview
(q) When Hadoop is not suitable
3. Map Reduce (2.5 Hours)
(a) What is MapReduce and Why it is popular
(b) MapReduce Framework– Introduction, Driver, Mapper, Reducer, Combiner, Split,
Shuffle & Sort
(c) YARN ARCHITECTURE
(d) Hadoop 1.0 Limitations
(e) MapReduce Limitations
(f) YARN Architecture
UNIT-III
4. Hive (3.5 Hours)
(a) Limitations of MapReduce
(b) Need for High Level Languages
(c) Analytical OLAP - Data warehousing with Apache Hive
(d) What is Hive?
(e) Hive Query Language
(f) Background of Hive
(g) Hive Installation and Configuration
(h) Hive Architecture, Data Types, Data Model, Examples
(i) Create/Show Database, Drop Tables
(j) SELECT, INSERT, OVERWRITE, EXPLAIN
Page 2 of 4
(k) CREATE, ALTER, DROP, TRUNCATE, JOINS
(l) SerDe (Serialization / Deserialization)
(m) Partitions and Buckets
(n) Limitations of Hive
(o) SQL vs. Hive
(p) Different Formats like Avro, Parquet and ORC
5. Scala (Object Oriented and Functional Programming) (2.5 Hours)
(a) Getting started With Scala.
(b) Scala Background, Scala Vs Java and Basics.
(c) Interactive Scala – REPL, data types, variables, expressions, simple functions.
(d) Running the program with Scala Compiler.
(e) Explore the type lattice and use type inference
(f) Define Methods and Pattern Matching.
(g) Scala set up on Windows.
(h) Scala set up on Unix.
UNIT-IV
6. Functional Programming, Object Oriented, Programming, Integrations (2.5 Hours)
(a) Classes and Properties. Objects. Packaging and Imports. Traits.
(b) Objects, classes, inheritance, lists with multiple related types, apply
(c) What is SBT? Integration of Scala in Eclipse IDE. Integration of SBT with Eclipse.
(d) Batch versus real-time data processing
7. Spark Core (3 Hours)
(a) Introduction to Spark, Spark versus Hadoop
(b) Architecture of Spark.
(c) Data Partitioning and Parallelism
(d) Coding Spark jobs in Scala
(e) Exploring the Spark shell -> Creating Spark Context.
(f) RDD Programming. Operations on RDD.
(g) Transformations. Actions
(h) Loading Data and Saving Data.
(i) Key Value Pair RDD.
(j) RCA for Spark Application failures
Page 3 of 4
UNIT-V
8. Spark Sql (2 Hours)
(a) Introduction to Apache Spark SQL
(b) The SQL context
(c) Importing and saving data
(d) Processing the Text files, JSON and Parquet Files
(e) Data Frames
(f) Using Hive
(g) PySpark and ML demo with use cases
(h) Connectivity with MySQL
(i) Error Handling
9. Spark Streaming (1.5 Hours)
(a) Introduction of Spark Streaming.
(b) Architecture of Spark Streaming
(c) Processing Distributed Log Files in Real Time
(d) Discretized streams RDD.
(e) Applying Transformations and Actions on Streaming Data
10. Kafka (1.5 Hours)
(a) Understanding Kafka Cluster
(b) Installing and Configuring Kafka Cluster
(c) Kafka Producer. Kafka Consumer
(d) Producer and Consumer in Action
(e) Reading Data from Kafka
(f) Lab: Implement Kafka Producer, Consumer using real time streaming data
TOTAL………………. (30 Hours)
Text Books:
1. BIG DATA ANALYTICS: Introduction to Hadoop, Spark, and Machine-Learning, Raj
Kamal, Preeti Saxena, Publisher McGraw Hill Education, 2019.
2. Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale, Tom White, Edition 4,
"O'Reilly Media, Inc.", 2015.
Reference Books:
1. The Hadoop for Dummies by Dirk deRoos, Paul C. Zikopoulos, Roman B. Melnyk, Bruce
Brown, Rafael Coss.
2. Hadoop MapReduce Cookbook, Srinath Perera, Thilina Gunarathne.
Page 4 of 4