Introduction to Hadoop & SparkReachUs@CloudxLab.com
Welcome to
Big Data
with
Hadoop & Spark
Please introduce yourself while others are
joining
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Session 1 - Big Data with Hadoop & Spark
Duration: 3 hours
Agenda:
• Introduction to Big Data
• 10 mins. break
• Spark & Hadoop Architecture
Notes:
• Please introduce yourself using chat window while others are joining
• Session is being recorded & Recording & presentation will be shared
• This is Session 1 out of 18 sessions on Big Data with Hadoop & Spark specialization.
• It suffices as an introduction to Big Data Technology Stack.
Asking Questions?
• Every one except Instructor is muted
• Please ask questions by typing in Q&A Window
• Instructor will read out the questions before answering
• To get better answers, keep your messages short and avoid chat language
About CloudxLab
Videos Quizzes Hands-On Projects Case Studies
Real Life Use Cases
Making learning fun and for life
Automated Hands-on Assessments
Learn by doing
Automated Hands-on Assessments
Problem Statement Hands On
Cloud based Lab
Assessment
Automated Hands-on Assessments
Problem
Statement
Evaluation
Automated Hands-on Assessments
Python Assessment Jupyter Notebook
Automated Hands-on Assessments
Python Assessment Jupyter Notebook
Questions?
https://discuss.cloudxlab.com
reachus@cloudxlab.com
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Course Instructor
Sandeep Giri
Worked On Large Scale Computing
Graduated from IIT Roorkee
Software Engineer
Loves Explaining Technologies
Founder
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Learn To Process
Big Data
With
Hadoop, Spark
&
Related Technologies
Course Objective
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Data Variety
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Data Variety
ETL
Extract Transform Load
Introduction to Hadoop & SparkReachUs@CloudxLab.com
1.Groups of networked computers
2.Interact with each other
3.To achieve a common goal.
Distributed Systems
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Question
How Many Bytes in One Petabyte?
1.1259x10^15
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Question
How Much Data Facebook Stores in
One Day?
600 TB
Introduction to Hadoop & SparkReachUs@CloudxLab.com
What is Big Data?
• Simply: Data of Very Big Size
• Can’t process with usual tools
• Distributed Architecture
Needed
• Structured / Unstructured
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Characteristics of Big Data
Problems Involving the
handling of data coming at
fast rate.
e.g. Number of requests
being received by Facebook,
Youtube streaming, Google
Analytics
Problems involving
complex data structures
e.g. Maps, Social Graphs,
Recommendations
VOLUME VELOCITY VARIETY
Data At Rest Data In Motion Data in Many Forms
Problems related to storage
of huge data reliably.
e.g. Storage of Logs of a
website, Storage of data by
gmail.
FB: 300 PB. 600TB/ day
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Characteristics of Big Data - Variety
Problems involving complex data structures
e.g. Maps, Social Graphs, Recommendations
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Question
Time taken to read 1 TB from HDD?
Around 6 hours
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Is One PetaByte Big Data?
If you have to count just vowels in 1 Petabyte
data everyday, do you need distributed
system?
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Is One PetaByte Big Data?
Yes.
Most of the existing systems can’t handle it.
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Why Big Data?
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Devices:
Smart Phones
4.6 billion mobile-phones.
1 - 2 billion people accessing the internet.
Application
Social Networks
Internet of Things
The devices became cheaper, faster and smaller.
The connectivity improved. Result: Many Applications
Connectivity
Wifi, 4G, NFC, GPS
X =>
Why is It Important Now?
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Computing Components
To process & store data
we need
3. HDD or SSD
Disk Size + Speed
2. RAM - Speed & Size
1. CPU Speed
4. Network
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above
Introduction to Hadoop & SparkReachUs@CloudxLab.com
1. Ecommerce - Recommendations
Example Big Data Customers
Introduction to Hadoop & SparkReachUs@CloudxLab.com
1. Ecommerce - Recommendations
Example Big Data Customers
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Example Big Data Problems
Recommendations - How?
USER ID MOVIE ID RATING
KUMAR matrix 4.0
KUMAR Ice age 3.5
GIRI
apocalypse
now
3.6
GIRI Ice age 3.5
USER ID MOVIE ID RATING
KUMAR apocalypse now 3.6
GIRI matrix 4.0
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Example Big Data Customers
2. Ecommerce - A/B Testing
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Big Data Customers
Telecommunications
1.Customer Churn Prevention
2.Network Performance Optimization
3.Calling Data Record (CDR) Analysis
4.Analyzing Network to Predict Failure
Government
1.Fraud Detection
2.Cyber Security Welfare
3.Justice
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Example Big Data Customers
Healthcare & Life Sciences
1.Health information exchange
2.Gene sequencing
3.Healthcare improvements
4.Drug Safety
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Big Data Solutions
1.Apache Hadoop
○ Apache Spark
2.Cassandra
3.MongoDB
4.Google Compute Engine
5.AWS
Introduction to Hadoop & SparkReachUs@CloudxLab.com
What is Hadoop?
A. Created by Doug Cutting (of Yahoo)
B. Built for Nutch search engine project
C. Joined by Mike Cafarella
D. Based on GFS, GMR & Google Big Table
E. Named after Toy Elephant
F. Open Source - Apache
G. Powerful, Popular & Supported
H. Framework to handle Big Data
I. For distributed, scalable and reliable computing
J. Written in Java
Introduction to Hadoop & SparkReachUs@CloudxLab.com
SQL like interface
SQL Interface
WorkFlow
Machine
learning
/ STATS
NoSQL Datastore
Compute Engine
File Storage
Components
Resource Manager
Spark
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Apache
• Really fast MapReduce
• 100x faster than Hadoop MapReduce in memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop
Spark Core - A fast and general engine for large-scale
data processing.
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Spark Architecture
Spark Core
StandaloneAmazon EC2
Hadoop
YARN
Apache Mesos
HDFS
HBase
Hive
Tachyon
Cassandra
SQL
Streaming MLLib GraphX
SparkR Java Python Scala
Libraries
Languages
Data Sources
Resource/cluster managers
Dataframes
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Thank you. For the full course please enroll at
https://cloudxlab.com/
Introduction to Hadoop & SparkReachUs@CloudxLab.com
For the full course please enroll at https://cloudxlab.com/
Introduction to Hadoop & SparkReachUs@CloudxLab.com
My Courses
Introduction to Hadoop & SparkReachUs@CloudxLab.com
My Course List
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Topics or PlayLists
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Learning Item

More Related Content

PDF
Big Data: an introduction
PDF
Introduction to Big Data
DOCX
Big data abstract
PPT
Big Data: An Overview
PPTX
Big Data Analysis Patterns - TriHUG 6/27/2013
PDF
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
PDF
Big Data Final Presentation
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Big Data: an introduction
Introduction to Big Data
Big data abstract
Big Data: An Overview
Big Data Analysis Patterns - TriHUG 6/27/2013
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Big Data Final Presentation
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...

What's hot (20)

PPTX
Introduction to Big Data
PDF
PPTX
Big Data Hadoop Tutorial by Easylearning Guru
PPT
BigData Analytics with Hadoop and BIRT
PPTX
Big Data, Baby Steps
PPT
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
PPTX
Big Data - A brief introduction
PDF
Introduction to Big Data
PPTX
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
PPTX
Big Data Course - BigData HUB
PDF
Big data analytics with Apache Hadoop
PPT
Big data analytics, survey r.nabati
PDF
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
PPTX
Big Data Analytics
PDF
Lesson 1 introduction to_big_data_and_hadoop.pptx
PDF
Introduction to Big Data and Hadoop
PDF
What is Big Data?
PPTX
Big Data - An Overview
PPTX
Big Data Analysis Patterns with Hadoop, Mahout and Solr
PPTX
Big Data Analytics with Hadoop
Introduction to Big Data
Big Data Hadoop Tutorial by Easylearning Guru
BigData Analytics with Hadoop and BIRT
Big Data, Baby Steps
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Big Data - A brief introduction
Introduction to Big Data
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Course - BigData HUB
Big data analytics with Apache Hadoop
Big data analytics, survey r.nabati
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
Big Data Analytics
Lesson 1 introduction to_big_data_and_hadoop.pptx
Introduction to Big Data and Hadoop
What is Big Data?
Big Data - An Overview
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analytics with Hadoop
Ad

Similar to Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial | CloudxLab (20)

PDF
Hadoop and SAP BI
PDF
Hadoop and the Data Warehouse: Point/Counter Point
PPTX
Power Platform Leeds - November 2019 - Microsoft Ignite Announcements
PPTX
2019 BioIt World - Post cloud legacy edition
PPTX
PostgreSQL as a Strategic Tool
 
PDF
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
PDF
Off-Label Data Mesh: A Prescription for Healthier Data
PPT
Data analytics & its Trends
PPTX
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PPTX
Spring + QueryDSL + MongoDB Presentation
PPTX
HadoopWorkshopJuly2014
PPT
Hadoop at Yahoo! -- Hadoop World NY 2009
PPT
Hw09 Hadoop Applications At Yahoo!
PPTX
How Cloud is Affecting Data Scientists
 
PDF
Concepts, use cases and principles to build big data systems (1)
PDF
Intro to Apache Spark
PPTX
From SQL to Python - A Beginner's Guide to Making the Switch
PPTX
Hadoop Data Modeling
PDF
Bi on Big Data - Strata 2016 in London
PDF
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
Hadoop and SAP BI
Hadoop and the Data Warehouse: Point/Counter Point
Power Platform Leeds - November 2019 - Microsoft Ignite Announcements
2019 BioIt World - Post cloud legacy edition
PostgreSQL as a Strategic Tool
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Off-Label Data Mesh: A Prescription for Healthier Data
Data analytics & its Trends
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Spring + QueryDSL + MongoDB Presentation
HadoopWorkshopJuly2014
Hadoop at Yahoo! -- Hadoop World NY 2009
Hw09 Hadoop Applications At Yahoo!
How Cloud is Affecting Data Scientists
 
Concepts, use cases and principles to build big data systems (1)
Intro to Apache Spark
From SQL to Python - A Beginner's Guide to Making the Switch
Hadoop Data Modeling
Bi on Big Data - Strata 2016 in London
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
Ad

More from CloudxLab (20)

PDF
Understanding computer vision with Deep Learning
PDF
Deep Learning Overview
PDF
Recurrent Neural Networks
PDF
Natural Language Processing
PDF
Naive Bayes
PDF
Autoencoders
PDF
Training Deep Neural Nets
PDF
Reinforcement Learning
PDF
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
PDF
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
PDF
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
PPTX
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
PPTX
Introduction to Deep Learning | CloudxLab
PPTX
Dimensionality Reduction | Machine Learning | CloudxLab
PPTX
Ensemble Learning and Random Forests
Understanding computer vision with Deep Learning
Deep Learning Overview
Recurrent Neural Networks
Natural Language Processing
Naive Bayes
Autoencoders
Training Deep Neural Nets
Reinforcement Learning
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction to Deep Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
Ensemble Learning and Random Forests

Recently uploaded (20)

PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Human Computer Interaction Miterm Lesson
PDF
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
PPTX
Build automations faster and more reliably with UiPath ScreenPlay
PPTX
Training Program for knowledge in solar cell and solar industry
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PDF
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PDF
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
PDF
Auditboard EB SOX Playbook 2023 edition.
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
4 layer Arch & Reference Arch of IoT.pdf
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
SaaS reusability assessment using machine learning techniques
PDF
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PDF
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
PDF
Build Real-Time ML Apps with Python, Feast & NoSQL
PDF
Electrocardiogram sequences data analytics and classification using unsupervi...
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Human Computer Interaction Miterm Lesson
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
Build automations faster and more reliably with UiPath ScreenPlay
Training Program for knowledge in solar cell and solar industry
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
Early detection and classification of bone marrow changes in lumbar vertebrae...
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
Auditboard EB SOX Playbook 2023 edition.
Lung cancer patients survival prediction using outlier detection and optimize...
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
4 layer Arch & Reference Arch of IoT.pdf
Basics of Cloud Computing - Cloud Ecosystem
SaaS reusability assessment using machine learning techniques
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
Build Real-Time ML Apps with Python, Feast & NoSQL
Electrocardiogram sequences data analytics and classification using unsupervi...

Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial | CloudxLab