Week 1 Lecture 1
Week 1 Lecture 1
Landscape
6/25/22 11:59 AM 1
Data Analytics Landscape
People Science
Data
Data Analytics
6/25/22 11:59 AM 2
Data Never Sleeps!
2023 This is
What happens in
One Minute on
Internet
Big Data: Changing the Game of Organizations
✔ Big Data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
✔ Big Data is data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to mange it and
extract value and hidden knowledge from it…
The 4 V’s of Big Data
5
DATA
5
The 4 V‘s of Big Data - Volume
8
The 4 V‘s of Big Data - Volume
Kilobytes or
Terabytes
Huge
Required
Amount of
Large
Data
Storage
Generation
Volume
Saved in Online or
Records Offline
Tables , Transact-
Files ions
https://www.whishworks.com/blog/big-data/understanding-the-3-vs-of-big-data-volume-velocity-and-variety 9
The 4 V‘s of Big Data - Velocity
The New York stock
exchange captures
8 TB OF TRADE
INFORMATION
In Devices 2022 per day
Velocity
ANALYSIS OF Modern cars
STREAMING DATA 100 SENSOR
that monitor items such as
fuel level and tire pressure
18.9 BILLION (2ms)
NETWORK
CONNECTIONS
Almost 2.5 connections per
person on earth
10
The 4 V‘s of Big Data - Velocity
https://www.whishworks.com/blog/big-data/understanding-the-3-vs-of-big-data-volume-velocity-and-variety 11
The 4 V‘ – Variety
12
The 4 V‘s – Veracity
Data is accurate, precise and trusted
13
The 4 V‘s – The 5th V
Cost Reduction
Time Saving
14
Four A’s of Big Data
ata
In
dD Aggregation
te
re
rag
e
att
te
c
d
S
Da
ta
Acquisition Analysis
ge
Lo
ed
g
wl
Da
o
Application
ta
Kn
15
Why Business Needs Big Data?
16
The Problem With ETL (Extract, Transform, Load)
• Exploding amounts of stored data the ETL
process starts being a real problem.
17
Scaling Up & Scaling Out Database
Storage Area Network
Scaling Up Database
19
20
Difference between Business Intelligence & Data Science?
vs
What is Business Intelligence?
21
DATA SCIENCE & BUSINESS INTELLIGENCE
Business
Value
Data Science
• Predictive analysis &
prescriptive analysis
• Why….? What will….? What
High should I do?
http://hashtaggers586.blogspot.com/ 22
Difference between Business Intelligence & Data Science
Business Intelligence Data Science
Data Analysis Yes Yes
Statistics Yes Yes
Visualization Yes Yes
Data Sources Usually SQL, often Data Less structured (logs, cloud data,
warehouse SQL, no-SQL, text)
People Science
Data
Data Analytics
24
Big Data Architecture Main Components
25
Data Platform – Data Life Cycle
• Streaming
Enrich • Data
Serve • Machine
& Data Flow • Data Warehouse • Operational Learning &
Engineering Database AI
Data Ingestion
Batch
Data
Platform
Real Time
The Rise of Big Data
Earlier with limited data, only one processor and one storage unit was needed
Structured Data
28
The Rise of Big Data
Soon, processor
A single Data generation
was notincreased leading
enough to to high
process such volume of data
high volume of along withkinds
different different data
of data as it
formats
was very time consuming.
Structured Data
Unstructured Data
29
The Rise of Big Data
Hence,
The multiple
Single processors
Storage werethe
unit become used to process
bottleneck duehigh volume
to which of dataoverhead
network and this
saved
was time
generated
Structured Data
Unstructured Data
30
The Rise of Big Data
The solution was to use distributed storage for each processor. This enabled
This methodThis is known
worked andasthere
parallel
wasprocessing
no networkwith distributed
overhead storage
generated
easy access to store and access data.
Structured Data
Unstructured Data
31
Big Data Challenges and Solution
32
Hadoop As Solution
33
What is Hadoop?
Hadoop is a framework that manages big data storage in a distributed way and
processes it parallelly
34
Components of Hadoop
Distributed
Storage
35
HDFS
Features of HDFS
Provides Implemented on
Provides data Highly fault
distributed commodity
security tolerant
storage hardware
36
What is MapReduce?
Hadoop MapReduce is a programming technique where huge data is
processed in a parallel and distributed fashion
Processor
Output
Big Data
37
Hadoop Use Case – Combating Fraudulent Activities
Detecting fraudulent transactions is one among the various problems any bank faces
38
Hadoop Use Case – Combating Fraudulent Activities
Approaches used by Zion’s security team to combat fraudulent activities
Security Information
Management – SIM Tools Parallel Processing System
Problem Problem
Processing of Processing of
unstructured data unstructured data
Zions could now Zions could now
(like server logs, (like server logs,
store massive store massive
customer data, customer data,
amount of data amount of data
customer customer
using Hadoop using Hadoop
transactions) was transactions) was
now possible now possible
40
Hadoop EcoSystem
Tableu
Impala
Sqoop
Knime
• Hadoop has four main modules: Hadoop Chuckwa
Rapidminer
common, HDFS, MapReduce and YARN. The Kafka
way these modules are woven together is what Flume
makes Hadoop so successful. Hadoop Hbase
• Hadoop's core functionality is the driver of HDFS Arrow
Hadoop's adoption. Many Apache side Hive
MapReduce
projects use it's core functions that makes him Pig
Spark
so popular. YARN
Flink
• Because of all those side projects Hadoop has Drill
turned more into an ecosystem that also
makes him popular.
Storm Zookeeper
Ambari
Oozie
41
Hadoop Is Everywhere?
1. Although Hadoop is so popular it is not the silver bullet. It’s not the tool that you should use for
everything.
2. Often times it does not make sense to deploy a Hadoop cluster, because it can be overkill.
3. Hadoop does not run on a single server. You basically need at least five servers, better six to run a small
cluster. Because of that the initial platform costs are quite high. One option you have is to use a
specialized systems like Cassandra, MongoDB or other NoSQL DB's for storage. Or you move to Amazon
and use Amazon's Simple Storage Service, or S3.
4. Guess what the tech behind S3 is. Yes, HDFS. That's why AWS also has the equivalent to MapReduce
named Elastic MapReduce.
5. The great thing about S3 is that you can start very small. When your system grows you don't have to
worry about s3's server scaling.
42
Should You Learn Big Data Ecosystem?
1. Yes, I definitely recommend you to get to now how Hadoop works and how to use it.
2. As I have explained you, the ecosystem is quite large. Many big data projects use
Hadoop or can interface with it. That’s why it is generally a good idea to know as
many big data technologies as possible.
3. Not in depth, but to the point that you know how they work and how you can use
them.
4. Your main goal should be to be able to hit the ground running when you join a big
data project. Plus, most of the technologies are open source. You can try them out for
free.
43
A State-of-the Art Enterprise Big Data Platform
Vision
1. Design and Build Enterprise Data Management platform comprising of Data Lake/DWH
2. Design and Build Data governance including Data quality, Master Data management, Metadata management , Data
Catalog and Security
3. Design and build consumption layers including Reporting, Dashboarding and AI/ML models
44
ata Warehousing vs Data Lake
Expensive for large data volumes Storage Designed for low-cost storage
Less Agile, fixed configuration Agility Highly agile, configure and reconfigure as needed
Data Producers Data Acquisition Data Curation Data Hub Data Access Consumption
Source Systems Ingestion Data Lake Transform Data Storage Data Consumption Users
A
/ Curated Layer Reportin
Raw Layer c g
Processin c Cubes
MDM e
Internal g Reposito DWH s
ry Reports/ Dashboards Report Users
Sources Internal s
/
External Source Real Data ELT/ETL A
Staging Analytics
Time/ Transformatio
Silver
P
n I
Batch/ External s Model
APIs Data Data Quality
/
Data Preparation,
Data Scientists
E
Staging
Unstructured Standardization x Building, Training,
data c
Transformation h
Deployment,
Structured Data Archiv Cleansing a Ops
Semi Structured al, n Downstream Admin /
Profiling g Operations
data Purgin Applications
e
g
46
Various Key Personas to Run the Program
Analytics Persona
Data Engineer Persona Data Governance Persona
(Citizen Data Science and Reporting)
47
How to choose technology to implement Data & AI
project in any organization?
How to choose technology to implement Data & AI
project in any organization?
50
What is Data Ingestion
The process of importing, transferring, loading and processing data for later use or
storage in database.
• Involves connecting to various data sources, extracting the data, and detecting
changed data.
• Data ingestion subsystems need to fetch data from variety of sources (such as
RDBMS, web-logs, application-logs, streaming data, social media, etc.),
Batch
Data in
Ingestion
Real Time
Hadoop
Data
Streaming
51
Apache Kafka Introduction
52
Kafka Use Cases
Kafka can be used for various purposes in an organization, such as:
Messaging
Millions of messages can be sent and received in real-time, using kafka
Service
Real Time Stream Kafka can be used to process a continuous stream of information in real-
Processing time and pass it to stream processing systems such as Storm.
Kafka can be used to collect physical log files from multiple systems and
Log Aggregation store them in a central location such as HDFS.
Event Sourcing A time ordered sequence of events can be maintained through Kafka
53
Some of the used of Kafka at LinkedIn are as follows:
• Collect metrics
Monitoring • Create monitoring dashboards
55
What is SQL?
56
What is NoSQL?
57
Features of NoSQL
Non-Relational Schema-free Simple API Distributed
• NoSQL databases never • NoSQL databases are • Offers easy to use • Multiple NoSQL databases
follow the relational model either schema-free or have interfaces for storage and can be executed in a
• Never provide tables with relaxed schemas querying data provided distributed fashion
flat fixed-column records • Do not require any sort of • APIs allow low-level data • Offers auto-scaling and fail-
• Work with self-contained definition of the schema of manipulation & selection over capabilities
aggregates or BLOBs the data methods • Often ACID concept can be
• Doesn't require object- • Offers heterogeneous • Text-based protocols sacrificed for scalability
relational mapping and structures of data in the mostly used with HTTP and throughput
data normalization same domain REST with JSON • Mostly no synchronous
• No complex features like • Mostly used no standard replication between
query languages, query based query language distributed nodes
planners,referential • Web-enabled databases Asynchronous Multi-
integrity joins, ACID running as internet-facing Master Replication, peer-
services to-peer, HDFS Replication
• Only providing eventual
consistency
• Shared Nothing
Architecture. This enables
less coordination and
higher distribution.
58
Types of NoSQL Databases
59
Advantages of NoSQL?
• Can be used as Primary or Analytic Data
Source
• Big Data Capability
• No Single Point of Failure
• Easy Replication
• It provides fast performance and horizontal
scalability.
• Can handle structured, semi-structured, and
unstructured data with equal effect.
• Handles big data which manages data
velocity, variety, volume, and complexity.
60
Disadvantages of NoSQL?
• No standardization rules
• Limited query capabilities
• RDBMS databases and tools are comparatively mature
• It does not offer any traditional database capabilities, like consistency when
multiple transactions are performed simultaneously.
• When the volume of data increases it is difficult to maintain unique values
as keys become difficult
• Doesn't work as well with relational data
• Open source options so not so popular for enterprises.
61
Difference Between Structured & Unstructured Data
Features Structured Data Unstructured Data
62
Data Warehousing vs Data Lake
63