Week 1 Lecture 2
Week 1 Lecture 2
Landscape
6/25/22 11:59 AM 1
Data Processing
2
Batch Vs Real-time Processing
Insight
Data Analytics Storage Consumer
Real-time Processing Pipeline
3
Batch Processing vs Real-Time Processing
The Features below show a comparison of batch and real-time analytics
in the enterprise use cases
4
Should You Do Batch Processing or Stream Processing?
A batch processing
architecture is simple, and
Batch processing is the
It is a good idea to start with therefore quick to set up.
foundation of every good big
batch processing. Platform simplicity means, it
data platform.
will also be relatively cheap
to run.
5
Limitations of MapReduce in Hadoop
6
Hadoop Ecosystem
Log Files
Reports
Staging
ERP
Result Store
RDBMS Raw Data Computed Services
(Reservoir) Information
Batch Analytic
Social
Compute Query
Channel Tools
Engine
Sensor
Alerting
Machine Tools
Mobile
7
Big Data Platform Blueprint
8
Big Data Technologies Landscape
Ingestion Storage/Retention Processing Access
Ingestion Architecture: Data Storage: Data Processing: Visualization and APIs:
• Scalable, Extensible to • Depending on the • Processing is provided • Dashboard and
capture streaming and requirements data is for both batch and applications that
batch data. placed into Hadoop near-time use cases provides valuable
• Provide capability to HDFS, Hive, HBase, • Provision Workflows business insights
business logic, filters, Elastic Search or in- for repeatable Data • Data will be made
validations, data memory. processing available to consumers
quality, routing, etc. • Metadata management • Provide Late Data using API, MQ Feed
business requirements • Policy-based Data Arrival Handling and DB access
Retention is provided.
Source 2
Informatica
HQL (Hive Query Lang) Impala
Source 3
Processing
…...
Source N
ML Model
(push/Pull) • Python
Time-defined • Spark
Batch data Code
Off loaded to Apache Apache Apache
Raw data Parquet Parquet Parquet
layer (File format) (File format) (File format)
10
Real Time Streaming Pipeline
Raw Data
Source Layer / Data Processing Curated Layer Visualization
Lake
Source 1
Source 2 Impala
Kafka
Source 3 Processing
Topics HDFS Hive DB MicroStrategy
Engine
…..
.
Source ML Model
N • Python Code
Apache Apache
Parquet Parquet
(File format) (File format)
11
AI and Data Platform Architecture Diagram – Complete
Operational Databases Interactive Query AI & ML Toolkit AI & ML Ops Search Engine
Self Service BI
(Deployment)
MySQL
instanc
e Alerts and Notifications
Key-Value Document
Store
Data Transformations and
SQL Data Streams (Real Time) Data Quality Scripts
Server User Mobile
instance App
Data Warehouse
Other Data
Sources
Batch Processing Users
& Job
Management
Raw Data Lake
Apps
12
AI and Data Platform Architecture Diagram – Complete With AWS Technologies
Operational Databases Amazon Athena AWS Deep Learning AMIsAmazon SageMaker Amazon Elasticsearch
Service Amazon QuickSight
Tableau
MySQL
instanc
e Amazon Simple
Amazon DynamoDB Notification Service
AWS Lambda
SQL Amazon Kinesis
Server Data Streams
instance User Mobile
App
Amazon Redshift
Other Data
Sources
AWS Glue Users
13
Netflix Use Case
14
About Netflix
140
118
million
million
Netflix is a streaming service that hours
users
allows customers to watch a wide every day
variety of award-winning TV shows,
movies, documentaries, and more on 3 trillion 12
thousands of internet-connected user petabytes
devices. events logs
per day every day
15
The Goal Of Netflix (Keep The Goal in Mind)
16
How Netflix Gather Big Data?
Do portions of programs
When is a program paused Do the credits gets skipped Etc
get re-watched
17
Batch Processing: For Finding Hit Movie to Recommend Users
• Location
• Country
Batch
• Impression
Processing
Events
• Play Events Batch processing using historical Data
• Completion
Events Hit Movies
Event 1
Event 2
Elastic
Event 3 Chuckwa Amazon S3
MapReduce
.
. Ingestion Storage Analytics
.
Event 4
Chuckwa (a scalable data collection system) wrote incoming messages into Hadoop
sequence files, stored in Amazon S3. These files then could be analyzed by Elastic
MapReduce Jobs (Daily and Hourly basis)
19
The Trending Now Feature
Click
Event
s Play events (While you watch):
Play • Title you watched last, where
Events you did stop watching,
• where you used 30s rewind,
Impression
• etc.
Events
Date &
Time Logs
Impression Events (Not Watching):
Browse Netflix library like scroll up and
down, scroll left or right, click on a
movie, etc.
Trending Now
20
Real Time Processing To Recommend The Movies
Looking past is not enough! Lets do real time processing.
• Location
• Country
Stream
• Impression
Processing
Events
• Play Events
Stream Processing based on the
user’s incoming data
• Completion
• Used Kafka Platform with
Events Cassandra database
Recommended Movies For
Particular User
• They replace their custom
analytics tool with Apache
Spark.
21
Netflix Streaming Pipeline based on Spark
Impression events
Beacon
Trending data
Viewing
History
Play events
Recommender
System
Live Data
What Netflix Achieved by Using Data Science?
Trending now
Continue watching
People Science
Data
Data Analytics
25
The Science – Algorithms
How Data Science algorithms/software can help in decision making?
Software
Algorithms in decision making
Reference: Data Science for business by foster provost & tom fawcett
28
1. CLASSIFICATION: Customers Classification
• Linear Classifiers
• Support Vector machines
• Decision Trees
• Random Forest Diamond
• Neural Networks
Classifier
Gold
Model
Customers
Data Silver
29
2. REGRESSION: Predict Sales
• Linear Regression
• Lasso Regression
• Logistic regression
• Support Vector Machine
Revenue (millions)
• Multivariate Regression Algorithm
Regression
Model
Sales Data
X axis
30
3. SIMILARITY MATCHING
• Nearest Neighbor Distance
• Levenshtein Distanance
• Damerau-levenshtein Distance
• Needman- Wunch Distance
• Hamming Distance
Model
Customers Data
31
4. CLUSTERING: Business problems & clustering
• K-means
• Mean-Shift Clustering
• DBSCAN
• EM-Clustering
Clustering
Model
• Do Customers form natural • Use k-means or other • You can identify the
groups or clusters? clustering machine clusters using clustering
• What product should we learning algorithms to algorithm as shown in
offer? address the challenge. the fig on unlabeled
• How should our sales team data based on particular
be structured? attributes.
32
5. PROFILING: Profiling Customers
• Classification
• Clustering
• Exploratory Data Analysis
Data
profiling
33
6. LINK PREDICTIONS: Suggestions
• Common Neighbors
• Adamic Adar
• Preferential attachment
• Resource allocation
• Same community
• Total neighbors
Link Prediction
Models
34
7. CO-OCCURANCE:
Data Science
Sales Data
35
8. DATA REDUCTION
• Correlations analysis,
• Identifying important & less
important features
• Drop duplicate information
Data Optimized
Data Science Dataset
• Converting the large dataset • Decide which attributes are • Now you can
into smaller datasets to important to you using co perform analytics
process data in less time relation or other techniques easily on smaller
and in effective way. and remove less important dataset and save
• Making sure integrity of attributes. time & reduce cost.
data will rename same • Group the similar attributes.
36
Machine Learning in Production
Machine
Learning
Stream
Batch
Processing
37
Machine Learning Models Do Not Work Forever
38
Which Platform Supports Retraining Model Automatically?
• You can create models and then use them in production. But
this loop is almost nowhere to be seen.
39
Machine Learning Training Parameter Management
40
Data is Stronger Than Opinions
You Have The Data. USE IT
This doesn’t You bring Show the Discussion
work the data statistics end there.
41
42
The Business
People Science
Data
Data Science
43
DATA-DRIVEN DECISION MAKING (DDD)
Reference: Data Science for business by foster provost & tom fawcett
VALUE: LEVERAGING DATA FOR VALUE-ADDED HEALTHCARE
6
WAYS A DATA SCIENTIST CAN ADD VALUE TO ANY BUSINESS 1/2
https://www.simplilearn.com/why-and-how-data-science-matters-to-business-article 46
The Applications – Some Areas
Telecom Tourism
47
Big Data In Telecom
• Customer Analytics
• Direct • Customer Journey
• Shop Visit • Footfall analysis
• IVR
• Chat
• Web
Cross-
• Social Media
channel CDR
Interactions
• 3G UE Agents
• Demographic
• APN, Probes
• Age, Gender
• Mobile Access Nodes
Customer • Ethnicity
• Core Network Network Data
Attributes • Geography
• CDRs/XDRs
•
Teleco • Segment
DPI/Drive Tests
• Service Degradations m • Data usage
• Platform Outages
Customer Transactional
Insights Data
• Federal Agencies
• City Councils
• Municipalities
• Business Metrics 48
Big Data & Tourism Department
49
Big Data & Airports
50
USE CASE: Data Science in Insurance Industry
Marketing Automation
51
Big Data In Banking
RISK DATA AGGREGATION &
REPORTING
PREDICTIVE WEALTH
MANAGEMENT &
AML
COMPLIANCE
ANALYTICS PRIVATE BANKING
ACROSS
BANKING DATA SCIENCE &
PREDICTIVE ANALYTICS
CONSUMER IN BANKING VARIOUS
BANKING TYPES OF
COLOR KEY FRAUD
DEFENSIVE
SAVE THE BANK
OFFENSIVE
DRIVE PROFITS &
COMPETITIVE
PAYMENTS
CYBER
ADVANTAGE SECURITY
FINANCIAL TRADING
APPLICATIONS
52
Big Data In Retail
CROSS SELLING & UP RECOMMENDATION
SELLING ENGINE
PURCHASE ATTRIBUTION
LIKELIHOOD MODELING
Market Collaborative
Basket Filtering
Analysis
Markov
Propensity Chain Monte
Model Carlo
54
Big Data In Healthcare
Disease
Patient Medical
Personalized Modeling
Data Test
Medicine and
Analysis Automation
Mapping
Merge and analyze Track patients activities, One of the flashiest uses of Data Science enables
data sets from movements, symptoms data science in the past automation of medical
multiple sources to to discover or few years has been in tests and provides you
create personalized identifying diseases. tracking (and finding ways real time analytics for
treatment. to halt or prevent) example BP, Diabetes
diseases. etc
55
USE CASE: Energy Efficiency (IOT Example)
Data science
Challenge Results
& Big Data
56
USE CASE: Smart Lighting (IOT Example)
Data science
Challenge Results
& Big Data
• Presence sensors to • Reactive/Predictive
ensure lights are not in maintenance
use when rooms are • Adaptive lighting • Product Optimization
empty. solution(presence • Personal settings
• Daylight harvesting to detection) • Scheduling
employ natural lighting • Learn occupants • Less energy
to minimize artificial individual lighting consumption
lighting needs. preferences • Customer’s satisfaction
• Personal dimming to • Management and
allow individuals the control by measuring
option to directly the light intensity
control the lighting in
the room or space
57
USE CASE: Customers Segmentation
Data science
Challenge Results
& Big Data
58
USE CASES: Customer’s Analytics
05/18/2025 07:38 AM 60
Fraudulent wire transfer
05/18/2025 07:38 AM 61
Excess Reserves
05/18/2025 07:38 AM 62
Credit Card Customers
05/18/2025 07:38 AM 63
Misinformation in loan applications
05/18/2025 07:38 AM 64
Potential Best Customers
05/18/2025 07:38 AM 65
Customers At Risk
05/18/2025 07:38 AM 66
Target Potential Customers
05/18/2025 07:38 AM 67
Automated Documentation
05/18/2025 07:38 AM 68
Long Loan-cycle times
05/18/2025 07:38 AM 69
Goal Setting
05/18/2025 07:38 AM 70
Insurance claims
05/18/2025 07:38 AM 71
Liquidity Forecasts
05/18/2025 07:38 AM 72
Risk Scoring
05/18/2025 07:38 AM 73
Real time Blocking
05/18/2025 07:38 AM 74
Rule based AML
05/18/2025 07:38 AM 75
Fraud Detection systems
05/18/2025 07:38 AM 76
Spot Identity Fraud
05/18/2025 07:38 AM 77
Energy and
Utilities
ML / AI
05/18/2025 07:38 AM 79
Data Platform should have these key Value Propositions
05/18/2025 07:38 AM 80
3 operational levers for utilities to employ AI use cases for better performance
05/18/2025 07:38 AM 81
Prioritizing use cases should consider both total potential values as well as
feasibility for maximizing impact
05/18/2025 07:38 AM 82
International Utilities Company: Improve maintenance by increasing the number
of resolutions at first visit
Context: Maintenance engineers are a scarce resource and when maintenance is required, details are not always precise enough.
These requests come from e-mails and phone calls. Before AI/ML, it was often required to visit at least twice due to the lack of right
tools.
Approach: Using AI/ML, they were able to predict the fault parts of machines from e-mails and phone calls received.
Impact:
• Lower engineering and inventories cost due to higher resolution ratio.
• Resolution ratio jumps from 15% to 60% for all maintenance operations.
• Realized Impact : $ 450k / year due to a better use of engineering resources and inventory
05/18/2025 07:38 AM 83
International Water Management Company: Reduces regulatory cost associated
with N2O measurement
Context: This company is commissioned by many Japanese municipalities to operate water purification plants. Japan's water purification
plants have set standards and regulations for greenhouse gas emissions for each treatment method. The Japanese local government
has also requested this company to take measures based on the measurement results for N2O. The cost of the measurement
equipment is $100,000 per unit, which is a large cost burden if many units are installed.
Approach: By using ML/AI they forecasted N20 concentration using time-series data from water plant, such as temperature,
transparency, pH, quantity of chemicals, and were able to add weather data. Instead of a large number of water quality sensors, this
approach combines a small number of water quality sensors with soft sensors with predictive models.
Impact:
MAPE: <15% for all water plants
ROI estimates: 1 million $ per year savings
05/18/2025 07:38 AM 84
QUESTIONS & ANSWER SESSION
05/18/2025 85
Healthcare
05/18/2025 07:38 AM 86
There are Hundreds of Opportunities to Optimize Every Division of A Health Care
Player
05/18/2025 07:38 AM 87
Healthcare Organizations are scaling Values using AI
05/18/2025 07:38 AM 88
Companies Adopting AI
05/18/2025 07:38 AM 89
Moving Through the AI Maturity Curve
05/18/2025 07:38 AM 90
.
05/18/2025 07:38 AM 91
Large US provider Identities FWA and secures Large ROI
Rule-driven system allowed potential cases of fraud, waste & abuse to fall through the cracks.
05/18/2025 07:38 AM 92