0% found this document useful (0 votes)

26 views92 pages

Week 1 Lecture 2

The document discusses the differences between batch and real-time data processing, highlighting their respective features and use cases in analytics. It also covers the limitations of MapReduce in Hadoop, the architecture of big data platforms, and the technologies involved in data ingestion, storage, processing, and visualization. Additionally, it presents a case study on Netflix's use of big data and data science to enhance user experience and recommend content.

Uploaded by

parth25stat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views92 pages

Week 1 Lecture 2

Uploaded by

parth25stat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 92

Data Analytics

Landscape

6/25/22 11:59 AM 1
Data Processing

2
Batch Vs Real-time Processing

Data Storage Analytics Insight

Batch Processing Pipeline

Insight
Data Analytics Storage Consumer
Real-time Processing Pipeline

3
Batch Processing vs Real-Time Processing
The Features below show a comparison of batch and real-time analytics
in the enterprise use cases

Batch Processing Real-Time Processing

• Large group of data/transactions is processed • Data processing takes place upon data entry or
in a single run. command receipt instantaneously.
• Jobs run without any manual intervention. • It must execute on response time within
• The entire data is pre-selected and fed using stringent constraints.
command-line parameters and scripts. • Example:
• It is used to execute multiple operations, Fraud detection
handle heavy data load, reporting, and offline
data workflow.
Example:
Regular reports requiring decision making

4
Should You Do Batch Processing or Stream Processing?

A batch processing
architecture is simple, and
Batch processing is the
It is a good idea to start with therefore quick to set up.
foundation of every good big
batch processing. Platform simplicity means, it
data platform.
will also be relatively cheap
to run.

A batch processing platform When the time comes and

will enable you to quickly ask you also need to do analytics
the big questions. They will on the fly, then add a
give you invaluable insight streaming pipeline to your
into your data and batch processing big data
customers. platform.

5
Limitations of MapReduce in Hadoop

The limitations of MapReduce in Hadoop are listed below:

Unsuitable with OLTP • OLTP requires a large number of short
(Online Transaction transactions, as it works on the batch-oriented
Processing) framework.

• The apache Giraph Library processes graphs,

Unfit for processing graphs which adds additional complexity on top of
MapReduce.

• Being a state-less execution, MapReduce

Unfit for iterative execution doesn’t fit with use cases like Kmeans that
needs iterative execution.

6
Hadoop Ecosystem

Data Data Data

(Analytical) Data Processing Result Store
Sources Collection Consumer

Log Files
Reports
Staging
ERP
Result Store
RDBMS Raw Data Computed Services
(Reservoir) Information
Batch Analytic
Social
Compute Query
Channel Tools
Engine
Sensor
Alerting
Machine Tools

Mobile

= Data in motion = Data at rest

7
Big Data Platform Blueprint

1. Ingestion is all about getting the

data in from the source and
making it available to later stages.
2. To analyze/process stage is where
the actual analytics is done.
Analytics, in the form of stream
and batch processing.
3. This is the typical big data storage
where you just store everything. It
enables you to analyze the big
picture.
4. Displaying data is as important as
ingesting, storing analyzing it.
People need to be able to make
data driven decisions.

8
Big Data Technologies Landscape
Ingestion Storage/Retention Processing Access
Ingestion Architecture: Data Storage: Data Processing: Visualization and APIs:
• Scalable, Extensible to • Depending on the • Processing is provided • Dashboard and
capture streaming and requirements data is for both batch and applications that
batch data. placed into Hadoop near-time use cases provides valuable
• Provide capability to HDFS, Hive, HBase, • Provision Workflows business insights
business logic, filters, Elastic Search or in- for repeatable Data • Data will be made
validations, data memory. processing available to consumers
quality, routing, etc. • Metadata management • Provide Late Data using API, MQ Feed
business requirements • Policy-based Data Arrival Handling and DB access
Retention is provided.

Technology Stack: Technology Stack: Technology Stack: Technology Stack:

• Apache Flume • HDFS • MapReduce • Qlik/ Tableue/
• Apache Kafka • Hive Tables • Hive Sportfire /
• Apache Storm • Hbase / MapR DB • Apache Spark Microstragety
• Apache Sqoop • Elastic Search • AWS Elastic • REST APIs
• NFS Gateway MapReduce

Management, Monitoring, Governance

Ambari, Cloudera Manager, Cloudera Navigator, MapR MCS
9
Batch Processing Pipeline
Raw Data Active
Sources Layer / Data Processing Curated Layer Visualization
Layer
Lake
Source 1

Source 2
Informatica
HQL (Hive Query Lang) Impala
Source 3
Processing
…...

HDFS Hive DB Hive DB MicroStrategy

Engine

Source N
ML Model
(push/Pull) • Python
Time-defined • Spark
Batch data Code
Off loaded to Apache Apache Apache
Raw data Parquet Parquet Parquet
layer (File format) (File format) (File format)

10
Real Time Streaming Pipeline
Raw Data
Source Layer / Data Processing Curated Layer Visualization
Lake

Source 1

Source 2 Impala
Kafka
Source 3 Processing
Topics HDFS Hive DB MicroStrategy
Engine
…..
.

Source ML Model
N • Python Code

Apache Apache
Parquet Parquet
(File format) (File format)
11
AI and Data Platform Architecture Diagram – Complete

Data Processed Data Consumer

Data Ingestion Raw Data Lake, AI & ML, Processing
Sources Stores Apps

Operational Databases Interactive Query AI & ML Toolkit AI & ML Ops Search Engine
Self Service BI
(Deployment)

MySQL
instanc
e Alerts and Notifications
Key-Value Document
Store
Data Transformations and
SQL Data Streams (Real Time) Data Quality Scripts
Server User Mobile
instance App

Data Warehouse
Other Data
Sources
Batch Processing Users
& Job
Management
Raw Data Lake

Apps

12
AI and Data Platform Architecture Diagram – Complete With AWS Technologies

Data Processed Data Consumer

Data Ingestion Raw Data Lake, AI & ML, Processing
Sources Stores Apps

Operational Databases Amazon Athena AWS Deep Learning AMIsAmazon SageMaker Amazon Elasticsearch
Service Amazon QuickSight
Tableau

MySQL
instanc
e Amazon Simple
Amazon DynamoDB Notification Service

AWS Lambda
SQL Amazon Kinesis
Server Data Streams
instance User Mobile
App

Amazon Redshift
Other Data
Sources
AWS Glue Users

Amazon Simple Storage

Service
Apps

13
Netflix Use Case

How Netflix added value to

their business with the help of
big data and data science?

14
About Netflix

140
118
million
million
Netflix is a streaming service that hours
users
allows customers to watch a wide every day
variety of award-winning TV shows,
movies, documentaries, and more on 3 trillion 12
thousands of internet-connected user petabytes
devices. events logs
per day every day

15
The Goal Of Netflix (Keep The Goal in Mind)

The goal for Netflix

is to keep you
subscribed and to get
new subscribers.

16
How Netflix Gather Big Data?

Do the nature of shows

Date the movie/show was One which device is it
Rating Searches watched vary depending on
watched watched
device

Do portions of programs
When is a program paused Do the credits gets skipped Etc
get re-watched

17
Batch Processing: For Finding Hit Movie to Recommend Users

• Location
• Country
Batch
• Impression
Processing
Events
• Play Events Batch processing using historical Data

• Completion
Events Hit Movies

Batch processing: Netflix knows the exact episode of

a TV shows that gets you hooked, not only globally
but for every country.
18
The Netflix Old Batch Processing Pipeline
When Netflix started our, they had a very simple batch processing system architecture
explained in this figure.

Event 1

Event 2

Elastic
Event 3 Chuckwa Amazon S3
MapReduce

.
. Ingestion Storage Analytics
.
Event 4

Chuckwa (a scalable data collection system) wrote incoming messages into Hadoop
sequence files, stored in Amazon S3. These files then could be analyzed by Elastic
MapReduce Jobs (Daily and Hourly basis)
19
The Trending Now Feature
Click
Event
s Play events (While you watch):
Play • Title you watched last, where
Events you did stop watching,
• where you used 30s rewind,
Impression
• etc.
Events

Date &
Time Logs
Impression Events (Not Watching):
Browse Netflix library like scroll up and
down, scroll left or right, click on a
movie, etc.

Trending Now
20
Real Time Processing To Recommend The Movies
Looking past is not enough! Lets do real time processing.

• Location
• Country
Stream
• Impression
Processing
Events
• Play Events
Stream Processing based on the
user’s incoming data
• Completion
• Used Kafka Platform with
Events Cassandra database
Recommended Movies For
Particular User
• They replace their custom
analytics tool with Apache
Spark.

21
Netflix Streaming Pipeline based on Spark
Impression events

Beacon

Trending data

Devices Kafka Spark Cassandra

Ingestion Analytics Storage

Viewing
History

Play events

Recommender
System

Live Data
What Netflix Achieved by Using Data Science?

Finding the next smash-hit series

Personalized video ranker

Video - Video Similarity Algorithm

23
24
The Science

People Science
Data

Data Analytics

Technology Processes Business

25
The Science – Algorithms
How Data Science algorithms/software can help in decision making?

Software
Algorithms in decision making

Ruled-based decision making Statistical reasoning Machine learning Artificial intelligence

Boolean data Simple Regression Classification Tasks Dynamic adaptation to novelty

(yes or no) Numerical data arbitrary data Autonomous selection of best
allowing for curve fitting That needs to be abstracted methodology when presented with
into numbers
Examples arbitrary data
⮚ Time or threshold- Examples Examples
based alarms ⮚ Interpolation ⮚ Identification of relevant Examples
⮚ Simple pattern ⮚ Outlier detection ⮚ Autonomous vehicles
features from large input
matching ⮚ Predictive maintenance datasets ⮚ Human-like conversational skills
⮚ Quality control using ⮚ Intelligent digital assistant
various metrics
26
Machine Learning Algorithms
Deep Boltzmann Machine • • Naïve Bayes
Deep Belief Networks • • Averaged One-Dependence Estimators
Convolutional neural Networks • Deep Learning • Bayesian Belief Network
Stacked Auto-Encoders • Bayesian • Gaussian Naïve Bayes
• Multinomial Naïve Bays
Random Forest • • Bayesian Network
Gradient Boosting Machines • • Classification & Regression Tree
Boosting • • Interactive dichotomiser 3
Bootstrapped Aggregation • Ensemble • C4.5
AdaBoost • • C5.0
Stacked Generalization • Decision Tree • Chi-squared Automatic Interaction
Gradient Boosted Regression Trees • Detection
• Decision Stump
Radial Basis Function Network • • Conditional Decision Trees
Perceptron • •
Neural Networks MS
Back-Propagation •
Hopfield Network • Machine • Principal Component Analysis
• Partial Least Squares Regression
Ridge Regression •
Learning • Sammon Mapping
Absolute Shrinkage & Selection Operator •
Regularization
Algorithms •
•
Multidimensional Scaling
Project Pursuit
Elastic Net •
Least Angle Regression • Dimensionality • Principle Component Regression
Reduction • Partial Least Squares Discriminant Analysis
• Mixture Discriminant Analysis
Cubist • • Quadratic Discriminant Analysis
One Rule • • Regularized Discriminant Analysis
Zero Rule • Rule System • Flexible Discriminant Analysis
Repeated Incremental Pruning to Produce • • Linear Discriminant Analysis
Error Reduction • K-Nearest Neighbor
• Learning Vector Quantization
Linear Regression • Instance Based • Self-Organizing Map
Ordinary Least Squares Regression • • Locally Weighted Learning
Stepwise Regression •
Multivariate Adaptive Regression Splines • Regression • K-means
• • K-medians
Locally Estimated Scatterplot Smoothing Clustering • Expectation Maximization
Logistic Regression •
• Hierarchical Clustering
BUSINESS PROBLEM TO DATA MINING TASKS
Answering Business
Classification Regression Questions
• Among all the customer which • How much will a given
are likely to given response? customer use the service? Who are the most profitable
customers?

Similarity Matching Clustering Is there really a difference

• Attempts to identify similar • Do our customers form natural between the profitable
individuals based on known groups or segments? customers and the average
about them.
customer?

Profiling Link Prediction But who really are these

• What is the typical cell phone • Recommending movies to customers? Can I
usage of this customer customers on the basis of characterize/classify them?
segment? watched and rated movies.

Co-occurrence Data Reduction Will some particular new

customer be profitable? How
• Which items are commonly • What is important to trade-off
purchased together? for improved insight
much revenue should I
expect this customer to
generate?

Reference: Data Science for business by foster provost & tom fawcett
28
1. CLASSIFICATION: Customers Classification
• Linear Classifiers
• Support Vector machines
• Decision Trees
• Random Forest Diamond
• Neural Networks

Classifier
Gold
Model

Customers
Data Silver

• Our challenge is to • Building the classifier • Classifier model will classify

classify the customers model on labeled data the customers into groups.
into different types of who will classify the e.g. Diamond, gold or Silver.
categories and predict customer into different • These classes can help us to
their class. groups. encourage the customers to
buy things.

29
2. REGRESSION: Predict Sales
• Linear Regression
• Lasso Regression
• Logistic regression
• Support Vector Machine

Revenue (millions)
• Multivariate Regression Algorithm

Regression
Model

Sales Data
X axis

• Predicting how much • Build a regression • Now we can see our

revenue will made by model who will estimated revenue &
company in upcoming years predict the revenue plan according to
based on historical data? for upcoming time that.
• How much resources we
need next year?

30
3. SIMILARITY MATCHING
• Nearest Neighbor Distance
• Levenshtein Distanance
• Damerau-levenshtein Distance
• Needman- Wunch Distance
• Hamming Distance

Model

Customers Data

• Targeting the people • Finding which people are

similar to existing profitable • Reach to those type of
who are similar to
customers by using customers and market your
your existing
classification, regression and business.
profitable customers.
clustering models.

31
4. CLUSTERING: Business problems & clustering
• K-means
• Mean-Shift Clustering
• DBSCAN
• EM-Clustering

Clustering
Model

• Do Customers form natural • Use k-means or other • You can identify the
groups or clusters? clustering machine clusters using clustering
• What product should we learning algorithms to algorithm as shown in
offer? address the challenge. the fig on unlabeled
• How should our sales team data based on particular
be structured? attributes.

32
5. PROFILING: Profiling Customers
• Classification
• Clustering
• Exploratory Data Analysis

Data
profiling

Cell phone usage data of

customers

• What is the typical • Profiling require complex calculations. • Through profiling

cell phone usage of For example profiling cell phone usage customer you can
particular customer might require a complex description of make new policies
segment?” night & weekend airtime averages, and offer calls,
international usage, roaming charges, messages packages.
text minutes, and so on.

33
6. LINK PREDICTIONS: Suggestions
• Common Neighbors
• Adamic Adar
• Preferential attachment
• Resource allocation
• Same community
• Total neighbors

Link Prediction
Models

The movies you might

Customer’s movies data enjoy

• Recommending the • Use Graph distance or • Link prediction is very

movies to customers other machine learning link common in social media
one can think of a graph prediction model to find websites like Facebook,
between customers and out which type of movies a twitter etc. through this you
the movies they have particular customer wants attract user to watch new
watched or rated to watch. movies or use new products

34
7. CO-OCCURANCE:

• Market basket analysis

• Association Rules

Data Science

Sales Data

• What items are • Using Association rules or • We can offer discount on

purchased to other machine learning sets of items which
together? algorithm we will identify customers purchased
which items are together this can help us
purchased together to increase our revenue.

35
8. DATA REDUCTION
• Correlations analysis,
• Identifying important & less
important features
• Drop duplicate information

Data Optimized
Data Science Dataset

• Converting the large dataset • Decide which attributes are • Now you can
into smaller datasets to important to you using co perform analytics
process data in less time relation or other techniques easily on smaller
and in effective way. and remove less important dataset and save
• Making sure integrity of attributes. time & reduce cost.
data will rename same • Group the similar attributes.

36
Machine Learning in Production

Machine
Learning

Stream
Batch
Processing

Why machine learning in production is harder then you think?

37
Machine Learning Models Do Not Work Forever

• Machine Learning model training is never

ending job. Every time new data comes
in we must need to retrain the model
based on latest dataset.

• What you do in development or

education is that you create a model and
fit it to the data. Then that model is
basically done forever?

• IoT world, the problem is that machines

are very different. They behave very
differently.

38
Which Platform Supports Retraining Model Automatically?

• Automatic re-training and re-deploying is a very big issue, a

very big problem for a lot of companies. Because most
existing platforms don't have this capability.

• Look at AWS machine learning for instance. The process is:

build, train, tune deploy, Where’s the loop of retraining?

• You can create models and then use them in production. But
this loop is almost nowhere to be seen.

39
Machine Learning Training Parameter Management

• To train a model you are manipulating input parameters of the models

• For example deep learning:
• How many layers do you use. The depth of the layers, which means how many neurons
you have in a layer. What activation function you use, how long are you training and
soon.
• You also need to keep track of what data you used to train which model.
• All those parameters need to be manipulated automatically, models trained and tested.
• To do all that, you basically need a database that keeps track of those variables.

40
Data is Stronger Than Opinions
You Have The Data. USE IT
This doesn’t You bring Show the Discussion
work the data statistics end there.

41
42
The Business

People Science
Data

Data Science

Technology Processes Business

43
DATA-DRIVEN DECISION MAKING (DDD)

DDD refers to the practice of taking decisions Data-Driven

using data, rather than purely on intuition: Decision
Making
(across the firm)

• Using data and trending historical data

Automated DDD
• Validating assumptions if any
• Using champion challenger to test scenarios Data Science
• Using experiments
• Use baseline
• Continuous improvement
₋ Customer experiences Data Engineering and Processing
(Including “Big Data” technologies)
₋ Costs
₋ Revenues
If you can’t measure it, you can’t manage it Other positive effects of data
processing
(e.g, faster transaction processing)

Reference: Data Science for business by foster provost & tom fawcett
VALUE: LEVERAGING DATA FOR VALUE-ADDED HEALTHCARE

Increase Proactive Strategic Increase the Value of

Revenue Decision Making Partnerships Data as an Asset
(Data Monetization)

Enhance Customer Enhance Operational Develop/ Enhance

Experience Efficiency Products & Services Innovation

6
WAYS A DATA SCIENTIST CAN ADD VALUE TO ANY BUSINESS 1/2

Empowering Data Scientists Data Scientists Identifying

management and direct the action challenge the opportunities.
officers to make based on trends staff to adopt the
better decisions. which in turn best practices
help in defining and focus on the
goals. issues that
matter.

https://www.simplilearn.com/why-and-how-data-science-matters-to-business-article 46
The Applications – Some Areas

Logistics Banking Insurance

Customer Energy Efficiency

Retail
Analytics & IOT

Marketing Manufacturing Healthcare

Telecom Tourism

47
Big Data In Telecom
• Customer Analytics
• Direct • Customer Journey
• Shop Visit • Footfall analysis
• IVR
• Chat
• Web
Cross-
• Social Media
channel CDR
Interactions

• 3G UE Agents
• Demographic
• APN, Probes
• Age, Gender
• Mobile Access Nodes
Customer • Ethnicity
• Core Network Network Data
Attributes • Geography
• CDRs/XDRs
•
Teleco • Segment
DPI/Drive Tests
• Service Degradations m • Data usage
• Platform Outages
Customer Transactional
Insights Data

• Churn / Propensity Other • Ordering / fulfillment

• NPS / Customer satisfaction score Enterprise • Trouble Ticketing
• Loyalty Data • Billing
• CLV
• revenue

• Federal Agencies
• City Councils
• Municipalities
• Business Metrics 48
Big Data & Tourism Department

BOOKING PRE-ARRIVAL STAY CHECK OUT OPERATIONS

• Booking Activity by Channel • Segmentation & Clustering •• Top Guests

Guest by Revenue
Satisfaction Score • Top Guests By Revenue • Wage Cost
• Loyalty Points Spend pattern
• Cancellations & Reschedule • Campaign ROI • Social Media Follower Base • Loyalty Points Spend Pattern • Total Labour Cost
• Repeat Customer Revenue
• Upgrades / Downgrade • Cross sell/Up Sell • Customer Retention Rate • Repeat Customer Revenue • Labour Turnover
• Guest Acquisition Cost
• ADR & Occupancy Ratio • Improved Loyalty Signups •• Processing Costs per
Guest Segmentation • Guest Acquisition Cost • Food Cost
• Look-to-book Ratio • Propensity Modeling • Feedbacks
Transaction& Complaints • Guest Segmentation • Average per room cost
• Advance Booking Ratio • Affinity Modeling •• Guest Spending Pattern
Social Sentiment Score • Feedbacks & Complaints • Average hourly Pay
• New Guest Market vs
• No Show • Influence Modeling • Most Preferred Channel • Guest Spending Pattern
Return Guest

49
Big Data & Airports

50
USE CASE: Data Science in Insurance Industry

Customer Fraud Customer

Experience Detection Insights

Marketing Automation

51
Big Data In Banking
RISK DATA AGGREGATION &
REPORTING

PREDICTIVE WEALTH
MANAGEMENT &
AML
COMPLIANCE
ANALYTICS PRIVATE BANKING

ACROSS
BANKING DATA SCIENCE &
PREDICTIVE ANALYTICS
CONSUMER IN BANKING VARIOUS
BANKING TYPES OF
COLOR KEY FRAUD

DEFENSIVE
SAVE THE BANK

OFFENSIVE
DRIVE PROFITS &
COMPETITIVE
PAYMENTS
CYBER
ADVANTAGE SECURITY

FINANCIAL TRADING
APPLICATIONS

52
Big Data In Retail
CROSS SELLING & UP RECOMMENDATION
SELLING ENGINE

PURCHASE ATTRIBUTION
LIKELIHOOD MODELING
Market Collaborative
Basket Filtering
Analysis
Markov
Propensity Chain Monte
Model Carlo

CHURN Survival Retail Optimization PRICING

ANALYTICS Analysis Techniques
ANALYTICS
RFM Panel Data
Analysis Regression
Multi-
Cluster
variate Time
Analysis
Series
MARKETING
CUSTOMER
MIX MODEL
ANALYTICS

DEMAND CUSTOMER, STORE AND

FORECASTING PRODUCT SEGMENTATION
53
Big Data In Manufacturing

Reduction of Supply Optimization of Perfecting Quality as a Predictive

Chain Risk Operations to a Higher Competitive Advantage Maintenance to
Degreed than Ever Reduce Cost

After-Sales Mass and Individual New Data-Driven From Local to

Improvements Customization Revenue Sources and Enterprise-Level Data
Business Models Analytics

54
Big Data In Healthcare

Disease
Patient Medical
Personalized Modeling
Data Test
Medicine and
Analysis Automation
Mapping

Merge and analyze Track patients activities, One of the flashiest uses of Data Science enables
data sets from movements, symptoms data science in the past automation of medical
multiple sources to to discover or few years has been in tests and provides you
create personalized identifying diseases. tracking (and finding ways real time analytics for
treatment. to halt or prevent) example BP, Diabetes
diseases. etc

55
USE CASE: Energy Efficiency (IOT Example)

Data science
Challenge Results
& Big Data

• Our mission is to • Identify inefficiencies • Product Optimization

provide an innovative of energy consumption • Less energy
and affordable solution • Statistical correlation consumption
that accelerates between presence of • Customer’s satisfaction
transition to people and
sustainable buildings inefficiencies
• Control policies for
devices to reduce the
energy consumption

56
USE CASE: Smart Lighting (IOT Example)

Data science
Challenge Results
& Big Data
• Presence sensors to • Reactive/Predictive
ensure lights are not in maintenance
use when rooms are • Adaptive lighting • Product Optimization
empty. solution(presence • Personal settings
• Daylight harvesting to detection) • Scheduling
employ natural lighting • Learn occupants • Less energy
to minimize artificial individual lighting consumption
lighting needs. preferences • Customer’s satisfaction
• Personal dimming to • Management and
allow individuals the control by measuring
option to directly the light intensity
control the lighting in
the room or space

57
USE CASE: Customers Segmentation

Data science
Challenge Results
& Big Data

• Identifying the • Customers • Increase sales

customers which are segmentation based • Reach to your
most likely to on their attributes. customers easily
purchase your • Multivariate analysis • Target customers
product? to find the customers • Offer discounts to your
• Identifying the type of which are most likely customers based on
customers. to purchase your their category like
• Classification of products gold, silver.
customers based on • Exploratory data • Customers satisfaction
their purchases analysis and finding
the information from
data.

58
USE CASES: Customer’s Analytics

Marketing & Customer Retention Customer

Advertising Service & Loyalty Experience

• Personalized • Identifying customer • Deploy and refine • Sentimental Analysis

marketing pain points and predictive models that help • Identifying the
• Right offer, at issues proactively them retain customers with connection between
right time, in right • Updating their FAQs proactive approaches. the customers
location & or other • Investments, in terms of experience and
context, to a right communications with offers and upgrades, can be company’s financial
person existing customers. made at the right time to performance.
increase the likelihood of
retaining desirable
customers
59
Currency Conversion

05/18/2025 07:38 AM 60
Fraudulent wire transfer

05/18/2025 07:38 AM 61
Excess Reserves

05/18/2025 07:38 AM 62
Credit Card Customers

05/18/2025 07:38 AM 63
Misinformation in loan applications

05/18/2025 07:38 AM 64
Potential Best Customers

05/18/2025 07:38 AM 65
Customers At Risk

05/18/2025 07:38 AM 66
Target Potential Customers

05/18/2025 07:38 AM 67
Automated Documentation

05/18/2025 07:38 AM 68
Long Loan-cycle times

05/18/2025 07:38 AM 69
Goal Setting

05/18/2025 07:38 AM 70
Insurance claims

05/18/2025 07:38 AM 71
Liquidity Forecasts

05/18/2025 07:38 AM 72
Risk Scoring

05/18/2025 07:38 AM 73
Real time Blocking

05/18/2025 07:38 AM 74
Rule based AML

05/18/2025 07:38 AM 75
Fraud Detection systems

05/18/2025 07:38 AM 76
Spot Identity Fraud

05/18/2025 07:38 AM 77
Energy and
Utilities

05/18/2025 07:38 AM Dr. Ehsan Ullah Warriach 78

Use Case Overview: Solar Generation Forecast

ML / AI

05/18/2025 07:38 AM 79
Data Platform should have these key Value Propositions

05/18/2025 07:38 AM 80
3 operational levers for utilities to employ AI use cases for better performance

05/18/2025 07:38 AM 81
Prioritizing use cases should consider both total potential values as well as
feasibility for maximizing impact

05/18/2025 07:38 AM 82
International Utilities Company: Improve maintenance by increasing the number
of resolutions at first visit

Context: Maintenance engineers are a scarce resource and when maintenance is required, details are not always precise enough.
These requests come from e-mails and phone calls. Before AI/ML, it was often required to visit at least twice due to the lack of right
tools.

Approach: Using AI/ML, they were able to predict the fault parts of machines from e-mails and phone calls received.

Impact:
• Lower engineering and inventories cost due to higher resolution ratio.
• Resolution ratio jumps from 15% to 60% for all maintenance operations.
• Realized Impact : $ 450k / year due to a better use of engineering resources and inventory

05/18/2025 07:38 AM 83
International Water Management Company: Reduces regulatory cost associated
with N2O measurement

Context: This company is commissioned by many Japanese municipalities to operate water purification plants. Japan's water purification
plants have set standards and regulations for greenhouse gas emissions for each treatment method. The Japanese local government
has also requested this company to take measures based on the measurement results for N2O. The cost of the measurement
equipment is $100,000 per unit, which is a large cost burden if many units are installed.

Approach: By using ML/AI they forecasted N20 concentration using time-series data from water plant, such as temperature,
transparency, pH, quantity of chemicals, and were able to add weather data. Instead of a large number of water quality sensors, this
approach combines a small number of water quality sensors with soft sensors with predictive models.

Impact:
MAPE: <15% for all water plants
ROI estimates: 1 million $ per year savings

05/18/2025 07:38 AM 84
QUESTIONS & ANSWER SESSION

05/18/2025 85
Healthcare

05/18/2025 07:38 AM 86
There are Hundreds of Opportunities to Optimize Every Division of A Health Care
Player

05/18/2025 07:38 AM 87
Healthcare Organizations are scaling Values using AI

05/18/2025 07:38 AM 88
Companies Adopting AI

05/18/2025 07:38 AM 89
Moving Through the AI Maturity Curve

05/18/2025 07:38 AM 90
.

05/18/2025 07:38 AM 91
Large US provider Identities FWA and secures Large ROI

Rule-driven system allowed potential cases of fraud, waste & abuse to fall through the cracks.

• Create a new model that could leverage historical results

• Now able to find more suspicious claims
• ID potential losses before payout
• Estimated ROI = $15M

In addition to using supervised machine learning to identify known behaviors of overpayments,

unsupervised machine learning models identify unknown behaviors of overpayments by
discovering claims that appear to be anomalous. Investigators can use this information to prioritize
the review of anomalous claims and to retrain their supervised machine learning models with
results from their latest investigations

05/18/2025 07:38 AM 92

Anzsmm 4P 3.1
100% (5)
Anzsmm 4P 3.1
192 pages
Learning and Behavior 9th Edition Full Version Download
82% (11)
Learning and Behavior 9th Edition Full Version Download
17 pages
Vinno M80 Brochures PDF
No ratings yet
Vinno M80 Brochures PDF
4 pages
RLB Construction Market Update Vietnam Q2 2018
No ratings yet
RLB Construction Market Update Vietnam Q2 2018
8 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Emma Bakpo Project
No ratings yet
Emma Bakpo Project
36 pages
Big Data Components
No ratings yet
Big Data Components
31 pages
Real-Time Big Data Analytics - Sample Chapter
100% (2)
Real-Time Big Data Analytics - Sample Chapter
30 pages
MODERN ENTERPRISE Data Pipeline
100% (1)
MODERN ENTERPRISE Data Pipeline
98 pages
Lec 4 - Big Data Ecosystem Architecture
No ratings yet
Lec 4 - Big Data Ecosystem Architecture
28 pages
BDA I Unit
No ratings yet
BDA I Unit
44 pages
Improvement_of_supply_chain_performance_of_printin
No ratings yet
Improvement_of_supply_chain_performance_of_printin
12 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
Soil Variability and Its Consequences in Geotechnical Engineering
No ratings yet
Soil Variability and Its Consequences in Geotechnical Engineering
302 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
DataEngg Day3
No ratings yet
DataEngg Day3
26 pages
Document 1
No ratings yet
Document 1
4 pages
Akshatha Paper
No ratings yet
Akshatha Paper
7 pages
Compute Engine
No ratings yet
Compute Engine
49 pages
FIFA 17 Release Date Details
No ratings yet
FIFA 17 Release Date Details
3 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
20250129-EB-Ultimate Data Streaming Guide
No ratings yet
20250129-EB-Ultimate Data Streaming Guide
103 pages
Module 1
No ratings yet
Module 1
29 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
Preparation 7 - Ointments
No ratings yet
Preparation 7 - Ointments
8 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
SA Unit 1 PPT 5
No ratings yet
SA Unit 1 PPT 5
14 pages
UNIT 1 To 5
No ratings yet
UNIT 1 To 5
37 pages
Unit 4
No ratings yet
Unit 4
30 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
STID1103 SYLLABUS A211 Student
No ratings yet
STID1103 SYLLABUS A211 Student
5 pages
Assignment No. 3 For Business Data Analytics
No ratings yet
Assignment No. 3 For Business Data Analytics
16 pages
Unit 1 Topic 2 Big Data Platform
No ratings yet
Unit 1 Topic 2 Big Data Platform
31 pages
1 - Big Data Analytics & IoT
No ratings yet
1 - Big Data Analytics & IoT
13 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
Big Data Analytics - Chapter 4
No ratings yet
Big Data Analytics - Chapter 4
22 pages
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
No ratings yet
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
18 pages
Big Data - Unit-I
No ratings yet
Big Data - Unit-I
17 pages
Big Data - Comprehensive Summary
No ratings yet
Big Data - Comprehensive Summary
12 pages
Lecture 2
No ratings yet
Lecture 2
11 pages
FLC Provider Database
0% (1)
FLC Provider Database
15 pages
Why Choose Jolly Phonics Flyer - 250125 - 035602
No ratings yet
Why Choose Jolly Phonics Flyer - 250125 - 035602
8 pages
Link Game PPSSPP (Sfile
100% (1)
Link Game PPSSPP (Sfile
9 pages
Types of Digital Data: Unit 1 Big Data KCS-061
No ratings yet
Types of Digital Data: Unit 1 Big Data KCS-061
12 pages
Real Time Data
No ratings yet
Real Time Data
4 pages
Managing Your Assets With Big Data Tools
No ratings yet
Managing Your Assets With Big Data Tools
54 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
Unit II Big Data Architecture
No ratings yet
Unit II Big Data Architecture
5 pages
Big Data and Hadoop: Senior Product Specialist
No ratings yet
Big Data and Hadoop: Senior Product Specialist
40 pages
Unit 3 - BDA - Notes
No ratings yet
Unit 3 - BDA - Notes
9 pages
Cement Outline.05
No ratings yet
Cement Outline.05
2 pages
Real Time Data Streaming New Techniques
No ratings yet
Real Time Data Streaming New Techniques
5 pages
Big Data Course Student
No ratings yet
Big Data Course Student
37 pages
Hazelcast Level Up To Instant Action-1706173416548
No ratings yet
Hazelcast Level Up To Instant Action-1706173416548
36 pages
Big Data Analytics Case Study Report
No ratings yet
Big Data Analytics Case Study Report
4 pages
Introduction To Big Data, Hadoop and Spark
No ratings yet
Introduction To Big Data, Hadoop and Spark
40 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
DSPL Casestidy
No ratings yet
DSPL Casestidy
3 pages
Big Data Distributed Platforms
No ratings yet
Big Data Distributed Platforms
18 pages
Bigdata Oral Assignment
No ratings yet
Bigdata Oral Assignment
23 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
16 pages
BDMA Part 2
No ratings yet
BDMA Part 2
16 pages
Varron - 1B - Physical Assessment Findings
No ratings yet
Varron - 1B - Physical Assessment Findings
17 pages
MyEdBC Family Portal Instructional Manual
No ratings yet
MyEdBC Family Portal Instructional Manual
6 pages
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
Big Data Architecture
No ratings yet
Big Data Architecture
9 pages
Lectures Named Reactions
No ratings yet
Lectures Named Reactions
26 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
Introduction To The Importance of Sanitation - 5
No ratings yet
Introduction To The Importance of Sanitation - 5
16 pages
Response of Framed Buildings To Excavation-Induced Movements
No ratings yet
Response of Framed Buildings To Excavation-Induced Movements
19 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
DS Architecture
No ratings yet
DS Architecture
7 pages
Matrices
No ratings yet
Matrices
12 pages
(Spmsoalan) Soalan KBAT Bio 2
No ratings yet
(Spmsoalan) Soalan KBAT Bio 2
5 pages
FUN Transmissions: by Bill Brayton
No ratings yet
FUN Transmissions: by Bill Brayton
4 pages
BIOLOGY PLUS TWO Short Notes - Line Foundation
No ratings yet
BIOLOGY PLUS TWO Short Notes - Line Foundation
9 pages
APSC 255 Formula Sheet
No ratings yet
APSC 255 Formula Sheet
3 pages
De Thi 100 - Fix
No ratings yet
De Thi 100 - Fix
6 pages
The Feasibility Study of Ballitaw
No ratings yet
The Feasibility Study of Ballitaw
2 pages
Arrays: Shristi Technology Labs
No ratings yet
Arrays: Shristi Technology Labs
9 pages

Week 1 Lecture 2

Uploaded by

Week 1 Lecture 2

Uploaded by

Data Analytics

Data Storage Analytics Insight

Batch Processing Pipeline

Batch Processing Real-Time Processing

A batch processing platform When the time comes and

The limitations of MapReduce in Hadoop are listed below:

• The apache Giraph Library processes graphs,

• Being a state-less execution, MapReduce

Data Data Data

= Data in motion = Data at rest

1. Ingestion is all about getting the

Technology Stack: Technology Stack: Technology Stack: Technology Stack:

Management, Monitoring, Governance

HDFS Hive DB Hive DB MicroStrategy

Data Processed Data Consumer

Data Processed Data Consumer

Amazon Simple Storage

How Netflix added value to

The goal for Netflix

Do the nature of shows

Batch processing: Netflix knows the exact episode of

Devices Kafka Spark Cassandra

Ingestion Analytics Storage

Finding the next smash-hit series

Personalized video ranker

Top in video ranker

Video - Video Similarity Algorithm

Technology Processes Business

Ruled-based decision making Statistical reasoning Machine learning Artificial intelligence

Boolean data Simple Regression Classification Tasks Dynamic adaptation to novelty

Similarity Matching Clustering Is there really a difference

Profiling Link Prediction But who really are these

Co-occurrence Data Reduction Will some particular new

• Our challenge is to • Building the classifier • Classifier model will classify

• Predicting how much • Build a regression • Now we can see our

• Targeting the people • Finding which people are

Cell phone usage data of

• What is the typical • Profiling require complex calculations. • Through profiling

The movies you might

• Recommending the • Use Graph distance or • Link prediction is very

• Market basket analysis

• What items are • Using Association rules or • We can offer discount on

Why machine learning in production is harder then you think?

• Machine Learning model training is never

• What you do in development or

• IoT world, the problem is that machines

• Automatic re-training and re-deploying is a very big issue, a

• Look at AWS machine learning for instance. The process is:

• To train a model you are manipulating input parameters of the models

Technology Processes Business

DDD refers to the practice of taking decisions Data-Driven

• Using data and trending historical data

Increase Proactive Strategic Increase the Value of

Enhance Customer Enhance Operational Develop/ Enhance

Empowering Data Scientists Data Scientists Identifying

Logistics Banking Insurance

Customer Energy Efficiency

Marketing Manufacturing Healthcare

• Churn / Propensity Other • Ordering / fulfillment

BOOKING PRE-ARRIVAL STAY CHECK OUT OPERATIONS

• Booking Activity by Channel • Segmentation & Clustering •• Top Guests

Customer Fraud Customer

CHURN Survival Retail Optimization PRICING

DEMAND CUSTOMER, STORE AND

Reduction of Supply Optimization of Perfecting Quality as a Predictive

After-Sales Mass and Individual New Data-Driven From Local to

• Our mission is to • Identify inefficiencies • Product Optimization

• Identifying the • Customers • Increase sales

Marketing & Customer Retention Customer

• Personalized • Identifying customer • Deploy and refine • Sentimental Analysis

05/18/2025 07:38 AM Dr. Ehsan Ullah Warriach 78

• Create a new model that could leverage historical results

In addition to using supervised machine learning to identify known behaviors of overpayments,

You might also like