SlideShare a Scribd company logo
Copyright © 2014 Improve Digital - All Rights Reserved
Approximation algorithms for
stream and batch processing
Gabriele Modena
Data Scientist Improve Digital

E: g.modena@improvedigital.com
Copyright © 2014 Improve Digital - All Rights Reserved
Real Time Advertisement Technology
Media Owners Advertisers
Copyright © 2014 Improve Digital - All Rights Reserved
3
Adtech 101
<150 msec
• Geographically distributed adserver fleet
• 200+ billion events / month
• Hundreds of TB in a Hadoop cluster
Copyright © 2014 Improve Digital - All Rights Reserved
4
– How much revenue did publisher X generate last month? Which
are the top advertisers?
• Reporting & BI
– Is the day-to-day traffic on site Y increasing or decreasing?
• Trend analysis
– Is the traffic legit or coming from a botnet ?
• Fraud detection
– How likely is this impression to generate a click or a conversion?
• Predictive modelling
– How are advertisers bidding and buying on inventory? Who is
our audience?
• Pattern Recognition
Improve digital data platform
Copyright © 2014 Improve Digital - All Rights Reserved 5
Historically
• Batch pipelines
• Incremental processing
• Realtime pipelines
• Monitoring and trend analysis
!
Batch dataset != Realtime dataset
Batch models != Realtime models
Copyright © 2014 Improve Digital - All Rights Reserved
6
• Write jobs once
• Unifiy models and
• Analytics codebase
• Datasets semantic
• Experimentation
Goals
Copyright © 2014 Improve Digital - All Rights Reserved
7
Analytics Architecture
Real-time
log
collection
Brokerage
(Kakfa
+Samza)
Processing
(YARN+Spark
+MapReduce)
Push Expose
Publish
Publish
Publish
Datab
ase
HDFS
Redis
Copyright © 2014 Improve Digital - All Rights Reserved
8
Kafka and Samza
• Kafka (http://kafka.apache.org) as a
distributed message queue
• Topic-based
• Producers write, consumers read
• Messages are persistently stored – topics
can be re-read
• We use Samza for coordinating ingestion, ETL
and distributed stream processing
Copyright © 2014 Improve Digital - All Rights Reserved
9
Apache Spark
• Spark (Zaharia et al. 2010)
• “Iterative” computing
• Generalization of MapReduce (Isard 2007)
• Runs atop Hadoop (YARN)

!
• Spark Streaming
• Break data into batches and pass it to
Spark engine (same API & data structures)
Copyright © 2014 Improve Digital - All Rights Reserved
10
Challenges
• Conceptually everything is a stream
• Satisfy a tradeoff between
• Latency
• Memory
• Accuracy

• On infinitely expanding datasets
Copyright © 2014 Improve Digital - All Rights Reserved
Make big data small
Samples, sketches and summaries
Copyright © 2014 Improve Digital - All Rights Reserved
12
Reservoir Sampling (Vitter, 1985)
• Hard to parallelize
• How to use samples to answer certain queries?
Count distinct? TopK?
• From an infinitely expanding dataset
• With constant memory and in a single pass
Copyright © 2014 Improve Digital - All Rights Reserved
Cardinality estimation (count distinct)
How many users are visiting a site?
Copyright © 2014 Improve Digital - All Rights Reserved
14
Claim
The cardinality of a multiset of
uniformly-distributed random
numbers can be estimated by
calculating the maximum number
of leading zeros in the binary
representation of each number in
the set.
Copyright © 2014 Improve Digital - All Rights Reserved
15
Intuitively

1. Apply an hash function on each element and
take the binary representation of the output
2. If the maximum number of leading zeros
observed is n, an estimate for the number of
distinct elements in the set is 2^n
3. Account for variance by averaging on subsets
HyperLogLog (Flajolet, Philippe, et al. 2008)
Copyright © 2014 Improve Digital - All Rights Reserved
16
val hll = new HyperLogLogMonoid(12)
!
val approxUsers = users.mapPartitions(user => user.map(uuid =>
hll(uuid.getBytes))).reduce(_ + _)
!
var h = globalHll.zero
approxUsers.foreach(rdd => {
if (rdd.count() != 0) {
val partial = rdd.first()
h += partial
}
})
HyperLogLog (with Spark + Algebird)
Copyright © 2014 Improve Digital - All Rights Reserved
17
HyperLogLog (< 2% error rate in 15kB)
Count
Exact
Approximate
Memory
Copyright © 2014 Improve Digital - All Rights Reserved
Frequency estimation
Top 10 most visited sites (out of a few millions) ?
Copyright © 2014 Improve Digital - All Rights Reserved
19
Count Min Sketch
(Cormode, Graham, and S. Muthukrishnan, 2005)
It’s the hashing trick!
Copyright © 2014 Improve Digital - All Rights Reserved
20
val eps = 0.01
val delta = 1E-3
val seed = 1
val perc = 0.003
!
val approxImpressions = publishers.mapPartitions(publisher => {
val cms = new CountMinSketchMonoid(delta, eps, seed, perc)
publisher.map(publisher_id => cms.create(publisher_id.toLong))
}).reduce(_ ++ _)
!
var globalCMS = new CountMinSketchMonoid(delta, eps, seed, perc).zero
approxTopUsers.foreach(rdd => {
if (rdd.count() != 0) {
val partial = rdd.first()
globalCMS ++= partial
val globalTopK = globalCMS.heavyHitters.map(id => (id,
globalCMS.frequency(id).estimate)).toSeq.sortBy(_._2).reverse.slice(0, 5)
}
})
CMS (with Spark + Algebird)
Copyright © 2014 Improve Digital - All Rights Reserved
21
CMS results
Exact Approximate
Copyright © 2014 Improve Digital - All Rights Reserved
Learning from data
Copyright © 2014 Improve Digital - All Rights Reserved 23
Iterative methods are hard to
scale in MapReduce
Copyright © 2014 Improve Digital - All Rights Reserved
24
• Liner Regression
– OLS + SGD on batches of data
– Recursive Least Squares with Forgetting
(Vahidi et al. 2005)

• Streaming kmeans (Ailon et al. 2009, Shindler
et al 2011, Ostrovsky et al. 2012)
– Single iteration-to-convergence
– Use sketches to reduce dimensionality (k log
N centroids)
– Mini batch updates + forgetfulness
Using sketches
Copyright © 2014 Improve Digital - All Rights Reserved
25
• Streaming is part of the broader system
• Approximation can help us scale both
streaming and batch loads
– Make “big data” small
– Unification
• Data collection and distribution is key
▪ Publishing results follows
• Large scale analytics = Architecture + Algos +
Data Structures
Conclusion
Copyright © 2014 Improve Digital - All Rights Reserved
Approximation algorithms for
stream and batch processing
Gabriele Modena
Data Scientist Improve Digital

E: g.modena@improvedigital.com

More Related Content

What's hot (20)

Titan and Cassandra at WellAware
Titan and Cassandra at WellAwareTitan and Cassandra at WellAware
Titan and Cassandra at WellAware
twilmes
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
Databricks
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
Carol McDonald
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
Jacky Chu
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Mathieu Dumoulin
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
Milind Bhandarkar
 
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Aditya Yadav
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiers
Rim Moussa
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Mathieu Dumoulin
 
MapR and Machine Learning Primer
MapR and Machine Learning PrimerMapR and Machine Learning Primer
MapR and Machine Learning Primer
Mathieu Dumoulin
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Vijay Srinivas Agneeswaran, Ph.D
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
marklpollack
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0
Aditya Yadav
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Spark Summit
 
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Big Data Spain
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Mathieu Dumoulin
 
Titan and Cassandra at WellAware
Titan and Cassandra at WellAwareTitan and Cassandra at WellAware
Titan and Cassandra at WellAware
twilmes
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
Databricks
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
Carol McDonald
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
Jacky Chu
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Mathieu Dumoulin
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Aditya Yadav
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiers
Rim Moussa
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Mathieu Dumoulin
 
MapR and Machine Learning Primer
MapR and Machine Learning PrimerMapR and Machine Learning Primer
MapR and Machine Learning Primer
Mathieu Dumoulin
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Vijay Srinivas Agneeswaran, Ph.D
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
marklpollack
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0
Aditya Yadav
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Spark Summit
 
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Big Data Spain
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Mathieu Dumoulin
 

Similar to Approximation algorithms for stream and batch processing (20)

Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
Streamlio
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
DataWorks Summit
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
Srinath Perera
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
Debasish Ghosh
 
Introduction to Large Scale Data Analysis with WSO2 Analytics Platform
Introduction to Large Scale Data Analysis with WSO2 Analytics PlatformIntroduction to Large Scale Data Analysis with WSO2 Analytics Platform
Introduction to Large Scale Data Analysis with WSO2 Analytics Platform
Srinath Perera
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Chris Fregly
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
AlexanderKyalo3
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
Gabriel Kamau
 
Analysing streaming data in real time (AWS)
Analysing streaming data in real time (AWS)Analysing streaming data in real time (AWS)
Analysing streaming data in real time (AWS)
javier ramirez
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Jason Riedy
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
Anirudh
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
StreamNative
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Gabriela Agustini
 
Big data-and-the-web
Big data-and-the-webBig data-and-the-web
Big data-and-the-web
Aravindharamanan S
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
Srinath Perera
 
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
Andrei Savu
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
Streamlio
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
DataWorks Summit
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
Srinath Perera
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
Debasish Ghosh
 
Introduction to Large Scale Data Analysis with WSO2 Analytics Platform
Introduction to Large Scale Data Analysis with WSO2 Analytics PlatformIntroduction to Large Scale Data Analysis with WSO2 Analytics Platform
Introduction to Large Scale Data Analysis with WSO2 Analytics Platform
Srinath Perera
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Chris Fregly
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
AlexanderKyalo3
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
Gabriel Kamau
 
Analysing streaming data in real time (AWS)
Analysing streaming data in real time (AWS)Analysing streaming data in real time (AWS)
Analysing streaming data in real time (AWS)
javier ramirez
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Jason Riedy
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
Anirudh
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
StreamNative
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Gabriela Agustini
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
Srinath Perera
 
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
Andrei Savu
 
Ad

Recently uploaded (20)

"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
Fwdays
 
Tableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdfTableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdf
elinavihriala
 
15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf
AffinityCore
 
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Designer
 
HPC High Performance Course Presentation.pptx
HPC High Performance Course Presentation.pptxHPC High Performance Course Presentation.pptx
HPC High Performance Course Presentation.pptx
naziaahmadnm
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.pptChronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
Content Moderation Services_ Leading the Future of Online Safety.docx
Content Moderation Services_ Leading the Future of Online Safety.docxContent Moderation Services_ Leading the Future of Online Safety.docx
Content Moderation Services_ Leading the Future of Online Safety.docx
sofiawilliams5966
 
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
Taqyea
 
Internal Architecture of Database Management Systems
Internal Architecture of Database Management SystemsInternal Architecture of Database Management Systems
Internal Architecture of Database Management Systems
M Munim
 
Blue Dark Professional Geometric Business Project Presentation .pdf
Blue Dark Professional Geometric Business Project Presentation .pdfBlue Dark Professional Geometric Business Project Presentation .pdf
Blue Dark Professional Geometric Business Project Presentation .pdf
mohammadhaidarayoobi
 
GROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptxGROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptx
mardoglenn21
 
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptxArtificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
AbhijitPal87
 
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docxGeospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
sofiawilliams5966
 
EPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptxEPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptx
ExtremerZ
 
Chapter4.1.pptx you can come to the house and statistics
Chapter4.1.pptx you can come to the house and statisticsChapter4.1.pptx you can come to the house and statistics
Chapter4.1.pptx you can come to the house and statistics
SotheaPheng
 
Cyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptxCyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptx
vilakshbhargava
 
Chapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I willChapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I will
SotheaPheng
 
delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)
jamespromind
 
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptxrefractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
KannanDamodaram
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
Fwdays
 
Tableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdfTableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdf
elinavihriala
 
15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf
AffinityCore
 
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Designer
 
HPC High Performance Course Presentation.pptx
HPC High Performance Course Presentation.pptxHPC High Performance Course Presentation.pptx
HPC High Performance Course Presentation.pptx
naziaahmadnm
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.pptChronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
Content Moderation Services_ Leading the Future of Online Safety.docx
Content Moderation Services_ Leading the Future of Online Safety.docxContent Moderation Services_ Leading the Future of Online Safety.docx
Content Moderation Services_ Leading the Future of Online Safety.docx
sofiawilliams5966
 
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
Taqyea
 
Internal Architecture of Database Management Systems
Internal Architecture of Database Management SystemsInternal Architecture of Database Management Systems
Internal Architecture of Database Management Systems
M Munim
 
Blue Dark Professional Geometric Business Project Presentation .pdf
Blue Dark Professional Geometric Business Project Presentation .pdfBlue Dark Professional Geometric Business Project Presentation .pdf
Blue Dark Professional Geometric Business Project Presentation .pdf
mohammadhaidarayoobi
 
GROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptxGROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptx
mardoglenn21
 
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptxArtificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
AbhijitPal87
 
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docxGeospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
sofiawilliams5966
 
EPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptxEPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptx
ExtremerZ
 
Chapter4.1.pptx you can come to the house and statistics
Chapter4.1.pptx you can come to the house and statisticsChapter4.1.pptx you can come to the house and statistics
Chapter4.1.pptx you can come to the house and statistics
SotheaPheng
 
Cyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptxCyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptx
vilakshbhargava
 
Chapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I willChapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I will
SotheaPheng
 
delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)
jamespromind
 
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptxrefractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
KannanDamodaram
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
Ad

Approximation algorithms for stream and batch processing

  • 1. Copyright © 2014 Improve Digital - All Rights Reserved Approximation algorithms for stream and batch processing Gabriele Modena Data Scientist Improve Digital
 E: [email protected]
  • 2. Copyright © 2014 Improve Digital - All Rights Reserved Real Time Advertisement Technology Media Owners Advertisers
  • 3. Copyright © 2014 Improve Digital - All Rights Reserved 3 Adtech 101 <150 msec • Geographically distributed adserver fleet • 200+ billion events / month • Hundreds of TB in a Hadoop cluster
  • 4. Copyright © 2014 Improve Digital - All Rights Reserved 4 – How much revenue did publisher X generate last month? Which are the top advertisers? • Reporting & BI – Is the day-to-day traffic on site Y increasing or decreasing? • Trend analysis – Is the traffic legit or coming from a botnet ? • Fraud detection – How likely is this impression to generate a click or a conversion? • Predictive modelling – How are advertisers bidding and buying on inventory? Who is our audience? • Pattern Recognition Improve digital data platform
  • 5. Copyright © 2014 Improve Digital - All Rights Reserved 5 Historically • Batch pipelines • Incremental processing • Realtime pipelines • Monitoring and trend analysis ! Batch dataset != Realtime dataset Batch models != Realtime models
  • 6. Copyright © 2014 Improve Digital - All Rights Reserved 6 • Write jobs once • Unifiy models and • Analytics codebase • Datasets semantic • Experimentation Goals
  • 7. Copyright © 2014 Improve Digital - All Rights Reserved 7 Analytics Architecture Real-time log collection Brokerage (Kakfa +Samza) Processing (YARN+Spark +MapReduce) Push Expose Publish Publish Publish Datab ase HDFS Redis
  • 8. Copyright © 2014 Improve Digital - All Rights Reserved 8 Kafka and Samza • Kafka (http://kafka.apache.org) as a distributed message queue • Topic-based • Producers write, consumers read • Messages are persistently stored – topics can be re-read • We use Samza for coordinating ingestion, ETL and distributed stream processing
  • 9. Copyright © 2014 Improve Digital - All Rights Reserved 9 Apache Spark • Spark (Zaharia et al. 2010) • “Iterative” computing • Generalization of MapReduce (Isard 2007) • Runs atop Hadoop (YARN)
 ! • Spark Streaming • Break data into batches and pass it to Spark engine (same API & data structures)
  • 10. Copyright © 2014 Improve Digital - All Rights Reserved 10 Challenges • Conceptually everything is a stream • Satisfy a tradeoff between • Latency • Memory • Accuracy
 • On infinitely expanding datasets
  • 11. Copyright © 2014 Improve Digital - All Rights Reserved Make big data small Samples, sketches and summaries
  • 12. Copyright © 2014 Improve Digital - All Rights Reserved 12 Reservoir Sampling (Vitter, 1985) • Hard to parallelize • How to use samples to answer certain queries? Count distinct? TopK? • From an infinitely expanding dataset • With constant memory and in a single pass
  • 13. Copyright © 2014 Improve Digital - All Rights Reserved Cardinality estimation (count distinct) How many users are visiting a site?
  • 14. Copyright © 2014 Improve Digital - All Rights Reserved 14 Claim The cardinality of a multiset of uniformly-distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set.
  • 15. Copyright © 2014 Improve Digital - All Rights Reserved 15 Intuitively
 1. Apply an hash function on each element and take the binary representation of the output 2. If the maximum number of leading zeros observed is n, an estimate for the number of distinct elements in the set is 2^n 3. Account for variance by averaging on subsets HyperLogLog (Flajolet, Philippe, et al. 2008)
  • 16. Copyright © 2014 Improve Digital - All Rights Reserved 16 val hll = new HyperLogLogMonoid(12) ! val approxUsers = users.mapPartitions(user => user.map(uuid => hll(uuid.getBytes))).reduce(_ + _) ! var h = globalHll.zero approxUsers.foreach(rdd => { if (rdd.count() != 0) { val partial = rdd.first() h += partial } }) HyperLogLog (with Spark + Algebird)
  • 17. Copyright © 2014 Improve Digital - All Rights Reserved 17 HyperLogLog (< 2% error rate in 15kB) Count Exact Approximate Memory
  • 18. Copyright © 2014 Improve Digital - All Rights Reserved Frequency estimation Top 10 most visited sites (out of a few millions) ?
  • 19. Copyright © 2014 Improve Digital - All Rights Reserved 19 Count Min Sketch (Cormode, Graham, and S. Muthukrishnan, 2005) It’s the hashing trick!
  • 20. Copyright © 2014 Improve Digital - All Rights Reserved 20 val eps = 0.01 val delta = 1E-3 val seed = 1 val perc = 0.003 ! val approxImpressions = publishers.mapPartitions(publisher => { val cms = new CountMinSketchMonoid(delta, eps, seed, perc) publisher.map(publisher_id => cms.create(publisher_id.toLong)) }).reduce(_ ++ _) ! var globalCMS = new CountMinSketchMonoid(delta, eps, seed, perc).zero approxTopUsers.foreach(rdd => { if (rdd.count() != 0) { val partial = rdd.first() globalCMS ++= partial val globalTopK = globalCMS.heavyHitters.map(id => (id, globalCMS.frequency(id).estimate)).toSeq.sortBy(_._2).reverse.slice(0, 5) } }) CMS (with Spark + Algebird)
  • 21. Copyright © 2014 Improve Digital - All Rights Reserved 21 CMS results Exact Approximate
  • 22. Copyright © 2014 Improve Digital - All Rights Reserved Learning from data
  • 23. Copyright © 2014 Improve Digital - All Rights Reserved 23 Iterative methods are hard to scale in MapReduce
  • 24. Copyright © 2014 Improve Digital - All Rights Reserved 24 • Liner Regression – OLS + SGD on batches of data – Recursive Least Squares with Forgetting (Vahidi et al. 2005)
 • Streaming kmeans (Ailon et al. 2009, Shindler et al 2011, Ostrovsky et al. 2012) – Single iteration-to-convergence – Use sketches to reduce dimensionality (k log N centroids) – Mini batch updates + forgetfulness Using sketches
  • 25. Copyright © 2014 Improve Digital - All Rights Reserved 25 • Streaming is part of the broader system • Approximation can help us scale both streaming and batch loads – Make “big data” small – Unification • Data collection and distribution is key ▪ Publishing results follows • Large scale analytics = Architecture + Algos + Data Structures Conclusion
  • 26. Copyright © 2014 Improve Digital - All Rights Reserved Approximation algorithms for stream and batch processing Gabriele Modena Data Scientist Improve Digital
 E: [email protected]