SlideShare a Scribd company logo
Data Stream Algorithms
in Storm and R
Radek Maciaszek
Who Am I?
 Radek Maciaszek
 Consulting at DataMine Lab (www.dataminelab.com) - Data mining,
business intelligence and data warehouse consultancy.
 Data scientist at a hedge fund in London
 BSc Computer Science, MSc in Cognitive and Decisions Sciences, MSc
in Bioinformatics
 During the career worked with many companies on Big Data and real
time processing projects; indcluding Orange, ad4game, Unanimis,
SkimLinks, CognitiveMatch, OpenX and many others.
Agenda
• Why streaming algorithms?
• Streaming algorithms crash course
• Apache Storm
• Storm + R – for more traditional statistics
• Use Cases
Data Explosion
• Exponential growth of information
[IDC, 2012] Information Data Corporation, a market research company
Data, data everywhere [Economist]
• “In 2013, the available storage capacity could hold 33% of all
data. By 2020, it will be able to store less than 15%” [IDC, 2014]
Data Streams – crash course
• Reasons to use data streams processing
• Data doesn’t fit into available memory and/or disk
• (Near) real-time data processing
• Scalability, cloud processing
• Examples
• Network traffic (ISP)
• Fraud detection
• Web traffic (i.e. online advertising)
Use Case – Dynamic Sampling
• OpenX – the ad server
• Customers with tens of millions of ad views per hour
• Challenge
• Create samples for statistical analysis. E.g: A/B testing, ANOVA, etc.
• How to sample data in real-time on the input stream of the data of
unknown size
• Solution
• Reservoir Sampling – allows to find a sample of a constant length from a
stream of unknown length of elements
Data Streaming algorithms
• Sampling
• Use statistic of a sample to estimate the statistic of population. The
bigger the sample the better the estimate.
• Reservoir Sampling – sample populations, without knowing it’s size.
• Algorithm:
• Store first n elements into the reservoir.
• Insert each k-th from the input stream in a random spot of the reservoir
with a probability of n/k (decreasing probability)
Source: Maldonado, et al; 2011
Moving average
• Example, online mean of the moving average of time-series at
time “t”
• Where: M – window size of the moving average.
• At time “t” predict “t+1” by removing last “t-M” element, and
adding “t” element.
• Requires last M elements to be stored in memory
• There are many more algorithms: mean, variance, regression,
percentile
Use Case - Counting Unique Users
• Large UK ad-network
• Challenge – calculate number of unique visitors - one of the
most important metrics in online advertising
• Hadoop MapReduce. It worked but took long time and too
much memory.
• Better solutions:
• Cardinality estimation on stream data, e.g. HyperLogLog algorithm
• Highly effective algorithm to count distinct number of elements
• Many other use cases:
• ISP estimates of traffic usage
• Cardinality in DB queries optimisation
• Algorithm and implementation details
• Transform input data into i.i.d. (independent and identically distributed)
uniform random bits of information
• Hash(x)= bit1 bit2 …
• Where P(bit1)=P(bit2)=1/2
• 1xxx -> P = 1/2, n >= 2
11xx -> P = 1/4, n >= 4
111x -> P = 1/8, n >= 8
n >=
• Record biggest
• Flajolet (1983) estimated the bias
• and
p = position of a first “0”
• 1983 - Flajolet & Martin. (first streaming algorithm)
Probabilistic counting
unhashed hashed
Source: http://git.io/veCtc
Probabilistic counting - algorithm
• Algorithm: p – calculates position of first zero in the bitmap
• Estimate the size using:
• R proof-of-concept implementation: http://git.io/ve8Ia
• Example:
Can we do better?
• LogLog – instead of keeping track of all 01s, keep track only of
the largest 0
• This will take LogLog bits, but at the cost of lost precision
• SuperLogLog – remove x% (typically 70%) of largest number
before estimating, more complex analysis
• HyperLogLog – harmonic mean of estimates
• Fast, cheap and 98% correct
• What if you want more traditional statistics?
Reference: Flajolet; Fusy et al. 2007
R – Open Source Statistics
• Open Source = low cost of adopting. Useful in prototyping.
• Large global community - more than 2.5 million users
• ~5,000 open source free packages
• Extensively used for modelling and visualisations
Source: Rexer Analytics
Use Case – Real-time Machine Learning
• Gaming ad-network
• 150m+ ad impressions per day
• Lambda architecture (fast and batch layers): Storm used in
parallel to Hadoop
• Challenge
• Make real-time decision on which ad to display – vs old system that used
to make decisions every 1h
• Use sophisticated statistical environment for A/B testing
• Solution
• Beta Distribution to compare effectiveness of the ads
• Use Storm to do real-time statistics
Use Case – Beta Distributions
• Comparing two ads:
• Ratio: CTR = Clicks / Views
• Wolphram Alpha: beta distribution (5, (30-5))
Source: Wolphram Alpha
Beta distributions prototyping – the R code
• Bootstrapping in R
Apache Storm
• Real-time calculations – the Hadoop of real time
• Fault tolerance, easy to scale
• Easy to develop - has local and distributed mode
• Storm multi-lang can be used with any language, including R
Getty Images
Storm Architecture
• Nimbus
• Master - equivalent of Hadoop JobTracker
• Distributes workload across cluster
• Heartbeat, reallocation of workers when needed
• Supervisor
• Runs the workers
• Communicates with Nimbus
using ZK
• Zookeeper
• coordination,
nodes discovery
Source: Apache Storm
Storm Topology
Image source: Storm github wiki
Can integrate with third party
languages and databases:
• Java
• Python
• Ruby
• Redis
• Hbase
• Cassandra
• Graph of stream computations
• Basic primitives nodes
• Spout – source of streams (Twitter API, queue, logs)
• Bolt – consumes streams, does the work, produces
streams
• Storm Trident
Storm + R
• Storm Multi-Language protocol
• Multiple Storm-R multi-language packages
provide Storm/R plumbing
• Recommended package: http://cran.r-
project.org/web/packages/Storm
• Example R code
Storm and R
storm = Storm$new();
storm$lambda = function(s) {
t = s$tuple;
t$output =
vector(mode="character",length=1);
clicks = as.numeric(t$input[1]);
views = as.numeric(t$input[2]);
t$output[1] = rbeta(1, clicks, views -
clicks);
s$emit(t);
#alternative: mark the tuple as failed.
s$fail(t);
}
storm$run();
Storm and Java integration
• Define Spout/Bolt in any programming language
• Executed as subprocess – JSON over stdin/stdout
public static class RBolt extends ShellBolt
implements IRichBolt {
public RBolt() {
super("Rscript", ”script.R");
}
}
Source: Apache Storm
Storm + R = flexibility
• Integration with existing Storm ecosystem – NoSQL, Kafka
• SOA framework - DRPC
• Scaling up your existing R processes
• Trident
Source: Apache Storm
Storm References
• https://storm.apache.org
• Storm and Java stream algorithms implementations:
• https://github.com/addthis/stream-lib
• https://github.com/aggregateknowledge/java-hll
• https://github.com/pmerienne/trident-ml
Thank you
• Summary:
• Data stream algorithms
• Storm – can be used with stream algorithms
• Storm + R – more traditional
• Questions and discussion
• https://uk.linkedin.com/in/radekmaciaszek
• http://www.dataminelab.com

More Related Content

What's hot (20)

Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
Md. Shamsur Rahim
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
StampedeCon
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
Joe Kelley
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
Otto Mok
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
DataWorks Summit
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Apache Storm
Apache StormApache Storm
Apache Storm
Edureka!
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Sonal Raj
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing Platform
Aleksandar Bradic
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
MapR Technologies
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
Zbigniew Jerzak
 
Apache Storm and Oracle Event Processing for Real-time Analytics
Apache Storm and Oracle Event Processing for Real-time AnalyticsApache Storm and Oracle Event Processing for Real-time Analytics
Apache Storm and Oracle Event Processing for Real-time Analytics
Prabhu Thukkaram
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are Algorithms
InfluxData
 
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
Ansgar Scherp
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
Md. Shamsur Rahim
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
StampedeCon
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
Joe Kelley
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
Otto Mok
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
DataWorks Summit
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Apache Storm
Apache StormApache Storm
Apache Storm
Edureka!
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Sonal Raj
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing Platform
Aleksandar Bradic
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
MapR Technologies
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
Zbigniew Jerzak
 
Apache Storm and Oracle Event Processing for Real-time Analytics
Apache Storm and Oracle Event Processing for Real-time AnalyticsApache Storm and Oracle Event Processing for Real-time Analytics
Apache Storm and Oracle Event Processing for Real-time Analytics
Prabhu Thukkaram
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are Algorithms
InfluxData
 
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
Ansgar Scherp
 

Viewers also liked (20)

Computer Programming For Power Systems Analysts.
Computer Programming For Power Systems Analysts.Computer Programming For Power Systems Analysts.
Computer Programming For Power Systems Analysts.
H. Kheir
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di Milano
Marco Brambilla
 
Chapter 2.1 : Data Stream
Chapter 2.1 : Data StreamChapter 2.1 : Data Stream
Chapter 2.1 : Data Stream
Ministry of Higher Education
 
Detecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataDetecting Anomalies in Streaming Data
Detecting Anomalies in Streaming Data
Subutai Ahmad
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
Raja Chiky
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
Radek Maciaszek
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
DataMine Lab
 
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
Rakuten Group, Inc.
 
Aggregation computation over distributed data streams
Aggregation computation over distributed data streamsAggregation computation over distributed data streams
Aggregation computation over distributed data streams
Yueshen Xu
 
Data Science with R for Java Developers
Data Science with R for Java DevelopersData Science with R for Java Developers
Data Science with R for Java Developers
NLJUG
 
Detecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine LearningDetecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine Learning
Andrew Beard
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm
Hamza Aslam
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Adrianos Dadis
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
Sonal Raj
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distribution
mcsrivas
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
Hortonworks
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
MapR Technologies
 
Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink-
Flink Forward
 
Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...
potaters
 
Computer Programming For Power Systems Analysts.
Computer Programming For Power Systems Analysts.Computer Programming For Power Systems Analysts.
Computer Programming For Power Systems Analysts.
H. Kheir
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di Milano
Marco Brambilla
 
Detecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataDetecting Anomalies in Streaming Data
Detecting Anomalies in Streaming Data
Subutai Ahmad
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
Raja Chiky
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
Radek Maciaszek
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
DataMine Lab
 
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
Rakuten Group, Inc.
 
Aggregation computation over distributed data streams
Aggregation computation over distributed data streamsAggregation computation over distributed data streams
Aggregation computation over distributed data streams
Yueshen Xu
 
Data Science with R for Java Developers
Data Science with R for Java DevelopersData Science with R for Java Developers
Data Science with R for Java Developers
NLJUG
 
Detecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine LearningDetecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine Learning
Andrew Beard
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm
Hamza Aslam
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Adrianos Dadis
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
Sonal Raj
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distribution
mcsrivas
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
Hortonworks
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
MapR Technologies
 
Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink-
Flink Forward
 
Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...
potaters
 
Ad

Similar to Data Stream Algorithms in Storm and R (20)

Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
Yoav chernobroda
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
Tyler Treat
 
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Sergii Khomenko
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
Palani Kumar
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29
Ted Dunning
 
Project Deimos
Project DeimosProject Deimos
Project Deimos
Simon Suo
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Vijay Srinivas Agneeswaran, Ph.D
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
Albert Bifet
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
Ted Dunning
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
Petr Zapletal
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
Jeffrey Breen
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
storm for RTA.pptx
storm for RTA.pptxstorm for RTA.pptx
storm for RTA.pptx
V.V.Vanniaperumal College for Women
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
Gabriele Modena
 
IOT.pptx
IOT.pptxIOT.pptx
IOT.pptx
NiveMurugan1
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
javier ramirez
 
Architecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessArchitecting R into Storm Application Development Process
Architecting R into Storm Application Development Process
DataWorks Summit
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
Portland R User Group
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
Portland R User Group
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
Yoav chernobroda
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
Tyler Treat
 
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Sergii Khomenko
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
Palani Kumar
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29
Ted Dunning
 
Project Deimos
Project DeimosProject Deimos
Project Deimos
Simon Suo
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Vijay Srinivas Agneeswaran, Ph.D
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
Albert Bifet
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
Ted Dunning
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
Petr Zapletal
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
Jeffrey Breen
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
Gabriele Modena
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
javier ramirez
 
Architecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessArchitecting R into Storm Application Development Process
Architecting R into Storm Application Development Process
DataWorks Summit
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
Portland R User Group
 
Ad

Recently uploaded (20)

Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptxArtificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
AbhijitPal87
 
BADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and InterpretationBADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and Interpretation
srishtisingh1813
 
Arrays in c programing. practicals and .ppt
Arrays in c programing. practicals and .pptArrays in c programing. practicals and .ppt
Arrays in c programing. practicals and .ppt
Carlos701746
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
Human body make Structure analysis the part of the human
Human body make Structure analysis the part of the humanHuman body make Structure analysis the part of the human
Human body make Structure analysis the part of the human
ankit392215
 
GROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptxGROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptx
mardoglenn21
 
lecture 33333222234555555555555555556.pptx
lecture 33333222234555555555555555556.pptxlecture 33333222234555555555555555556.pptx
lecture 33333222234555555555555555556.pptx
obsinaafilmakuush
 
LECT CONCURRENCY………………..pdf document or power point
LECT CONCURRENCY………………..pdf document or power pointLECT CONCURRENCY………………..pdf document or power point
LECT CONCURRENCY………………..pdf document or power point
nwanjamakane
 
Chapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I willChapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I will
SotheaPheng
 
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
Fwdays
 
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Designer
 
15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf
AffinityCore
 
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Karim Baïna
 
Cyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptxCyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptx
vilakshbhargava
 
How to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing SoftwareHow to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing Software
skalatskayaek
 
delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)
jamespromind
 
HPC High Performance Course Presentation.pptx
HPC High Performance Course Presentation.pptxHPC High Performance Course Presentation.pptx
HPC High Performance Course Presentation.pptx
naziaahmadnm
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
Tableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdfTableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdf
elinavihriala
 
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
Taqyea
 
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptxArtificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
AbhijitPal87
 
BADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and InterpretationBADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and Interpretation
srishtisingh1813
 
Arrays in c programing. practicals and .ppt
Arrays in c programing. practicals and .pptArrays in c programing. practicals and .ppt
Arrays in c programing. practicals and .ppt
Carlos701746
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
Human body make Structure analysis the part of the human
Human body make Structure analysis the part of the humanHuman body make Structure analysis the part of the human
Human body make Structure analysis the part of the human
ankit392215
 
GROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptxGROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptx
mardoglenn21
 
lecture 33333222234555555555555555556.pptx
lecture 33333222234555555555555555556.pptxlecture 33333222234555555555555555556.pptx
lecture 33333222234555555555555555556.pptx
obsinaafilmakuush
 
LECT CONCURRENCY………………..pdf document or power point
LECT CONCURRENCY………………..pdf document or power pointLECT CONCURRENCY………………..pdf document or power point
LECT CONCURRENCY………………..pdf document or power point
nwanjamakane
 
Chapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I willChapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I will
SotheaPheng
 
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
Fwdays
 
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Designer
 
15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf
AffinityCore
 
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Karim Baïna
 
Cyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptxCyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptx
vilakshbhargava
 
How to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing SoftwareHow to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing Software
skalatskayaek
 
delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)
jamespromind
 
HPC High Performance Course Presentation.pptx
HPC High Performance Course Presentation.pptxHPC High Performance Course Presentation.pptx
HPC High Performance Course Presentation.pptx
naziaahmadnm
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
Tableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdfTableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdf
elinavihriala
 
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
Taqyea
 

Data Stream Algorithms in Storm and R

  • 1. Data Stream Algorithms in Storm and R Radek Maciaszek
  • 2. Who Am I?  Radek Maciaszek  Consulting at DataMine Lab (www.dataminelab.com) - Data mining, business intelligence and data warehouse consultancy.  Data scientist at a hedge fund in London  BSc Computer Science, MSc in Cognitive and Decisions Sciences, MSc in Bioinformatics  During the career worked with many companies on Big Data and real time processing projects; indcluding Orange, ad4game, Unanimis, SkimLinks, CognitiveMatch, OpenX and many others.
  • 3. Agenda • Why streaming algorithms? • Streaming algorithms crash course • Apache Storm • Storm + R – for more traditional statistics • Use Cases
  • 4. Data Explosion • Exponential growth of information [IDC, 2012] Information Data Corporation, a market research company
  • 5. Data, data everywhere [Economist] • “In 2013, the available storage capacity could hold 33% of all data. By 2020, it will be able to store less than 15%” [IDC, 2014]
  • 6. Data Streams – crash course • Reasons to use data streams processing • Data doesn’t fit into available memory and/or disk • (Near) real-time data processing • Scalability, cloud processing • Examples • Network traffic (ISP) • Fraud detection • Web traffic (i.e. online advertising)
  • 7. Use Case – Dynamic Sampling • OpenX – the ad server • Customers with tens of millions of ad views per hour • Challenge • Create samples for statistical analysis. E.g: A/B testing, ANOVA, etc. • How to sample data in real-time on the input stream of the data of unknown size • Solution • Reservoir Sampling – allows to find a sample of a constant length from a stream of unknown length of elements
  • 8. Data Streaming algorithms • Sampling • Use statistic of a sample to estimate the statistic of population. The bigger the sample the better the estimate. • Reservoir Sampling – sample populations, without knowing it’s size. • Algorithm: • Store first n elements into the reservoir. • Insert each k-th from the input stream in a random spot of the reservoir with a probability of n/k (decreasing probability) Source: Maldonado, et al; 2011
  • 9. Moving average • Example, online mean of the moving average of time-series at time “t” • Where: M – window size of the moving average. • At time “t” predict “t+1” by removing last “t-M” element, and adding “t” element. • Requires last M elements to be stored in memory • There are many more algorithms: mean, variance, regression, percentile
  • 10. Use Case - Counting Unique Users • Large UK ad-network • Challenge – calculate number of unique visitors - one of the most important metrics in online advertising • Hadoop MapReduce. It worked but took long time and too much memory. • Better solutions: • Cardinality estimation on stream data, e.g. HyperLogLog algorithm • Highly effective algorithm to count distinct number of elements • Many other use cases: • ISP estimates of traffic usage • Cardinality in DB queries optimisation • Algorithm and implementation details
  • 11. • Transform input data into i.i.d. (independent and identically distributed) uniform random bits of information • Hash(x)= bit1 bit2 … • Where P(bit1)=P(bit2)=1/2 • 1xxx -> P = 1/2, n >= 2 11xx -> P = 1/4, n >= 4 111x -> P = 1/8, n >= 8 n >= • Record biggest • Flajolet (1983) estimated the bias • and p = position of a first “0” • 1983 - Flajolet & Martin. (first streaming algorithm) Probabilistic counting unhashed hashed Source: http://git.io/veCtc
  • 12. Probabilistic counting - algorithm • Algorithm: p – calculates position of first zero in the bitmap • Estimate the size using: • R proof-of-concept implementation: http://git.io/ve8Ia • Example:
  • 13. Can we do better? • LogLog – instead of keeping track of all 01s, keep track only of the largest 0 • This will take LogLog bits, but at the cost of lost precision • SuperLogLog – remove x% (typically 70%) of largest number before estimating, more complex analysis • HyperLogLog – harmonic mean of estimates • Fast, cheap and 98% correct • What if you want more traditional statistics? Reference: Flajolet; Fusy et al. 2007
  • 14. R – Open Source Statistics • Open Source = low cost of adopting. Useful in prototyping. • Large global community - more than 2.5 million users • ~5,000 open source free packages • Extensively used for modelling and visualisations Source: Rexer Analytics
  • 15. Use Case – Real-time Machine Learning • Gaming ad-network • 150m+ ad impressions per day • Lambda architecture (fast and batch layers): Storm used in parallel to Hadoop • Challenge • Make real-time decision on which ad to display – vs old system that used to make decisions every 1h • Use sophisticated statistical environment for A/B testing • Solution • Beta Distribution to compare effectiveness of the ads • Use Storm to do real-time statistics
  • 16. Use Case – Beta Distributions • Comparing two ads: • Ratio: CTR = Clicks / Views • Wolphram Alpha: beta distribution (5, (30-5)) Source: Wolphram Alpha
  • 17. Beta distributions prototyping – the R code • Bootstrapping in R
  • 18. Apache Storm • Real-time calculations – the Hadoop of real time • Fault tolerance, easy to scale • Easy to develop - has local and distributed mode • Storm multi-lang can be used with any language, including R Getty Images
  • 19. Storm Architecture • Nimbus • Master - equivalent of Hadoop JobTracker • Distributes workload across cluster • Heartbeat, reallocation of workers when needed • Supervisor • Runs the workers • Communicates with Nimbus using ZK • Zookeeper • coordination, nodes discovery Source: Apache Storm
  • 20. Storm Topology Image source: Storm github wiki Can integrate with third party languages and databases: • Java • Python • Ruby • Redis • Hbase • Cassandra • Graph of stream computations • Basic primitives nodes • Spout – source of streams (Twitter API, queue, logs) • Bolt – consumes streams, does the work, produces streams • Storm Trident
  • 21. Storm + R • Storm Multi-Language protocol • Multiple Storm-R multi-language packages provide Storm/R plumbing • Recommended package: http://cran.r- project.org/web/packages/Storm • Example R code
  • 22. Storm and R storm = Storm$new(); storm$lambda = function(s) { t = s$tuple; t$output = vector(mode="character",length=1); clicks = as.numeric(t$input[1]); views = as.numeric(t$input[2]); t$output[1] = rbeta(1, clicks, views - clicks); s$emit(t); #alternative: mark the tuple as failed. s$fail(t); } storm$run();
  • 23. Storm and Java integration • Define Spout/Bolt in any programming language • Executed as subprocess – JSON over stdin/stdout public static class RBolt extends ShellBolt implements IRichBolt { public RBolt() { super("Rscript", ”script.R"); } } Source: Apache Storm
  • 24. Storm + R = flexibility • Integration with existing Storm ecosystem – NoSQL, Kafka • SOA framework - DRPC • Scaling up your existing R processes • Trident Source: Apache Storm
  • 25. Storm References • https://storm.apache.org • Storm and Java stream algorithms implementations: • https://github.com/addthis/stream-lib • https://github.com/aggregateknowledge/java-hll • https://github.com/pmerienne/trident-ml
  • 26. Thank you • Summary: • Data stream algorithms • Storm – can be used with stream algorithms • Storm + R – more traditional • Questions and discussion • https://uk.linkedin.com/in/radekmaciaszek • http://www.dataminelab.com