SlideShare a Scribd company logo
Realtime Data 
Analysis Patterns 
Mikio Braun 
@mikiobraun 
streamdrill & TU Berlin 
O'Really Strata+Hadoop, Barcelona 
Nov 21, 2014 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
How it all started: Realtime 
Twitter Retweet Trends 
Rails app + PostgreSQL 
About 100 tweets/second,and it got worse 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Road from there 
● Version 1.0: Rails + PostgreSQL 
– store and batch 
● Version 2.0: Scala + Cassandra 
– stream processing & working data on disk 
● Version 3.0: streamdrill 
– “in-memory realtime analytics database” 
– approximative algorithms to bound resources 
– moderate parallelism for some things 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Lessons learned? 
Not just one kind of 
realtime. 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Applications 
FFiinnaannccee GGaammiinngg MMoonniittoorriinngg 
AAddvveerrttiissmmeenntt SSeennssoorr NNeettwwoorrkkss SSoocciiaall MMeeddiiaa 
Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Two Dimensions of Real-Time 
Complexity Latency 
● counting 
● trends 
● outlier detection 
● recommendation 
● prediction (churn, 
etc.) 
● now (ms, RTB) 
● seconds (fraud) 
● hours (monitoring) 
● days (reporting) 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
What makes realtime hard 
● Many Events 
– 100 events / second 
– 360k per hour 
– 8.6M per day 
– 260M per month 
– 3.2B per year 
● Many Objects 
http://www.flickr.com/photos/arenamontanus/269158554/ 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Classes of Realtime 
● Events per second (100s? 1000s? 10k?) 
● Number of objects (A few dozen? Millions?) 
● Complexity (Counting? Trends?) 
● Latency (Milliseconds? Hours?) 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
General Architecture 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Data Acquisition 
● Flat files / HDFS 
● Apache Flume / Logstash 
● Apache Kafka for distributed logging 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Processing 
● Depending on Latency: Batch or Streaming 
● Batch 
– Apache Hadoop 
– Apache Spark 
– Apache Flink 
● Streaming 
– Apache Storm 
– Apache Samza 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Query Layer 
● Hadoop/Storm/Spark have no query layer 
● Some db backend like redis to store the results 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Lambda Architecture: Mixing 
Batch & Streaming 
http://lambda-architecture.net/ 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Kappa Architecture 
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Scaling vs. Approximation 
● Scaling is expensive 
● Not all results are relevant 
● Data changes all the time anyway 
● Approximate: 
Trade accuracy for resource usage 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Approximation harmful? 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Heavy Hitters 
● Count activities over large item sets (millions, even 
more, e.g. IP addresses, Twitter users) 
● Interested in most active elements only. 
frank 
paul 
jan 
felix 
leo 
alex 
15 
12 
8 
5 
3 
2 
Fixed tables of counts 
Case 1: element already in data base 
paul paul 12 13 
Case 2: new element 
nico alex 2 
nico 3 
Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference 
on Database Theory, 2005 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Count Min Sketch 
● Summarize histograms over large feature sets 
● Like bloom filters, but better 
m bins 
0 0 3 0 
1 1 0 2 
0 2 0 0 
0 3 5 2 
0 5 3 2 
2 4 5 0 
1 3 7 3 
0 2 0 8 
n different 
hash functions 
Updates for new entry 
Query result: 1 
● Query: Take minimum over all hash functions 
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. 
LATIN 2004, J. Algorithm 55(1): 58-75 (2005) . 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Hyper Log Log 
● Hash stream to generate random bit strings 
● Look for infrequent events 
● If probability is one hundreths → should have 
seen 100 events on average if it occurs. 
● Average to improve estimate. 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Comparing Approx. Algorithms 
● Heavy Hitters: 
– approx. counts + top-k 
– large memory requirement 
● Count Min Sketch 
– approx. counts for all, but no top-k, no elements 
– needs to know size beforehand 
● HyperLogLog 
– approx. number of distinct elements 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Exponential Decay 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Beyond Counting 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Streamdrill & Demos 
● Realtime Analysis Solutions 
● Core Engine: 
– Heavy Hitters + exponential decay + seconndary indices 
– Instant counts & top-k results over time windows 
– In-memory 
– Written in Scala 
● Modules 
– Profiling and Trending 
– Recommendations 
– Count Distinct 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
http://play.streamdrill.com/vis/ 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
● Trends: 
– symbol:combinations $AAPL:$GOOG 
– symbol:hashtag $AAPL:#trading 
– symbol:keywords $GOOG:disruption 
– symbol:mentions $GOOG:WallStreetCom 
– symbol trend $AAPL 
– symbol:url $FB:http://on.wsj.com/15fHaZW 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
Twitter 
streamdrill 
JavaScript 
via REST 
tweets 
Tweet Analyzer 
updates 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime user profiles 
● Process 10k events / second on one machine 
● Track about 1 Million counts per 1 GB 
● Shard by user for higher accuracy 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime Data Analysis Patterns 
● Acquisition / Processing / Query Layer 
● Acquisition: Flat files and distributed logs 
● Processing: Scaling batch or streaming 
● Query Layer: Separate query from processing 
● Lambda and Kappa Architecture 
● Approximation as alternative to scaling 
● Trends with indices as building blocks for data 
analysis 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Thank You 
Mikio Braun 
mikio@streamdrill.com 
@mikiobraun 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun

More Related Content

What's hot (20)

Agile metrics
Agile metricsAgile metrics
Agile metrics
Sebastian Radics
 
Achieving Elite and High Performance DevOps Using DORA Metrics
Achieving Elite and High Performance DevOps Using DORA MetricsAchieving Elite and High Performance DevOps Using DORA Metrics
Achieving Elite and High Performance DevOps Using DORA Metrics
Aggregage
 
Application Integration - A Key Component of your Digital Transformation
Application Integration - A Key Component of your Digital TransformationApplication Integration - A Key Component of your Digital Transformation
Application Integration - A Key Component of your Digital Transformation
Safe Software
 
Sami Zahran Quality Gates
Sami Zahran Quality GatesSami Zahran Quality Gates
Sami Zahran Quality Gates
Dr. Sami Zahran
 
Design Thinking Case Studies | In Their Own Words | Ideafarms
Design Thinking Case Studies | In Their Own Words | IdeafarmsDesign Thinking Case Studies | In Their Own Words | Ideafarms
Design Thinking Case Studies | In Their Own Words | Ideafarms
Ideafarms
 
How to Misuse and Abuse DORA Metrics.pptx
How to Misuse and Abuse DORA Metrics.pptxHow to Misuse and Abuse DORA Metrics.pptx
How to Misuse and Abuse DORA Metrics.pptx
Bryan Finster
 
DevOps Transition Strategies
DevOps Transition StrategiesDevOps Transition Strategies
DevOps Transition Strategies
Alec Lazarescu
 
Agile Transformation
Agile TransformationAgile Transformation
Agile Transformation
Max Carlin
 
Reinventing Application Performance Testing with Service Virtualization
Reinventing Application Performance Testing with Service VirtualizationReinventing Application Performance Testing with Service Virtualization
Reinventing Application Performance Testing with Service Virtualization
CA Technologies
 
What do you mean by “API as a Product”?
What do you mean by “API as a Product”?What do you mean by “API as a Product”?
What do you mean by “API as a Product”?
Nordic APIs
 
DevOps
DevOps DevOps
DevOps
Hakan Yüksel
 
Quarterly Based Multiple Project Timeline
Quarterly Based Multiple Project TimelineQuarterly Based Multiple Project Timeline
Quarterly Based Multiple Project Timeline
SlideTeam
 
Agile Metrics
Agile MetricsAgile Metrics
Agile Metrics
nick945
 
10 Business Advantages of DevOps
10 Business Advantages of DevOps10 Business Advantages of DevOps
10 Business Advantages of DevOps
cliqtechno
 
Requirements engineering. IREB practices
Requirements engineering. IREB practicesRequirements engineering. IREB practices
Requirements engineering. IREB practices
Eugene Bulba
 
The Art of Scalability - Managing growth
The Art of Scalability - Managing growthThe Art of Scalability - Managing growth
The Art of Scalability - Managing growth
Lorenzo Alberton
 
Agile evolution lifecycle - From implementing Agile to being Agile
Agile evolution lifecycle - From implementing Agile to being AgileAgile evolution lifecycle - From implementing Agile to being Agile
Agile evolution lifecycle - From implementing Agile to being Agile
Michal Epstein
 
Transforming Organizations with CI/CD
Transforming Organizations with CI/CDTransforming Organizations with CI/CD
Transforming Organizations with CI/CD
Cprime
 
Agile Fundamentals
Agile FundamentalsAgile Fundamentals
Agile Fundamentals
Atlassian
 
Agile
AgileAgile
Agile
Abhinav Regmi
 
Achieving Elite and High Performance DevOps Using DORA Metrics
Achieving Elite and High Performance DevOps Using DORA MetricsAchieving Elite and High Performance DevOps Using DORA Metrics
Achieving Elite and High Performance DevOps Using DORA Metrics
Aggregage
 
Application Integration - A Key Component of your Digital Transformation
Application Integration - A Key Component of your Digital TransformationApplication Integration - A Key Component of your Digital Transformation
Application Integration - A Key Component of your Digital Transformation
Safe Software
 
Sami Zahran Quality Gates
Sami Zahran Quality GatesSami Zahran Quality Gates
Sami Zahran Quality Gates
Dr. Sami Zahran
 
Design Thinking Case Studies | In Their Own Words | Ideafarms
Design Thinking Case Studies | In Their Own Words | IdeafarmsDesign Thinking Case Studies | In Their Own Words | Ideafarms
Design Thinking Case Studies | In Their Own Words | Ideafarms
Ideafarms
 
How to Misuse and Abuse DORA Metrics.pptx
How to Misuse and Abuse DORA Metrics.pptxHow to Misuse and Abuse DORA Metrics.pptx
How to Misuse and Abuse DORA Metrics.pptx
Bryan Finster
 
DevOps Transition Strategies
DevOps Transition StrategiesDevOps Transition Strategies
DevOps Transition Strategies
Alec Lazarescu
 
Agile Transformation
Agile TransformationAgile Transformation
Agile Transformation
Max Carlin
 
Reinventing Application Performance Testing with Service Virtualization
Reinventing Application Performance Testing with Service VirtualizationReinventing Application Performance Testing with Service Virtualization
Reinventing Application Performance Testing with Service Virtualization
CA Technologies
 
What do you mean by “API as a Product”?
What do you mean by “API as a Product”?What do you mean by “API as a Product”?
What do you mean by “API as a Product”?
Nordic APIs
 
Quarterly Based Multiple Project Timeline
Quarterly Based Multiple Project TimelineQuarterly Based Multiple Project Timeline
Quarterly Based Multiple Project Timeline
SlideTeam
 
Agile Metrics
Agile MetricsAgile Metrics
Agile Metrics
nick945
 
10 Business Advantages of DevOps
10 Business Advantages of DevOps10 Business Advantages of DevOps
10 Business Advantages of DevOps
cliqtechno
 
Requirements engineering. IREB practices
Requirements engineering. IREB practicesRequirements engineering. IREB practices
Requirements engineering. IREB practices
Eugene Bulba
 
The Art of Scalability - Managing growth
The Art of Scalability - Managing growthThe Art of Scalability - Managing growth
The Art of Scalability - Managing growth
Lorenzo Alberton
 
Agile evolution lifecycle - From implementing Agile to being Agile
Agile evolution lifecycle - From implementing Agile to being AgileAgile evolution lifecycle - From implementing Agile to being Agile
Agile evolution lifecycle - From implementing Agile to being Agile
Michal Epstein
 
Transforming Organizations with CI/CD
Transforming Organizations with CI/CDTransforming Organizations with CI/CD
Transforming Organizations with CI/CD
Cprime
 
Agile Fundamentals
Agile FundamentalsAgile Fundamentals
Agile Fundamentals
Atlassian
 

Similar to Realtime Data Analysis Patterns (20)

Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
Raja Chiky
 
A Deep dive into Big Data Analytics Data mining
A Deep dive into Big Data Analytics Data miningA Deep dive into Big Data Analytics Data mining
A Deep dive into Big Data Analytics Data mining
theniche69
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
Srinath Perera
 
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsDEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
Sriskandarajah Suhothayan
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
Streamlio
 
WSO2Con USA 2017: Analytics Patterns for Your Digital Enterprise
WSO2Con USA 2017: Analytics Patterns for Your Digital EnterpriseWSO2Con USA 2017: Analytics Patterns for Your Digital Enterprise
WSO2Con USA 2017: Analytics Patterns for Your Digital Enterprise
WSO2
 
Analytics Patterns for Your Digital Enterprise
Analytics Patterns for Your Digital EnterpriseAnalytics Patterns for Your Digital Enterprise
Analytics Patterns for Your Digital Enterprise
Sriskandarajah Suhothayan
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
eswcsummerschool
 
WebAction-Sami Abkay
WebAction-Sami AbkayWebAction-Sami Abkay
WebAction-Sami Abkay
Inside Analysis
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
Theo Schlossnagle
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014
Raja Chiky
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
SujaAldrin
 
Hadoop at datasift
Hadoop at datasiftHadoop at datasift
Hadoop at datasift
Jairam Chandar
 
Db2425082511
Db2425082511Db2425082511
Db2425082511
IJMER
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
Gabriele Modena
 
Efficient Way to Identify User Aware Rare Sequential Patterns in Document Str...
Efficient Way to Identify User Aware Rare Sequential Patterns in Document Str...Efficient Way to Identify User Aware Rare Sequential Patterns in Document Str...
Efficient Way to Identify User Aware Rare Sequential Patterns in Document Str...
ijtsrd
 
Mining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering AlgorithmMining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering Algorithm
Manishankar Medi
 
Design and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemDesign and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management System
Erdi Olmezogullari
 
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesA Survey on Big Data Mining Challenges
A Survey on Big Data Mining Challenges
Editor IJMTER
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
Raja Chiky
 
A Deep dive into Big Data Analytics Data mining
A Deep dive into Big Data Analytics Data miningA Deep dive into Big Data Analytics Data mining
A Deep dive into Big Data Analytics Data mining
theniche69
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
Srinath Perera
 
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsDEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
Sriskandarajah Suhothayan
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
Streamlio
 
WSO2Con USA 2017: Analytics Patterns for Your Digital Enterprise
WSO2Con USA 2017: Analytics Patterns for Your Digital EnterpriseWSO2Con USA 2017: Analytics Patterns for Your Digital Enterprise
WSO2Con USA 2017: Analytics Patterns for Your Digital Enterprise
WSO2
 
Analytics Patterns for Your Digital Enterprise
Analytics Patterns for Your Digital EnterpriseAnalytics Patterns for Your Digital Enterprise
Analytics Patterns for Your Digital Enterprise
Sriskandarajah Suhothayan
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
eswcsummerschool
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014
Raja Chiky
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
SujaAldrin
 
Db2425082511
Db2425082511Db2425082511
Db2425082511
IJMER
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
Gabriele Modena
 
Efficient Way to Identify User Aware Rare Sequential Patterns in Document Str...
Efficient Way to Identify User Aware Rare Sequential Patterns in Document Str...Efficient Way to Identify User Aware Rare Sequential Patterns in Document Str...
Efficient Way to Identify User Aware Rare Sequential Patterns in Document Str...
ijtsrd
 
Mining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering AlgorithmMining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering Algorithm
Manishankar Medi
 
Design and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemDesign and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management System
Erdi Olmezogullari
 
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesA Survey on Big Data Mining Challenges
A Survey on Big Data Mining Challenges
Editor IJMTER
 
Ad

More from Mikio L. Braun (9)

Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020
Mikio L. Braun
 
Academia to industry looking back on a decade of ml
Academia to industry looking back on a decade of mlAcademia to industry looking back on a decade of ml
Academia to industry looking back on a decade of ml
Mikio L. Braun
 
Architecting AI Applications
Architecting AI ApplicationsArchitecting AI Applications
Architecting AI Applications
Mikio L. Braun
 
Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018
Mikio L. Braun
 
Hardcore Data Science - in Practice
Hardcore Data Science - in PracticeHardcore Data Science - in Practice
Hardcore Data Science - in Practice
Mikio L. Braun
 
Data flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into FlinkData flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into Flink
Mikio L. Braun
 
Scalable Machine Learning
Scalable Machine LearningScalable Machine Learning
Scalable Machine Learning
Mikio L. Braun
 
Cassandra - An Introduction
Cassandra - An IntroductionCassandra - An Introduction
Cassandra - An Introduction
Mikio L. Braun
 
Cassandra - Eine Einführung
Cassandra - Eine EinführungCassandra - Eine Einführung
Cassandra - Eine Einführung
Mikio L. Braun
 
Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020
Mikio L. Braun
 
Academia to industry looking back on a decade of ml
Academia to industry looking back on a decade of mlAcademia to industry looking back on a decade of ml
Academia to industry looking back on a decade of ml
Mikio L. Braun
 
Architecting AI Applications
Architecting AI ApplicationsArchitecting AI Applications
Architecting AI Applications
Mikio L. Braun
 
Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018
Mikio L. Braun
 
Hardcore Data Science - in Practice
Hardcore Data Science - in PracticeHardcore Data Science - in Practice
Hardcore Data Science - in Practice
Mikio L. Braun
 
Data flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into FlinkData flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into Flink
Mikio L. Braun
 
Scalable Machine Learning
Scalable Machine LearningScalable Machine Learning
Scalable Machine Learning
Mikio L. Braun
 
Cassandra - An Introduction
Cassandra - An IntroductionCassandra - An Introduction
Cassandra - An Introduction
Mikio L. Braun
 
Cassandra - Eine Einführung
Cassandra - Eine EinführungCassandra - Eine Einführung
Cassandra - Eine Einführung
Mikio L. Braun
 
Ad

Recently uploaded (17)

HPC_Course_Presentation_No_Images included.pptx
HPC_Course_Presentation_No_Images included.pptxHPC_Course_Presentation_No_Images included.pptx
HPC_Course_Presentation_No_Images included.pptx
naziaahmadnm
 
OSI_Security_Architecture Computer Science.pptx
OSI_Security_Architecture Computer Science.pptxOSI_Security_Architecture Computer Science.pptx
OSI_Security_Architecture Computer Science.pptx
faizanaseem873
 
Presentation About The Buttons | Selma SALTIK
Presentation About The Buttons | Selma SALTIKPresentation About The Buttons | Selma SALTIK
Presentation About The Buttons | Selma SALTIK
SELMA SALTIK
 
Networking_Essentials_version_3.0_-_Module_7.pptx
Networking_Essentials_version_3.0_-_Module_7.pptxNetworking_Essentials_version_3.0_-_Module_7.pptx
Networking_Essentials_version_3.0_-_Module_7.pptx
elestirmen
 
ARTIFICIAL INTELLIGENCE.pptx2565567765676
ARTIFICIAL INTELLIGENCE.pptx2565567765676ARTIFICIAL INTELLIGENCE.pptx2565567765676
ARTIFICIAL INTELLIGENCE.pptx2565567765676
areebaimtiazpmas
 
Networking concepts from zero to hero that covers the security aspects
Networking concepts from zero to hero that covers the security aspectsNetworking concepts from zero to hero that covers the security aspects
Networking concepts from zero to hero that covers the security aspects
amansinght675
 
原版西班牙马拉加大学毕业证(UMA毕业证书)如何办理
原版西班牙马拉加大学毕业证(UMA毕业证书)如何办理原版西班牙马拉加大学毕业证(UMA毕业证书)如何办理
原版西班牙马拉加大学毕业证(UMA毕业证书)如何办理
Taqyea
 
5 Reasons cheap WordPress hosting is costing you more | Reversed Out
5 Reasons cheap WordPress hosting is costing you more | Reversed Out5 Reasons cheap WordPress hosting is costing you more | Reversed Out
5 Reasons cheap WordPress hosting is costing you more | Reversed Out
Reversed Out Creative
 
10 Latest Technologies and Their Benefits to End.pptx
10 Latest Technologies and Their Benefits to End.pptx10 Latest Technologies and Their Benefits to End.pptx
10 Latest Technologies and Their Benefits to End.pptx
EphraimOOghodero
 
Frontier Unlimited Internet Setup Step-by-Step Guide.pdf
Frontier Unlimited Internet Setup Step-by-Step Guide.pdfFrontier Unlimited Internet Setup Step-by-Step Guide.pdf
Frontier Unlimited Internet Setup Step-by-Step Guide.pdf
Internet Bundle Now
 
Cloud VPS Provider in India: The Best Hosting Solution for Your Business
Cloud VPS Provider in India: The Best Hosting Solution for Your BusinessCloud VPS Provider in India: The Best Hosting Solution for Your Business
Cloud VPS Provider in India: The Best Hosting Solution for Your Business
DanaJohnson510230
 
How to Make Money as a Cam Model – Tips, Tools & Real Talk
How to Make Money as a Cam Model – Tips, Tools & Real TalkHow to Make Money as a Cam Model – Tips, Tools & Real Talk
How to Make Money as a Cam Model – Tips, Tools & Real Talk
Cam Sites Expert
 
all Practical Project LAST summary note.docx
all Practical Project LAST summary note.docxall Practical Project LAST summary note.docx
all Practical Project LAST summary note.docx
seidjemal94
 
Transport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by Slidesgo.pptx
Transport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by Slidesgo.pptxTransport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by Slidesgo.pptx
Transport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by Slidesgo.pptx
ssuser80a7e81
 
basic to advance network security concepts
basic to advance network security conceptsbasic to advance network security concepts
basic to advance network security concepts
amansinght675
 
AI REPLACING HUMANS /FATHER OF AI/BIRTH OF AI
AI REPLACING HUMANS /FATHER OF AI/BIRTH OF AIAI REPLACING HUMANS /FATHER OF AI/BIRTH OF AI
AI REPLACING HUMANS /FATHER OF AI/BIRTH OF AI
skdav34
 
Essential Tech Stack for Effective Shopify Dropshipping Integration.pdf
Essential Tech Stack for Effective Shopify Dropshipping Integration.pdfEssential Tech Stack for Effective Shopify Dropshipping Integration.pdf
Essential Tech Stack for Effective Shopify Dropshipping Integration.pdf
CartCoders
 
HPC_Course_Presentation_No_Images included.pptx
HPC_Course_Presentation_No_Images included.pptxHPC_Course_Presentation_No_Images included.pptx
HPC_Course_Presentation_No_Images included.pptx
naziaahmadnm
 
OSI_Security_Architecture Computer Science.pptx
OSI_Security_Architecture Computer Science.pptxOSI_Security_Architecture Computer Science.pptx
OSI_Security_Architecture Computer Science.pptx
faizanaseem873
 
Presentation About The Buttons | Selma SALTIK
Presentation About The Buttons | Selma SALTIKPresentation About The Buttons | Selma SALTIK
Presentation About The Buttons | Selma SALTIK
SELMA SALTIK
 
Networking_Essentials_version_3.0_-_Module_7.pptx
Networking_Essentials_version_3.0_-_Module_7.pptxNetworking_Essentials_version_3.0_-_Module_7.pptx
Networking_Essentials_version_3.0_-_Module_7.pptx
elestirmen
 
ARTIFICIAL INTELLIGENCE.pptx2565567765676
ARTIFICIAL INTELLIGENCE.pptx2565567765676ARTIFICIAL INTELLIGENCE.pptx2565567765676
ARTIFICIAL INTELLIGENCE.pptx2565567765676
areebaimtiazpmas
 
Networking concepts from zero to hero that covers the security aspects
Networking concepts from zero to hero that covers the security aspectsNetworking concepts from zero to hero that covers the security aspects
Networking concepts from zero to hero that covers the security aspects
amansinght675
 
原版西班牙马拉加大学毕业证(UMA毕业证书)如何办理
原版西班牙马拉加大学毕业证(UMA毕业证书)如何办理原版西班牙马拉加大学毕业证(UMA毕业证书)如何办理
原版西班牙马拉加大学毕业证(UMA毕业证书)如何办理
Taqyea
 
5 Reasons cheap WordPress hosting is costing you more | Reversed Out
5 Reasons cheap WordPress hosting is costing you more | Reversed Out5 Reasons cheap WordPress hosting is costing you more | Reversed Out
5 Reasons cheap WordPress hosting is costing you more | Reversed Out
Reversed Out Creative
 
10 Latest Technologies and Their Benefits to End.pptx
10 Latest Technologies and Their Benefits to End.pptx10 Latest Technologies and Their Benefits to End.pptx
10 Latest Technologies and Their Benefits to End.pptx
EphraimOOghodero
 
Frontier Unlimited Internet Setup Step-by-Step Guide.pdf
Frontier Unlimited Internet Setup Step-by-Step Guide.pdfFrontier Unlimited Internet Setup Step-by-Step Guide.pdf
Frontier Unlimited Internet Setup Step-by-Step Guide.pdf
Internet Bundle Now
 
Cloud VPS Provider in India: The Best Hosting Solution for Your Business
Cloud VPS Provider in India: The Best Hosting Solution for Your BusinessCloud VPS Provider in India: The Best Hosting Solution for Your Business
Cloud VPS Provider in India: The Best Hosting Solution for Your Business
DanaJohnson510230
 
How to Make Money as a Cam Model – Tips, Tools & Real Talk
How to Make Money as a Cam Model – Tips, Tools & Real TalkHow to Make Money as a Cam Model – Tips, Tools & Real Talk
How to Make Money as a Cam Model – Tips, Tools & Real Talk
Cam Sites Expert
 
all Practical Project LAST summary note.docx
all Practical Project LAST summary note.docxall Practical Project LAST summary note.docx
all Practical Project LAST summary note.docx
seidjemal94
 
Transport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by Slidesgo.pptx
Transport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by Slidesgo.pptxTransport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by Slidesgo.pptx
Transport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by Slidesgo.pptx
ssuser80a7e81
 
basic to advance network security concepts
basic to advance network security conceptsbasic to advance network security concepts
basic to advance network security concepts
amansinght675
 
AI REPLACING HUMANS /FATHER OF AI/BIRTH OF AI
AI REPLACING HUMANS /FATHER OF AI/BIRTH OF AIAI REPLACING HUMANS /FATHER OF AI/BIRTH OF AI
AI REPLACING HUMANS /FATHER OF AI/BIRTH OF AI
skdav34
 
Essential Tech Stack for Effective Shopify Dropshipping Integration.pdf
Essential Tech Stack for Effective Shopify Dropshipping Integration.pdfEssential Tech Stack for Effective Shopify Dropshipping Integration.pdf
Essential Tech Stack for Effective Shopify Dropshipping Integration.pdf
CartCoders
 

Realtime Data Analysis Patterns

  • 1. Realtime Data Analysis Patterns Mikio Braun @mikiobraun streamdrill & TU Berlin O'Really Strata+Hadoop, Barcelona Nov 21, 2014 Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 2. How it all started: Realtime Twitter Retweet Trends Rails app + PostgreSQL About 100 tweets/second,and it got worse Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 3. Road from there ● Version 1.0: Rails + PostgreSQL – store and batch ● Version 2.0: Scala + Cassandra – stream processing & working data on disk ● Version 3.0: streamdrill – “in-memory realtime analytics database” – approximative algorithms to bound resources – moderate parallelism for some things Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 4. Lessons learned? Not just one kind of realtime. Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 5. Applications FFiinnaannccee GGaammiinngg MMoonniittoorriinngg AAddvveerrttiissmmeenntt SSeennssoorr NNeettwwoorrkkss SSoocciiaall MMeeddiiaa Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 6. Two Dimensions of Real-Time Complexity Latency ● counting ● trends ● outlier detection ● recommendation ● prediction (churn, etc.) ● now (ms, RTB) ● seconds (fraud) ● hours (monitoring) ● days (reporting) Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 7. What makes realtime hard ● Many Events – 100 events / second – 360k per hour – 8.6M per day – 260M per month – 3.2B per year ● Many Objects http://www.flickr.com/photos/arenamontanus/269158554/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 8. Classes of Realtime ● Events per second (100s? 1000s? 10k?) ● Number of objects (A few dozen? Millions?) ● Complexity (Counting? Trends?) ● Latency (Milliseconds? Hours?) Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 9. General Architecture Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 10. Data Acquisition ● Flat files / HDFS ● Apache Flume / Logstash ● Apache Kafka for distributed logging Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 11. Processing ● Depending on Latency: Batch or Streaming ● Batch – Apache Hadoop – Apache Spark – Apache Flink ● Streaming – Apache Storm – Apache Samza Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 12. Query Layer ● Hadoop/Storm/Spark have no query layer ● Some db backend like redis to store the results Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 13. Lambda Architecture: Mixing Batch & Streaming http://lambda-architecture.net/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 14. Kappa Architecture http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 15. Scaling vs. Approximation ● Scaling is expensive ● Not all results are relevant ● Data changes all the time anyway ● Approximate: Trade accuracy for resource usage Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 16. Approximation harmful? Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 17. Heavy Hitters ● Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users) ● Interested in most active elements only. frank paul jan felix leo alex 15 12 8 5 3 2 Fixed tables of counts Case 1: element already in data base paul paul 12 13 Case 2: new element nico alex 2 nico 3 Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005 Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 18. Count Min Sketch ● Summarize histograms over large feature sets ● Like bloom filters, but better m bins 0 0 3 0 1 1 0 2 0 2 0 0 0 3 5 2 0 5 3 2 2 4 5 0 1 3 7 3 0 2 0 8 n different hash functions Updates for new entry Query result: 1 ● Query: Take minimum over all hash functions G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) . Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 19. Hyper Log Log ● Hash stream to generate random bit strings ● Look for infrequent events ● If probability is one hundreths → should have seen 100 events on average if it occurs. ● Average to improve estimate. Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 20. Comparing Approx. Algorithms ● Heavy Hitters: – approx. counts + top-k – large memory requirement ● Count Min Sketch – approx. counts for all, but no top-k, no elements – needs to know size beforehand ● HyperLogLog – approx. number of distinct elements Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 21. Exponential Decay Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 22. Beyond Counting Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 23. Streamdrill & Demos ● Realtime Analysis Solutions ● Core Engine: – Heavy Hitters + exponential decay + seconndary indices – Instant counts & top-k results over time windows – In-memory – Written in Scala ● Modules – Profiling and Trending – Recommendations – Count Distinct Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 24. Example: Twitter Stock Analysis http://play.streamdrill.com/vis/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 25. Example: Twitter Stock Analysis ● Trends: – symbol:combinations $AAPL:$GOOG – symbol:hashtag $AAPL:#trading – symbol:keywords $GOOG:disruption – symbol:mentions $GOOG:WallStreetCom – symbol trend $AAPL – symbol:url $FB:http://on.wsj.com/15fHaZW Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 26. Example: Twitter Stock Analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 27. Example: Twitter Stock Analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 28. Example: Twitter Stock Analysis Twitter streamdrill JavaScript via REST tweets Tweet Analyzer updates Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 29. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 30. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 31. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 32. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 33. Realtime user profiles ● Process 10k events / second on one machine ● Track about 1 Million counts per 1 GB ● Shard by user for higher accuracy Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 34. Realtime Data Analysis Patterns ● Acquisition / Processing / Query Layer ● Acquisition: Flat files and distributed logs ● Processing: Scaling batch or streaming ● Query Layer: Separate query from processing ● Lambda and Kappa Architecture ● Approximation as alternative to scaling ● Trends with indices as building blocks for data analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 35. Thank You Mikio Braun [email protected] @mikiobraun Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun