SlideShare a Scribd company logo
A ggregation  C omputation  O ver  D istributed  D ata  S treams (partial content) Yueshen Xu Middleware, CCNT  Zhejiang University Middleware, CCNT, ZJU 12/15/11
Paper reference What's Different: Distributed, Continuous Monitoring of  Duplicate- Resilient Aggregates on Data Streams Published in ICDE, 2006 Cited by 61 times By Graham Cormode, S. Muthukrishnan etc. 12/15/11 Middleware, CCNT, ZJU I think it’s a good reading suitable for freshmen on distributed data streams Bell Lab Expert/27 Rutgers Expert/45 !
Background Distributed Data Streams Where and why? Large scale monitoring applications Many sensors distributed over a wide area 12/15/11 Middleware, CCNT, ZJU Just one example Distributed Streaming Model Query paradigm Centralized Decentralized VS
Constraints and Features Constraints Space  Embedded equipments don’t have enough memory  Processing power The same reason Communication capability Unreliable, spotty and sporadic 12/15/11 Middleware, CCNT, ZJU All resources are restricted Features Different from  ad hoc  queries in DBMS, but continuous What’s different?
Trouble Duplication    Why? Wide scale monitoring invariably encounters the same events at different points 12/15/11 Middleware, CCNT, ZJU Instances The same flow will be observed in different routers The same individual will be observed by several mobile sensors Requirement Duplicate-resilient aggregate Two vital questions What is the amount of duplication in the network? What are the versions of classical aggregates in the presence of duplicates?   root of all evil
Topic What kind of topics are researchers interested in ? Aggregation computation Routing algorithms … What is the aggregation? Summarization, namely a statistic variable describing the original data sets Examples min, max, quantile, heavy hitter distinct counts, average, sum … 12/15/11 Middleware, CCNT, ZJU Not strange contacting with data streams Why aggregation?    transaction
Problems and Concerns Distinct count To obtain the number of distinct data (item, record, etc) in multi-sets, namely the cardinality Distinct sample Important, but I’m sorry that I haven’t finished this part    12/15/11 Middleware, CCNT, ZJU What does this paper concern about? Priority: correctness, communication cost Computational cost, space cost ! Features attached to those algorithms applied to distributed environments
Distinct Counting:  Flajolet-Martin Sketch Flajolet-Martin Sketch P. Flajolet, G. Martin . Probabilistic Counting Algorithms for Data Base Applications . Journal of Computer and System Sciences, 1985(Cited by 628) Goal: To estimate the cardinalities of multi-sets of data using  relative small space by one pass scan The sketch is a kind of data structure, which is the way to obtain the aggregation results.  (skyline)   I think this method can be regarded as the classical application  of probability without complexity. 12/15/11 Middleware, CCNT, ZJU Give a question: How about you dealing with this problem? The computing paradigm of sketching Be appropriate for using in data streams inherently
Flajolet-Martin Sketch(Cont.) Preliminary    what do we need? the Multi-set M, containing all items/records, and |M| = n the upper bound on the number of distinct  items/records U, which is more than n the bitmap B, consisting of L elements, and 2 L  = U the hash function h(x: item/record), transforming each items  into a binary string distributed uniformly over the range of [1…2 L ], just like b 1 b 2 …b L , in which b 1  is the lowest digit, and b L  is the highest  the p(x), attaining the left most  position of ‘1’ 12/15/11 Middleware, CCNT, ZJU counting not computing 1 1 … 0 B 0 PPT  VS Whiteboard ? x record h(x) 1 L
Flajolet-Martin Sketch(Cont.) The algorithm itself the core task: Remarking the position of which the leftmost ‘1’ of the hash value recorded by p(x) in bitmap B  12/15/11 Middleware, CCNT, ZJU for i:=1 to L do bitmap[i] :=0  for all x in M do  begin  index := p(hash(x)); if bitmap[index] = the bitmap[index] :=1; end Why?
Flajolet-Martin Sketch(Cont.) The explanation The fact: bitmap[k] equals to 1 iff  after execution a pattern of the form 0 k-1 1 has appeared amongst hashed values of records in M The probability: the occurrence probability of the pattern 0 k-1 1 is 1/2 k Occurrence times: so if |M| = n, then bitmap[1] is accessed approximately n/2 times, bitmap[2] approximately n/4 times Extension: bitmap[k] will almost certainly be zero if k >> log 2 (n) and one if k << log 2 (n) wit a fringe of 0 and 1 for k ≈ log 2 (n) Selection: the leftmost 0, the rightmost 1 or something else 12/15/11 Middleware, CCNT, ZJU U The most practical part is over, and the left is very complicated taken for proving and error analysis, namely all about mathematic for i:=1 to L do bitmap[i] :=0  for all x in M do  begin  index := p(hash(x)); if bitmap[index] = the bitmap[index] :=1; end
Flajolet-Martin Sketch(Cont.) Conclusion Analysis Bit-based, reducing the space complexity by constant level Space complexity O(log(n))    O(log(log(n))) Duplicate-insensitive    duplicate-resilient and flexible Order-insensitive    stable and robust Additivity    The ability to merge two FM sketches together, and the merger is simply the bitwise-or of each pair of corresponding bitmaps Questions How to make the value of U? What’s the relationship of U and n? How to make the analysis to the error? … 12/15/11 Middleware, CCNT, ZJU nice qualities for distributed aggregation
Question What’s the relationship between sketch and skyline? Are they the same? Or… Does the  aggregation computation  belong to the research fields of  data mining ? No, I think 12/15/11 Middleware, CCNT, ZJU
Q&A 12/15/11 Middleware, CCNT, ZJU

More Related Content

What's hot (20)

010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process
Ha Phuong
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query Optimization
Ali Usman
 
Fast and Bootstrap Robust for LTS
Fast and Bootstrap Robust for LTSFast and Bootstrap Robust for LTS
Fast and Bootstrap Robust for LTS
dessybudiyanti
 
Notes on Spectral Clustering
Notes on Spectral ClusteringNotes on Spectral Clustering
Notes on Spectral Clustering
Davide Eynard
 
Design of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic MultiplierDesign of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic Multiplier
VLSICS Design
 
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Beat Signer
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Ryan B Harvey, CSDP, CSM
 
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Shinwoo Jang
 
a paper reading of table recognition
a paper reading of table recognitiona paper reading of table recognition
a paper reading of table recognition
Ning Lu
 
Dask glm-scipy2017-final
Dask glm-scipy2017-finalDask glm-scipy2017-final
Dask glm-scipy2017-final
Hussain Sultan
 
QTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature MapQTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature Map
Ha Phuong
 
Pca analysis
Pca analysisPca analysis
Pca analysis
kunasujitha
 
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
ijseajournal
 
Lecture7 xing fei-fei
Lecture7 xing fei-feiLecture7 xing fei-fei
Lecture7 xing fei-fei
Tianlu Wang
 
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
The Statistical and Applied Mathematical Sciences Institute
 
Approaches to online quantile estimation
Approaches to online quantile estimationApproaches to online quantile estimation
Approaches to online quantile estimation
Data Con LA
 
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
ijcsbi
 
Data decomposition techniques
Data decomposition techniquesData decomposition techniques
Data decomposition techniques
Mohamed Ramadan
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
ijcses
 
Tensor Spectral Clustering
Tensor Spectral ClusteringTensor Spectral Clustering
Tensor Spectral Clustering
Austin Benson
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process
Ha Phuong
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query Optimization
Ali Usman
 
Fast and Bootstrap Robust for LTS
Fast and Bootstrap Robust for LTSFast and Bootstrap Robust for LTS
Fast and Bootstrap Robust for LTS
dessybudiyanti
 
Notes on Spectral Clustering
Notes on Spectral ClusteringNotes on Spectral Clustering
Notes on Spectral Clustering
Davide Eynard
 
Design of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic MultiplierDesign of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic Multiplier
VLSICS Design
 
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Beat Signer
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Ryan B Harvey, CSDP, CSM
 
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Shinwoo Jang
 
a paper reading of table recognition
a paper reading of table recognitiona paper reading of table recognition
a paper reading of table recognition
Ning Lu
 
Dask glm-scipy2017-final
Dask glm-scipy2017-finalDask glm-scipy2017-final
Dask glm-scipy2017-final
Hussain Sultan
 
QTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature MapQTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature Map
Ha Phuong
 
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
ijseajournal
 
Lecture7 xing fei-fei
Lecture7 xing fei-feiLecture7 xing fei-fei
Lecture7 xing fei-fei
Tianlu Wang
 
Approaches to online quantile estimation
Approaches to online quantile estimationApproaches to online quantile estimation
Approaches to online quantile estimation
Data Con LA
 
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
ijcsbi
 
Data decomposition techniques
Data decomposition techniquesData decomposition techniques
Data decomposition techniques
Mohamed Ramadan
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
ijcses
 
Tensor Spectral Clustering
Tensor Spectral ClusteringTensor Spectral Clustering
Tensor Spectral Clustering
Austin Benson
 

Viewers also liked (14)

Hash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataHash - A probabilistic approach for big data
Hash - A probabilistic approach for big data
Luca Mastrostefano
 
Detecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine LearningDetecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine Learning
Andrew Beard
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
Sonal Raj
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
Allen Day, PhD
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Sonal Raj
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...
potaters
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Cloudera, Inc.
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Internet of Things (IoT) - We Are at the Tip of An Iceberg
Internet of Things (IoT) - We Are at the Tip of An IcebergInternet of Things (IoT) - We Are at the Tip of An Iceberg
Internet of Things (IoT) - We Are at the Tip of An Iceberg
Dr. Mazlan Abbas
 
Budapest Spark Meetup - Apache Spark @enbrite.ly
Budapest Spark Meetup - Apache Spark @enbrite.lyBudapest Spark Meetup - Apache Spark @enbrite.ly
Budapest Spark Meetup - Apache Spark @enbrite.ly
Mészáros József
 
Hash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataHash - A probabilistic approach for big data
Hash - A probabilistic approach for big data
Luca Mastrostefano
 
Detecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine LearningDetecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine Learning
Andrew Beard
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
Sonal Raj
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
Allen Day, PhD
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Sonal Raj
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...
potaters
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Cloudera, Inc.
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Internet of Things (IoT) - We Are at the Tip of An Iceberg
Internet of Things (IoT) - We Are at the Tip of An IcebergInternet of Things (IoT) - We Are at the Tip of An Iceberg
Internet of Things (IoT) - We Are at the Tip of An Iceberg
Dr. Mazlan Abbas
 
Budapest Spark Meetup - Apache Spark @enbrite.ly
Budapest Spark Meetup - Apache Spark @enbrite.lyBudapest Spark Meetup - Apache Spark @enbrite.ly
Budapest Spark Meetup - Apache Spark @enbrite.ly
Mészáros József
 
Ad

Similar to Aggregation computation over distributed data streams (20)

codecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backendscodecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
DataStax Academy
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
shrinivasvasala
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
Debasish Ghosh
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr PryymakProbabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PyData
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
Joe Kelley
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
Thinh Dang
 
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
DataStax
 
(slides 1) Visual Computing: Geometry, Graphics, and Vision
(slides 1) Visual Computing: Geometry, Graphics, and Vision(slides 1) Visual Computing: Geometry, Graphics, and Vision
(slides 1) Visual Computing: Geometry, Graphics, and Vision
Frank Nielsen
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
lakshmidkurup
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
An introduction to probabilistic data structures
An introduction to probabilistic data structuresAn introduction to probabilistic data structures
An introduction to probabilistic data structures
Miguel Ping
 
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Kyle Davis
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
SujaAldrin
 
Hokusai - Sketching streams in real time
Hokusai - Sketching streams in real timeHokusai - Sketching streams in real time
Hokusai - Sketching streams in real time
Sergiy Matusevych
 
Data monsters probablistic data structures
Data monsters probablistic data structuresData monsters probablistic data structures
Data monsters probablistic data structures
GreenM
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
Tyler Treat
 
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backendscodecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
DataStax Academy
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
shrinivasvasala
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
Debasish Ghosh
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr PryymakProbabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PyData
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
Joe Kelley
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
Thinh Dang
 
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
DataStax
 
(slides 1) Visual Computing: Geometry, Graphics, and Vision
(slides 1) Visual Computing: Geometry, Graphics, and Vision(slides 1) Visual Computing: Geometry, Graphics, and Vision
(slides 1) Visual Computing: Geometry, Graphics, and Vision
Frank Nielsen
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
An introduction to probabilistic data structures
An introduction to probabilistic data structuresAn introduction to probabilistic data structures
An introduction to probabilistic data structures
Miguel Ping
 
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Kyle Davis
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
SujaAldrin
 
Hokusai - Sketching streams in real time
Hokusai - Sketching streams in real timeHokusai - Sketching streams in real time
Hokusai - Sketching streams in real time
Sergiy Matusevych
 
Data monsters probablistic data structures
Data monsters probablistic data structuresData monsters probablistic data structures
Data monsters probablistic data structures
GreenM
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
Tyler Treat
 
Ad

More from Yueshen Xu (20)

Context aware service recommendation
Context aware service recommendationContext aware service recommendation
Context aware service recommendation
Yueshen Xu
 
Course review for ir class 本科课件
Course review for ir class 本科课件Course review for ir class 本科课件
Course review for ir class 本科课件
Yueshen Xu
 
Semantic web 本科课件
Semantic web 本科课件Semantic web 本科课件
Semantic web 本科课件
Yueshen Xu
 
Recommender system slides for undergraduate
Recommender system slides for undergraduateRecommender system slides for undergraduate
Recommender system slides for undergraduate
Yueshen Xu
 
推荐系统 本科课件
 推荐系统 本科课件 推荐系统 本科课件
推荐系统 本科课件
Yueshen Xu
 
Text classification 本科课件
Text classification 本科课件Text classification 本科课件
Text classification 本科课件
Yueshen Xu
 
Thinking in clustering yueshen xu
Thinking in clustering yueshen xuThinking in clustering yueshen xu
Thinking in clustering yueshen xu
Yueshen Xu
 
Text clustering (information retrieval, in chinese)
Text clustering (information retrieval, in chinese)Text clustering (information retrieval, in chinese)
Text clustering (information retrieval, in chinese)
Yueshen Xu
 
(Hierarchical) Topic Modeling_Yueshen Xu
(Hierarchical) Topic Modeling_Yueshen Xu(Hierarchical) Topic Modeling_Yueshen Xu
(Hierarchical) Topic Modeling_Yueshen Xu
Yueshen Xu
 
(Hierarchical) topic modeling
(Hierarchical) topic modeling (Hierarchical) topic modeling
(Hierarchical) topic modeling
Yueshen Xu
 
Non parametric bayesian learning in discrete data
Non parametric bayesian learning in discrete dataNon parametric bayesian learning in discrete data
Non parametric bayesian learning in discrete data
Yueshen Xu
 
聚类 (Clustering)
聚类 (Clustering)聚类 (Clustering)
聚类 (Clustering)
Yueshen Xu
 
Yueshen xu cv
Yueshen xu cvYueshen xu cv
Yueshen xu cv
Yueshen Xu
 
徐悦甡简历
徐悦甡简历徐悦甡简历
徐悦甡简历
Yueshen Xu
 
Learning to recommend with user generated content
Learning to recommend with user generated contentLearning to recommend with user generated content
Learning to recommend with user generated content
Yueshen Xu
 
Social recommender system
Social recommender systemSocial recommender system
Social recommender system
Yueshen Xu
 
Summary on the Conference of WISE 2013
Summary on the Conference of WISE 2013Summary on the Conference of WISE 2013
Summary on the Conference of WISE 2013
Yueshen Xu
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
Yueshen Xu
 
Acoustic modeling using deep belief networks
Acoustic modeling using deep belief networksAcoustic modeling using deep belief networks
Acoustic modeling using deep belief networks
Yueshen Xu
 
Summarization for dragon star program
Summarization for dragon  star programSummarization for dragon  star program
Summarization for dragon star program
Yueshen Xu
 
Context aware service recommendation
Context aware service recommendationContext aware service recommendation
Context aware service recommendation
Yueshen Xu
 
Course review for ir class 本科课件
Course review for ir class 本科课件Course review for ir class 本科课件
Course review for ir class 本科课件
Yueshen Xu
 
Semantic web 本科课件
Semantic web 本科课件Semantic web 本科课件
Semantic web 本科课件
Yueshen Xu
 
Recommender system slides for undergraduate
Recommender system slides for undergraduateRecommender system slides for undergraduate
Recommender system slides for undergraduate
Yueshen Xu
 
推荐系统 本科课件
 推荐系统 本科课件 推荐系统 本科课件
推荐系统 本科课件
Yueshen Xu
 
Text classification 本科课件
Text classification 本科课件Text classification 本科课件
Text classification 本科课件
Yueshen Xu
 
Thinking in clustering yueshen xu
Thinking in clustering yueshen xuThinking in clustering yueshen xu
Thinking in clustering yueshen xu
Yueshen Xu
 
Text clustering (information retrieval, in chinese)
Text clustering (information retrieval, in chinese)Text clustering (information retrieval, in chinese)
Text clustering (information retrieval, in chinese)
Yueshen Xu
 
(Hierarchical) Topic Modeling_Yueshen Xu
(Hierarchical) Topic Modeling_Yueshen Xu(Hierarchical) Topic Modeling_Yueshen Xu
(Hierarchical) Topic Modeling_Yueshen Xu
Yueshen Xu
 
(Hierarchical) topic modeling
(Hierarchical) topic modeling (Hierarchical) topic modeling
(Hierarchical) topic modeling
Yueshen Xu
 
Non parametric bayesian learning in discrete data
Non parametric bayesian learning in discrete dataNon parametric bayesian learning in discrete data
Non parametric bayesian learning in discrete data
Yueshen Xu
 
聚类 (Clustering)
聚类 (Clustering)聚类 (Clustering)
聚类 (Clustering)
Yueshen Xu
 
徐悦甡简历
徐悦甡简历徐悦甡简历
徐悦甡简历
Yueshen Xu
 
Learning to recommend with user generated content
Learning to recommend with user generated contentLearning to recommend with user generated content
Learning to recommend with user generated content
Yueshen Xu
 
Social recommender system
Social recommender systemSocial recommender system
Social recommender system
Yueshen Xu
 
Summary on the Conference of WISE 2013
Summary on the Conference of WISE 2013Summary on the Conference of WISE 2013
Summary on the Conference of WISE 2013
Yueshen Xu
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
Yueshen Xu
 
Acoustic modeling using deep belief networks
Acoustic modeling using deep belief networksAcoustic modeling using deep belief networks
Acoustic modeling using deep belief networks
Yueshen Xu
 
Summarization for dragon star program
Summarization for dragon  star programSummarization for dragon  star program
Summarization for dragon star program
Yueshen Xu
 

Recently uploaded (20)

Smart Borrowing: Everything You Need to Know About Short Term Loans in India
Smart Borrowing: Everything You Need to Know About Short Term Loans in IndiaSmart Borrowing: Everything You Need to Know About Short Term Loans in India
Smart Borrowing: Everything You Need to Know About Short Term Loans in India
fincrifcontent
 
Trends Spotting Strategic foresight for tomorrow’s education systems - Debora...
Trends Spotting Strategic foresight for tomorrow’s education systems - Debora...Trends Spotting Strategic foresight for tomorrow’s education systems - Debora...
Trends Spotting Strategic foresight for tomorrow’s education systems - Debora...
EduSkills OECD
 
CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...
CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...
CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...
Sritoma Majumder
 
Search Engine Optimization (SEO) for Website Success
Search Engine Optimization (SEO) for Website SuccessSearch Engine Optimization (SEO) for Website Success
Search Engine Optimization (SEO) for Website Success
Muneeb Rana
 
Forestry Model Exit Exam_2025_Wollega University, Gimbi Campus.pdf
Forestry Model Exit Exam_2025_Wollega University, Gimbi Campus.pdfForestry Model Exit Exam_2025_Wollega University, Gimbi Campus.pdf
Forestry Model Exit Exam_2025_Wollega University, Gimbi Campus.pdf
ChalaKelbessa
 
HUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGY
HUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGYHUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGY
HUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGY
DHARMENDRA SAHU
 
LDMMIA Reiki Yoga S8 Free Workshop Grad Level
LDMMIA Reiki Yoga S8 Free Workshop Grad LevelLDMMIA Reiki Yoga S8 Free Workshop Grad Level
LDMMIA Reiki Yoga S8 Free Workshop Grad Level
LDM & Mia eStudios
 
How to Manage Maintenance Request in Odoo 18
How to Manage Maintenance Request in Odoo 18How to Manage Maintenance Request in Odoo 18
How to Manage Maintenance Request in Odoo 18
Celine George
 
SEM II 3202 STRUCTURAL MECHANICS, B ARCH, REGULATION 2021, ANNA UNIVERSITY, R...
SEM II 3202 STRUCTURAL MECHANICS, B ARCH, REGULATION 2021, ANNA UNIVERSITY, R...SEM II 3202 STRUCTURAL MECHANICS, B ARCH, REGULATION 2021, ANNA UNIVERSITY, R...
SEM II 3202 STRUCTURAL MECHANICS, B ARCH, REGULATION 2021, ANNA UNIVERSITY, R...
RVSPSOA
 
la storia dell'Inghilterra, letteratura inglese
la storia dell'Inghilterra, letteratura inglesela storia dell'Inghilterra, letteratura inglese
la storia dell'Inghilterra, letteratura inglese
LetiziaLucente
 
State institute of educational technology
State institute of educational technologyState institute of educational technology
State institute of educational technology
vp5806484
 
Artificial intelligence Presented by JM.
Artificial intelligence Presented by JM.Artificial intelligence Presented by JM.
Artificial intelligence Presented by JM.
jmansha170
 
Dashboard Overview in Odoo 18 - Odoo Slides
Dashboard Overview in Odoo 18 - Odoo SlidesDashboard Overview in Odoo 18 - Odoo Slides
Dashboard Overview in Odoo 18 - Odoo Slides
Celine George
 
Cloud Computing ..PPT ( Faizan ALTAF )..
Cloud Computing ..PPT ( Faizan ALTAF )..Cloud Computing ..PPT ( Faizan ALTAF )..
Cloud Computing ..PPT ( Faizan ALTAF )..
faizanaltaf231
 
Freckle Project April 2025 Survey and report May 2025.pptx
Freckle Project April 2025 Survey and report May 2025.pptxFreckle Project April 2025 Survey and report May 2025.pptx
Freckle Project April 2025 Survey and report May 2025.pptx
EveryLibrary
 
AR3201 WORLD ARCHITECTURE AND URBANISM EARLY CIVILISATIONS TO RENAISSANCE QUE...
AR3201 WORLD ARCHITECTURE AND URBANISM EARLY CIVILISATIONS TO RENAISSANCE QUE...AR3201 WORLD ARCHITECTURE AND URBANISM EARLY CIVILISATIONS TO RENAISSANCE QUE...
AR3201 WORLD ARCHITECTURE AND URBANISM EARLY CIVILISATIONS TO RENAISSANCE QUE...
Mani Sasidharan
 
Semisolid_Dosage_Forms.pptx
Semisolid_Dosage_Forms.pptxSemisolid_Dosage_Forms.pptx
Semisolid_Dosage_Forms.pptx
Shantanu Ranjan
 
Webcrawler_Mule_AIChain_MuleSoft_Meetup_Hyderabad
Webcrawler_Mule_AIChain_MuleSoft_Meetup_HyderabadWebcrawler_Mule_AIChain_MuleSoft_Meetup_Hyderabad
Webcrawler_Mule_AIChain_MuleSoft_Meetup_Hyderabad
Veera Pallapu
 
EUPHORIA GENERAL QUIZ FINALS | QUIZ CLUB OF PSGCAS | 21 MARCH 2025
EUPHORIA GENERAL QUIZ FINALS | QUIZ CLUB OF PSGCAS | 21 MARCH 2025EUPHORIA GENERAL QUIZ FINALS | QUIZ CLUB OF PSGCAS | 21 MARCH 2025
EUPHORIA GENERAL QUIZ FINALS | QUIZ CLUB OF PSGCAS | 21 MARCH 2025
Quiz Club of PSG College of Arts & Science
 
Stewart Butler - OECD - How to design and deliver higher technical education ...
Stewart Butler - OECD - How to design and deliver higher technical education ...Stewart Butler - OECD - How to design and deliver higher technical education ...
Stewart Butler - OECD - How to design and deliver higher technical education ...
EduSkills OECD
 
Smart Borrowing: Everything You Need to Know About Short Term Loans in India
Smart Borrowing: Everything You Need to Know About Short Term Loans in IndiaSmart Borrowing: Everything You Need to Know About Short Term Loans in India
Smart Borrowing: Everything You Need to Know About Short Term Loans in India
fincrifcontent
 
Trends Spotting Strategic foresight for tomorrow’s education systems - Debora...
Trends Spotting Strategic foresight for tomorrow’s education systems - Debora...Trends Spotting Strategic foresight for tomorrow’s education systems - Debora...
Trends Spotting Strategic foresight for tomorrow’s education systems - Debora...
EduSkills OECD
 
CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...
CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...
CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...
Sritoma Majumder
 
Search Engine Optimization (SEO) for Website Success
Search Engine Optimization (SEO) for Website SuccessSearch Engine Optimization (SEO) for Website Success
Search Engine Optimization (SEO) for Website Success
Muneeb Rana
 
Forestry Model Exit Exam_2025_Wollega University, Gimbi Campus.pdf
Forestry Model Exit Exam_2025_Wollega University, Gimbi Campus.pdfForestry Model Exit Exam_2025_Wollega University, Gimbi Campus.pdf
Forestry Model Exit Exam_2025_Wollega University, Gimbi Campus.pdf
ChalaKelbessa
 
HUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGY
HUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGYHUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGY
HUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGY
DHARMENDRA SAHU
 
LDMMIA Reiki Yoga S8 Free Workshop Grad Level
LDMMIA Reiki Yoga S8 Free Workshop Grad LevelLDMMIA Reiki Yoga S8 Free Workshop Grad Level
LDMMIA Reiki Yoga S8 Free Workshop Grad Level
LDM & Mia eStudios
 
How to Manage Maintenance Request in Odoo 18
How to Manage Maintenance Request in Odoo 18How to Manage Maintenance Request in Odoo 18
How to Manage Maintenance Request in Odoo 18
Celine George
 
SEM II 3202 STRUCTURAL MECHANICS, B ARCH, REGULATION 2021, ANNA UNIVERSITY, R...
SEM II 3202 STRUCTURAL MECHANICS, B ARCH, REGULATION 2021, ANNA UNIVERSITY, R...SEM II 3202 STRUCTURAL MECHANICS, B ARCH, REGULATION 2021, ANNA UNIVERSITY, R...
SEM II 3202 STRUCTURAL MECHANICS, B ARCH, REGULATION 2021, ANNA UNIVERSITY, R...
RVSPSOA
 
la storia dell'Inghilterra, letteratura inglese
la storia dell'Inghilterra, letteratura inglesela storia dell'Inghilterra, letteratura inglese
la storia dell'Inghilterra, letteratura inglese
LetiziaLucente
 
State institute of educational technology
State institute of educational technologyState institute of educational technology
State institute of educational technology
vp5806484
 
Artificial intelligence Presented by JM.
Artificial intelligence Presented by JM.Artificial intelligence Presented by JM.
Artificial intelligence Presented by JM.
jmansha170
 
Dashboard Overview in Odoo 18 - Odoo Slides
Dashboard Overview in Odoo 18 - Odoo SlidesDashboard Overview in Odoo 18 - Odoo Slides
Dashboard Overview in Odoo 18 - Odoo Slides
Celine George
 
Cloud Computing ..PPT ( Faizan ALTAF )..
Cloud Computing ..PPT ( Faizan ALTAF )..Cloud Computing ..PPT ( Faizan ALTAF )..
Cloud Computing ..PPT ( Faizan ALTAF )..
faizanaltaf231
 
Freckle Project April 2025 Survey and report May 2025.pptx
Freckle Project April 2025 Survey and report May 2025.pptxFreckle Project April 2025 Survey and report May 2025.pptx
Freckle Project April 2025 Survey and report May 2025.pptx
EveryLibrary
 
AR3201 WORLD ARCHITECTURE AND URBANISM EARLY CIVILISATIONS TO RENAISSANCE QUE...
AR3201 WORLD ARCHITECTURE AND URBANISM EARLY CIVILISATIONS TO RENAISSANCE QUE...AR3201 WORLD ARCHITECTURE AND URBANISM EARLY CIVILISATIONS TO RENAISSANCE QUE...
AR3201 WORLD ARCHITECTURE AND URBANISM EARLY CIVILISATIONS TO RENAISSANCE QUE...
Mani Sasidharan
 
Semisolid_Dosage_Forms.pptx
Semisolid_Dosage_Forms.pptxSemisolid_Dosage_Forms.pptx
Semisolid_Dosage_Forms.pptx
Shantanu Ranjan
 
Webcrawler_Mule_AIChain_MuleSoft_Meetup_Hyderabad
Webcrawler_Mule_AIChain_MuleSoft_Meetup_HyderabadWebcrawler_Mule_AIChain_MuleSoft_Meetup_Hyderabad
Webcrawler_Mule_AIChain_MuleSoft_Meetup_Hyderabad
Veera Pallapu
 
Stewart Butler - OECD - How to design and deliver higher technical education ...
Stewart Butler - OECD - How to design and deliver higher technical education ...Stewart Butler - OECD - How to design and deliver higher technical education ...
Stewart Butler - OECD - How to design and deliver higher technical education ...
EduSkills OECD
 

Aggregation computation over distributed data streams

  • 1. A ggregation C omputation O ver D istributed D ata S treams (partial content) Yueshen Xu Middleware, CCNT Zhejiang University Middleware, CCNT, ZJU 12/15/11
  • 2. Paper reference What's Different: Distributed, Continuous Monitoring of Duplicate- Resilient Aggregates on Data Streams Published in ICDE, 2006 Cited by 61 times By Graham Cormode, S. Muthukrishnan etc. 12/15/11 Middleware, CCNT, ZJU I think it’s a good reading suitable for freshmen on distributed data streams Bell Lab Expert/27 Rutgers Expert/45 !
  • 3. Background Distributed Data Streams Where and why? Large scale monitoring applications Many sensors distributed over a wide area 12/15/11 Middleware, CCNT, ZJU Just one example Distributed Streaming Model Query paradigm Centralized Decentralized VS
  • 4. Constraints and Features Constraints Space Embedded equipments don’t have enough memory Processing power The same reason Communication capability Unreliable, spotty and sporadic 12/15/11 Middleware, CCNT, ZJU All resources are restricted Features Different from ad hoc queries in DBMS, but continuous What’s different?
  • 5. Trouble Duplication  Why? Wide scale monitoring invariably encounters the same events at different points 12/15/11 Middleware, CCNT, ZJU Instances The same flow will be observed in different routers The same individual will be observed by several mobile sensors Requirement Duplicate-resilient aggregate Two vital questions What is the amount of duplication in the network? What are the versions of classical aggregates in the presence of duplicates?   root of all evil
  • 6. Topic What kind of topics are researchers interested in ? Aggregation computation Routing algorithms … What is the aggregation? Summarization, namely a statistic variable describing the original data sets Examples min, max, quantile, heavy hitter distinct counts, average, sum … 12/15/11 Middleware, CCNT, ZJU Not strange contacting with data streams Why aggregation?  transaction
  • 7. Problems and Concerns Distinct count To obtain the number of distinct data (item, record, etc) in multi-sets, namely the cardinality Distinct sample Important, but I’m sorry that I haven’t finished this part  12/15/11 Middleware, CCNT, ZJU What does this paper concern about? Priority: correctness, communication cost Computational cost, space cost ! Features attached to those algorithms applied to distributed environments
  • 8. Distinct Counting: Flajolet-Martin Sketch Flajolet-Martin Sketch P. Flajolet, G. Martin . Probabilistic Counting Algorithms for Data Base Applications . Journal of Computer and System Sciences, 1985(Cited by 628) Goal: To estimate the cardinalities of multi-sets of data using relative small space by one pass scan The sketch is a kind of data structure, which is the way to obtain the aggregation results. (skyline) I think this method can be regarded as the classical application of probability without complexity. 12/15/11 Middleware, CCNT, ZJU Give a question: How about you dealing with this problem? The computing paradigm of sketching Be appropriate for using in data streams inherently
  • 9. Flajolet-Martin Sketch(Cont.) Preliminary  what do we need? the Multi-set M, containing all items/records, and |M| = n the upper bound on the number of distinct items/records U, which is more than n the bitmap B, consisting of L elements, and 2 L = U the hash function h(x: item/record), transforming each items into a binary string distributed uniformly over the range of [1…2 L ], just like b 1 b 2 …b L , in which b 1 is the lowest digit, and b L is the highest the p(x), attaining the left most position of ‘1’ 12/15/11 Middleware, CCNT, ZJU counting not computing 1 1 … 0 B 0 PPT VS Whiteboard ? x record h(x) 1 L
  • 10. Flajolet-Martin Sketch(Cont.) The algorithm itself the core task: Remarking the position of which the leftmost ‘1’ of the hash value recorded by p(x) in bitmap B 12/15/11 Middleware, CCNT, ZJU for i:=1 to L do bitmap[i] :=0 for all x in M do begin index := p(hash(x)); if bitmap[index] = the bitmap[index] :=1; end Why?
  • 11. Flajolet-Martin Sketch(Cont.) The explanation The fact: bitmap[k] equals to 1 iff after execution a pattern of the form 0 k-1 1 has appeared amongst hashed values of records in M The probability: the occurrence probability of the pattern 0 k-1 1 is 1/2 k Occurrence times: so if |M| = n, then bitmap[1] is accessed approximately n/2 times, bitmap[2] approximately n/4 times Extension: bitmap[k] will almost certainly be zero if k >> log 2 (n) and one if k << log 2 (n) wit a fringe of 0 and 1 for k ≈ log 2 (n) Selection: the leftmost 0, the rightmost 1 or something else 12/15/11 Middleware, CCNT, ZJU U The most practical part is over, and the left is very complicated taken for proving and error analysis, namely all about mathematic for i:=1 to L do bitmap[i] :=0 for all x in M do begin index := p(hash(x)); if bitmap[index] = the bitmap[index] :=1; end
  • 12. Flajolet-Martin Sketch(Cont.) Conclusion Analysis Bit-based, reducing the space complexity by constant level Space complexity O(log(n))  O(log(log(n))) Duplicate-insensitive  duplicate-resilient and flexible Order-insensitive  stable and robust Additivity  The ability to merge two FM sketches together, and the merger is simply the bitwise-or of each pair of corresponding bitmaps Questions How to make the value of U? What’s the relationship of U and n? How to make the analysis to the error? … 12/15/11 Middleware, CCNT, ZJU nice qualities for distributed aggregation
  • 13. Question What’s the relationship between sketch and skyline? Are they the same? Or… Does the aggregation computation belong to the research fields of data mining ? No, I think 12/15/11 Middleware, CCNT, ZJU