SlideShare a Scribd company logo
A ggregation  C omputation  O ver  D istributed  D ata  S treams (partial content) Yueshen Xu Middleware, CCNT  Zhejiang University Middleware, CCNT, ZJU 12/15/11
Paper reference What's Different: Distributed, Continuous Monitoring of  Duplicate- Resilient Aggregates on Data Streams Published in ICDE, 2006 Cited by 61 times By Graham Cormode, S. Muthukrishnan etc. 12/15/11 Middleware, CCNT, ZJU I think it’s a good reading suitable for freshmen on distributed data streams Bell Lab Expert/27 Rutgers Expert/45 !
Background Distributed Data Streams Where and why? Large scale monitoring applications Many sensors distributed over a wide area 12/15/11 Middleware, CCNT, ZJU Just one example Distributed Streaming Model Query paradigm Centralized Decentralized VS
Constraints and Features Constraints Space  Embedded equipments don’t have enough memory  Processing power The same reason Communication capability Unreliable, spotty and sporadic 12/15/11 Middleware, CCNT, ZJU All resources are restricted Features Different from  ad hoc  queries in DBMS, but continuous What’s different?
Trouble Duplication    Why? Wide scale monitoring invariably encounters the same events at different points 12/15/11 Middleware, CCNT, ZJU Instances The same flow will be observed in different routers The same individual will be observed by several mobile sensors Requirement Duplicate-resilient aggregate Two vital questions What is the amount of duplication in the network? What are the versions of classical aggregates in the presence of duplicates?   root of all evil
Topic What kind of topics are researchers interested in ? Aggregation computation Routing algorithms … What is the aggregation? Summarization, namely a statistic variable describing the original data sets Examples min, max, quantile, heavy hitter distinct counts, average, sum … 12/15/11 Middleware, CCNT, ZJU Not strange contacting with data streams Why aggregation?    transaction
Problems and Concerns Distinct count To obtain the number of distinct data (item, record, etc) in multi-sets, namely the cardinality Distinct sample Important, but I’m sorry that I haven’t finished this part    12/15/11 Middleware, CCNT, ZJU What does this paper concern about? Priority: correctness, communication cost Computational cost, space cost ! Features attached to those algorithms applied to distributed environments
Distinct Counting:  Flajolet-Martin Sketch Flajolet-Martin Sketch P. Flajolet, G. Martin . Probabilistic Counting Algorithms for Data Base Applications . Journal of Computer and System Sciences, 1985(Cited by 628) Goal: To estimate the cardinalities of multi-sets of data using  relative small space by one pass scan The sketch is a kind of data structure, which is the way to obtain the aggregation results.  (skyline)   I think this method can be regarded as the classical application  of probability without complexity. 12/15/11 Middleware, CCNT, ZJU Give a question: How about you dealing with this problem? The computing paradigm of sketching Be appropriate for using in data streams inherently
Flajolet-Martin Sketch(Cont.) Preliminary    what do we need? the Multi-set M, containing all items/records, and |M| = n the upper bound on the number of distinct  items/records U, which is more than n the bitmap B, consisting of L elements, and 2 L  = U the hash function h(x: item/record), transforming each items  into a binary string distributed uniformly over the range of [1…2 L ], just like b 1 b 2 …b L , in which b 1  is the lowest digit, and b L  is the highest  the p(x), attaining the left most  position of ‘1’ 12/15/11 Middleware, CCNT, ZJU counting not computing 1 1 … 0 B 0 PPT  VS Whiteboard ? x record h(x) 1 L
Flajolet-Martin Sketch(Cont.) The algorithm itself the core task: Remarking the position of which the leftmost ‘1’ of the hash value recorded by p(x) in bitmap B  12/15/11 Middleware, CCNT, ZJU for i:=1 to L do bitmap[i] :=0  for all x in M do  begin  index := p(hash(x)); if bitmap[index] = the bitmap[index] :=1; end Why?
Flajolet-Martin Sketch(Cont.) The explanation The fact: bitmap[k] equals to 1 iff  after execution a pattern of the form 0 k-1 1 has appeared amongst hashed values of records in M The probability: the occurrence probability of the pattern 0 k-1 1 is 1/2 k Occurrence times: so if |M| = n, then bitmap[1] is accessed approximately n/2 times, bitmap[2] approximately n/4 times Extension: bitmap[k] will almost certainly be zero if k >> log 2 (n) and one if k << log 2 (n) wit a fringe of 0 and 1 for k ≈ log 2 (n) Selection: the leftmost 0, the rightmost 1 or something else 12/15/11 Middleware, CCNT, ZJU U The most practical part is over, and the left is very complicated taken for proving and error analysis, namely all about mathematic for i:=1 to L do bitmap[i] :=0  for all x in M do  begin  index := p(hash(x)); if bitmap[index] = the bitmap[index] :=1; end
Flajolet-Martin Sketch(Cont.) Conclusion Analysis Bit-based, reducing the space complexity by constant level Space complexity O(log(n))    O(log(log(n))) Duplicate-insensitive    duplicate-resilient and flexible Order-insensitive    stable and robust Additivity    The ability to merge two FM sketches together, and the merger is simply the bitwise-or of each pair of corresponding bitmaps Questions How to make the value of U? What’s the relationship of U and n? How to make the analysis to the error? … 12/15/11 Middleware, CCNT, ZJU nice qualities for distributed aggregation
Question What’s the relationship between sketch and skyline? Are they the same? Or… Does the  aggregation computation  belong to the research fields of  data mining ? No, I think 12/15/11 Middleware, CCNT, ZJU
Q&A 12/15/11 Middleware, CCNT, ZJU
Ad

Recommended

Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
Yueshen Xu
 
23. Advanced Datatypes and New Application in DBMS
23. Advanced Datatypes and New Application in DBMS
koolkampus
 
Dimension Reduction And Visualization Of Large High Dimensional Data Via Inte...
Dimension Reduction And Visualization Of Large High Dimensional Data Via Inte...
wl820609
 
cis98010
cis98010
perfj
 
Papers We Love Kyiv, July 2018: A Conflict-Free Replicated JSON Datatype
Papers We Love Kyiv, July 2018: A Conflict-Free Replicated JSON Datatype
Max Klymyshyn
 
Graph Matching
Graph Matching
graphitech
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
ijscmcj
 
The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)
James McMurray
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process
Ha Phuong
 
Database , 8 Query Optimization
Database , 8 Query Optimization
Ali Usman
 
Fast and Bootstrap Robust for LTS
Fast and Bootstrap Robust for LTS
dessybudiyanti
 
Notes on Spectral Clustering
Notes on Spectral Clustering
Davide Eynard
 
Design of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic Multiplier
VLSICS Design
 
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Beat Signer
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Ryan B Harvey, CSDP, CSM
 
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Shinwoo Jang
 
a paper reading of table recognition
a paper reading of table recognition
Ning Lu
 
Dask glm-scipy2017-final
Dask glm-scipy2017-final
Hussain Sultan
 
QTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature Map
Ha Phuong
 
Pca analysis
Pca analysis
kunasujitha
 
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
ijseajournal
 
Lecture7 xing fei-fei
Lecture7 xing fei-fei
Tianlu Wang
 
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
The Statistical and Applied Mathematical Sciences Institute
 
Approaches to online quantile estimation
Approaches to online quantile estimation
Data Con LA
 
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
ijcsbi
 
Data decomposition techniques
Data decomposition techniques
Mohamed Ramadan
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
ijcses
 
Tensor Spectral Clustering
Tensor Spectral Clustering
Austin Benson
 
Hash - A probabilistic approach for big data
Hash - A probabilistic approach for big data
Luca Mastrostefano
 
Detecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine Learning
Andrew Beard
 

More Related Content

What's hot (20)

010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process
Ha Phuong
 
Database , 8 Query Optimization
Database , 8 Query Optimization
Ali Usman
 
Fast and Bootstrap Robust for LTS
Fast and Bootstrap Robust for LTS
dessybudiyanti
 
Notes on Spectral Clustering
Notes on Spectral Clustering
Davide Eynard
 
Design of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic Multiplier
VLSICS Design
 
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Beat Signer
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Ryan B Harvey, CSDP, CSM
 
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Shinwoo Jang
 
a paper reading of table recognition
a paper reading of table recognition
Ning Lu
 
Dask glm-scipy2017-final
Dask glm-scipy2017-final
Hussain Sultan
 
QTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature Map
Ha Phuong
 
Pca analysis
Pca analysis
kunasujitha
 
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
ijseajournal
 
Lecture7 xing fei-fei
Lecture7 xing fei-fei
Tianlu Wang
 
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
The Statistical and Applied Mathematical Sciences Institute
 
Approaches to online quantile estimation
Approaches to online quantile estimation
Data Con LA
 
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
ijcsbi
 
Data decomposition techniques
Data decomposition techniques
Mohamed Ramadan
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
ijcses
 
Tensor Spectral Clustering
Tensor Spectral Clustering
Austin Benson
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process
Ha Phuong
 
Database , 8 Query Optimization
Database , 8 Query Optimization
Ali Usman
 
Fast and Bootstrap Robust for LTS
Fast and Bootstrap Robust for LTS
dessybudiyanti
 
Notes on Spectral Clustering
Notes on Spectral Clustering
Davide Eynard
 
Design of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic Multiplier
VLSICS Design
 
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Beat Signer
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Ryan B Harvey, CSDP, CSM
 
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Shinwoo Jang
 
a paper reading of table recognition
a paper reading of table recognition
Ning Lu
 
Dask glm-scipy2017-final
Dask glm-scipy2017-final
Hussain Sultan
 
QTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature Map
Ha Phuong
 
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
ijseajournal
 
Lecture7 xing fei-fei
Lecture7 xing fei-fei
Tianlu Wang
 
Approaches to online quantile estimation
Approaches to online quantile estimation
Data Con LA
 
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
ijcsbi
 
Data decomposition techniques
Data decomposition techniques
Mohamed Ramadan
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
ijcses
 
Tensor Spectral Clustering
Tensor Spectral Clustering
Austin Benson
 

Viewers also liked (14)

Hash - A probabilistic approach for big data
Hash - A probabilistic approach for big data
Luca Mastrostefano
 
Detecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine Learning
Andrew Beard
 
Storm Real Time Computation
Storm Real Time Computation
Sonal Raj
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
Allen Day, PhD
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Sonal Raj
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...
potaters
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Cloudera, Inc.
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Internet of Things (IoT) - We Are at the Tip of An Iceberg
Internet of Things (IoT) - We Are at the Tip of An Iceberg
Dr. Mazlan Abbas
 
Budapest Spark Meetup - Apache Spark @enbrite.ly
Budapest Spark Meetup - Apache Spark @enbrite.ly
Mészáros József
 
Hash - A probabilistic approach for big data
Hash - A probabilistic approach for big data
Luca Mastrostefano
 
Detecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine Learning
Andrew Beard
 
Storm Real Time Computation
Storm Real Time Computation
Sonal Raj
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
Allen Day, PhD
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Sonal Raj
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...
potaters
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Cloudera, Inc.
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Internet of Things (IoT) - We Are at the Tip of An Iceberg
Internet of Things (IoT) - We Are at the Tip of An Iceberg
Dr. Mazlan Abbas
 
Budapest Spark Meetup - Apache Spark @enbrite.ly
Budapest Spark Meetup - Apache Spark @enbrite.ly
Mészáros József
 
Ad

Similar to Aggregation computation over distributed data streams (20)

codecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
DataStax Academy
 
Probabilistic data structures
Probabilistic data structures
shrinivasvasala
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
 
Data streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
Debasish Ghosh
 
Tech talk Probabilistic Data Structure
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PyData
 
Streaming Algorithms
Streaming Algorithms
Joe Kelley
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
Probabilistic data structure
Probabilistic data structure
Thinh Dang
 
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
DataStax
 
(slides 1) Visual Computing: Geometry, Graphics, and Vision
(slides 1) Visual Computing: Geometry, Graphics, and Vision
Frank Nielsen
 
Bigdata analytics
Bigdata analytics
lakshmidkurup
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
An introduction to probabilistic data structures
An introduction to probabilistic data structures
Miguel Ping
 
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Kyle Davis
 
Mining Data Streams
Mining Data Streams
SujaAldrin
 
Hokusai - Sketching streams in real time
Hokusai - Sketching streams in real time
Sergiy Matusevych
 
Data monsters probablistic data structures
Data monsters probablistic data structures
GreenM
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
Tyler Treat
 
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
DataStax Academy
 
Probabilistic data structures
Probabilistic data structures
shrinivasvasala
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
 
Data streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
Debasish Ghosh
 
Tech talk Probabilistic Data Structure
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PyData
 
Streaming Algorithms
Streaming Algorithms
Joe Kelley
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
Probabilistic data structure
Probabilistic data structure
Thinh Dang
 
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
DataStax
 
(slides 1) Visual Computing: Geometry, Graphics, and Vision
(slides 1) Visual Computing: Geometry, Graphics, and Vision
Frank Nielsen
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
An introduction to probabilistic data structures
An introduction to probabilistic data structures
Miguel Ping
 
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Kyle Davis
 
Mining Data Streams
Mining Data Streams
SujaAldrin
 
Hokusai - Sketching streams in real time
Hokusai - Sketching streams in real time
Sergiy Matusevych
 
Data monsters probablistic data structures
Data monsters probablistic data structures
GreenM
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
Tyler Treat
 
Ad

More from Yueshen Xu (20)

Context aware service recommendation
Context aware service recommendation
Yueshen Xu
 
Course review for ir class 本科课件
Course review for ir class 本科课件
Yueshen Xu
 
Semantic web 本科课件
Semantic web 本科课件
Yueshen Xu
 
Recommender system slides for undergraduate
Recommender system slides for undergraduate
Yueshen Xu
 
推荐系统 本科课件
推荐系统 本科课件
Yueshen Xu
 
Text classification 本科课件
Text classification 本科课件
Yueshen Xu
 
Thinking in clustering yueshen xu
Thinking in clustering yueshen xu
Yueshen Xu
 
Text clustering (information retrieval, in chinese)
Text clustering (information retrieval, in chinese)
Yueshen Xu
 
(Hierarchical) Topic Modeling_Yueshen Xu
(Hierarchical) Topic Modeling_Yueshen Xu
Yueshen Xu
 
(Hierarchical) topic modeling
(Hierarchical) topic modeling
Yueshen Xu
 
Non parametric bayesian learning in discrete data
Non parametric bayesian learning in discrete data
Yueshen Xu
 
聚类 (Clustering)
聚类 (Clustering)
Yueshen Xu
 
Yueshen xu cv
Yueshen xu cv
Yueshen Xu
 
徐悦甡简历
徐悦甡简历
Yueshen Xu
 
Learning to recommend with user generated content
Learning to recommend with user generated content
Yueshen Xu
 
Social recommender system
Social recommender system
Yueshen Xu
 
Summary on the Conference of WISE 2013
Summary on the Conference of WISE 2013
Yueshen Xu
 
Topic model an introduction
Topic model an introduction
Yueshen Xu
 
Acoustic modeling using deep belief networks
Acoustic modeling using deep belief networks
Yueshen Xu
 
Summarization for dragon star program
Summarization for dragon star program
Yueshen Xu
 
Context aware service recommendation
Context aware service recommendation
Yueshen Xu
 
Course review for ir class 本科课件
Course review for ir class 本科课件
Yueshen Xu
 
Semantic web 本科课件
Semantic web 本科课件
Yueshen Xu
 
Recommender system slides for undergraduate
Recommender system slides for undergraduate
Yueshen Xu
 
推荐系统 本科课件
推荐系统 本科课件
Yueshen Xu
 
Text classification 本科课件
Text classification 本科课件
Yueshen Xu
 
Thinking in clustering yueshen xu
Thinking in clustering yueshen xu
Yueshen Xu
 
Text clustering (information retrieval, in chinese)
Text clustering (information retrieval, in chinese)
Yueshen Xu
 
(Hierarchical) Topic Modeling_Yueshen Xu
(Hierarchical) Topic Modeling_Yueshen Xu
Yueshen Xu
 
(Hierarchical) topic modeling
(Hierarchical) topic modeling
Yueshen Xu
 
Non parametric bayesian learning in discrete data
Non parametric bayesian learning in discrete data
Yueshen Xu
 
聚类 (Clustering)
聚类 (Clustering)
Yueshen Xu
 
徐悦甡简历
徐悦甡简历
Yueshen Xu
 
Learning to recommend with user generated content
Learning to recommend with user generated content
Yueshen Xu
 
Social recommender system
Social recommender system
Yueshen Xu
 
Summary on the Conference of WISE 2013
Summary on the Conference of WISE 2013
Yueshen Xu
 
Topic model an introduction
Topic model an introduction
Yueshen Xu
 
Acoustic modeling using deep belief networks
Acoustic modeling using deep belief networks
Yueshen Xu
 
Summarization for dragon star program
Summarization for dragon star program
Yueshen Xu
 

Recently uploaded (20)

K12 Tableau User Group virtual event June 18, 2025
K12 Tableau User Group virtual event June 18, 2025
dogden2
 
LDMMIA Yoga S10 Free Workshop Grad Level
LDMMIA Yoga S10 Free Workshop Grad Level
LDM & Mia eStudios
 
Paper 107 | From Watchdog to Lapdog: Ishiguro’s Fiction and the Rise of “Godi...
Paper 107 | From Watchdog to Lapdog: Ishiguro’s Fiction and the Rise of “Godi...
Rajdeep Bavaliya
 
Health Care Planning and Organization of Health Care at Various Levels – Unit...
Health Care Planning and Organization of Health Care at Various Levels – Unit...
RAKESH SAJJAN
 
How to Customize Quotation Layouts in Odoo 18
How to Customize Quotation Layouts in Odoo 18
Celine George
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 6-14-2025.pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 6-14-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
Q1_ENGLISH_PPT_WEEK 1 power point grade 3 Quarter 1 week 1
Q1_ENGLISH_PPT_WEEK 1 power point grade 3 Quarter 1 week 1
jutaydeonne
 
LDMMIA Practitioner Student Reiki Yoga S2 Video PDF Without Yogi Goddess
LDMMIA Practitioner Student Reiki Yoga S2 Video PDF Without Yogi Goddess
LDM & Mia eStudios
 
2025 June Year 9 Presentation: Subject selection.pptx
2025 June Year 9 Presentation: Subject selection.pptx
mansk2
 
Community Health Nursing Approaches, Concepts, Roles & Responsibilities – Uni...
Community Health Nursing Approaches, Concepts, Roles & Responsibilities – Uni...
RAKESH SAJJAN
 
Battle of Bookworms 2025 - U25 Literature Quiz by Pragya
Battle of Bookworms 2025 - U25 Literature Quiz by Pragya
Pragya - UEM Kolkata Quiz Club
 
“THE BEST CLASS IN SCHOOL”. _
“THE BEST CLASS IN SCHOOL”. _
Colégio Santa Teresinha
 
Code Profiling in Odoo 18 - Odoo 18 Slides
Code Profiling in Odoo 18 - Odoo 18 Slides
Celine George
 
Nutrition Assessment and Nutrition Education – Unit 4 | B.Sc Nursing 5th Seme...
Nutrition Assessment and Nutrition Education – Unit 4 | B.Sc Nursing 5th Seme...
RAKESH SAJJAN
 
Environmental Science, Environmental Health, and Sanitation – Unit 3 | B.Sc N...
Environmental Science, Environmental Health, and Sanitation – Unit 3 | B.Sc N...
RAKESH SAJJAN
 
Chalukyas of Gujrat, Solanki Dynasty NEP.pptx
Chalukyas of Gujrat, Solanki Dynasty NEP.pptx
Dr. Ravi Shankar Arya Mahila P. G. College, Banaras Hindu University, Varanasi, India.
 
NSUMD_M1 Library Orientation_June 11, 2025.pptx
NSUMD_M1 Library Orientation_June 11, 2025.pptx
Julie Sarpy
 
Tanja Vujicic - PISA for Schools contact Info
Tanja Vujicic - PISA for Schools contact Info
EduSkills OECD
 
ECONOMICS, DISASTER MANAGEMENT, ROAD SAFETY - STUDY MATERIAL [10TH]
ECONOMICS, DISASTER MANAGEMENT, ROAD SAFETY - STUDY MATERIAL [10TH]
SHERAZ AHMAD LONE
 
Birnagar High School Platinum Jubilee Quiz.pptx
Birnagar High School Platinum Jubilee Quiz.pptx
Sourav Kr Podder
 
K12 Tableau User Group virtual event June 18, 2025
K12 Tableau User Group virtual event June 18, 2025
dogden2
 
LDMMIA Yoga S10 Free Workshop Grad Level
LDMMIA Yoga S10 Free Workshop Grad Level
LDM & Mia eStudios
 
Paper 107 | From Watchdog to Lapdog: Ishiguro’s Fiction and the Rise of “Godi...
Paper 107 | From Watchdog to Lapdog: Ishiguro’s Fiction and the Rise of “Godi...
Rajdeep Bavaliya
 
Health Care Planning and Organization of Health Care at Various Levels – Unit...
Health Care Planning and Organization of Health Care at Various Levels – Unit...
RAKESH SAJJAN
 
How to Customize Quotation Layouts in Odoo 18
How to Customize Quotation Layouts in Odoo 18
Celine George
 
Q1_ENGLISH_PPT_WEEK 1 power point grade 3 Quarter 1 week 1
Q1_ENGLISH_PPT_WEEK 1 power point grade 3 Quarter 1 week 1
jutaydeonne
 
LDMMIA Practitioner Student Reiki Yoga S2 Video PDF Without Yogi Goddess
LDMMIA Practitioner Student Reiki Yoga S2 Video PDF Without Yogi Goddess
LDM & Mia eStudios
 
2025 June Year 9 Presentation: Subject selection.pptx
2025 June Year 9 Presentation: Subject selection.pptx
mansk2
 
Community Health Nursing Approaches, Concepts, Roles & Responsibilities – Uni...
Community Health Nursing Approaches, Concepts, Roles & Responsibilities – Uni...
RAKESH SAJJAN
 
Battle of Bookworms 2025 - U25 Literature Quiz by Pragya
Battle of Bookworms 2025 - U25 Literature Quiz by Pragya
Pragya - UEM Kolkata Quiz Club
 
Code Profiling in Odoo 18 - Odoo 18 Slides
Code Profiling in Odoo 18 - Odoo 18 Slides
Celine George
 
Nutrition Assessment and Nutrition Education – Unit 4 | B.Sc Nursing 5th Seme...
Nutrition Assessment and Nutrition Education – Unit 4 | B.Sc Nursing 5th Seme...
RAKESH SAJJAN
 
Environmental Science, Environmental Health, and Sanitation – Unit 3 | B.Sc N...
Environmental Science, Environmental Health, and Sanitation – Unit 3 | B.Sc N...
RAKESH SAJJAN
 
NSUMD_M1 Library Orientation_June 11, 2025.pptx
NSUMD_M1 Library Orientation_June 11, 2025.pptx
Julie Sarpy
 
Tanja Vujicic - PISA for Schools contact Info
Tanja Vujicic - PISA for Schools contact Info
EduSkills OECD
 
ECONOMICS, DISASTER MANAGEMENT, ROAD SAFETY - STUDY MATERIAL [10TH]
ECONOMICS, DISASTER MANAGEMENT, ROAD SAFETY - STUDY MATERIAL [10TH]
SHERAZ AHMAD LONE
 
Birnagar High School Platinum Jubilee Quiz.pptx
Birnagar High School Platinum Jubilee Quiz.pptx
Sourav Kr Podder
 

Aggregation computation over distributed data streams

  • 1. A ggregation C omputation O ver D istributed D ata S treams (partial content) Yueshen Xu Middleware, CCNT Zhejiang University Middleware, CCNT, ZJU 12/15/11
  • 2. Paper reference What's Different: Distributed, Continuous Monitoring of Duplicate- Resilient Aggregates on Data Streams Published in ICDE, 2006 Cited by 61 times By Graham Cormode, S. Muthukrishnan etc. 12/15/11 Middleware, CCNT, ZJU I think it’s a good reading suitable for freshmen on distributed data streams Bell Lab Expert/27 Rutgers Expert/45 !
  • 3. Background Distributed Data Streams Where and why? Large scale monitoring applications Many sensors distributed over a wide area 12/15/11 Middleware, CCNT, ZJU Just one example Distributed Streaming Model Query paradigm Centralized Decentralized VS
  • 4. Constraints and Features Constraints Space Embedded equipments don’t have enough memory Processing power The same reason Communication capability Unreliable, spotty and sporadic 12/15/11 Middleware, CCNT, ZJU All resources are restricted Features Different from ad hoc queries in DBMS, but continuous What’s different?
  • 5. Trouble Duplication  Why? Wide scale monitoring invariably encounters the same events at different points 12/15/11 Middleware, CCNT, ZJU Instances The same flow will be observed in different routers The same individual will be observed by several mobile sensors Requirement Duplicate-resilient aggregate Two vital questions What is the amount of duplication in the network? What are the versions of classical aggregates in the presence of duplicates?   root of all evil
  • 6. Topic What kind of topics are researchers interested in ? Aggregation computation Routing algorithms … What is the aggregation? Summarization, namely a statistic variable describing the original data sets Examples min, max, quantile, heavy hitter distinct counts, average, sum … 12/15/11 Middleware, CCNT, ZJU Not strange contacting with data streams Why aggregation?  transaction
  • 7. Problems and Concerns Distinct count To obtain the number of distinct data (item, record, etc) in multi-sets, namely the cardinality Distinct sample Important, but I’m sorry that I haven’t finished this part  12/15/11 Middleware, CCNT, ZJU What does this paper concern about? Priority: correctness, communication cost Computational cost, space cost ! Features attached to those algorithms applied to distributed environments
  • 8. Distinct Counting: Flajolet-Martin Sketch Flajolet-Martin Sketch P. Flajolet, G. Martin . Probabilistic Counting Algorithms for Data Base Applications . Journal of Computer and System Sciences, 1985(Cited by 628) Goal: To estimate the cardinalities of multi-sets of data using relative small space by one pass scan The sketch is a kind of data structure, which is the way to obtain the aggregation results. (skyline) I think this method can be regarded as the classical application of probability without complexity. 12/15/11 Middleware, CCNT, ZJU Give a question: How about you dealing with this problem? The computing paradigm of sketching Be appropriate for using in data streams inherently
  • 9. Flajolet-Martin Sketch(Cont.) Preliminary  what do we need? the Multi-set M, containing all items/records, and |M| = n the upper bound on the number of distinct items/records U, which is more than n the bitmap B, consisting of L elements, and 2 L = U the hash function h(x: item/record), transforming each items into a binary string distributed uniformly over the range of [1…2 L ], just like b 1 b 2 …b L , in which b 1 is the lowest digit, and b L is the highest the p(x), attaining the left most position of ‘1’ 12/15/11 Middleware, CCNT, ZJU counting not computing 1 1 … 0 B 0 PPT VS Whiteboard ? x record h(x) 1 L
  • 10. Flajolet-Martin Sketch(Cont.) The algorithm itself the core task: Remarking the position of which the leftmost ‘1’ of the hash value recorded by p(x) in bitmap B 12/15/11 Middleware, CCNT, ZJU for i:=1 to L do bitmap[i] :=0 for all x in M do begin index := p(hash(x)); if bitmap[index] = the bitmap[index] :=1; end Why?
  • 11. Flajolet-Martin Sketch(Cont.) The explanation The fact: bitmap[k] equals to 1 iff after execution a pattern of the form 0 k-1 1 has appeared amongst hashed values of records in M The probability: the occurrence probability of the pattern 0 k-1 1 is 1/2 k Occurrence times: so if |M| = n, then bitmap[1] is accessed approximately n/2 times, bitmap[2] approximately n/4 times Extension: bitmap[k] will almost certainly be zero if k >> log 2 (n) and one if k << log 2 (n) wit a fringe of 0 and 1 for k ≈ log 2 (n) Selection: the leftmost 0, the rightmost 1 or something else 12/15/11 Middleware, CCNT, ZJU U The most practical part is over, and the left is very complicated taken for proving and error analysis, namely all about mathematic for i:=1 to L do bitmap[i] :=0 for all x in M do begin index := p(hash(x)); if bitmap[index] = the bitmap[index] :=1; end
  • 12. Flajolet-Martin Sketch(Cont.) Conclusion Analysis Bit-based, reducing the space complexity by constant level Space complexity O(log(n))  O(log(log(n))) Duplicate-insensitive  duplicate-resilient and flexible Order-insensitive  stable and robust Additivity  The ability to merge two FM sketches together, and the merger is simply the bitwise-or of each pair of corresponding bitmaps Questions How to make the value of U? What’s the relationship of U and n? How to make the analysis to the error? … 12/15/11 Middleware, CCNT, ZJU nice qualities for distributed aggregation
  • 13. Question What’s the relationship between sketch and skyline? Are they the same? Or… Does the aggregation computation belong to the research fields of data mining ? No, I think 12/15/11 Middleware, CCNT, ZJU