SlideShare a Scribd company logo
Probabilistic data structures
https://www.linkedin.com/in/taras-yaroshchuk-551383105/
Taras Yaroshchuk
Senior Data Engineer at Sigma Software
- 4 years in Data Engineering
- AdTech, IoT, FinTech
- Scala/Java/Python
- Trying to contribute to big data community
Skype/Telegram/FB/everywhere: taras.yaroshchuk
Use cases
● Membership (Bloom filter, Quotient filter, Cuckoo filter)
● Frequency (Frequent algorithm, Count-Min Sketch)
● Cardinality (Linear Counting, LogLog, HyperLogLog)
● Rank (Random sampling, q-digest, t-digest)
● Similarity (Locality-Sensitive Hashing, MinHash, SimHash)
Motivation
->
Data monsters probablistic data structures
Data monsters probablistic data structures
Use cases
● Membership (Bloom filter, Quotient filter, Cuckoo filter)
● Frequency (Frequent algorithm, Count-Min Sketch)
● Cardinality (Linear Counting, LogLog, HyperLogLog)
● Rank (Random sampling, q-digest, t-digest)
● Similarity (Locality-Sensitive Hashing, MinHash, SimHash)
Hashing
Cryptographic hash functions
● Message-Digest Algorithm (MD5)
● Secure Hash Algorithms (SHA-256, SHA-512, etc)
● RadioGetun
Non-Cryptographics hash functions
● FNV1
● CityHash, FarmHash
● MurmurHash3
42
Bloom Filter (Membership)
- Google Bigtable, HBase, Cassandra and
PostgreSQL use Bloom filters to reduce the disk
lookups for non-existent rows or columns.
- Medium uses bloom filter to avoid showing
duplicate recommendations
- Bad URLs for Google Chrome
- Compromised passwords
Bloom Filter (Membership)
- It is like Set(), but doesn’t store elements itself
- Supports 2 operations: add element,
check if element exists
HashSet
Bloom Filter (Membership)
0 1 2 3 4
1 0 1 1 0
- It is like Set(), but doesn’t store elements itself
- Supports 2 operations: add element,
check if element exists
- Bit array
- Use multiple hash functions
h1(x) = MurmurHash3(x) % 10
h2(x) = FNV1(x) % 10
HashSet
Bloom Filter (Membership)
Example:
- camera on highway
- bad internet connection
- police in 400m
Bloom Filter (Membership)
0 1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 1 0 0
Euroblyaha detection
QWERTY777, NET1234, ASDF999
1. Add QWERTY777
h1 = MurmurHash3(QWERTY777) % 10 = 7
h2 = FNV1(QWERTY777) % 10 = 3
Bloom Filter (Membership)
0 1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 1 0 0
Euroblyaha detection
QWERTY777, NET1234, ASDF999
1. Add QWERTY777
h1 = MurmurHash3(QWERTY777) % 10 = 7
h2 = FNV1(QWERTY777) % 10 = 3
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
2. Add NET1234
h1 = MurmurHash3(NET1234) % 10 = 1
h2 = FNV1(NET1234) % 10 = 3
Bloom Filter (Membership)
0 1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 1 0 0
Euroblyaha detection
QWERTY777, NET1234, ASDF999
1. Add QWERTY777
h1 = MurmurHash3(QWERTY777) % 10 = 7
h2 = FNV1(QWERTY777) % 10 = 3
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
2. Add NET1234
h1 = MurmurHash3(NET1234) % 10 = 1
h2 = FNV1(NET1234) % 10 = 3
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
3. Contains ASDF999? (false)
h1 = MurmurHash3(ASDF999) % 10 = 5
h2 = FNV1(ASDF999) % 10 = 6
Bloom Filter (Membership)
0 1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 1 0 0
Euroblyaha detection
QWERTY777, NET1234, ASDF999
1. Add QWERTY777
h1 = MurmurHash3(QWERTY777) % 10 = 7
h2 = FNV1(QWERTY777) % 10 = 3
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
2. Add NET1234
h1 = MurmurHash3(NET1234) % 10 = 1
h2 = FNV1(NET1234) % 10 = 3
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
3. Contains ASDF999? (false)
h1 = MurmurHash3(ASDF999) % 10 = 5
h2 = FNV1(ASDF999) % 10 = 6
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
4. Contains NET1234? (true)
h1 = MurmurHash3(NET1234) % 10 = 1
h2 = FNV1(NET1234) % 10 = 3
Bloom Filter (Membership)
- Element definitely doesn’t exist in the set
- Element may exist in the set. Lets say, 98%
Bloom Filter (Membership)
p - positive error rate
m - based on the size of the filter
k - the number of hash functions,
n - number of elements inserted
- Element definitely doesn’t exist in the set
- Element may exist in the set. Lets say, 98%
k m/n p, %
4 6 5.62
6 8 2.15
8 12 0.314
11 16 0.04581 billion elements, p=2% ~ 1 Gb
Cassandra
bloom filter
How it looks like?
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>22.0</version>
</dependency>
How many times element occurred?
Show top X elements
For streaming application that deals with huge amounts of data
● DNS DDoS
● Intent Surge
● twitter trending hashtags
Count-Min Sketch (Frequency)
- Use multiple hash functions
- Matrix of counters (not bits)
- Top frequent elements
- Shows upper bound estimation (less than)
Count-Min Sketch (Frequency)
h1(x) = MurmurHash3(x) % 10
h2(x) = FNV1(x) % 10
0 1 2 3 4 5 6 7 8 9
h1 0 0 0 0 0 0 0 0 0 0
h2 0 0 0 0 0 0 0 0 0 0
{ ->#quarantine, #quarantine, #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine,
#brexit, #oscar, #quarantine }
Count-Min Sketch (Frequency)
h1(x) = MurmurHash3(x) % 10
h2(x) = FNV1(x) % 10
0 1 2 3 4 5 6 7 8 9
h1 0 0 0 0 0 0 0 0 0 0
h2 0 0 0 0 0 0 0 0 0 0
{ #quarantine, #quarantine, -> #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine,
#brexit, #oscar, #quarantine }
Count-Min Sketch (Frequency)
0 1 2 3 4 5 6 7 8 9
h1 0 0 0 0 2 0 0 0 0 0
h2 0 0 0 0 0 0 0 2 0 0
h1(x) = MurmurHash3(quarantine) % 10 = 4
h2(x) = FNV1(quarantine) % 10 = 7
1. #quarantine
2. #quarantine
{ #quarantine, #quarantine, #brexit, #brexit, -> #alyonalyona, #quarantine, #tesla, #tesla, #quarantine,
#brexit, #oscar, #quarantine }
Count-Min Sketch (Frequency)
0 1 2 3 4 5 6 7 8 9
h1 0 0 0 0 2 0 0 0 0 0
h2 0 0 0 0 0 0 0 2 0 0
h1(x) = MurmurHash3(quarantine) % 10 =
4
h2(x) = FNV1(quarantine) % 10 = 7
1. #quarantine
2. #quarantine
0 1 2 3 4 5 6 7 8 9
h1 0 0 0 0 2 0 0 0 2 0
h2 0 0 2 0 0 0 0 2 0 0
3. #brexit
4. #brexit
h1(x) = MurmurHash3(brexit) % 10 = 8
h2(x) = FNV1(brexit) % 10 = 2
{ #quarantine, #quarantine, #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine,
#brexit, #oscar, #quarantine -> }
Count-Min Sketch (Frequency)
0 1 2 3 4 5 6 7 8 9
h1 0 0 0 0 6 0 1 0 5 0
h2 0 0 3 0 0 1 0 6 0 2
h1(x) = MurmurHash3(x) % 10
h2(x) = FNV1(x) % 10 h1(x) = MurmurHash3(brexit) % 10 = 8
h2(x) = FNV1(brexit) % 10 = 2
h1(x) = MurmurHash3(tesla) % 10 = 8
h2(x) = FNV1(tesla) % 10 = 9
{ #quarantine, #quarantine, #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine,
#brexit, #oscar, #quarantine -> }
Count-Min Sketch (Frequency)
0 1 2 3 4 5 6 7 8 9
h1 0 0 0 0 6 0 1 0 5 0
h2 0 0 3 0 0 1 0 6 0 2
How many times #tesla?
h1(x) = MurmurHash3(tesla) % 10 = 8
h2(x) = FNV1(tesla) % 10 = 9
Final answer = min(h1[8], h2[9]) = min(5, 2) = 2
Count-Min Sketch (Frequency)
p = |ln(1/σ)|
m = 2.71828/ɛ
p - number hash functions
σ - standard error
m - number of bits
ɛ - overestimation factor
Example:
We expect to store 10 million of elements
σ should be ~1%, accepted overestimation is
10.
p = |ln(1/0.01)| = 5
ɛ = 10/107=10-6
m = 2.71828/10-6 = 2718280
Conclusions
- Probabilistic data structures are not general purpose
- They should be used as optimization
- They can save you memory and time
- Sound complex, but not so scary in practice
- Learn them and impress your interviewer
https://www.amazon.com/Probabilistic-Data-Structures-Algorithms-Applications/dp/3748190484
Thanks!

More Related Content

PDF
twitteRで快適Rライフ!
Takeshi Arabiki
 
PDF
RではじめるTwitter解析
Takeshi Arabiki
 
PPTX
The groovy puzzlers (as Presented at JavaOne 2014)
GroovyPuzzlers
 
PDF
Τα Πολύ Βασικά για την Python
Moses Boudourides
 
TXT
Emo
Pooja Mondal
 
PDF
Slides Δικτυακών Υπολογισμών με την Python
Moses Boudourides
 
PDF
dplyr
Romain Francois
 
PDF
Grestest2
Ankit Dubey
 
twitteRで快適Rライフ!
Takeshi Arabiki
 
RではじめるTwitter解析
Takeshi Arabiki
 
The groovy puzzlers (as Presented at JavaOne 2014)
GroovyPuzzlers
 
Τα Πολύ Βασικά για την Python
Moses Boudourides
 
Slides Δικτυακών Υπολογισμών με την Python
Moses Boudourides
 
Grestest2
Ankit Dubey
 

Similar to Data monsters probablistic data structures (20)

PPTX
Probabilistic data structures
shrinivasvasala
 
PPTX
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
PPT
Footalks#1 Bloom Filters
Jesly Varghese
 
PPTX
Unit 5 Streams2.pptx
SonaliAjankar
 
PPTX
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
PPTX
Probabilistic data structure
Thinh Dang
 
PPT
New zealand bloom filter
xlight
 
PDF
Hash - A probabilistic approach for big data
Luca Mastrostefano
 
PDF
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
 
PDF
Bloom filter
feng lee
 
PDF
Approximate "Now" is Better Than Accurate "Later"
NUS-ISS
 
PDF
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PyData
 
PDF
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
PDF
Bloom Filters: An Introduction
IRJET Journal
 
PDF
Tutorial 9 (bloom filters)
Kira
 
PDF
On Improving the Performance of Data Leak Prevention using White-list Approach
Patrick Nguyen
 
PDF
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Hoang Nguyen Phong
 
PPTX
Sketch algoritms
Meir Maor
 
KEY
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Lorenzo Alberton
 
PDF
Hash Functions FTW
sunnygleason
 
Probabilistic data structures
shrinivasvasala
 
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
Footalks#1 Bloom Filters
Jesly Varghese
 
Unit 5 Streams2.pptx
SonaliAjankar
 
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
Probabilistic data structure
Thinh Dang
 
New zealand bloom filter
xlight
 
Hash - A probabilistic approach for big data
Luca Mastrostefano
 
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
 
Bloom filter
feng lee
 
Approximate "Now" is Better Than Accurate "Later"
NUS-ISS
 
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PyData
 
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
Bloom Filters: An Introduction
IRJET Journal
 
Tutorial 9 (bloom filters)
Kira
 
On Improving the Performance of Data Leak Prevention using White-list Approach
Patrick Nguyen
 
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Hoang Nguyen Phong
 
Sketch algoritms
Meir Maor
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Lorenzo Alberton
 
Hash Functions FTW
sunnygleason
 
Ad

More from GreenM (8)

PPTX
User Case of Migration from MicroStrategy to Power BI
GreenM
 
PPTX
Tableau vs Microstrategy
GreenM
 
PPTX
Data streamsnorkelingdatamonsters
GreenM
 
PPTX
Data monstersrealtimeetl new
GreenM
 
PPTX
DAX as Power BI Visualization Weapon
GreenM
 
PPTX
How To Make Your Dashboard Smaller
GreenM
 
PDF
Data Pipeline Installation Quality
GreenM
 
PPTX
Scalable data pipeline
GreenM
 
User Case of Migration from MicroStrategy to Power BI
GreenM
 
Tableau vs Microstrategy
GreenM
 
Data streamsnorkelingdatamonsters
GreenM
 
Data monstersrealtimeetl new
GreenM
 
DAX as Power BI Visualization Weapon
GreenM
 
How To Make Your Dashboard Smaller
GreenM
 
Data Pipeline Installation Quality
GreenM
 
Scalable data pipeline
GreenM
 
Ad

Recently uploaded (20)

PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PDF
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
PPTX
Short term internship project report on power Bi
JMJCollegeComputerde
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PPTX
Azure Data management Engineer project.pptx
sumitmundhe77
 
PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
Short term internship project report on power Bi
JMJCollegeComputerde
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
Azure Data management Engineer project.pptx
sumitmundhe77
 
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 

Data monsters probablistic data structures

  • 2. https://www.linkedin.com/in/taras-yaroshchuk-551383105/ Taras Yaroshchuk Senior Data Engineer at Sigma Software - 4 years in Data Engineering - AdTech, IoT, FinTech - Scala/Java/Python - Trying to contribute to big data community Skype/Telegram/FB/everywhere: taras.yaroshchuk
  • 3. Use cases ● Membership (Bloom filter, Quotient filter, Cuckoo filter) ● Frequency (Frequent algorithm, Count-Min Sketch) ● Cardinality (Linear Counting, LogLog, HyperLogLog) ● Rank (Random sampling, q-digest, t-digest) ● Similarity (Locality-Sensitive Hashing, MinHash, SimHash)
  • 7. Use cases ● Membership (Bloom filter, Quotient filter, Cuckoo filter) ● Frequency (Frequent algorithm, Count-Min Sketch) ● Cardinality (Linear Counting, LogLog, HyperLogLog) ● Rank (Random sampling, q-digest, t-digest) ● Similarity (Locality-Sensitive Hashing, MinHash, SimHash)
  • 8. Hashing Cryptographic hash functions ● Message-Digest Algorithm (MD5) ● Secure Hash Algorithms (SHA-256, SHA-512, etc) ● RadioGetun Non-Cryptographics hash functions ● FNV1 ● CityHash, FarmHash ● MurmurHash3 42
  • 9. Bloom Filter (Membership) - Google Bigtable, HBase, Cassandra and PostgreSQL use Bloom filters to reduce the disk lookups for non-existent rows or columns. - Medium uses bloom filter to avoid showing duplicate recommendations - Bad URLs for Google Chrome - Compromised passwords
  • 10. Bloom Filter (Membership) - It is like Set(), but doesn’t store elements itself - Supports 2 operations: add element, check if element exists HashSet
  • 11. Bloom Filter (Membership) 0 1 2 3 4 1 0 1 1 0 - It is like Set(), but doesn’t store elements itself - Supports 2 operations: add element, check if element exists - Bit array - Use multiple hash functions h1(x) = MurmurHash3(x) % 10 h2(x) = FNV1(x) % 10 HashSet
  • 12. Bloom Filter (Membership) Example: - camera on highway - bad internet connection - police in 400m
  • 13. Bloom Filter (Membership) 0 1 2 3 4 5 6 7 8 9 0 0 0 1 0 0 0 1 0 0 Euroblyaha detection QWERTY777, NET1234, ASDF999 1. Add QWERTY777 h1 = MurmurHash3(QWERTY777) % 10 = 7 h2 = FNV1(QWERTY777) % 10 = 3
  • 14. Bloom Filter (Membership) 0 1 2 3 4 5 6 7 8 9 0 0 0 1 0 0 0 1 0 0 Euroblyaha detection QWERTY777, NET1234, ASDF999 1. Add QWERTY777 h1 = MurmurHash3(QWERTY777) % 10 = 7 h2 = FNV1(QWERTY777) % 10 = 3 0 1 2 3 4 5 6 7 8 9 0 1 0 1 0 0 0 1 0 0 2. Add NET1234 h1 = MurmurHash3(NET1234) % 10 = 1 h2 = FNV1(NET1234) % 10 = 3
  • 15. Bloom Filter (Membership) 0 1 2 3 4 5 6 7 8 9 0 0 0 1 0 0 0 1 0 0 Euroblyaha detection QWERTY777, NET1234, ASDF999 1. Add QWERTY777 h1 = MurmurHash3(QWERTY777) % 10 = 7 h2 = FNV1(QWERTY777) % 10 = 3 0 1 2 3 4 5 6 7 8 9 0 1 0 1 0 0 0 1 0 0 2. Add NET1234 h1 = MurmurHash3(NET1234) % 10 = 1 h2 = FNV1(NET1234) % 10 = 3 0 1 2 3 4 5 6 7 8 9 0 1 0 1 0 0 0 1 0 0 3. Contains ASDF999? (false) h1 = MurmurHash3(ASDF999) % 10 = 5 h2 = FNV1(ASDF999) % 10 = 6
  • 16. Bloom Filter (Membership) 0 1 2 3 4 5 6 7 8 9 0 0 0 1 0 0 0 1 0 0 Euroblyaha detection QWERTY777, NET1234, ASDF999 1. Add QWERTY777 h1 = MurmurHash3(QWERTY777) % 10 = 7 h2 = FNV1(QWERTY777) % 10 = 3 0 1 2 3 4 5 6 7 8 9 0 1 0 1 0 0 0 1 0 0 2. Add NET1234 h1 = MurmurHash3(NET1234) % 10 = 1 h2 = FNV1(NET1234) % 10 = 3 0 1 2 3 4 5 6 7 8 9 0 1 0 1 0 0 0 1 0 0 3. Contains ASDF999? (false) h1 = MurmurHash3(ASDF999) % 10 = 5 h2 = FNV1(ASDF999) % 10 = 6 0 1 2 3 4 5 6 7 8 9 0 1 0 1 0 0 0 1 0 0 4. Contains NET1234? (true) h1 = MurmurHash3(NET1234) % 10 = 1 h2 = FNV1(NET1234) % 10 = 3
  • 17. Bloom Filter (Membership) - Element definitely doesn’t exist in the set - Element may exist in the set. Lets say, 98%
  • 18. Bloom Filter (Membership) p - positive error rate m - based on the size of the filter k - the number of hash functions, n - number of elements inserted - Element definitely doesn’t exist in the set - Element may exist in the set. Lets say, 98% k m/n p, % 4 6 5.62 6 8 2.15 8 12 0.314 11 16 0.04581 billion elements, p=2% ~ 1 Gb
  • 20. How it looks like? <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>22.0</version> </dependency>
  • 21. How many times element occurred? Show top X elements For streaming application that deals with huge amounts of data ● DNS DDoS ● Intent Surge ● twitter trending hashtags Count-Min Sketch (Frequency)
  • 22. - Use multiple hash functions - Matrix of counters (not bits) - Top frequent elements - Shows upper bound estimation (less than) Count-Min Sketch (Frequency) h1(x) = MurmurHash3(x) % 10 h2(x) = FNV1(x) % 10 0 1 2 3 4 5 6 7 8 9 h1 0 0 0 0 0 0 0 0 0 0 h2 0 0 0 0 0 0 0 0 0 0
  • 23. { ->#quarantine, #quarantine, #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine, #brexit, #oscar, #quarantine } Count-Min Sketch (Frequency) h1(x) = MurmurHash3(x) % 10 h2(x) = FNV1(x) % 10 0 1 2 3 4 5 6 7 8 9 h1 0 0 0 0 0 0 0 0 0 0 h2 0 0 0 0 0 0 0 0 0 0
  • 24. { #quarantine, #quarantine, -> #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine, #brexit, #oscar, #quarantine } Count-Min Sketch (Frequency) 0 1 2 3 4 5 6 7 8 9 h1 0 0 0 0 2 0 0 0 0 0 h2 0 0 0 0 0 0 0 2 0 0 h1(x) = MurmurHash3(quarantine) % 10 = 4 h2(x) = FNV1(quarantine) % 10 = 7 1. #quarantine 2. #quarantine
  • 25. { #quarantine, #quarantine, #brexit, #brexit, -> #alyonalyona, #quarantine, #tesla, #tesla, #quarantine, #brexit, #oscar, #quarantine } Count-Min Sketch (Frequency) 0 1 2 3 4 5 6 7 8 9 h1 0 0 0 0 2 0 0 0 0 0 h2 0 0 0 0 0 0 0 2 0 0 h1(x) = MurmurHash3(quarantine) % 10 = 4 h2(x) = FNV1(quarantine) % 10 = 7 1. #quarantine 2. #quarantine 0 1 2 3 4 5 6 7 8 9 h1 0 0 0 0 2 0 0 0 2 0 h2 0 0 2 0 0 0 0 2 0 0 3. #brexit 4. #brexit h1(x) = MurmurHash3(brexit) % 10 = 8 h2(x) = FNV1(brexit) % 10 = 2
  • 26. { #quarantine, #quarantine, #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine, #brexit, #oscar, #quarantine -> } Count-Min Sketch (Frequency) 0 1 2 3 4 5 6 7 8 9 h1 0 0 0 0 6 0 1 0 5 0 h2 0 0 3 0 0 1 0 6 0 2 h1(x) = MurmurHash3(x) % 10 h2(x) = FNV1(x) % 10 h1(x) = MurmurHash3(brexit) % 10 = 8 h2(x) = FNV1(brexit) % 10 = 2 h1(x) = MurmurHash3(tesla) % 10 = 8 h2(x) = FNV1(tesla) % 10 = 9
  • 27. { #quarantine, #quarantine, #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine, #brexit, #oscar, #quarantine -> } Count-Min Sketch (Frequency) 0 1 2 3 4 5 6 7 8 9 h1 0 0 0 0 6 0 1 0 5 0 h2 0 0 3 0 0 1 0 6 0 2 How many times #tesla? h1(x) = MurmurHash3(tesla) % 10 = 8 h2(x) = FNV1(tesla) % 10 = 9 Final answer = min(h1[8], h2[9]) = min(5, 2) = 2
  • 28. Count-Min Sketch (Frequency) p = |ln(1/σ)| m = 2.71828/ɛ p - number hash functions σ - standard error m - number of bits ɛ - overestimation factor Example: We expect to store 10 million of elements σ should be ~1%, accepted overestimation is 10. p = |ln(1/0.01)| = 5 ɛ = 10/107=10-6 m = 2.71828/10-6 = 2718280
  • 29. Conclusions - Probabilistic data structures are not general purpose - They should be used as optimization - They can save you memory and time - Sound complex, but not so scary in practice - Learn them and impress your interviewer https://www.amazon.com/Probabilistic-Data-Structures-Algorithms-Applications/dp/3748190484