Stream Processing
Stream Processing
Algorithms
1
AGENDA
• Introduction
• Data Stream Management
• Real life applications
• Streaming Queries
• Issues
• Sampling
• Filtering
• Counting Distinct Elements
• Moments
2
Introduction
3
The Stream Model
4
Data Stream Management
system
Stream processor – a kind of Data management system
Any number of streams can enter the system.
Each stream can provide elements at its own schedule.
Need not have the same data rates or data types.
Time between elements of one stream need not be
uniform.
Rate of arrival of stream elements is not under the control
of the system distinguishes stream processing from DBMS.
5
Ad-Hoc
Queries
. . . 1, 5, 2, 7, 0, 9, 3 Standing
Queries
Output
. . . a, r, v, t, y, h, b
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering
Limited
Working
Storage Archival
Storage
6
Data Stream Management system-
contd..
DBMS controls the rate at which data is read from the disk &
therefor never has to worry about data getting lost as it attempts to
execute queries.
Stream storing – i) Archival store
ii) Working store
Archival store
– larger
- could be examined only under special circumstances using
time-consuming retrieval processes
7
Data Stream Management system-
contd..
Working store
- summaries or parts of streams may be placed
- can be used for answering queries
- might be disk or main memory, depending
on how fast we need to process queries
- of sufficiently limited capacity that it cannot
store all the data from all the streams.
8
Real Life Applications
• Web traffic
• Internet
• Sensor data
• Image data
9
Applications -Web traffic
10
Applications - Web traffic –
contd…
• A sudden increase in the click rate for a link - -
could indicate either of the following two:
11
Applications – Sensors
• Consider a temperature sensor bobbing about in the
ocean.
12
Applications – Sensors –
contd…
• If a GPS unit is attached to the sensor to report surface
height instead of temperature.
14
Applications – Image data
• Satellites often send down to earth streams consisting
of many terabytes of images per day.
15
Stream Queries
There exist two ways of querying about streams :
i)Standing query
ii)Ad-hoc query
16
Standing Queries
17
Standing Queries
Eg1 :
• Stream produced by the ocean-surface-temperature
sensor.
18
Standing Queries – contd…
• Eg2 :
• Another query - the maximum temperature ever recorded by that
sensor.
19
Standing Queries – contd…
• Eg 3 :
• We want the average temperature over all time.
22
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
Past Future
23
Eg:
Web sites often like to report the number of unique users over
the past month.
•If we think of each login as a stream element,
• we can maintain a window that is all logins in the most
recent month.
• We must associate the arrival time with each login, so we
know when it no longer belongs to the window.
•If we think of the window as a relation Logins(name, time), then
it is simple to get the number of unique users over the past
month.
The SQL query is:
SELECT COUNT(DISTINCT(name))
FROM Logins
WHERE time >= t;
Here, t is a constant that represents the time one month before
the current time.
24
Issues in stream
processing
• Streams often deliver elements very rapidly.
25
Issues in stream processing –
contd…
• Even when streams are “slow,” as in the sensor-data
example, there may be many such streams.
26
Issues in stream processing –
contd…
• Many problems about streaming data are solved if we had
enough memory.
• By using a hash function, one can avoid keeping the list of users.
• If the user hashes to bucket 0, then accept this search query for the
sample, and if not, then not.
31
Obtaining a Representative sample –
contd…
• More generally, we can obtain a sample consisting of any rational
fraction a/b of the users by hashing user names to b buckets, 0
through b − 1.
• Add the search query to the sample if the hash value is less than a.
32
Varying Sample Size
• In our running example, we retain all the search queries of the selected
accumulated, and new users that are selected for the sample will appear
in the stream
33
Exercise Problem
34
Exercise Problem contd…
For each of the queries below, indicate how you would construct the
sample. That is, tell what the key attributes should be.
• Estimate the fraction of courses where at least half the students got
“A.”
35
Filtering streams
• Common process on streams is selection, or filtering.
36
Filtering streams – contd…
37
Bloom Filtering – Example
39
Bloom Filtering – Example
• Since there are one billion members of S, approximately 1/8th of
the bits will be 1.
• If the bit to which that email address hashes is 1, then we let the
email through.
40
Bloom Filtering – Example
• Unfortunately, some spam email will get through
• Approximately 1/8th of the stream elements whose email
address is not in S will happen to hash to a bit whose
value is 1 and will be let through
• since the majority of emails are spam (about 80%
according to some reports),
• eliminating 7/8th of the spam is a significant benefit
41
Bloom Filtering – Example
• If we want to eliminate every spam, we need only check
for membership in S those good and bad emails that get
through the filter.
42
Bloom Filter
A Bloom filter consists of:
1. An array of n bits, initially all 0’s.
2. A collection of hash functions h1, h2, . . . , hk.
Each hash function maps “key” values to n buckets,
corresponding to the n bits of the bit-array
3. A set S of m key values.
43
Bloom Filter
n = 10
44
Bloom Filter
k=3
45
Bloom Filter
k=3
46
Bloom Filter – contd…
• To initialize the bit array, begin with all bits 0.
• Take each key value in S and hash it using each of the k hash
functions.
• To test a key K that arrives in the stream, check that all of h 1(K),
h2(K), . . . , hk(K) are 1’s in the bit-array.
• If all are 1’s, then let the stream element through. One or more of
these bits are 0, then K could not be in S, so reject the stream
element.
47
Counting Distinct elements in a
stream
• Consider the problem of counting distinct elements in a
stream.
48
Counting Distinct elements in a
stream
• Consider a generalization of the problem of counting
distinct elements in a stream.
49
Moments
50
Moments – contd…
51
Surprise number
52
Alon-Matias-Szegedy (AMS)
algorithm
• Suppose we do not have enough space to
count all the mi’s for all the elements of the
stream.
54
AMS algorithm – contd…
55
AMS algorithm – contd…
a b c b d a c d a b d c a a b
• Average(X1, X2 , X3 ) = 55
• True value for our stream: 59
57
Higher order moments
58
Exercise
Compute the surprise number (second moment) for the stream
3, 1, 4, 1, 3, 4, 2, 1, 2. What is the third moment of this stream?
59
Solution for Exercise
3, 1, 4, 1, 3, 4, 2, 1, 2
Occurrence of 3 = 2 times
Occurrence of 1 = 3 times
Occurrence of 4 = 2 time
Occurrence of 2 = 2 times
Surprise number = 2 2 *3 + 3 2 *1 = 21
Third order moment = 2 3 + 3 3 + 2 3 + 2 3 = 51
60
Dealing with infinite streams
• Estimate we used for second and higher moments assumes
that n, the stream length, is a constant.
61
Dealing with infinite streams –
contd…
Problem - we must be careful how we select the positions for the variables
If we do this selection once and for all, then as the stream gets longer,
- we are biased in favour of early positions,
- the estimate of the moment will be too large
if we wait too long to pick positions, then
- early in the stream we do not have many variables
- will get an unreliable estimate.
62
Dealing with infinite streams – contd…
Solution :
• Maintain as many variables as we can store at all times, and throw
some out as the stream grows.
• At all times, the probability of picking any one position for a variable
is the same as that of picking any other position.
63
Counting Bits – (1)
64
Counting Bits – (2)
• You can’t get an exact answer without storing the entire window.
• Real Problem: what if we cannot afford to store N bits?
• E.g., we are processing 1 billion streams and N = 1 billion
But we’re happy with an approximate answer.
65
Something That Doesn’t (Quite)
Work
66
Example
67
What’s Good?
68
What’s Not So Good?
• As long as the 1’s are fairly evenly distributed, the error due to the
unknown region is small – no more than 50%.
• But it could be that all the 1’s are in the unknown area at the end.
• In that case, the error is unbounded.
69
Fixup
70