0% found this document useful (0 votes)
0 views70 pages

Stream Processing

The document discusses stream processing algorithms, focusing on data stream management systems that handle rapid input from various sources like web traffic and sensors. It covers the challenges of processing streams with limited memory, the importance of sampling and filtering techniques, and the use of standing and ad-hoc queries. Additionally, it introduces concepts like Bloom filtering for efficient data handling and provides real-life applications of stream processing.

Uploaded by

SARANYA M -77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views70 pages

Stream Processing

The document discusses stream processing algorithms, focusing on data stream management systems that handle rapid input from various sources like web traffic and sensors. It covers the challenges of processing streams with limited memory, the importance of sampling and filtering techniques, and the use of standing and ad-hoc queries. Additionally, it introduces concepts like Bloom filtering for efficient data handling and provides real-life applications of stream processing.

Uploaded by

SARANYA M -77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 70

Stream Processing

Algorithms

1
AGENDA
• Introduction
• Data Stream Management
• Real life applications
• Streaming Queries
• Issues
• Sampling
• Filtering
• Counting Distinct Elements
• Moments

2
Introduction

• In a DBMS, input is under the control of the


programmer.

• Stream Management is important when the input rate


is controlled externally.

• Example: Google queries.

3
The Stream Model

• Input tuples enter at a rapid rate, at one or more


input ports.
• The system cannot store the entire stream
accessibly.
• How do you make critical calculations about the
stream using a limited amount of (secondary)
memory?

4
Data Stream Management
system
 Stream processor – a kind of Data management system
 Any number of streams can enter the system.
 Each stream can provide elements at its own schedule.
 Need not have the same data rates or data types.
 Time between elements of one stream need not be
uniform.
 Rate of arrival of stream elements is not under the control
of the system distinguishes stream processing from DBMS.
5
Ad-Hoc
Queries

. . . 1, 5, 2, 7, 0, 9, 3 Standing
Queries
Output
. . . a, r, v, t, y, h, b
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time

Streams Entering

Limited
Working
Storage Archival
Storage

6
Data Stream Management system-
contd..
 DBMS controls the rate at which data is read from the disk &
therefor never has to worry about data getting lost as it attempts to
execute queries.
 Stream storing – i) Archival store
ii) Working store
 Archival store
– larger
- could be examined only under special circumstances using
time-consuming retrieval processes
7
Data Stream Management system-
contd..
 Working store
- summaries or parts of streams may be placed
- can be used for answering queries
- might be disk or main memory, depending
on how fast we need to process queries
- of sufficiently limited capacity that it cannot
store all the data from all the streams.
8
Real Life Applications
• Web traffic
• Internet
• Sensor data
• Image data

9
Applications -Web traffic

• Web sites receive streams of various types.


Mining query streams
• Google wants to know what queries are more frequent today than
yesterday.
Mining click streams.
• Yahoo wants to know which of its pages are getting an unusual
number of hits in the past hour
• Many interesting things can be learnt- Eg: an increase in
queries like “dengue fever symptoms” enables us to
predict the number of sufferers.

10
Applications - Web traffic –
contd…
• A sudden increase in the click rate for a link - -
could indicate either of the following two:

•some news connected to that page,

•link is broken and needs to be repaired

11
Applications – Sensors
• Consider a temperature sensor bobbing about in the
ocean.

• Sending reading of the surface temperature to base


station each hour .

• Data produced by this sensor -stream of real


numbers

• Data rate is so low.

12
Applications – Sensors –
contd…
• If a GPS unit is attached to the sensor to report surface
height instead of temperature.

• The surface height varies quite rapidly.

• So the sensor would send back a reading every tenth of a


second.

• If it sends a 4-byte real number each time, then it


produces 3.5 megabytes per day.

• To learn something about ocean behavior, it is necessary


to deploy a million sensors, each sending back a stream,
at the rate of ten per second. 13
Applications – Sensors –
contd…
• There would be one for every 150 square miles of ocean,
million sensors are used.

• Then 3.5 terabytes of data arriving every day.

• We definitely need to think about


- what can be kept in working storage and
- what can only be archived.

14
Applications – Image data
• Satellites often send down to earth streams consisting
of many terabytes of images per day.

• Surveillance cameras produce images with lower


resolution than satellites, but there can be many of
them, each producing a stream of images at intervals
like one second.

• London is said to have six million such cameras, each


producing a stream

15
Stream Queries
There exist two ways of querying about streams :

i)Standing query
ii)Ad-hoc query

16
Standing Queries

• A place within the processor where standing queries are


stored.

• These queries are, in a sense, permanently executing, and


produce outputs at appropriate times.

17
Standing Queries
Eg1 :
• Stream produced by the ocean-surface-temperature
sensor.

•A standing query to output an alert whenever the


temperature exceeds 25 degrees centigrade.

•This query is easily answered, since it depends only on


the most recent stream element.

18
Standing Queries – contd…
• Eg2 :
• Another query - the maximum temperature ever recorded by that
sensor.

• We can answer this query by retaining a simple summary: the


maximum of all stream elements ever seen.

• Not necessary to record the entire stream.

• When a new stream element arrives, we compare it with the


stored maximum and set the maximum to whichever is larger.

19
Standing Queries – contd…
• Eg 3 :
• We want the average temperature over all time.

• We have only to record two values: the number of


readings ever sent in the stream and the sum of those
readings.

• Adjust these values easily each time a new reading


arrives.

• We can produce their quotient as the answer to the


query.
20
Ad-hoc Queries
• A question asked once about the current state of a
stream or streams.

• If we do not store all streams in their entirety, then we


cannot answer arbitrary queries about streams.

• If we have some idea what kind of queries will be


asked, then prepare by storing appropriate parts or
summaries of streams.

• To satisfy a wide variety of ad-hoc queries, a common


approach - to store a sliding window of each stream in
the working store. 21
Sliding Windows
• A useful model of stream processing is that queries
are about a window of length N – the N most recent
elements received.

• Interesting case: N is so large it cannot be stored in


memory, or even on disk.

• Or, it can be all the elements that arrived within the


last t time units, e.g., one day
• Or, there are so many streams that windows for all cannot be
stored

22
qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

Past Future

23
Eg:
Web sites often like to report the number of unique users over
the past month.
•If we think of each login as a stream element,
• we can maintain a window that is all logins in the most
recent month.
• We must associate the arrival time with each login, so we
know when it no longer belongs to the window.
•If we think of the window as a relation Logins(name, time), then
it is simple to get the number of unique users over the past
month.
The SQL query is:
SELECT COUNT(DISTINCT(name))
FROM Logins
WHERE time >= t;
Here, t is a constant that represents the time one month before
the current time.

24
Issues in stream
processing
• Streams often deliver elements very rapidly.

• We must process elements in real time, or we lose the


opportunity to process them at all, without accessing the archival
storage.

• It is important that the stream-processing algorithm is executed


in main memory, without access to secondary storage or with
only rare accesses to secondary storage.

25
Issues in stream processing –
contd…
• Even when streams are “slow,” as in the sensor-data
example, there may be many such streams.

• Even if each stream by itself can be processed using a


small amount of main memory, the requirements of all the
streams together can easily exceed the amount of available
main memory.

26
Issues in stream processing –
contd…
• Many problems about streaming data are solved if we had
enough memory.

• New techniques are required in order to execute them at a


realistic rate on a machine of realistic size.

• Two generalizations about stream algorithms:


• It is much more efficient to get an approximate answer to
our problem than an exact solution
• A variety of techniques related to hashing are useful -
introduce useful randomness into the algorithm’s
behavior, to produce an approximate answer that is very
close to the true result
27
Sampling Data in a Stream
• Extracting reliable samples from a stream.

• If we know what queries are to be asked, then there are a


number of methods for sampling.

• If looking for a technique that will allow ad-hoc queries on


the sample.

• Eg : A search engine receives a stream of queries, and it would like to


study the behavior of typical users.
Assume that the stream consists of tuples (user, query, time) 28
Sampling Data in a Stream - Example
• Suppose that we want to answer queries such as “What
fraction of the typical user’s queries were repeated over the
past month?”

• Assume also that we wish to store only 1/10th of the stream


elements.

• The obvious approach would be to generate a random


number, say an integer from 0 to 9, in response to each
search query.

• Store the tuple if and only if the random number is 0.


• If we do so, each user has, on average, 1/10th of their
queries stored. 29
Obtaining a Representative
sample
• Previous example cannot be answered by taking a sample
of each user’s search queries.

• we should strive to pick 1/10th of the users, and take all


their searches for the sample.

• If we can store a list of all users, and whether or not they


are in the sample, then we could do the following:

• Each time a search query arrives in the stream, we look up


the user to see whether or not they are in the sample
30
Obtaining a Representative sample – contd…
• That method works as long as we can keep the list of all users and
their in/out decision in main memory (because there isn’t time to go
to disk for every search that arrives).

• By using a hash function, one can avoid keeping the list of users.

• ie. we hash each user name to one of ten buckets, 0 through 9.

• If the user hashes to bucket 0, then accept this search query for the
sample, and if not, then not.

31
Obtaining a Representative sample –
contd…
• More generally, we can obtain a sample consisting of any rational
fraction a/b of the users by hashing user names to b buckets, 0
through b − 1.

• Add the search query to the sample if the hash value is less than a.

32
Varying Sample Size

• Sample will grow as more of the stream enters.

• In our running example, we retain all the search queries of the selected

1/10th of the users, forever


• As time goes on, more searches for the same users will be

accumulated, and new users that are selected for the sample will appear

in the stream

33
Exercise Problem

• Suppose we have a stream of tuples with the schema


Grades(university, courseID, studentID, grade)
Assume universities are unique, but a courseID is unique only within a
university (i.e., different universities may have different courses with
the same ID, e.g., “CS101”) and likewise, studentID’s are unique only
within a university.
(different universities may assign the same ID to different students).

Suppose we want to answer certain queries approximately from a 1/20 th


sample of the data.

34
Exercise Problem contd…

For each of the queries below, indicate how you would construct the
sample. That is, tell what the key attributes should be.

• For each university, estimate the average number of students in a


course.

• Estimate the fraction of students who have a GPA of 3.5 or more.

• Estimate the fraction of courses where at least half the students got
“A.”

35
Filtering streams
• Common process on streams is selection, or filtering.

• Accept those tuples in the stream that meet a criterion.

• Accepted tuples are passed to another process as a stream, while


other tuples are dropped.

• If the selection criterion is a property of the tuple that can be


calculated (e.g., the first component is less than 10), then the
selection is easy to do.

• Problem becomes harder when the criterion involves lookup for


membership in a set

36
Filtering streams – contd…

• It is especially hard, when that set is too large to store in


main memory.

• Technique known as “Bloom filtering” - a way to


eliminate most of the tuples that do not meet the criterion

37
Bloom Filtering – Example

• Suppose we have a set S of one billion allowed email addresses –


believed not to be spam.

• Stream consists of pairs: an email address and the email itself.

• Typical email address is 20 bytes or more, it is not reasonable to store S


in main memory.

• We can either use disk accesses to determine whether or not to let


through any given stream element.

• Or we can devise a method that requires no more main memory than


available, and filters most of the undesired stream elements.
38
Bloom Filtering – Example
• Suppose one gigabyte of main memory is available.
• In Bloom filtering, that main memory is used as a bit
array.
• Room for eight billion bits, since one byte equals eight
bits
• Devise a hash function h from email addresses to eight
billion buckets.

• Hash each member of S to a bit, and set that bit to 1, all


other bits of the array remain 0

39
Bloom Filtering – Example
• Since there are one billion members of S, approximately 1/8th of
the bits will be 1.

• Exact fraction of bits set to 1 will be slightly less than 1/8th,


because it is possible that two members of S hash to the same bit.

• When a stream element arrives, we hash its email address.

• If the bit to which that email address hashes is 1, then we let the
email through.

• But if the email address hashes to a 0, we can drop this stream


element

40
Bloom Filtering – Example
• Unfortunately, some spam email will get through
• Approximately 1/8th of the stream elements whose email
address is not in S will happen to hash to a bit whose
value is 1 and will be let through
• since the majority of emails are spam (about 80%
according to some reports),
• eliminating 7/8th of the spam is a significant benefit

41
Bloom Filtering – Example
• If we want to eliminate every spam, we need only check
for membership in S those good and bad emails that get
through the filter.

• Those checks will require the use of secondary memory to


access S itself.

• As a simple example, we could use a cascade of filters,


each of which would eliminate 7/8th of the remaining
spam.

42
Bloom Filter
A Bloom filter consists of:
1. An array of n bits, initially all 0’s.
2. A collection of hash functions h1, h2, . . . , hk.
Each hash function maps “key” values to n buckets,
corresponding to the n bits of the bit-array
3. A set S of m key values.

• Purpose of the Bloom filter is to allow through all


stream elements whose keys are in S, while
rejecting most of the stream elements whose keys
are not in S

43
Bloom Filter

n = 10
44
Bloom Filter

k=3
45
Bloom Filter

k=3
46
Bloom Filter – contd…
• To initialize the bit array, begin with all bits 0.

• Take each key value in S and hash it using each of the k hash
functions.

• Set to 1 each bit that is hi(K).

• For some hash function hi and some key value K in S.

• To test a key K that arrives in the stream, check that all of h 1(K),
h2(K), . . . , hk(K) are 1’s in the bit-array.

• If all are 1’s, then let the stream element through. One or more of
these bits are 0, then K could not be in S, so reject the stream
element.
47
Counting Distinct elements in a
stream
• Consider the problem of counting distinct elements in a
stream.

• Counting the number of unique users logged in.

• Use Efficient search structures like hashing.

• If the number of distinct elements is too high, then we need


more main memory or more machines.

48
Counting Distinct elements in a
stream
• Consider a generalization of the problem of counting
distinct elements in a stream.

• The problem, called computing “moments,” involves the


distribution of frequencies of different elements in the
stream

49
Moments

50
Moments – contd…

51
Surprise number

Stream with 100 elements with 11 distinct ones

i)10 elements occur 9 times and one element


occurs 10 times

Surprise number ( 2nd moment) =


9 2 *10 +(10)2 *1 = 910

ii) 10 elements occur 1 time, one element 90


times
Surprise number ( 2nd moment) =
1 2 *10 +(90)2 *1 = 8110

52
Alon-Matias-Szegedy (AMS)
algorithm
• Suppose we do not have enough space to
count all the mi’s for all the elements of the
stream.

• Can still estimate the second moment of the


stream using a limited amount of space.

• The more space we use, the more accurate


the estimate will be.

• We compute some number of variables


53
Alon-Matias-Szegedy (AMS)
algorithm
Lets define
X = (element, value)
X.element : element of the universal set
X.value : counter of X.element in the stream starting at a randomly chosen
position
Example :
n = 15
ma 2+ mb2+mc2+md2 = 5 2 + 4 2 + 3 2 + 3 2 = 59
a b c b d a c d a b d c a a b

54
AMS algorithm – contd…

To determine the value of a variable X,


•We choose a position in the stream between 1 and
n, at random.
• Set X.element to be the element found there and
initialize X.value to 1.
• As we read the stream, add 1 to X.value each
time we encounter another occurrence of
X.element.

55
AMS algorithm – contd…

(1) Randomly pick 3 positions with known length.

a b c b d a c d a b d c a a b

(3 variables to th 2n order) strea


X1 = (c,1) compute e d from m
(2) Process the
a stream,
b c oneb d a c d a b d c a a b
element at a
time.
X1 = (c, 2)
X2 = X2 = (d, 2)
(d,1) X1 = (c, 3)
X3 = (a,1)
X3 =
56
(a, 2)
AMS algorithm – contd…
• Estimate of Second order moment from any
X = (element, value)
Estimate = n * (2 * X.value -1)

• estimate from X1: 15 *(2 * 3 -1) = 75


• estimate from X2 : 15 * (2 * 2 -1) = 45
• estimate from X3: 15 * (2 *2 -1) = 45

• Average(X1, X2 , X3 ) = 55
• True value for our stream: 59

57
Higher order moments

58
Exercise
Compute the surprise number (second moment) for the stream
3, 1, 4, 1, 3, 4, 2, 1, 2. What is the third moment of this stream?

59
Solution for Exercise

3, 1, 4, 1, 3, 4, 2, 1, 2
Occurrence of 3 = 2 times
Occurrence of 1 = 3 times
Occurrence of 4 = 2 time
Occurrence of 2 = 2 times
Surprise number = 2 2 *3 + 3 2 *1 = 21
Third order moment = 2 3 + 3 3 + 2 3 + 2 3 = 51

60
Dealing with infinite streams
• Estimate we used for second and higher moments assumes
that n, the stream length, is a constant.

• In practice, n grows with time.

• That fact doesn’t cause problems, since we store only the


values of variables and multiply some function of that value
by n when it is time to estimate the moment.

• We can count the number of stream elements seen and store


this value,(which only requires log n bits)

61
Dealing with infinite streams –
contd…
 Problem - we must be careful how we select the positions for the variables
 If we do this selection once and for all, then as the stream gets longer,
- we are biased in favour of early positions,
- the estimate of the moment will be too large
 if we wait too long to pick positions, then
- early in the stream we do not have many variables
- will get an unreliable estimate.

62
Dealing with infinite streams – contd…
 Solution :
• Maintain as many variables as we can store at all times, and throw
some out as the stream grows.

• The discarded variables are replaced by new ones,

• At all times, the probability of picking any one position for a variable
is the same as that of picking any other position.

• Suppose we have space to store s variables, first s positions are each


picked as the position of one of the s variables.

63
Counting Bits – (1)

• Problem: given a stream of 0’s and 1’s, be prepared to answer


queries of the form “how many 1’s in the last k bits?” where k ≤ N.
• Obvious solution: store the most recent N bits.
• When new bit comes in, discard the N +1st bit.

64
Counting Bits – (2)

• You can’t get an exact answer without storing the entire window.
• Real Problem: what if we cannot afford to store N bits?
• E.g., we are processing 1 billion streams and N = 1 billion
But we’re happy with an approximate answer.

65
Something That Doesn’t (Quite)
Work

• Summarize exponentially increasing regions of the stream, looking


backward.
• Drop small regions if they begin at the same point as a larger region.

66
Example

We can construct the count of


the last N bits, except we’re
Not sure how many of the last
6 are included.
6 10
? 4
3 2
1 2
10
001110001010010001011011011100101011001101

67
What’s Good?

• Stores only O(log2N ) bits.


• O(log N ) counts of log2N bits each.
• Easy update as more bits enter.
• Error in count no greater than the number of 1’s in the “unknown”
area.

68
What’s Not So Good?

• As long as the 1’s are fairly evenly distributed, the error due to the
unknown region is small – no more than 50%.
• But it could be that all the 1’s are in the unknown area at the end.
• In that case, the error is unbounded.

69
Fixup

• Instead of summarizing fixed-length blocks, summarize blocks with


specific numbers of 1’s.
• Let the block sizes (number of 1’s) increase exponentially.
• When there are few 1’s in the window, block sizes stay small, so
errors are small.

70

You might also like