SlideShare a Scribd company logo
Mining Data Streams
1
Mining Complex data
 Stream data
 Massive data, temporally ordered, fast changing and potentially
infinite
 Satellite Images, Data from electric power grids
 Time-Series data
 Sequence of values obtained over time
 Economic and Sales data, natural phenomenon
 Sequence data
 Sequences of ordered elements or events (without time)
 DNA and protein sequences
 Graphs
 Represent various kinds of Networks
 Social Network Analysis
2
Mining Data Streams
 Streams – Temporally ordered, fast changing, massive
and potentially infinite.
 Characteristics
 Huge volumes of continuous data, possibly infinite
 Fast changing and requires fast, real-time response
 Most stream data are at pretty low-level or multi-dimensional
 Random access is expensive—Single scan algorithms are
required
 On-line, multi-level and multi-dimensional
3
4
Architecture: Stream Query Processing
Scratch SpaceScratch Space
(Main memory and/or Disk)(Main memory and/or Disk)
User/ApplicationUser/ApplicationUser/ApplicationUser/Application
Continuous QueryContinuous Query
Stream QueryStream Query
ProcessorProcessor
ResultsResults
Multiple streamsMultiple streams
DSMS (Data Stream Management System)
Stream Queries
 Queries are often continuous
 Evaluated continuously as stream data arrives
 Answer updated over time
 Queries are often complex
 Beyond element-at-a-time processing
 Beyond stream-at-a-time processing
 Beyond relational queries
 Query types
 One-time query vs. continuous query (being evaluated continuously as
stream continues to arrive)
 Predefined query vs. ad-hoc query (issued on-line)
 Unbounded memory requirements
5
Stream Query Processing
Challenges
 Unbounded memory requirements
 For real-time response, main memory algorithm should be used
 Memory requirement is unbounded due to future tuples
 Approximate query answering
 With bounded memory, it is not always possible to produce exact
answers
 High-quality approximate answers are desired
 Data reduction and synopsis construction methods
 Sketches, random sampling, histograms, wavelets, etc.
6
Methodologies
 Impractical to scan more than once
 Even inspection is an issue
 Gigantic size rules out storage
 Size of Universe is also very large
 Even generating summaries is not possible
 New data structures, techniques and algorithms are needed
 Tradeoff between Storage and accuracy
 Synopses
 Summaries of data - support approximate answers
 Use Synopsis data structures
 Algorithms must use poly-logarithmic space O(logk
N)
 Compute an approximate answer within a factor ε of the actual answer –
As approximation factor goes down, space requirements increase
7
Synopsis Data Structures and
Techniques
 Random sampling
 Histograms
 Sliding windows
 Multi-resolution model
 Sketches
 Randomized algorithms
8
 Random Sampling
 Without knowing the total length in advance
 Reservoir sampling
 Maintain a set of s candidates in the reservoir, which form a true random sample of
the element seen so far in the stream.
 As the data stream flow, every new element has a certain probability (s/N) of
replacing an old element in the reservoir.
 Sliding Windows
 Make decisions based only on recent data of sliding window size w
 An element arriving at time t expires at time t + w
9
Synopsis Data Structures and
Techniques
 Histograms
 Approximate the frequency distribution of element values in a stream
 Partition data into a set of buckets
 Equal-width (equal value range for buckets) vs. V-optimal (minimizing
frequency variance within each bucket)
 Multi-resolution methods
 Data reduction through divide and conquer
 Balanced Binary tree – each level provides a different resolution
 Hierarchical clustering of trees
 Micro clusters, Macro clusters and Summary Statistics
 Wavelets – For spatial and multimedia data
 Builds a multi-resolution hierarchy structure of input signal
10
Synopsis Data Structures and
Techniques
Synopsis Data Structures and
Techniques
 Sketches
 Sampling and Sliding window focus on small amount of data
 Histograms and Wavelets require several passes
 Sketches use Frequency moments
 U = {1, 2, … v} A = {a1, a2, …aN}
 F0 – number of distinct elements
 F1 – length of sequence
 F2 – Self-join size / repeat rate / Gini’s index of homogeneity
 Frequency moments – give information about data and degree of skew or
asymmetry
 Given N elements and v values, sketches can approximate F0, F1, F2 in O(log v
+ log N) space
 Each element is hashed onto {-1, +1} – Sketch Partitioning
 Random variable X = ∑i mizi is maintained and can be used to approximate
moments
11
Synopsis Data Structures and
Techniques
 Randomized Algorithms
 Las Vegas – right answer but running time varies
 Monte Carlo – time bound; may not always return the correct
answer
 When random variables are used, its deviation from the expected
value must be bounded
 Chebyshev’s inequality:
 Let X be a random variable with mean μ and standard deviation σ
 Chernoff bound:
 Let X be the sum of independent Poisson trials X1, …, Xn, δ in (0, 1]
 The probability decreases exponentially as we move from the mean
12
2
2
)|(|
k
kXP
σ
µ ≤>−
4/2
|])1([ µδ
µδ −
<+< eXP
Stream OLAP and Stream Data Cubes
 Accommodating Stream data completely in a warehouse is
challenging
 Data is low-level so multi-dimensional analysis has to be done–
OLAP Analysis is needed
 Example: Power Supply Stream data – fluctuation of power usage
will be analyzed at higher levels such as by city/district.. / hourly…
 Can be viewed as a Virtual data cube
 Efficient methods are needed for systematic analysis
 Impossible to store complete stream data and compute a fully
materialized cube
 Compression Techniques are needed
13
Stream OLAP - Compression
 A tilted time frame
 Different time granularities
 second, minute, quarter, hour, day, week, …
 Critical layers
 Minimum interest layer (m-layer)
 Observation layer (o-layer)
 User: watches at o-layer and occasionally needs to drill-down
down to m-layer
 Partial materialization of stream cubes
 Full materialization
 No materialization
 Partial materialization
14
Time Dimension with Compressed
Time Scale
 Tilted time Frame
 Recent changes are recorded at a fine scale and long-
term changes at coarse scale
 Level of coarseness depends on application
requirements and on how old the time point is
 Time Frame models
 Natural tilted time frame model
 Logarithmic tilted time frame model
 Progressive logarithmic tilted time frame model
15
Tilted Time Frame
 Natural tilted time frame:
 Minimal: quarter, then 4 quarters → 1 hour, 24 hours → day, …
 Registers 4 + 24 + 31 +12 = 71 units of time for a year
 Logarithmic tilted time frame:
 Minimal: 1 quarter, then 1, 2, 4, 8, 16, 32, …
 Records log2(365 x 24 x 4)+1 = 16.1 units / year
16
4 q t r s2 4 h o u r s3 1 d a y s1 2 m o n t h s
t i m e
T i m e
t8 t 4 t 2 t t1 6 t3 2 t6 4 t
Tilted Time Frame
 Progressive Logarithmic tilted time frame
 Max_frame and Max_capacity
 If (t mod 2i
) =0 but (t mod 2i+1
) <> 0, t is inserted into frame
number i if i<= max_frame else into max_frame
 Example:
 Suppose there are 5 frames and each takes maximal 3 snapshots
 Given a snapshot number N, if N mod 2d
= 0, insert into the frame
number d. If there are more than 3 snapshots, remove the oldest
one
17
Frame no. Snapshots (by clock time)
0 69 67 65
1 66 62 58
2 68 60 52
3 56 40 24
4 48 16
5 64 32
Critical Layers
 Compute and store only mission-critical cuboids
of the full data cube
 Minimal Interest layer – not feasible to examine
minute details
 Observation layer – layer at which analyst studies the
data
 Example: Power Supply Stream data Cube
 Raw data layer – Individual_user, Street_address, second
 Minimal Interest layer – user_group, street_block and minute
 Observation layer – *(all_user), city, quarter
18
Critical Layers
19
(*, city, quarter)
(user-group, street_block, minute)
m-layer (minimal interest)
(individual-user, street, second)
o-layer (observation)
Partial Materialization
 On-line materialization
 Materialization takes precious space and time
 Only incremental materialization (with tilted time frame)
 Only materialize “cuboids” of the critical layers
 Online computation may take too much time
 Preferred solution:
 Popular-path approach: Materializing those along the popular
drilling paths
 H-tree (Hyperlink tree) structure: Such cuboids can be
computed and stored efficiently using the H-tree structure
 Aggregate Cells in non-leaf nodes
 Based on popular path – all cells are cuboids on popular path – non-
leaf nodes of H-tree; space and computation overheads are reduced
20
Frequent Pattern Mining in Data
Streams
 Existing FP Mining algorithms require to scan the complete data set
 In data streams an infrequent item may become frequent and vice
versa
 Approach I – Keep track of only limited set of items / itemsets
 Approach II – Approximate set of answers
 Example: a router is interested in all flows:
 whose frequency is at least 1% (σ) of the entire traffic stream seen
so far and feels that 1/10 of σ (ε = 0.1%) error is comfortable
 All frequent items with a support of atleast min_support will be output;
some items with a support of min_support – ε will also be output
21
Lossy Counting Algorithm
 Input: min_support threshold σ and error bound ε
 Incoming stream is divided into buckets of width w = 1/ε
 N – Current stream length
 Frequency list data structure is maintained
 For each item – approximate frequency count f, and maximum possible
error ∆ are maintained
 If the item is from the bth
bucket, the maximum possible error ∆ on the
frequency count of the item is b-1
 When a bucket boundary is reached, an item entry is deleted if f+ ∆ <= b the
current bucket number – Frequency list is kept small
 The frequency of an item can be under-estimated by atmost εN
 f+ ∆ <= b; b <= N/w i.e b<= Nε so f + ∆ <= Nε
 All frequent items and some sub-frequent items with frequency σN – εN
will be output
22
Lossy Counting Algorithm
 Properties
 No false Negatives (all frequent items are output)
 False positives have a frequency of atleast σN – εN
 Frequency of a frequent item is under-estimated by atmost εN
 Reasonable Space Requirements – 7/ ε
23
Lossy Counting Algorithm
24
Bucket 1 Bucket 2 Bucket 3
Divide Stream into ‘Buckets’ (bucket size is 1/ ε = 1000)
Lossy Counting Algorithm
25
Empty
(summary)
+
At bucket boundary, decrease all counters by 1
+
At bucket boundary, decrease all counters by 1
Lossy Counting Algorithm
 To find frequent itemsets
 Load as many buckets as possible into main memory - β
 If updated frequency f+∆ <= b where b is the current
bucket number entry can be deleted
 If an itemset has frequency f >= β and does not appear
in the list, it is inserted as a new entry with ∆ set to b - β
26
Lossy Counting Algorithm
27
Divide Stream into ‘Buckets’ as for frequent items
But fill as many buckets as possible in main memory one time
Bucket 1 Bucket 2 Bucket 3
If we put 3 buckets of data into main memory one time,
Then decrease each frequency count by 3
Lossy Counting Algorithm
28
2
2
1
2
1
1
1
summary data 3 buckets data
in memory
4
4
10
2
2
0
+
3
3
9
summary data
Itemset is deleted.
Choosing a large number of buckets helps to delete more
Lossy Counting Algorithm
 Strengths:
 A simple idea
 Can be extended to frequent itemsets
 Weakness:
 Space Bound is not good
 For frequent itemsets, they do scan each record many times
 The output is based on all previous data. But sometimes, we are
only interested in recent data
29
Classification of Dynamic Data Streams
 Traditional Classification techniques scan the training data multiple
times – Off line process
 Not possible in data streams
 Example: Decision trees – at each node best possible split is determined
by considering all the data, for all attributes
 Concept drift occurs in streams
 Changes in the classification model over time
 Stream Classification techniques
 Hoeffding tree Algorithm
 Very Fast Decision Tree (VFDT)
 Concept-adapting Very Fast Decision Tree
 Classifier Ensemble
30
 Decision Tree Learning Algorithm
 Used originally to track Web click streams and predict next
access
 Hoeffding trees – small sample is adequate to choose
optimal splitting attribute
 Uses Hoeffding bound
 r: random variable, R: range of r
 n: # independent observations
 Mean computed from sample – ravg
 True Mean of r is at least ravg – ε, with probability 1 – δ (user-input)
 Determines smallest number of samples needed at a node to
split
31
Hoeffding Tree Algorithm
n
R
2
)/1ln(2
δ
ε =
Hoeffding Tree Algorithm
 Hoeffding Tree Input
S: sequence of examples
X: attributes
G( ): evaluation function
d: desired accuracy
 Hoeffding Tree Algorithm
for each node
retrieve G(Xa) and G(Xb) //two highest G(Xi)
if ( G(Xa) – G(Xb) > ε )
split on Xa
recurse to next node
break
 Complexity : O(ldvc); l- maximum depth; d – number of attributes; v-
maximum number of values for any attribute; c- number of classes
 Disagreement with traditional tree is at-most δ / p
32
Hoeffding Tree Algorithm
33
yes no
Packets > 10
Protocol = http
Protocol = ftp
yes
yes no
Packets > 10
Bytes >
60K
Protocol = http
Data Stream
Data Stream
New data can be integrated as it
streams in
 Strengths
 Scales better than traditional methods
 Sub-linear with sampling
 Very small memory utilization
 Incremental
 Make class predictions in parallel
 New examples are added as they come
 Weakness
 Could spend a lot of time with ties
 Memory used with tree expansion
 Cannot handle concept drift
34
Hoeffding Tree Algorithm
Very Fast Decision Tree
 Modifications to Hoeffding Tree
 Near-ties broken more aggressively
 G computed every nmin
 Deactivates certain leaves to save memory
 Poor attributes dropped
 Compare to Hoeffding Tree: Better time and memory
 Compare to traditional decision tree
 Similar accuracy
 Better runtime
 Still does not handle concept drift
35
Concept-adapting VFDT
 Concept Drift
 Time-changing data streams
 Incorporate new and eliminate old
 Sliding window – sensitive to window size (w)
 CVFDT
 Increments count with new example
 Decrement old example
 Sliding window – but does not construct model from scratch
 Nodes assigned monotonically increasing IDs
 Grows alternate subtrees
 When alternate more accurate => replace old
 O(w) better runtime than VFDT
36
Classifier Ensemble Approach
 Method (derived from the ensemble idea in classification)
 train K classifiers from K chunks
 for each subsequent chunk
train a new classifier
test other classifiers against the chunk
assign weight to each classifier
select top K classifiers
 Instead of discarding old examples – here least accurate
classifiers are discarded
37
Clustering Data Streams
 Methodologies used:
 Compute and store summaries of past data
 Apply a divide-and-conquer strategy
 Incremental Clustering of incoming data streams
 Perform micro-clustering as well as macro-clustering analysis
 Explore multiple time granularity for the analysis of cluster
evolution – data summaries at different time points
 Divide stream clustering into on-line and off-line (or independent)
processes
38
STREAM
 Input: Data stream of points X1, X2…XN with timestamps
T1, T2…TN
 Single pass, k-medians approach
 Clusters data such that Sum Squared Error (SSQ) between the
points and cluster center is minimized
 Data streams are processed in buckets of m points (to fit into
main memory)
 For each bucket – k clusters are formed, weighted centers are
retained and other points are discarded
 When enough centers are collected, these are clustered
 Clusters in limited time and space but does not handle
time granularity
39
CluStream: Clustering Evolving Streams
 Design goal
 High quality for clustering evolving data streams with greater
functionality

One-pass over the original stream data

Limited space usage and high efficiency
 CluStream: A framework for clustering evolving data streams
 Divide the clustering process into online and offline components

Online component: periodically stores summary statistics
about the stream data

Offline component: answers various user questions based on
the stored summary statistics
40
CluStream
 Tilted time frame model adopted
 Micro-cluster
 Statistical information about data locality
 Temporal extension of the cluster-feature vector
 Multi-dimensional points X1, X2…XN with timestamps T1, T2…TN

Each point contains d dimensions

A micro-cluster for n points is defined as a (2.d + 3) tuple
(CF2x
, CF1x
, CF2t
, CF1t
, n)
 Pyramidal time frame
 Decide at what moments the snapshots of the statistical information
are stored away on disk
41
CluStream
 Online micro-cluster maintenance
 Initial creation of q micro-clusters

q is usually significantly larger than the number of natural
clusters
 Online incremental update of micro-clusters

If new point is within max-boundary, insert into the micro-
cluster or create a new cluster

May delete obsolete micro-cluster or merge two closest ones
 Query-based macro-clustering
 Based on a user-specified time-horizon h and the number of
macro-clusters K, compute macroclusters using the k-means
algorithm
42
CluStream
 Evolution Analysis
 Given time horizon h and two clock times t1, t2, cluster analysis
examines data arriving between (t2-h, t2) and (t1-h, t1)
 Check for new clusters
 Loss of original clusters
 How clusters have shifted in position and nature
 Net snapshots of microclusters N(t1, h) and N(t2, h)
 CluStream – accurate and efficient; scalable; flexible
43

More Related Content

What's hot (20)

4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
Krish_ver2
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
ZHAO Sam
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
lavanya marichamy
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
Archana Swaminathan
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
Navjot Kaur
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
Krish_ver2
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
Salah Amean
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf
vvpadhu
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
Ramakant Soni
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Knoldus Inc.
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
nehabsairam
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
Krish_ver2
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Text clustering
Text clusteringText clustering
Text clustering
KU Leuven
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
mrizwan969
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
Krish_ver2
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
ZHAO Sam
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
Krish_ver2
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
Salah Amean
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf
vvpadhu
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
Ramakant Soni
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Knoldus Inc.
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
nehabsairam
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
Krish_ver2
 
Text clustering
Text clusteringText clustering
Text clustering
KU Leuven
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
mrizwan969
 

Similar to 5.1 mining data streams (20)

Aa31163168
Aa31163168Aa31163168
Aa31163168
IJERA Editor
 
IJCSIT
IJCSITIJCSIT
IJCSIT
Poonam Debnath
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
Raja Chiky
 
data streammining and its applications.ppt
data streammining and its applications.pptdata streammining and its applications.ppt
data streammining and its applications.ppt
ajajkhan16
 
081.ppt
081.ppt081.ppt
081.ppt
amil baba
 
Fp3111131118
Fp3111131118Fp3111131118
Fp3111131118
IJERA Editor
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques
Houw Liong The
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)
DheerajPachauri
 
Jewei Hans & Kamber Chapter 8
Jewei Hans & Kamber  Chapter 8Jewei Hans & Kamber  Chapter 8
Jewei Hans & Kamber Chapter 8
Houw Liong The
 
Data Streaming in Big Data and Data mining in streaming
Data Streaming in Big Data and Data mining in streamingData Streaming in Big Data and Data mining in streaming
Data Streaming in Big Data and Data mining in streaming
AMSERMAKANITeaching
 
A Deep dive into Big Data Analytics Data mining
A Deep dive into Big Data Analytics Data miningA Deep dive into Big Data Analytics Data mining
A Deep dive into Big Data Analytics Data mining
theniche69
 
1105.1950
1105.19501105.1950
1105.1950
Nhat Tam
 
Paper id 25201431
Paper id 25201431Paper id 25201431
Paper id 25201431
IJRAT
 
Paper Title - Mining Techniques for Streaming Data
Paper Title - Mining Techniques for Streaming DataPaper Title - Mining Techniques for Streaming Data
Paper Title - Mining Techniques for Streaming Data
IJDKP
 
MINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATAMINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATA
IJDKP
 
MINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATAMINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATA
IJDKP
 
Mining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering AlgorithmMining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering Algorithm
Manishankar Medi
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
AlexanderKyalo3
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
Gabriel Kamau
 
Uint-4 Mining Data Stream.pdf
Uint-4 Mining Data Stream.pdfUint-4 Mining Data Stream.pdf
Uint-4 Mining Data Stream.pdf
Sitamarhi Institute of Technology
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
Raja Chiky
 
data streammining and its applications.ppt
data streammining and its applications.pptdata streammining and its applications.ppt
data streammining and its applications.ppt
ajajkhan16
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques
Houw Liong The
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)
DheerajPachauri
 
Jewei Hans & Kamber Chapter 8
Jewei Hans & Kamber  Chapter 8Jewei Hans & Kamber  Chapter 8
Jewei Hans & Kamber Chapter 8
Houw Liong The
 
Data Streaming in Big Data and Data mining in streaming
Data Streaming in Big Data and Data mining in streamingData Streaming in Big Data and Data mining in streaming
Data Streaming in Big Data and Data mining in streaming
AMSERMAKANITeaching
 
A Deep dive into Big Data Analytics Data mining
A Deep dive into Big Data Analytics Data miningA Deep dive into Big Data Analytics Data mining
A Deep dive into Big Data Analytics Data mining
theniche69
 
Paper id 25201431
Paper id 25201431Paper id 25201431
Paper id 25201431
IJRAT
 
Paper Title - Mining Techniques for Streaming Data
Paper Title - Mining Techniques for Streaming DataPaper Title - Mining Techniques for Streaming Data
Paper Title - Mining Techniques for Streaming Data
IJDKP
 
MINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATAMINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATA
IJDKP
 
MINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATAMINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATA
IJDKP
 
Mining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering AlgorithmMining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering Algorithm
Manishankar Medi
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
AlexanderKyalo3
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
Gabriel Kamau
 
Ad

More from Krish_ver2 (20)

5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back tracking
Krish_ver2
 
5.5 back track
5.5 back track5.5 back track
5.5 back track
Krish_ver2
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02
Krish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
Krish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
Krish_ver2
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithm
Krish_ver2
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
Krish_ver2
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programming
Krish_ver2
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
Krish_ver2
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
Krish_ver2
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
Krish_ver2
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
Krish_ver2
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02
Krish_ver2
 
5.1 greedy
5.1 greedy5.1 greedy
5.1 greedy
Krish_ver2
 
5.1 greedy 03
5.1 greedy 035.1 greedy 03
5.1 greedy 03
Krish_ver2
 
4.4 hashing02
4.4 hashing024.4 hashing02
4.4 hashing02
Krish_ver2
 
4.4 hashing
4.4 hashing4.4 hashing
4.4 hashing
Krish_ver2
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing ext
Krish_ver2
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
Krish_ver2
 
4.2 bst
4.2 bst4.2 bst
4.2 bst
Krish_ver2
 
5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back tracking
Krish_ver2
 
5.5 back track
5.5 back track5.5 back track
5.5 back track
Krish_ver2
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02
Krish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
Krish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
Krish_ver2
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithm
Krish_ver2
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
Krish_ver2
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programming
Krish_ver2
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
Krish_ver2
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
Krish_ver2
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
Krish_ver2
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
Krish_ver2
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02
Krish_ver2
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing ext
Krish_ver2
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
Krish_ver2
 
Ad

Recently uploaded (20)

YSPH VMOC Special Report - Measles Outbreak Southwest US 5-30-2025.pptx
YSPH VMOC Special Report - Measles Outbreak  Southwest US 5-30-2025.pptxYSPH VMOC Special Report - Measles Outbreak  Southwest US 5-30-2025.pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 5-30-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
How to Configure Add to Cart in Odoo 18 Website
How to Configure Add to Cart in Odoo 18 WebsiteHow to Configure Add to Cart in Odoo 18 Website
How to Configure Add to Cart in Odoo 18 Website
Celine George
 
QUIZ-O-FORCE FINAL SET BY SUDIPTA & SUBHAM.pptx
QUIZ-O-FORCE FINAL SET BY SUDIPTA & SUBHAM.pptxQUIZ-O-FORCE FINAL SET BY SUDIPTA & SUBHAM.pptx
QUIZ-O-FORCE FINAL SET BY SUDIPTA & SUBHAM.pptx
Sourav Kr Podder
 
Dashboard Overview in Odoo 18 - Odoo Slides
Dashboard Overview in Odoo 18 - Odoo SlidesDashboard Overview in Odoo 18 - Odoo Slides
Dashboard Overview in Odoo 18 - Odoo Slides
Celine George
 
CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...
CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...
CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...
Sritoma Majumder
 
Critical Thinking and Bias with Jibi Moses
Critical Thinking and Bias with Jibi MosesCritical Thinking and Bias with Jibi Moses
Critical Thinking and Bias with Jibi Moses
Excellence Foundation for South Sudan
 
Stewart Butler - OECD - How to design and deliver higher technical education ...
Stewart Butler - OECD - How to design and deliver higher technical education ...Stewart Butler - OECD - How to design and deliver higher technical education ...
Stewart Butler - OECD - How to design and deliver higher technical education ...
EduSkills OECD
 
Writing Research Papers: Guidance for Research Community
Writing Research Papers: Guidance for Research CommunityWriting Research Papers: Guidance for Research Community
Writing Research Papers: Guidance for Research Community
Rishi Bankim Chandra Evening College, Naihati, North 24 Parganas, West Bengal, India
 
LDMMIA Bonus GUEST GRAD Student Check-in
LDMMIA Bonus GUEST GRAD Student Check-inLDMMIA Bonus GUEST GRAD Student Check-in
LDMMIA Bonus GUEST GRAD Student Check-in
LDM & Mia eStudios
 
POS Reporting in Odoo 18 - Odoo 18 Slides
POS Reporting in Odoo 18 - Odoo 18 SlidesPOS Reporting in Odoo 18 - Odoo 18 Slides
POS Reporting in Odoo 18 - Odoo 18 Slides
Celine George
 
STUDENT LOAN TRUST FUND DEFAULTERS GHANA
STUDENT LOAN TRUST FUND DEFAULTERS GHANASTUDENT LOAN TRUST FUND DEFAULTERS GHANA
STUDENT LOAN TRUST FUND DEFAULTERS GHANA
Kweku Zurek
 
Order: Odonata Isoptera and Thysanoptera.pptx
Order: Odonata Isoptera and Thysanoptera.pptxOrder: Odonata Isoptera and Thysanoptera.pptx
Order: Odonata Isoptera and Thysanoptera.pptx
Arshad Shaikh
 
How to Create Time Off Request in Odoo 18 Time Off
How to Create Time Off Request in Odoo 18 Time OffHow to Create Time Off Request in Odoo 18 Time Off
How to Create Time Off Request in Odoo 18 Time Off
Celine George
 
How to Use Owl Slots in Odoo 17 - Odoo Slides
How to Use Owl Slots in Odoo 17 - Odoo SlidesHow to Use Owl Slots in Odoo 17 - Odoo Slides
How to Use Owl Slots in Odoo 17 - Odoo Slides
Celine George
 
Exploring Identity Through Colombian Companies
Exploring Identity Through Colombian CompaniesExploring Identity Through Colombian Companies
Exploring Identity Through Colombian Companies
OlgaLeonorTorresSnch
 
Search Engine Optimization (SEO) for Website Success
Search Engine Optimization (SEO) for Website SuccessSearch Engine Optimization (SEO) for Website Success
Search Engine Optimization (SEO) for Website Success
Muneeb Rana
 
How to Create a Stage or a Pipeline in Odoo 18 CRM
How to Create a Stage or a Pipeline in Odoo 18 CRMHow to Create a Stage or a Pipeline in Odoo 18 CRM
How to Create a Stage or a Pipeline in Odoo 18 CRM
Celine George
 
HUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGY
HUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGYHUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGY
HUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGY
DHARMENDRA SAHU
 
প্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdf
প্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdfপ্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdf
প্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdf
Pragya - UEM Kolkata Quiz Club
 
"Orthoptera: Grasshoppers, Crickets, and Katydids pptx
"Orthoptera: Grasshoppers, Crickets, and Katydids pptx"Orthoptera: Grasshoppers, Crickets, and Katydids pptx
"Orthoptera: Grasshoppers, Crickets, and Katydids pptx
Arshad Shaikh
 
How to Configure Add to Cart in Odoo 18 Website
How to Configure Add to Cart in Odoo 18 WebsiteHow to Configure Add to Cart in Odoo 18 Website
How to Configure Add to Cart in Odoo 18 Website
Celine George
 
QUIZ-O-FORCE FINAL SET BY SUDIPTA & SUBHAM.pptx
QUIZ-O-FORCE FINAL SET BY SUDIPTA & SUBHAM.pptxQUIZ-O-FORCE FINAL SET BY SUDIPTA & SUBHAM.pptx
QUIZ-O-FORCE FINAL SET BY SUDIPTA & SUBHAM.pptx
Sourav Kr Podder
 
Dashboard Overview in Odoo 18 - Odoo Slides
Dashboard Overview in Odoo 18 - Odoo SlidesDashboard Overview in Odoo 18 - Odoo Slides
Dashboard Overview in Odoo 18 - Odoo Slides
Celine George
 
CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...
CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...
CBSE - Grade 11 - Mathematics - Ch 2 - Relations And Functions - Notes (PDF F...
Sritoma Majumder
 
Stewart Butler - OECD - How to design and deliver higher technical education ...
Stewart Butler - OECD - How to design and deliver higher technical education ...Stewart Butler - OECD - How to design and deliver higher technical education ...
Stewart Butler - OECD - How to design and deliver higher technical education ...
EduSkills OECD
 
LDMMIA Bonus GUEST GRAD Student Check-in
LDMMIA Bonus GUEST GRAD Student Check-inLDMMIA Bonus GUEST GRAD Student Check-in
LDMMIA Bonus GUEST GRAD Student Check-in
LDM & Mia eStudios
 
POS Reporting in Odoo 18 - Odoo 18 Slides
POS Reporting in Odoo 18 - Odoo 18 SlidesPOS Reporting in Odoo 18 - Odoo 18 Slides
POS Reporting in Odoo 18 - Odoo 18 Slides
Celine George
 
STUDENT LOAN TRUST FUND DEFAULTERS GHANA
STUDENT LOAN TRUST FUND DEFAULTERS GHANASTUDENT LOAN TRUST FUND DEFAULTERS GHANA
STUDENT LOAN TRUST FUND DEFAULTERS GHANA
Kweku Zurek
 
Order: Odonata Isoptera and Thysanoptera.pptx
Order: Odonata Isoptera and Thysanoptera.pptxOrder: Odonata Isoptera and Thysanoptera.pptx
Order: Odonata Isoptera and Thysanoptera.pptx
Arshad Shaikh
 
How to Create Time Off Request in Odoo 18 Time Off
How to Create Time Off Request in Odoo 18 Time OffHow to Create Time Off Request in Odoo 18 Time Off
How to Create Time Off Request in Odoo 18 Time Off
Celine George
 
How to Use Owl Slots in Odoo 17 - Odoo Slides
How to Use Owl Slots in Odoo 17 - Odoo SlidesHow to Use Owl Slots in Odoo 17 - Odoo Slides
How to Use Owl Slots in Odoo 17 - Odoo Slides
Celine George
 
Exploring Identity Through Colombian Companies
Exploring Identity Through Colombian CompaniesExploring Identity Through Colombian Companies
Exploring Identity Through Colombian Companies
OlgaLeonorTorresSnch
 
Search Engine Optimization (SEO) for Website Success
Search Engine Optimization (SEO) for Website SuccessSearch Engine Optimization (SEO) for Website Success
Search Engine Optimization (SEO) for Website Success
Muneeb Rana
 
How to Create a Stage or a Pipeline in Odoo 18 CRM
How to Create a Stage or a Pipeline in Odoo 18 CRMHow to Create a Stage or a Pipeline in Odoo 18 CRM
How to Create a Stage or a Pipeline in Odoo 18 CRM
Celine George
 
HUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGY
HUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGYHUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGY
HUMAN SKELETAL SYSTEM ANATAMY AND PHYSIOLOGY
DHARMENDRA SAHU
 
প্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdf
প্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdfপ্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdf
প্রত্যুৎপন্নমতিত্ব - Prottutponnomotittwa 2025.pdf
Pragya - UEM Kolkata Quiz Club
 
"Orthoptera: Grasshoppers, Crickets, and Katydids pptx
"Orthoptera: Grasshoppers, Crickets, and Katydids pptx"Orthoptera: Grasshoppers, Crickets, and Katydids pptx
"Orthoptera: Grasshoppers, Crickets, and Katydids pptx
Arshad Shaikh
 

5.1 mining data streams

  • 2. Mining Complex data  Stream data  Massive data, temporally ordered, fast changing and potentially infinite  Satellite Images, Data from electric power grids  Time-Series data  Sequence of values obtained over time  Economic and Sales data, natural phenomenon  Sequence data  Sequences of ordered elements or events (without time)  DNA and protein sequences  Graphs  Represent various kinds of Networks  Social Network Analysis 2
  • 3. Mining Data Streams  Streams – Temporally ordered, fast changing, massive and potentially infinite.  Characteristics  Huge volumes of continuous data, possibly infinite  Fast changing and requires fast, real-time response  Most stream data are at pretty low-level or multi-dimensional  Random access is expensive—Single scan algorithms are required  On-line, multi-level and multi-dimensional 3
  • 4. 4 Architecture: Stream Query Processing Scratch SpaceScratch Space (Main memory and/or Disk)(Main memory and/or Disk) User/ApplicationUser/ApplicationUser/ApplicationUser/Application Continuous QueryContinuous Query Stream QueryStream Query ProcessorProcessor ResultsResults Multiple streamsMultiple streams DSMS (Data Stream Management System)
  • 5. Stream Queries  Queries are often continuous  Evaluated continuously as stream data arrives  Answer updated over time  Queries are often complex  Beyond element-at-a-time processing  Beyond stream-at-a-time processing  Beyond relational queries  Query types  One-time query vs. continuous query (being evaluated continuously as stream continues to arrive)  Predefined query vs. ad-hoc query (issued on-line)  Unbounded memory requirements 5
  • 6. Stream Query Processing Challenges  Unbounded memory requirements  For real-time response, main memory algorithm should be used  Memory requirement is unbounded due to future tuples  Approximate query answering  With bounded memory, it is not always possible to produce exact answers  High-quality approximate answers are desired  Data reduction and synopsis construction methods  Sketches, random sampling, histograms, wavelets, etc. 6
  • 7. Methodologies  Impractical to scan more than once  Even inspection is an issue  Gigantic size rules out storage  Size of Universe is also very large  Even generating summaries is not possible  New data structures, techniques and algorithms are needed  Tradeoff between Storage and accuracy  Synopses  Summaries of data - support approximate answers  Use Synopsis data structures  Algorithms must use poly-logarithmic space O(logk N)  Compute an approximate answer within a factor ε of the actual answer – As approximation factor goes down, space requirements increase 7
  • 8. Synopsis Data Structures and Techniques  Random sampling  Histograms  Sliding windows  Multi-resolution model  Sketches  Randomized algorithms 8
  • 9.  Random Sampling  Without knowing the total length in advance  Reservoir sampling  Maintain a set of s candidates in the reservoir, which form a true random sample of the element seen so far in the stream.  As the data stream flow, every new element has a certain probability (s/N) of replacing an old element in the reservoir.  Sliding Windows  Make decisions based only on recent data of sliding window size w  An element arriving at time t expires at time t + w 9 Synopsis Data Structures and Techniques
  • 10.  Histograms  Approximate the frequency distribution of element values in a stream  Partition data into a set of buckets  Equal-width (equal value range for buckets) vs. V-optimal (minimizing frequency variance within each bucket)  Multi-resolution methods  Data reduction through divide and conquer  Balanced Binary tree – each level provides a different resolution  Hierarchical clustering of trees  Micro clusters, Macro clusters and Summary Statistics  Wavelets – For spatial and multimedia data  Builds a multi-resolution hierarchy structure of input signal 10 Synopsis Data Structures and Techniques
  • 11. Synopsis Data Structures and Techniques  Sketches  Sampling and Sliding window focus on small amount of data  Histograms and Wavelets require several passes  Sketches use Frequency moments  U = {1, 2, … v} A = {a1, a2, …aN}  F0 – number of distinct elements  F1 – length of sequence  F2 – Self-join size / repeat rate / Gini’s index of homogeneity  Frequency moments – give information about data and degree of skew or asymmetry  Given N elements and v values, sketches can approximate F0, F1, F2 in O(log v + log N) space  Each element is hashed onto {-1, +1} – Sketch Partitioning  Random variable X = ∑i mizi is maintained and can be used to approximate moments 11
  • 12. Synopsis Data Structures and Techniques  Randomized Algorithms  Las Vegas – right answer but running time varies  Monte Carlo – time bound; may not always return the correct answer  When random variables are used, its deviation from the expected value must be bounded  Chebyshev’s inequality:  Let X be a random variable with mean μ and standard deviation σ  Chernoff bound:  Let X be the sum of independent Poisson trials X1, …, Xn, δ in (0, 1]  The probability decreases exponentially as we move from the mean 12 2 2 )|(| k kXP σ µ ≤>− 4/2 |])1([ µδ µδ − <+< eXP
  • 13. Stream OLAP and Stream Data Cubes  Accommodating Stream data completely in a warehouse is challenging  Data is low-level so multi-dimensional analysis has to be done– OLAP Analysis is needed  Example: Power Supply Stream data – fluctuation of power usage will be analyzed at higher levels such as by city/district.. / hourly…  Can be viewed as a Virtual data cube  Efficient methods are needed for systematic analysis  Impossible to store complete stream data and compute a fully materialized cube  Compression Techniques are needed 13
  • 14. Stream OLAP - Compression  A tilted time frame  Different time granularities  second, minute, quarter, hour, day, week, …  Critical layers  Minimum interest layer (m-layer)  Observation layer (o-layer)  User: watches at o-layer and occasionally needs to drill-down down to m-layer  Partial materialization of stream cubes  Full materialization  No materialization  Partial materialization 14
  • 15. Time Dimension with Compressed Time Scale  Tilted time Frame  Recent changes are recorded at a fine scale and long- term changes at coarse scale  Level of coarseness depends on application requirements and on how old the time point is  Time Frame models  Natural tilted time frame model  Logarithmic tilted time frame model  Progressive logarithmic tilted time frame model 15
  • 16. Tilted Time Frame  Natural tilted time frame:  Minimal: quarter, then 4 quarters → 1 hour, 24 hours → day, …  Registers 4 + 24 + 31 +12 = 71 units of time for a year  Logarithmic tilted time frame:  Minimal: 1 quarter, then 1, 2, 4, 8, 16, 32, …  Records log2(365 x 24 x 4)+1 = 16.1 units / year 16 4 q t r s2 4 h o u r s3 1 d a y s1 2 m o n t h s t i m e T i m e t8 t 4 t 2 t t1 6 t3 2 t6 4 t
  • 17. Tilted Time Frame  Progressive Logarithmic tilted time frame  Max_frame and Max_capacity  If (t mod 2i ) =0 but (t mod 2i+1 ) <> 0, t is inserted into frame number i if i<= max_frame else into max_frame  Example:  Suppose there are 5 frames and each takes maximal 3 snapshots  Given a snapshot number N, if N mod 2d = 0, insert into the frame number d. If there are more than 3 snapshots, remove the oldest one 17 Frame no. Snapshots (by clock time) 0 69 67 65 1 66 62 58 2 68 60 52 3 56 40 24 4 48 16 5 64 32
  • 18. Critical Layers  Compute and store only mission-critical cuboids of the full data cube  Minimal Interest layer – not feasible to examine minute details  Observation layer – layer at which analyst studies the data  Example: Power Supply Stream data Cube  Raw data layer – Individual_user, Street_address, second  Minimal Interest layer – user_group, street_block and minute  Observation layer – *(all_user), city, quarter 18
  • 19. Critical Layers 19 (*, city, quarter) (user-group, street_block, minute) m-layer (minimal interest) (individual-user, street, second) o-layer (observation)
  • 20. Partial Materialization  On-line materialization  Materialization takes precious space and time  Only incremental materialization (with tilted time frame)  Only materialize “cuboids” of the critical layers  Online computation may take too much time  Preferred solution:  Popular-path approach: Materializing those along the popular drilling paths  H-tree (Hyperlink tree) structure: Such cuboids can be computed and stored efficiently using the H-tree structure  Aggregate Cells in non-leaf nodes  Based on popular path – all cells are cuboids on popular path – non- leaf nodes of H-tree; space and computation overheads are reduced 20
  • 21. Frequent Pattern Mining in Data Streams  Existing FP Mining algorithms require to scan the complete data set  In data streams an infrequent item may become frequent and vice versa  Approach I – Keep track of only limited set of items / itemsets  Approach II – Approximate set of answers  Example: a router is interested in all flows:  whose frequency is at least 1% (σ) of the entire traffic stream seen so far and feels that 1/10 of σ (ε = 0.1%) error is comfortable  All frequent items with a support of atleast min_support will be output; some items with a support of min_support – ε will also be output 21
  • 22. Lossy Counting Algorithm  Input: min_support threshold σ and error bound ε  Incoming stream is divided into buckets of width w = 1/ε  N – Current stream length  Frequency list data structure is maintained  For each item – approximate frequency count f, and maximum possible error ∆ are maintained  If the item is from the bth bucket, the maximum possible error ∆ on the frequency count of the item is b-1  When a bucket boundary is reached, an item entry is deleted if f+ ∆ <= b the current bucket number – Frequency list is kept small  The frequency of an item can be under-estimated by atmost εN  f+ ∆ <= b; b <= N/w i.e b<= Nε so f + ∆ <= Nε  All frequent items and some sub-frequent items with frequency σN – εN will be output 22
  • 23. Lossy Counting Algorithm  Properties  No false Negatives (all frequent items are output)  False positives have a frequency of atleast σN – εN  Frequency of a frequent item is under-estimated by atmost εN  Reasonable Space Requirements – 7/ ε 23
  • 24. Lossy Counting Algorithm 24 Bucket 1 Bucket 2 Bucket 3 Divide Stream into ‘Buckets’ (bucket size is 1/ ε = 1000)
  • 25. Lossy Counting Algorithm 25 Empty (summary) + At bucket boundary, decrease all counters by 1 + At bucket boundary, decrease all counters by 1
  • 26. Lossy Counting Algorithm  To find frequent itemsets  Load as many buckets as possible into main memory - β  If updated frequency f+∆ <= b where b is the current bucket number entry can be deleted  If an itemset has frequency f >= β and does not appear in the list, it is inserted as a new entry with ∆ set to b - β 26
  • 27. Lossy Counting Algorithm 27 Divide Stream into ‘Buckets’ as for frequent items But fill as many buckets as possible in main memory one time Bucket 1 Bucket 2 Bucket 3 If we put 3 buckets of data into main memory one time, Then decrease each frequency count by 3
  • 28. Lossy Counting Algorithm 28 2 2 1 2 1 1 1 summary data 3 buckets data in memory 4 4 10 2 2 0 + 3 3 9 summary data Itemset is deleted. Choosing a large number of buckets helps to delete more
  • 29. Lossy Counting Algorithm  Strengths:  A simple idea  Can be extended to frequent itemsets  Weakness:  Space Bound is not good  For frequent itemsets, they do scan each record many times  The output is based on all previous data. But sometimes, we are only interested in recent data 29
  • 30. Classification of Dynamic Data Streams  Traditional Classification techniques scan the training data multiple times – Off line process  Not possible in data streams  Example: Decision trees – at each node best possible split is determined by considering all the data, for all attributes  Concept drift occurs in streams  Changes in the classification model over time  Stream Classification techniques  Hoeffding tree Algorithm  Very Fast Decision Tree (VFDT)  Concept-adapting Very Fast Decision Tree  Classifier Ensemble 30
  • 31.  Decision Tree Learning Algorithm  Used originally to track Web click streams and predict next access  Hoeffding trees – small sample is adequate to choose optimal splitting attribute  Uses Hoeffding bound  r: random variable, R: range of r  n: # independent observations  Mean computed from sample – ravg  True Mean of r is at least ravg – ε, with probability 1 – δ (user-input)  Determines smallest number of samples needed at a node to split 31 Hoeffding Tree Algorithm n R 2 )/1ln(2 δ ε =
  • 32. Hoeffding Tree Algorithm  Hoeffding Tree Input S: sequence of examples X: attributes G( ): evaluation function d: desired accuracy  Hoeffding Tree Algorithm for each node retrieve G(Xa) and G(Xb) //two highest G(Xi) if ( G(Xa) – G(Xb) > ε ) split on Xa recurse to next node break  Complexity : O(ldvc); l- maximum depth; d – number of attributes; v- maximum number of values for any attribute; c- number of classes  Disagreement with traditional tree is at-most δ / p 32
  • 33. Hoeffding Tree Algorithm 33 yes no Packets > 10 Protocol = http Protocol = ftp yes yes no Packets > 10 Bytes > 60K Protocol = http Data Stream Data Stream New data can be integrated as it streams in
  • 34.  Strengths  Scales better than traditional methods  Sub-linear with sampling  Very small memory utilization  Incremental  Make class predictions in parallel  New examples are added as they come  Weakness  Could spend a lot of time with ties  Memory used with tree expansion  Cannot handle concept drift 34 Hoeffding Tree Algorithm
  • 35. Very Fast Decision Tree  Modifications to Hoeffding Tree  Near-ties broken more aggressively  G computed every nmin  Deactivates certain leaves to save memory  Poor attributes dropped  Compare to Hoeffding Tree: Better time and memory  Compare to traditional decision tree  Similar accuracy  Better runtime  Still does not handle concept drift 35
  • 36. Concept-adapting VFDT  Concept Drift  Time-changing data streams  Incorporate new and eliminate old  Sliding window – sensitive to window size (w)  CVFDT  Increments count with new example  Decrement old example  Sliding window – but does not construct model from scratch  Nodes assigned monotonically increasing IDs  Grows alternate subtrees  When alternate more accurate => replace old  O(w) better runtime than VFDT 36
  • 37. Classifier Ensemble Approach  Method (derived from the ensemble idea in classification)  train K classifiers from K chunks  for each subsequent chunk train a new classifier test other classifiers against the chunk assign weight to each classifier select top K classifiers  Instead of discarding old examples – here least accurate classifiers are discarded 37
  • 38. Clustering Data Streams  Methodologies used:  Compute and store summaries of past data  Apply a divide-and-conquer strategy  Incremental Clustering of incoming data streams  Perform micro-clustering as well as macro-clustering analysis  Explore multiple time granularity for the analysis of cluster evolution – data summaries at different time points  Divide stream clustering into on-line and off-line (or independent) processes 38
  • 39. STREAM  Input: Data stream of points X1, X2…XN with timestamps T1, T2…TN  Single pass, k-medians approach  Clusters data such that Sum Squared Error (SSQ) between the points and cluster center is minimized  Data streams are processed in buckets of m points (to fit into main memory)  For each bucket – k clusters are formed, weighted centers are retained and other points are discarded  When enough centers are collected, these are clustered  Clusters in limited time and space but does not handle time granularity 39
  • 40. CluStream: Clustering Evolving Streams  Design goal  High quality for clustering evolving data streams with greater functionality  One-pass over the original stream data  Limited space usage and high efficiency  CluStream: A framework for clustering evolving data streams  Divide the clustering process into online and offline components  Online component: periodically stores summary statistics about the stream data  Offline component: answers various user questions based on the stored summary statistics 40
  • 41. CluStream  Tilted time frame model adopted  Micro-cluster  Statistical information about data locality  Temporal extension of the cluster-feature vector  Multi-dimensional points X1, X2…XN with timestamps T1, T2…TN  Each point contains d dimensions  A micro-cluster for n points is defined as a (2.d + 3) tuple (CF2x , CF1x , CF2t , CF1t , n)  Pyramidal time frame  Decide at what moments the snapshots of the statistical information are stored away on disk 41
  • 42. CluStream  Online micro-cluster maintenance  Initial creation of q micro-clusters  q is usually significantly larger than the number of natural clusters  Online incremental update of micro-clusters  If new point is within max-boundary, insert into the micro- cluster or create a new cluster  May delete obsolete micro-cluster or merge two closest ones  Query-based macro-clustering  Based on a user-specified time-horizon h and the number of macro-clusters K, compute macroclusters using the k-means algorithm 42
  • 43. CluStream  Evolution Analysis  Given time horizon h and two clock times t1, t2, cluster analysis examines data arriving between (t2-h, t2) and (t1-h, t1)  Check for new clusters  Loss of original clusters  How clusters have shifted in position and nature  Net snapshots of microclusters N(t1, h) and N(t2, h)  CluStream – accurate and efficient; scalable; flexible 43