0% found this document useful (0 votes)
10 views

Data Mining_Unit-V

The document provides an extensive overview of data stream mining, highlighting its characteristics, challenges, and techniques such as clustering, classification, and anomaly detection. It also covers sequential pattern mining, including algorithms and applications in various fields like retail and healthcare. Additionally, it discusses mining object, spatial, multimedia, text, and web data, detailing techniques and examples for each type.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Data Mining_Unit-V

The document provides an extensive overview of data stream mining, highlighting its characteristics, challenges, and techniques such as clustering, classification, and anomaly detection. It also covers sequential pattern mining, including algorithms and applications in various fields like retail and healthcare. Additionally, it discusses mining object, spatial, multimedia, text, and web data, detailing techniques and examples for each type.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Advanced Concepts: Mining Data Streams

1. Introduction to Data Streams


A data stream is a continuous, rapid flow of data generated in real time. Unlike
traditional static datasets, data streams are dynamic and potentially infinite,
making storage and processing challenging.
Examples of data streams:
 Sensor data from IoT devices
 Clickstream data from websites
 Stock market updates
 Network traffic data
2. Characteristics of Data Streams
 Continuous flow: Data arrives in a steady, ongoing manner.
 High speed: Data is generated at a rapid rate, requiring real-time
processing.
 Potentially unbounded: The total size of the data is unknown and can
grow indefinitely.
 No re-scans: Once data is processed, it is usually discarded or archived
since it cannot be re-scanned.
 Evolving nature: The distribution of data may change over time (concept
drift).

3. Basic Concepts in Mining Data Streams


Mining data streams refers to extracting meaningful patterns, trends, or insights
from a continuous flow of data. Unlike mining static datasets, stream mining
algorithms must operate under strict constraints.
Challenges in Mining Data Streams
 Limited memory: Stream data cannot be stored entirely; only summaries
or models can be kept.
 Real-time processing: Algorithms must process data quickly as it
arrives.
 Handling concept drift: Adapting to changes in the data distribution
over time.
 Accuracy vs. efficiency: Maintaining a balance between processing
speed and the accuracy of the results.

4. Tasks in Data Stream Mining


 Clustering:
o Example: Grouping IoT sensor data into clusters of similar
temperature readings.
 Classification:
o Example: Real-time email spam detection.

 Frequent Pattern Mining:


o Example: Identifying the most common search terms on a website.

 Anomaly Detection:
o Example: Detecting fraudulent credit card transactions.

 Regression:
o Example: Predicting real-time stock prices.

5. Techniques for Mining Data Streams


a) Sliding Window Model
 Divides the data stream into fixed-sized windows.
 Only the most recent data within the window is analyzed.
 Example:
o A 10-minute sliding window in network traffic monitoring examines
the last 10 minutes of packets.
b) Damped Window Model
 Assigns higher weights to recent data and lower weights to older data.
 Useful for emphasizing recent trends while still considering historical data.
c) Synopsis Data Structures
 Summarizes data streams using small, memory-efficient structures.
 Examples:
o Count-Min Sketch: Estimates the frequency of items in a stream.

o Bloom Filter: Tests membership of an element in a set.

d) Sampling
 Randomly selects a subset of the data stream for processing.
 Example:
o Sampling a portion of website traffic to estimate the total number of
users.
e) Incremental Algorithms
 Continuously updates the model with each incoming data point.
 Example:
o Incremental clustering algorithms like CluStream.

6. Mining Techniques with Examples


a) Mining Frequent Patterns
 Identify patterns that occur frequently in a data stream.
 Example:
o In an online shopping platform, identify products frequently
purchased together in real time.
b) Classification
 Train a model to classify incoming data points.
 Example:
o A fraud detection system that labels transactions as "fraudulent" or
"legitimate."
c) Clustering
 Groups data points into clusters based on similarity.
 Example:
o Clustering GPS data from delivery vehicles to identify common
delivery routes.
d) Outlier Detection
 Identifies anomalies or unusual data points in the stream.
 Example:
o Detecting unusual spikes in CPU usage in a data center.

7. Example: CluStream for Data Stream Clustering


Scenario:
 A ride-hailing app collects GPS data from its drivers in real-time.
 Goal: Group drivers into clusters based on their locations.
Steps:
1. Online Phase:
o Summarize the incoming data stream into micro-clusters.
o Each micro-cluster stores summary statistics (e.g., centroid,
weight).
2. Offline Phase:
o Use the micro-clusters to generate larger, meaningful clusters.

o Periodically update clusters to reflect new data.

8. Example: Decision Tree for Data Stream Classification


Scenario:
 A financial institution monitors real-time credit card transactions to detect
fraud.
Steps:
1. Train a decision tree model using a small sample of labeled transactions.
2. As new transactions arrive:
o Classify each transaction as "fraudulent" or "legitimate."

o Update the decision tree with new labeled data points (incremental
learning).

9. Tools for Mining Data Streams


 MOA (Massive Online Analysis): A framework for data stream mining.
 Storm: A real-time distributed stream processing system.
 Apache Flink: A platform for processing unbounded streams of data.

1. Time-Series Data
Definition: Time-series data is a sequence of data points collected over time,
typically at uniform intervals. Each data point is associated with a timestamp.
Examples:
 Stock prices over days
 Temperature measurements recorded hourly
 Monthly sales data of a product
Characteristics:
 Temporal Ordering: The sequence of the data matters.
 Continuous or Discrete: Data can be continuously measured or occur at
distinct intervals.
 Trends and Seasonality: Patterns such as growth trends or seasonal
variations are common.

2. Sequence Patterns in Transactional Databases


Definition: Sequence pattern mining involves discovering frequent
subsequences in a dataset where the data is arranged in the form of sequences
(e.g., customer purchases over time).
Transactional Database Example: A transactional database records
sequences of events or items purchased by customers:
Customer A: {Milk → Bread → Butter}
Customer B: {Bread → Butter}
Customer C: {Milk → Butter → Bread}
The goal is to discover patterns like:
 Customers often buy "Milk → Bread".
 "Butter" frequently follows "Bread".

3. Mining Sequence Patterns


Mining sequence patterns focuses on identifying frequent patterns within a
sequence of events in the database.
Key Applications:
 Market Basket Analysis: Identifying the sequence of products frequently
purchased together.
 Web Clickstream Analysis: Understanding the sequence of pages users
visit on a website.
 Medical Data Analysis: Discovering sequences of symptoms leading to a
diagnosis.

4. Algorithms for Mining Sequence Patterns


Several algorithms are used for mining sequence patterns in transactional
databases. Below are some commonly used methods:
a) Apriori-based Sequential Pattern Mining
 Extends the Apriori algorithm to handle sequential data.
 Generates candidate sequences and evaluates their frequency.
Example:
Database:
1. {A → B → C}
2. {A → C → D}
3. {B → C → A}

Pattern: {A → C}
Support: 2 (appears in sequences 1 and 2)
b) PrefixSpan (Prefix-projected Sequential Pattern Mining)
 Projects the database based on prefixes of sequences to reduce the search
space.
 Generates frequent patterns directly from the projected database.
Example:
Database: {A → B → C}, {A → C}, {B → C → D}
Step 1: Identify prefix (e.g., {A}).
Step 2: Project database: {B → C}, {C}.
Step 3: Mine frequent patterns: {A → C}.
c) SPADE (Sequential Pattern Discovery using Equivalence classes)
 Uses a vertical database format to efficiently mine sequences.
 Stores each item with the list of sequences in which it appears.

5. Example of Sequential Pattern Mining


Dataset:
Customer Transactions:
1. {Milk → Bread → Butter}
2. {Bread → Butter → Cheese}
3. {Milk → Butter → Bread}
4. {Bread → Cheese → Milk}
Steps:
1. Identify individual items:
o Frequent 1-itemsets: {Milk, Bread, Butter, Cheese}.

o Support: {Milk: 3, Bread: 4, Butter: 3, Cheese: 2}.


2. Generate 2-sequence patterns:
o {Milk → Bread}, {Milk → Butter}, {Bread → Butter}.

o Support: {Milk → Bread: 2, Milk → Butter: 2, Bread → Butter: 3}.

3. Generate 3-sequence patterns:


o {Milk → Bread → Butter}, {Bread → Butter → Cheese}.

o Support: {Milk → Bread → Butter: 1, Bread → Butter → Cheese: 1}.

4. Output frequent patterns:


o Patterns with support ≥ 2:

 {Milk → Bread}, {Milk → Butter}, {Bread → Butter}.

6. Mining Sequence Patterns in Time-Series Data


Sequence patterns can also be extracted from time-series data, where the
timestamps dictate the ordering.
Example:
 Scenario: A retail store records the following sales data over a week:
 Day 1: {Milk → Bread}
 Day 2: {Bread → Butter}
 Day 3: {Milk → Butter → Bread}
 Goal: Find frequent item purchase sequences.
 Result: The pattern {Milk → Bread} occurs twice and is frequent.

7. Key Metrics in Sequence Pattern Mining


 Support: The fraction of sequences containing the pattern.
o Example: If {Milk → Bread} appears in 2 out of 4 transactions, its
support is 50%.
 Confidence: The likelihood of one event following another.
o Example: If {Milk → Bread} has confidence 75%, it means 75% of
customers who bought milk also bought bread.

8. Applications of Sequence Pattern Mining


1. Retail:
o Optimizing product placement based on purchase patterns.

2. Healthcare:
o Identifying symptom sequences leading to a diagnosis.

3. Finance:
o Detecting sequences of transactions indicating fraud.

4. Web Analytics:
o Analyzing user navigation paths to improve website design.

9. Tools for Sequential Pattern Mining


 SAS Enterprise Miner
 R: Libraries like arulesSequences.
 Weka: Tools for mining sequences in data.
 SPMF: An open-source data mining library specialized in sequence pattern
mining.
Sequential pattern mining is an essential technique for uncovering hidden trends
in temporal and transactional data, offering valuable insights for decision-making
in diverse fields.

Mining Object, Spatial, Multimedia, Text, and Web Data


Data mining can be applied to different types of data sources, such as spatial,
multimedia, text, and web data. Below is a detailed explanation of each type,
along with examples and key concepts.

1. Mining Object Data


Definition:
Object data refers to data that has inherent structures, such as objects in object-
oriented databases or geospatial systems.
Examples:
 Product catalogs in e-commerce.
 Customer records in a relational database.
Techniques:
 Object-Oriented Data Mining: Analyzes attributes and relationships of
objects.
 Applications: Fraud detection, customer profiling.

2. Spatial Data Mining


Definition:
Spatial data mining is the process of discovering interesting patterns and
relationships in spatial datasets, such as geographic or spatial objects.
Examples:
 Weather data analysis: Detecting patterns in rainfall distribution.
 Urban planning: Identifying areas with high traffic congestion.

Key Techniques:
 Spatial Clustering: Identifies groups of similar spatial objects (e.g., k-
Means for geographic data).
 Spatial Classification: Assigns a label to spatial data (e.g., land type
classification: forest, urban, water).
 Spatial Association Rules: Finds relationships among spatial data (e.g.,
"If a region is near a river, it is likely to have fertile soil").
Applications:
 Satellite image analysis.
 Geographic Information Systems (GIS).
 Crime mapping.

3. Multimedia Data Mining


Definition:
Multimedia data mining focuses on discovering meaningful patterns in
multimedia data, such as images, audio, video, and animations.
Examples:
 Video surveillance: Identifying unusual behavior.
 Image analysis: Detecting defects in industrial images.
Key Techniques:
 Content-Based Retrieval: Identifies multimedia content based on
features (e.g., color, texture, shape).
 Pattern Recognition: Detects recurring patterns in multimedia (e.g.,
voice or facial recognition).
 Clustering: Groups similar images or audio clips.
Applications:
 Healthcare: Analyzing X-rays and MRIs.
 Entertainment: Recommending movies or music.
4. Text Mining
Definition:
Text mining involves extracting useful information and patterns from
unstructured textual data.
Examples:
 Sentiment analysis: Analyzing customer reviews for positive or negative
sentiment.
 Topic modeling: Categorizing documents into topics.
Key Techniques:
 Natural Language Processing (NLP): Processes and analyzes textual
data (e.g., tokenization, stemming).
 TF-IDF (Term Frequency-Inverse Document Frequency): Measures
the importance of words in a document relative to a collection of
documents.
 Text Clustering: Groups similar documents (e.g., news articles).
 Text Classification: Categorizes text into predefined classes (e.g., spam
vs. non-spam emails).
Applications:
 Social media analysis.
 Legal document processing.
 Fraud detection (e.g., analyzing textual patterns in contracts).

5. Mining the World Wide Web


Definition:
Web mining refers to the application of data mining techniques to extract
insights from web data, including content, structure, and user behavior.
Types of Web Mining:
1. Web Content Mining:
o Extracts patterns from web pages' content.

o Example: Extracting product descriptions from e-commerce sites.

o Techniques: NLP, semantic analysis.

2. Web Structure Mining:


o Discovers relationships between web pages based on hyperlinks.

o Example: PageRank algorithm by Google.


o Techniques: Graph theory, link analysis.

3. Web Usage Mining:


o Analyzes user behavior data (e.g., clickstreams, browsing history).

o Example: Recommending products based on browsing patterns.

o Techniques: Sequential pattern mining, association rule mining.

Applications:
 Search engine optimization (SEO).
 E-commerce recommendations.
 User behavior analysis.

6. Spatial Data Mining: Example


Scenario:
A city government wants to analyze accident-prone areas. They use a spatial
dataset containing the coordinates of accidents.
Steps:
1. Spatial Clustering: Identify clusters of accident-prone areas using
algorithms like DBSCAN.
2. Spatial Association Rules: Analyze factors contributing to accidents
(e.g., "High traffic volume near schools increases accident risk").
3. Visualization: Create heatmaps to visualize high-risk zones.

7. Multimedia Data Mining: Example


Scenario:
A streaming platform wants to recommend movies.
Steps:
1. Content-Based Mining: Analyze features like genre, actors, and ratings.
2. Pattern Recognition: Identify patterns in user preferences.
3. Clustering: Group similar movies for personalized recommendations.

8. Text Mining: Example


Scenario:
A company analyzes customer feedback to identify trends.
Steps:
1. Data Preprocessing: Tokenize and clean the text.
2. Sentiment Analysis: Classify feedback as positive, negative, or neutral.
3. Topic Modeling: Identify frequently mentioned topics.
Output:
 Positive Feedback: "Great product quality."
 Negative Feedback: "Delivery was delayed."

9. Web Mining: Example


Scenario:
An e-commerce site wants to improve product recommendations.
Steps:
1. Web Usage Mining: Analyze user browsing and purchase history.
2. Web Content Mining: Extract product descriptions for keyword
matching.
3. Web Structure Mining: Identify relationships between similar products
using hyperlinks.
Output:
 Recommendations for "Smartphones" include cases, chargers, and screen
protectors.

You might also like