Unit-5 - Outliers and Statistical Approaches in Data Mining
Introduction to outliers, Challenges in detecting Outliers,Outlier Detection Methods -
Supervised, Semisupervised, Unsupervised- Statistical Data Mining approaches - Data mining
in Recommender Systems,Data mining for Intrusion Detection, Data Mining for Financial
Analysis
Outlier:
A data object that deviates significantly from the normal objects as if it were generated
by a different mechanism Ex.: Unusual credit card purchase
Outliers are different from the noise data,Noise is random error or variance in a
measured variable.Noise should be removed before outlier detection
Outlier detection vs. novelty detection: early stage, outlier; but later merged into the
model
Applications: Credit card fraud detection Telecom fraud detection Customer
segmentation Medical analysis
Example:If a customer who usually spends $50 suddenly makes a $5,000 purchase, that could
be flagged as an outlier—potentially indicating credit card fraud
There are three types of outliers: global, contextual, and collective.
1. Global Outlier (or Point Anomaly):
○ A global outlier is an object that significantly deviates from the entire dataset.
○ Example: In network intrusion detection, an unusual data packet that doesn't
match any typical behavior can be considered a global outlier.
2. Contextual Outlier (or Conditional Outlier):
○ A contextual outlier deviates significantly when considering a specific context, like
time or location.
○ Example: A temperature of 80°F in Urbana could be an outlier in winter, but not
in summer.
○ Attributes are divided into:
■ Contextual attributes: Define the context (e.g., time, location).
■ Behavioral attributes: Characteristics of the object used to evaluate the
outlier (e.g., temperature).
○ This can be viewed as a generalization of local outliers, where the object’s
density significantly deviates from its local area.
3. Collective Outlier:
○ This type involves a group of data objects that, together, deviate significantly from
the rest of the dataset (but may not be outliers individually).
Issues: For contextual outliers, the challenge lies in defining a meaningful context for
evaluation.
Collective Outliers occur when a group of data objects collectively deviate significantly from the
rest of the dataset, even if the individual objects within the group aren't outliers themselves. The
outlier behavior is exhibited when considered as a group.
Example: In intrusion detection systems, multiple computers in a network may individually send
harmless data packets, but when several computers simultaneously start sending
denial-of-service (DoS) packets to each other, they form a collective outlier. Individually, the
computers might not be performing anything suspicious, but together, their behavior indicates
an anomaly.
Detection of Collective Outliers:
● To detect collective outliers, it's important to consider the behavior of groups of objects,
rather than just individual ones.
● Background knowledge, such as relationships or similarities between data objects (e.g.,
distance metrics or similarity measures), is essential to identifying collective outliers.
Multiple Outlier Types:
● A dataset may contain different types of outliers (e.g., global, contextual, collective), and
an object can belong to more than one type of outlier depending on the analysis context.
This makes collective outlier detection more complex, as it requires understanding the
relationships among data objects and how they interact within the group.
Challenges of Outlier Detection:
1. Modeling Normal Objects and Outliers Properly:
○ It's difficult to accurately define what constitutes "normal" behavior because it can
vary significantly across different contexts.
○ Example: In a clinic setting, a slight deviation in a patient's vital signs (e.g.,
heart rate) may be considered an outlier, as even small changes could signal a
health issue. In contrast, in marketing analysis, larger fluctuations in customer
behavior (e.g., spending) might be considered normal, so smaller variations could
be flagged as outliers.
2. The Border Between Normal and Outlier Objects is Often a Gray Area:
○ Determining the exact threshold between normal data and outliers can be
ambiguous. The distinction is not always clear-cut.
○ Example: A temperature of 26.67°C (80°F) might be an outlier in a winter
context in Delhi but normal during the summer. The line between normal and
outlier depends heavily on the time of year, making it difficult to define with
precision.
3. Application-Specific Outlier Detection:
○ The choice of distance measures and relationships among objects depends on
the specific application, meaning that outlier detection techniques must be
tailored to each scenario.
○ Example: In fraud detection (e.g., credit card fraud), a sudden, large purchase
far from the customer's usual spending pattern might be flagged as an outlier.
However, in sports analysis, a player's performance like Michael Jordan’s
high-scoring games might be considered normal for a legend, even though the
same performance in a different context could be flagged as an outlier.
4. Handling Noise in Outlier Detection:
○ Noise—random errors or fluctuations in data—can interfere with the detection
process, making it hard to distinguish between actual outliers and random
variance.
○ Example: If you are tracking website traffic and there's a sudden spike due to a
server glitch (noise), this might obscure genuine outlier data like a real
marketing campaign driving a surge in visitors. The noise can make it harder to
identify the real outliers (the successful marketing campaign).
5. Understandability:
○ Once an outlier is detected, it’s important to explain why it is considered an
outlier. The rationale for the detection must be clear and justifiable.
○ Example: If an employee’s sales in one month are drastically lower than in
others, we need to explain whether this is due to personal issues, seasonal
fluctuations, or an actual error in reporting. The outlier should be explained in
terms of how likely it is to have occurred within the expected normal range.
6. Specify the Degree of an Outlier:
○ Sometimes, it's necessary to quantify how much of an outlier an object is, i.e.,
how unlikely it is that the object was generated by the normal mechanism.
○ Example: In bank fraud detection, a purchase of $10,000 when the user
typically spends $100 might not just be an outlier—it could be an extreme outlier.
Specifying the degree of deviation from normal behavior helps prioritize actions,
such as flagging it as "highly suspicious" for further investigation.
These challenges highlight the complexity of outlier detection, where context, application, noise,
and interpretability all play significant roles.
statistical model for outlier detection
A Gaussian distribution (or normal distribution) is often used in detecting outliers, as many
datasets tend to follow a normal distribution or can be approximated as normal. In this context,
outliers are typically defined as data points that fall far away from the mean, often measured
using the Z-score or by calculating a threshold based on the standard deviation of the Gaussian
distribution.
Example of Outlier Detection using a Gaussian Distribution
Let's say you have a dataset and you want to detect outliers assuming that the data follows a
Gaussian distribution.
1. Data and Assumption
Assume you have the following dataset of values representing some measurement (e.g.,
heights in cm):
[160, 162, 158, 170, 155, 165, 172, 180, 152, 175]
For this example, let's assume the data follows a Gaussian distribution (or normal distribution),
and we'll detect any outliers using the Z-score method.
2. Calculate the Mean and Standard Deviation
To apply a Gaussian-based method for detecting outliers, we first calculate the mean and
standard deviation of the dataset.
4. Define Outliers
Typically, in a Gaussian distribution, values that fall more than 2 or 3 standard deviations
from the mean are considered outliers. This corresponds to Z-scores greater than 3 or less than
-3.
● Outlier Threshold (Z-score): You can use a Z-score threshold of ±3 to flag outliers. Any
point with a Z-score below -3 or above +3 will be considered an outlier.
Proximity-based outlier detection methods identify outliers by measuring how
"far" a data point is from its neighbors. The basic idea is:
Normal points are close to many other points.
Outliers are far from most other points.
These methods are especially useful when the data doesn't follow a specific distribution
(non-parametric) or when you're working with multidimensional data.
Distance based - eg 1. k-Nearest Neighbors (k-NN) Distance
This method identifies outliers based on the distance to their k-th nearest neighbor.
● For each point, calculate the distance to its k-th nearest neighbor.
● If that distance is large compared to others, it's likely an outlier.
Pros: Simple and intuitive
Cons: Sensitive to the choice of k, can be expensive with large dataset
Density Based
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Although primarily a clustering algorithm, DBSCAN naturally identifies outliers as noise points.
● Groups together points that are closely packed (based on ε radius and minPts).
● Points that don’t belong to any cluster are considered outliers.
Great for:
● Data with clusters of different shapes.
● Automatically finding outliers without labeling.
Clustering-based outlier detection methods rely on the assumption that:
Normal data points belong to a cluster,
while outliers either don't belong to any cluster or belong to small, sparse clusters.
These methods are especially useful when your data naturally forms groups or patterns.
PARAMETRIC METHODS:
compare this G value with the critical value from Grubbs’ distribution (based on the
t-distribution and sample size).
When to Use:
● You have one suspect outlier.
● Data is normally distributed.
● Small to moderate sample size.
data = [12, 12.5, 13, 13.1, 13.2, 13.4, 14, 29]
Detection of Multivariate Outliers
Given Data:
[4, 2], [2, 4], [2, 6], [5, 3], [8, 7], [50, 60]
Steps:
1. Compute Mahalanobis distance - for variable correlation.
2. Compute Squared distance follows Chi-square distribution under normality.
3. Any point with Mahalanobis distance squared greater than critical Chi-square
value is flagged as an outlier.
Step 1: Compute the Mean Vector
Step 2: Compute the Covariance Matrix
Step 3: Invert the Covariance Matrix
Step 4: Compute Mahalanobis Distance for Observation 5: (8, 7)
Step 5: Compare with Chi-square Threshold
Using Mixture of Parametric Distributions
Use a Mixture of Parametric Distributions (like Gaussians) to model the data and
identify low-probability points (i.e., outliers).
Steps:
A Gaussian Mixture Model (GMM) assumes that:
● Your data is generated from K Gaussian distributions (clusters),
● Each cluster has its own mean, variance, and weight,
● An outlier is a point that has low probability under all components of the mixture.
Eg: Data: [4, 5, 5, 6, 7, 100]
.Step 1: Fit a Gaussian Mixture Model (GMM)
G1: captures normal data (centered around 5-6)
G 2: could try to capture extreme points like 100
(You can fit this model using Expectation-Maximization (EM))
Step 2: Compute Likelihood for Each Data Point
Identify the Outliers.
Non-Parametric Methods: Detection Using Histogram
Use a histogram to estimate the density of a dataset. Outliers are then defined as values that
fall into low-frequency (rare) bins.
This method:
● Requires no assumption about the underlying distribution
● Is easy to implement and visualize
● Works best for 1D (univariate) data
Data = [5, 6, 6, 7, 8, 5, 6, 7, 6, 100]
Major Statistical Data Mining Methods
Regression is a statistical method used to model the relationship between a dependent
variable and one or more independent variables.
When to apply?
Regression analysis is applied when the goal is to predict a continuous outcome, understand
the strength of the relationships between variables, or make forecasts based on historical data.
There are different types of regression,
A Generalized Linear Model (GLM) is used when your data does not meet the
assumptions of standard linear regression — especially when the dependent variable is not
normally distributed.
When to Apply
When the dependent variable is not continuous and normally distributed.
When modeling count data, binary outcomes, or rates.
When the data has a non-linear relationship between the response and predictors.
Analysis of Variance (ANOVA) is a statistical method used to determine whether
there are significant differences between the means of three or more groups. It helps you
test if at least one group mean is different from the others, without running multiple t-tests
(which increases error rate).
When to Apply
● When comparing 3 or more group means.
● The dependent variable is continuous (e.g., test scores, weight).
● The independent variable is categorical (e.g., different treatments, groups).
● Assumes normally distributed data, equal variances, and independent samples.
●
A Mixed Effect Model is used when your data has:
● Fixed effects: factors you’re mainly interested in (e.g., treatment, time, gender).
● Random effects: natural groupings in your data that might affect the outcome (e.g.,
people, schools, cities).
Mixed models help account for variation within and between groups.
Factor Analysis is a statistical method used to identify underlying relationships or
latent variables (factors) that explain the correlations among observed variables. It’s commonly
used in psychology, social sciences, and other fields to reduce the number of variables in a
dataset by grouping them into fewer factors that still capture most of the information.
Discriminant Analysis is a statistical technique used to classify a set of
observations into predefined categories. It is particularly useful when you want to predict the
category or class of a new observation based on its features.
The goal of discriminant analysis is to find a combination of predictors that best separates the
different classes in the data.
● Can be used for both classification and dimensionality reduction.
● Works well when the data is normally distributed and classes are well-separated.
Survival Analysis is a branch of statistics focused on analyzing the time until an event of
interest occurs.
The key challenge in survival analysis is handling censored data, where the event has not yet
occurred for some subjects during the study period.
Applications of Survival Analysis:
● Medical Research: To study the time to death, disease progression, or recovery from a
treatment.
● Reliability Engineering: To analyze the time to failure of machinery or products.
● Customer Churn: To predict how long customers will stay before leaving a service (e.g.,
telecom, subscription services).
● Economics: For modeling the time until an event like bankruptcy or employment.
Supervised Learning-Based Outlier Detection Methods:
1. Classification-Based Outlier Detection:
○ In supervised learning, outlier detection can be framed as a classification
problem, where the task is to classify data points as either "normal" (inlier) or
"outlier."
○ The model is trained on a labeled dataset where some data points are marked as
outliers and others as normal.
Example: One class SVM,KNN
Semi-Supervised Outlier Detection: Overview + Example
Semi-supervised outlier detection lies between supervised and unsupervised learning. It is
used when you have a small amount of labeled data (typically normal instances) and a
larger amount of unlabeled data (which may include outliers).
Example: One class SVM