0% found this document useful (0 votes)

44 views15 pages

UN IT-5 Outliers and Statistical Approaches in Data Mining-2

The document discusses outliers in data mining, detailing their types (global, contextual, collective) and challenges in detection, including noise and application-specific issues. It outlines various detection methods, such as statistical, proximity-based, and density-based approaches, and emphasizes the importance of context and relationships among data objects. Additionally, it covers major statistical data mining methods like regression, ANOVA, and survival analysis, as well as supervised and semi-supervised learning techniques for outlier detection.

Uploaded by

drajalakshmi2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views15 pages

UN IT-5 Outliers and Statistical Approaches in Data Mining-2

Uploaded by

drajalakshmi2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Unit-5 - Outliers and Statistical Approaches in Data Mining

Introduction to outliers, Challenges in detecting Outliers,Outlier Detection Methods -

Supervised, Semisupervised, Unsupervised- Statistical Data Mining approaches - Data mining
in Recommender Systems,Data mining for Intrusion Detection, Data Mining for Financial
Analysis

Outlier:
A data object that deviates significantly from the normal objects as if it were generated
by a different mechanism Ex.: Unusual credit card purchase

Outliers are different from the noise data,Noise is random error or variance in a
measured variable.Noise should be removed before outlier detection

Outlier detection vs. novelty detection: early stage, outlier; but later merged into the
model
Applications: Credit card fraud detection Telecom fraud detection Customer
segmentation Medical analysis

Example:If a customer who usually spends $50 suddenly makes a $5,000 purchase, that could
be flagged as an outlier—potentially indicating credit card fraud

There are three types of outliers: global, contextual, and collective.

1. Global Outlier (or Point Anomaly):

○ A global outlier is an object that significantly deviates from the entire dataset.

○ Example: In network intrusion detection, an unusual data packet that doesn't

match any typical behavior can be considered a global outlier.

2. Contextual Outlier (or Conditional Outlier):

○ A contextual outlier deviates significantly when considering a specific context, like

time or location.

○ Example: A temperature of 80°F in Urbana could be an outlier in winter, but not

in summer.

○ Attributes are divided into:

■ Contextual attributes: Define the context (e.g., time, location).

■ Behavioral attributes: Characteristics of the object used to evaluate the

outlier (e.g., temperature).

○ This can be viewed as a generalization of local outliers, where the object’s

density significantly deviates from its local area.

3. Collective Outlier:

○ This type involves a group of data objects that, together, deviate significantly from
the rest of the dataset (but may not be outliers individually).

Issues: For contextual outliers, the challenge lies in defining a meaningful context for
evaluation.

Collective Outliers occur when a group of data objects collectively deviate significantly from the
rest of the dataset, even if the individual objects within the group aren't outliers themselves. The
outlier behavior is exhibited when considered as a group.

Example: In intrusion detection systems, multiple computers in a network may individually send
harmless data packets, but when several computers simultaneously start sending
denial-of-service (DoS) packets to each other, they form a collective outlier. Individually, the
computers might not be performing anything suspicious, but together, their behavior indicates
an anomaly.

Detection of Collective Outliers:

● To detect collective outliers, it's important to consider the behavior of groups of objects,
rather than just individual ones.

● Background knowledge, such as relationships or similarities between data objects (e.g.,

distance metrics or similarity measures), is essential to identifying collective outliers.

Multiple Outlier Types:

● A dataset may contain different types of outliers (e.g., global, contextual, collective), and
an object can belong to more than one type of outlier depending on the analysis context.

This makes collective outlier detection more complex, as it requires understanding the
relationships among data objects and how they interact within the group.
Challenges of Outlier Detection:

1. Modeling Normal Objects and Outliers Properly:

○ It's difficult to accurately define what constitutes "normal" behavior because it can
vary significantly across different contexts.

○ Example: In a clinic setting, a slight deviation in a patient's vital signs (e.g.,

heart rate) may be considered an outlier, as even small changes could signal a
health issue. In contrast, in marketing analysis, larger fluctuations in customer
behavior (e.g., spending) might be considered normal, so smaller variations could
be flagged as outliers.

2. The Border Between Normal and Outlier Objects is Often a Gray Area:

○ Determining the exact threshold between normal data and outliers can be
ambiguous. The distinction is not always clear-cut.

○ Example: A temperature of 26.67°C (80°F) might be an outlier in a winter

context in Delhi but normal during the summer. The line between normal and
outlier depends heavily on the time of year, making it difficult to define with
precision.

3. Application-Specific Outlier Detection:

○ The choice of distance measures and relationships among objects depends on

the specific application, meaning that outlier detection techniques must be
tailored to each scenario.

○ Example: In fraud detection (e.g., credit card fraud), a sudden, large purchase
far from the customer's usual spending pattern might be flagged as an outlier.
However, in sports analysis, a player's performance like Michael Jordan’s
high-scoring games might be considered normal for a legend, even though the
same performance in a different context could be flagged as an outlier.

4. Handling Noise in Outlier Detection:

○ Noise—random errors or fluctuations in data—can interfere with the detection

process, making it hard to distinguish between actual outliers and random
variance.

○ Example: If you are tracking website traffic and there's a sudden spike due to a
server glitch (noise), this might obscure genuine outlier data like a real
marketing campaign driving a surge in visitors. The noise can make it harder to
identify the real outliers (the successful marketing campaign).

5. Understandability:

○ Once an outlier is detected, it’s important to explain why it is considered an

outlier. The rationale for the detection must be clear and justifiable.

○ Example: If an employee’s sales in one month are drastically lower than in

others, we need to explain whether this is due to personal issues, seasonal
fluctuations, or an actual error in reporting. The outlier should be explained in
terms of how likely it is to have occurred within the expected normal range.

6. Specify the Degree of an Outlier:

○ Sometimes, it's necessary to quantify how much of an outlier an object is, i.e.,
how unlikely it is that the object was generated by the normal mechanism.

○ Example: In bank fraud detection, a purchase of $10,000 when the user

typically spends $100 might not just be an outlier—it could be an extreme outlier.
Specifying the degree of deviation from normal behavior helps prioritize actions,
such as flagging it as "highly suspicious" for further investigation.

These challenges highlight the complexity of outlier detection, where context, application, noise,
and interpretability all play significant roles.

statistical model for outlier detection

A Gaussian distribution (or normal distribution) is often used in detecting outliers, as many
datasets tend to follow a normal distribution or can be approximated as normal. In this context,
outliers are typically defined as data points that fall far away from the mean, often measured
using the Z-score or by calculating a threshold based on the standard deviation of the Gaussian
distribution.

Example of Outlier Detection using a Gaussian Distribution

Let's say you have a dataset and you want to detect outliers assuming that the data follows a
Gaussian distribution.

1. Data and Assumption

Assume you have the following dataset of values representing some measurement (e.g.,
heights in cm):

[160, 162, 158, 170, 155, 165, 172, 180, 152, 175]
For this example, let's assume the data follows a Gaussian distribution (or normal distribution),
and we'll detect any outliers using the Z-score method.

2. Calculate the Mean and Standard Deviation

To apply a Gaussian-based method for detecting outliers, we first calculate the mean and
standard deviation of the dataset.

4. Define Outliers

Typically, in a Gaussian distribution, values that fall more than 2 or 3 standard deviations
from the mean are considered outliers. This corresponds to Z-scores greater than 3 or less than
-3.

● Outlier Threshold (Z-score): You can use a Z-score threshold of ±3 to flag outliers. Any
point with a Z-score below -3 or above +3 will be considered an outlier.

Proximity-based outlier detection methods identify outliers by measuring how

"far" a data point is from its neighbors. The basic idea is:

Normal points are close to many other points.

Outliers are far from most other points.

These methods are especially useful when the data doesn't follow a specific distribution
(non-parametric) or when you're working with multidimensional data.

Distance based - eg 1. k-Nearest Neighbors (k-NN) Distance

This method identifies outliers based on the distance to their k-th nearest neighbor.

● For each point, calculate the distance to its k-th nearest neighbor.

● If that distance is large compared to others, it's likely an outlier.

Pros: Simple and intuitive

Cons: Sensitive to the choice of k, can be expensive with large dataset

Density Based
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Although primarily a clustering algorithm, DBSCAN naturally identifies outliers as noise points.

● Groups together points that are closely packed (based on ε radius and minPts).

● Points that don’t belong to any cluster are considered outliers.

Great for:

● Data with clusters of different shapes.

● Automatically finding outliers without labeling.

Clustering-based outlier detection methods rely on the assumption that:

Normal data points belong to a cluster,
while outliers either don't belong to any cluster or belong to small, sparse clusters.

These methods are especially useful when your data naturally forms groups or patterns.

PARAMETRIC METHODS:
compare this G value with the critical value from Grubbs’ distribution (based on the
t-distribution and sample size).

When to Use:
● You have one suspect outlier.

● Data is normally distributed.

● Small to moderate sample size.

data = [12, 12.5, 13, 13.1, 13.2, 13.4, 14, 29]

Detection of Multivariate Outliers

Given Data:

[4, 2], [2, 4], [2, 6], [5, 3], [8, 7], [50, 60]

Steps:

1. Compute Mahalanobis distance - for variable correlation.

2. Compute Squared distance follows Chi-square distribution under normality.

3. Any point with Mahalanobis distance squared greater than critical Chi-square
value is flagged as an outlier.

Step 1: Compute the Mean Vector

Step 2: Compute the Covariance Matrix

Step 3: Invert the Covariance Matrix

Step 4: Compute Mahalanobis Distance for Observation 5: (8, 7)

Step 5: Compare with Chi-square Threshold

Using Mixture of Parametric Distributions

Use a Mixture of Parametric Distributions (like Gaussians) to model the data and
identify low-probability points (i.e., outliers).

Steps:

A Gaussian Mixture Model (GMM) assumes that:

● Your data is generated from K Gaussian distributions (clusters),

● Each cluster has its own mean, variance, and weight,

● An outlier is a point that has low probability under all components of the mixture.

Eg: Data: [4, 5, 5, 6, 7, 100]

.Step 1: Fit a Gaussian Mixture Model (GMM)

G1: captures normal data (centered around 5-6)

G 2: could try to capture extreme points like 100

(You can fit this model using Expectation-Maximization (EM))

Step 2: Compute Likelihood for Each Data Point

Identify the Outliers.

Non-Parametric Methods: Detection Using Histogram

Use a histogram to estimate the density of a dataset. Outliers are then defined as values that
fall into low-frequency (rare) bins.

This method:

● Requires no assumption about the underlying distribution

● Is easy to implement and visualize

● Works best for 1D (univariate) data

Data = [5, 6, 6, 7, 8, 5, 6, 7, 6, 100]

Major Statistical Data Mining Methods

Regression is a statistical method used to model the relationship between a dependent

variable and one or more independent variables.

When to apply?

Regression analysis is applied when the goal is to predict a continuous outcome, understand
the strength of the relationships between variables, or make forecasts based on historical data.

There are different types of regression,

A Generalized Linear Model (GLM) is used when your data does not meet the
assumptions of standard linear regression — especially when the dependent variable is not
normally distributed.

When to Apply

When the dependent variable is not continuous and normally distributed.

When modeling count data, binary outcomes, or rates.

When the data has a non-linear relationship between the response and predictors.

Analysis of Variance (ANOVA) is a statistical method used to determine whether

there are significant differences between the means of three or more groups. It helps you
test if at least one group mean is different from the others, without running multiple t-tests
(which increases error rate).

When to Apply

● When comparing 3 or more group means.

● The dependent variable is continuous (e.g., test scores, weight).

● The independent variable is categorical (e.g., different treatments, groups).

● Assumes normally distributed data, equal variances, and independent samples.

●

A Mixed Effect Model is used when your data has:

● Fixed effects: factors you’re mainly interested in (e.g., treatment, time, gender).

● Random effects: natural groupings in your data that might affect the outcome (e.g.,
people, schools, cities).

Mixed models help account for variation within and between groups.

Factor Analysis is a statistical method used to identify underlying relationships or

latent variables (factors) that explain the correlations among observed variables. It’s commonly
used in psychology, social sciences, and other fields to reduce the number of variables in a
dataset by grouping them into fewer factors that still capture most of the information.

Discriminant Analysis is a statistical technique used to classify a set of

observations into predefined categories. It is particularly useful when you want to predict the
category or class of a new observation based on its features.

The goal of discriminant analysis is to find a combination of predictors that best separates the
different classes in the data.

● Can be used for both classification and dimensionality reduction.

● Works well when the data is normally distributed and classes are well-separated.

Survival Analysis is a branch of statistics focused on analyzing the time until an event of
interest occurs.

The key challenge in survival analysis is handling censored data, where the event has not yet
occurred for some subjects during the study period.

Applications of Survival Analysis:

● Medical Research: To study the time to death, disease progression, or recovery from a
treatment.

● Reliability Engineering: To analyze the time to failure of machinery or products.

● Customer Churn: To predict how long customers will stay before leaving a service (e.g.,
telecom, subscription services).

● Economics: For modeling the time until an event like bankruptcy or employment.

Supervised Learning-Based Outlier Detection Methods:

1. Classification-Based Outlier Detection:

○ In supervised learning, outlier detection can be framed as a classification

problem, where the task is to classify data points as either "normal" (inlier) or
"outlier."

○ The model is trained on a labeled dataset where some data points are marked as
outliers and others as normal.
Example: One class SVM,KNN

Semi-Supervised Outlier Detection: Overview + Example

Semi-supervised outlier detection lies between supervised and unsupervised learning. It is

used when you have a small amount of labeled data (typically normal instances) and a
larger amount of unlabeled data (which may include outliers).

Example: One class SVM

Outlier Detection
No ratings yet
Outlier Detection
10 pages
Outlier Detection for Analysts
No ratings yet
Outlier Detection for Analysts
13 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Methods To Detect Different Types of Outliers: March 2016
No ratings yet
Methods To Detect Different Types of Outliers: March 2016
7 pages
07 Outlier Detection
No ratings yet
07 Outlier Detection
54 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Outliers
No ratings yet
Outliers
3 pages
Data Mining - Outlier Analysis
100% (3)
Data Mining - Outlier Analysis
11 pages
Outlier Analysis & Detection Methods
No ratings yet
Outlier Analysis & Detection Methods
4 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
28 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
6 pages
Unit 5 - Lecture 1 - Outlier Detection
No ratings yet
Unit 5 - Lecture 1 - Outlier Detection
30 pages
741 Outlier Detection
No ratings yet
741 Outlier Detection
55 pages
Anomaly Detection
No ratings yet
Anomaly Detection
10 pages
188 1496475265 - 03-06-2017 PDF
No ratings yet
188 1496475265 - 03-06-2017 PDF
6 pages
Outlier Detection Techniques
100% (2)
Outlier Detection Techniques
56 pages
Outlier Detection
No ratings yet
Outlier Detection
45 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
44 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
On Detection of Outliers and Their Effect in Supervised Classification
No ratings yet
On Detection of Outliers and Their Effect in Supervised Classification
14 pages
Outlier Detection Techniques Review
No ratings yet
Outlier Detection Techniques Review
13 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Apriori Algorithm & Outlier Analysis
No ratings yet
Apriori Algorithm & Outlier Analysis
89 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Lecture 12
No ratings yet
Lecture 12
54 pages
Lec3. Outlier Analysis
No ratings yet
Lec3. Outlier Analysis
54 pages
Anomaly Detection
No ratings yet
Anomaly Detection
7 pages
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
Outlier Detection Methods Guide
No ratings yet
Outlier Detection Methods Guide
2 pages
Datamining Seminar
No ratings yet
Datamining Seminar
19 pages
Anomaly Detection and Outlier Analysis
No ratings yet
Anomaly Detection and Outlier Analysis
25 pages
12 Outlier
No ratings yet
12 Outlier
16 pages
Module 11 (C)
No ratings yet
Module 11 (C)
4 pages
Elastic Anomalies
No ratings yet
Elastic Anomalies
7 pages
How To Calculate Outliers
No ratings yet
How To Calculate Outliers
7 pages
A Survey On Outlier Detection Methods
No ratings yet
A Survey On Outlier Detection Methods
4 pages
Print Data Mining 5
No ratings yet
Print Data Mining 5
6 pages
Unit 5
No ratings yet
Unit 5
47 pages
12 Outlier
No ratings yet
12 Outlier
18 pages
12outlier 1
No ratings yet
12outlier 1
45 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
12 pages
Handling Ouliers
No ratings yet
Handling Ouliers
5 pages
Unit 4
No ratings yet
Unit 4
17 pages
Unit-5 Outlier Analysis
No ratings yet
Unit-5 Outlier Analysis
32 pages
Unit 5
No ratings yet
Unit 5
70 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
Outlier Detection Insights
No ratings yet
Outlier Detection Insights
84 pages
Lecture23 2
No ratings yet
Lecture23 2
10 pages
Outlier or Anomaly Detection
No ratings yet
Outlier or Anomaly Detection
9 pages
Outlier Detection in Data Mining
No ratings yet
Outlier Detection in Data Mining
72 pages
Module5 - Outlier - Analysis: Reference: "Data Mining The Text Book", Charu C. Aggarwal, Springer, 2015. (Chapters 8)
No ratings yet
Module5 - Outlier - Analysis: Reference: "Data Mining The Text Book", Charu C. Aggarwal, Springer, 2015. (Chapters 8)
21 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
12 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Unit5 OutliersDetection
No ratings yet
Unit5 OutliersDetection
37 pages
What Is Outlier
No ratings yet
What Is Outlier
3 pages
Repository and Search Engine For Alumni of College (RASE: A Project Report On
No ratings yet
Repository and Search Engine For Alumni of College (RASE: A Project Report On
3 pages
Key Value Drivers Corporate
No ratings yet
Key Value Drivers Corporate
35 pages
ECE 212 Lecture 1
No ratings yet
ECE 212 Lecture 1
9 pages
Expphyfinal
No ratings yet
Expphyfinal
8 pages
Scribe and Dice
No ratings yet
Scribe and Dice
5 pages
Waste Management: Marco Tomasi Morgano, Hans Leibold, Frank Richter, Dieter Stapf, Helmut Seifert
No ratings yet
Waste Management: Marco Tomasi Morgano, Hans Leibold, Frank Richter, Dieter Stapf, Helmut Seifert
9 pages
Electric and Magnetic Images: by P. HAMMOND, M.A., Member
No ratings yet
Electric and Magnetic Images: by P. HAMMOND, M.A., Member
8 pages
Bgcse Mathematics 2021 Paper 2
No ratings yet
Bgcse Mathematics 2021 Paper 2
11 pages
FFSG5115PA0 - Manual de Servicio
No ratings yet
FFSG5115PA0 - Manual de Servicio
12 pages
Wohner Technical Data
No ratings yet
Wohner Technical Data
74 pages
Surface Wave Analysis For Near Surface Applications 1st Edition Giancarlo Dal Moro Instant Download
100% (8)
Surface Wave Analysis For Near Surface Applications 1st Edition Giancarlo Dal Moro Instant Download
33 pages
Grade 6 Science: Matter Basics
No ratings yet
Grade 6 Science: Matter Basics
29 pages
Precise End of Cosmic Inflation
No ratings yet
Precise End of Cosmic Inflation
24 pages
56933-2556-Principles of Chain Surveying
No ratings yet
56933-2556-Principles of Chain Surveying
3 pages
MSCE Geography Study Guide
100% (3)
MSCE Geography Study Guide
46 pages
To Start Main Engine in Sequence
No ratings yet
To Start Main Engine in Sequence
5 pages
Human Detection
No ratings yet
Human Detection
8 pages
Breast Imaging & Diagnosis Guide
No ratings yet
Breast Imaging & Diagnosis Guide
64 pages
KB Rapida-106 Engl Web
No ratings yet
KB Rapida-106 Engl Web
40 pages
Problem Solving Drills Assessment Tool
No ratings yet
Problem Solving Drills Assessment Tool
2 pages
Question List With All Answers - Unlocked
No ratings yet
Question List With All Answers - Unlocked
37 pages
Structure of CNC Machine
No ratings yet
Structure of CNC Machine
74 pages
OPIANA - MIDTERM+Problem-set-4-5-6-7-and-8 - 9-10
No ratings yet
OPIANA - MIDTERM+Problem-set-4-5-6-7-and-8 - 9-10
73 pages
2024 - Format Thesis Utm
No ratings yet
2024 - Format Thesis Utm
40 pages
Geophysical Inverse Methods Guide
67% (3)
Geophysical Inverse Methods Guide
313 pages
Questions Onelectrical Circuits From Past Papers
0% (1)
Questions Onelectrical Circuits From Past Papers
6 pages
Document
No ratings yet
Document
4 pages
VANET Based Ad Hoc Simulation of Vehicle Trajectories in Accident Scenarios
No ratings yet
VANET Based Ad Hoc Simulation of Vehicle Trajectories in Accident Scenarios
9 pages
CLI 40 0059 Unlocked
No ratings yet
CLI 40 0059 Unlocked
13 pages
Electrical Engineering Basics Quiz
No ratings yet
Electrical Engineering Basics Quiz
9 pages

UN IT-5 Outliers and Statistical Approaches in Data Mining-2

Uploaded by

UN IT-5 Outliers and Statistical Approaches in Data Mining-2

Uploaded by

Unit-5 - Outliers and Statistical Approaches in Data Mining

Introduction to outliers, Challenges in detecting Outliers,Outlier Detection Methods -

There are three types of outliers: global, contextual, and collective.

1.​ Global Outlier (or Point Anomaly):​

○​ Example: In network intrusion detection, an unusual data packet that doesn't

2.​ Contextual Outlier (or Conditional Outlier):​

○​ A contextual outlier deviates significantly when considering a specific context, like

○​ Example: A temperature of 80°F in Urbana could be an outlier in winter, but not

○​ Attributes are divided into:​

■​ Behavioral attributes: Characteristics of the object used to evaluate the

○​ This can be viewed as a generalization of local outliers, where the object’s

3.​ Collective Outlier:​

Detection of Collective Outliers:

●​ Background knowledge, such as relationships or similarities between data objects (e.g.,

Multiple Outlier Types:

1.​ Modeling Normal Objects and Outliers Properly:​

○​ Example: In a clinic setting, a slight deviation in a patient's vital signs (e.g.,

○​ Example: A temperature of 26.67°C (80°F) might be an outlier in a winter

3.​ Application-Specific Outlier Detection:​

○​ The choice of distance measures and relationships among objects depends on

4.​ Handling Noise in Outlier Detection:​

○​ Noise—random errors or fluctuations in data—can interfere with the detection

○​ Once an outlier is detected, it’s important to explain why it is considered an

○​ Example: If an employee’s sales in one month are drastically lower than in

6.​ Specify the Degree of an Outlier:​

○​ Example: In bank fraud detection, a purchase of $10,000 when the user

statistical model for outlier detection

Example of Outlier Detection using a Gaussian Distribution

1. Data and Assumption

2. Calculate the Mean and Standard Deviation

Proximity-based outlier detection methods identify outliers by measuring how

Normal points are close to many other points.​

Distance based - eg 1. k-Nearest Neighbors (k-NN) Distance

●​ If that distance is large compared to others, it's likely an outlier.​

Pros: Simple and intuitive​

●​ Points that don’t belong to any cluster are considered outliers.​

●​ Data with clusters of different shapes.​

●​ Automatically finding outliers without labeling.

Clustering-based outlier detection methods rely on the assumption that:

●​ Data is normally distributed.​

●​ Small to moderate sample size.

data = [12, 12.5, 13, 13.1, 13.2, 13.4, 14, 29]

1.​ Compute Mahalanobis distance - for variable correlation.​

2.​ Compute Squared distance follows Chi-square distribution under normality.​

Step 1: Compute the Mean Vector

Step 2: Compute the Covariance Matrix

Step 4: Compute Mahalanobis Distance for Observation 5: (8, 7)

Step 5: Compare with Chi-square Threshold

A Gaussian Mixture Model (GMM) assumes that:

●​ Your data is generated from K Gaussian distributions (clusters),​

●​ Each cluster has its own mean, variance, and weight,​

Eg: Data: [4, 5, 5, 6, 7, 100]

.Step 1: Fit a Gaussian Mixture Model (GMM)

G1: captures normal data (centered around 5-6)​

(You can fit this model using Expectation-Maximization (EM))

Step 2: Compute Likelihood for Each Data Point

Non-Parametric Methods: Detection Using Histogram

●​ Requires no assumption about the underlying distribution​

●​ Is easy to implement and visualize​

●​ Works best for 1D (univariate) data

Data = [5, 6, 6, 7, 8, 5, 6, 7, 6, 100]

Regression is a statistical method used to model the relationship between a dependent

There are different types of regression,

​ When the dependent variable is not continuous and normally distributed.

Analysis of Variance (ANOVA) is a statistical method used to determine whether

●​ When comparing 3 or more group means.​

●​ The dependent variable is continuous (e.g., test scores, weight).​

●​ The independent variable is categorical (e.g., different treatments, groups).​

●​ Assumes normally distributed data, equal variances, and independent samples.

A Mixed Effect Model is used when your data has:

Factor Analysis is a statistical method used to identify underlying relationships or

Discriminant Analysis is a statistical technique used to classify a set of

●​ Can be used for both classification and dimensionality reduction.​

Applications of Survival Analysis:

●​ Reliability Engineering: To analyze the time to failure of machinery or products.​

Supervised Learning-Based Outlier Detection Methods:

1. Global Outlier (or Point Anomaly):

○ Example: In network intrusion detection, an unusual data packet that doesn't

2. Contextual Outlier (or Conditional Outlier):

○ A contextual outlier deviates significantly when considering a specific context, like

○ Example: A temperature of 80°F in Urbana could be an outlier in winter, but not

○ Attributes are divided into:

■ Behavioral attributes: Characteristics of the object used to evaluate the

○ This can be viewed as a generalization of local outliers, where the object’s

3. Collective Outlier:

● Background knowledge, such as relationships or similarities between data objects (e.g.,

1. Modeling Normal Objects and Outliers Properly:

○ Example: In a clinic setting, a slight deviation in a patient's vital signs (e.g.,

○ Example: A temperature of 26.67°C (80°F) might be an outlier in a winter

3. Application-Specific Outlier Detection:

○ The choice of distance measures and relationships among objects depends on

4. Handling Noise in Outlier Detection:

○ Noise—random errors or fluctuations in data—can interfere with the detection

○ Once an outlier is detected, it’s important to explain why it is considered an

○ Example: If an employee’s sales in one month are drastically lower than in

6. Specify the Degree of an Outlier:

○ Example: In bank fraud detection, a purchase of $10,000 when the user

Normal points are close to many other points.

● If that distance is large compared to others, it's likely an outlier.

Pros: Simple and intuitive

● Points that don’t belong to any cluster are considered outliers.

● Data with clusters of different shapes.

● Automatically finding outliers without labeling.

● Data is normally distributed.

● Small to moderate sample size.

1. Compute Mahalanobis distance - for variable correlation.

2. Compute Squared distance follows Chi-square distribution under normality.

● Your data is generated from K Gaussian distributions (clusters),

● Each cluster has its own mean, variance, and weight,

G1: captures normal data (centered around 5-6)

● Requires no assumption about the underlying distribution

● Is easy to implement and visualize

● Works best for 1D (univariate) data

When the dependent variable is not continuous and normally distributed.

● When comparing 3 or more group means.

● The dependent variable is continuous (e.g., test scores, weight).

● The independent variable is categorical (e.g., different treatments, groups).

● Assumes normally distributed data, equal variances, and independent samples.

● Can be used for both classification and dimensionality reduction.

● Reliability Engineering: To analyze the time to failure of machinery or products.

1. Classification-Based Outlier Detection:

○ In supervised learning, outlier detection can be framed as a classification