Clustering-Based approaches for outlier detection in data mining
Last Updated :
10 Sep, 2021
Clustering Analysis is the process of dividing a set of data objects into subsets. Each subset is a cluster such that objects are similar to each other. The set of clusters obtained from clustering analysis can be referred to as Clustering. For example: Segregating customers in a Retail market as a frequent customer, new customer.
Basic approaches in Clustering:
Partition Methods:
Used to find mutually exclusive spherical clusters. It is based on remote clusters. It uses iterative movement technology to improve partitioning. To represent the center of the cluster, we can use the mean or center point. This is very effective for small and medium data sets.
Hierarchical Methods:
Creates hierarchical decomposition of the specified data record of the data object. They can be based on distance or density and continuity. They are divided into cohesion method and division method. If so, this is an outlier.
Density-Based Methods:
This method is a density-based approach for finding arbitrarily shaped clusters. The general idea of the density-based method is to continue growing a given cluster as long as the density exceeds some threshold. They mainly consider exclusive clusters only not the fizzy clusters. They can be extended from full space to sub-space clustering.
Grid-Based Methods:
Here we quantize the object into a finite grid number of cells forming a grid structure. All the operations are performed on the grid structure only. The main advantage of this method is the processing time which is much faster and independent of the number of objects.
Cluster-Based Approaches for detecting Outliers:
Clustering-based outlier detection methods assume that the normal data objects belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters. Clustering-based approaches detect outliers by extracting the relationship between Objects and Cluster. An object is an outlier if
- Does the object belong to any cluster? If not, then it is identified as an outlier.
- Is there a large distance between the object and the cluster to which it is closest? If yes, it is an outlier.
- Is the object part of a small or sparse cluster? If yes, then all the objects in that cluster are outliers.
Checking an outlier:
This K-Means makes use of a ratio dist(o, co)/x
where,
co is the closest center to object o and
dist(o, co) is the distance between o and co
x is the average distance between co and o
Note that each of the procedures we've visible up to now detects individual objects items as outliers due to the fact they evaluate items separately in opposition to clusters withinside the information set. However, in a huge information set, a few outliers can be comparable and shape a small cluster. The procedures mentioned up to now can be deceived via way of means of such outliers.
To conquer this problem, the 3rd method to cluster-primarily based totally outlier detection identifies small or sparse clusters and pronounces the items in the one's clusters to be outliers as well. An instance of this method is the FindCBLOF set of rules, which matches as follows.
1. Find clusters in an information set, and type them in step with reducing the length. The set of rules assumes that the maximum of the information factors aren't outliers. It makes use of a parameter α (0 ≤ α ≤ 1) to differentiate huge from small clusters. Any cluster that incorporates at the least a percent α (e.g., α = 90%) of the information set is taken into consideration as a “huge cluster.” The final clusters are noted as “small clusters.”
2. To every information factor, assign a cluster-primarily based totally nearby outlier factor (CBLOF). For a factor belonging to a huge cluster, its CBLOF is made from the cluster’s length and the similarity among the factor and the cluster. For a factor belonging to a small cluster, its CBLOF is calculated because it the made from the dimensions of the small cluster and the similarity among the factor and the nearest huge cluster. CBLOF defines the similarity between a factor and a cluster in a statistical manner that represents the opportunity that the factor belongs to the cluster. The large the value, the extra comparable the factor and the cluster are. The CBLOF rating can locate outlier factors that might be some distance from any clusters. In addition, small clusters which might be some distance from any huge cluster are taken into consideration to encompass outliers. The factors with the bottom CBLOF rankings are suspected outliers. To detect outliers in small clusters we go with finding the cluster-based local outlier factor. To find CBLOF we should follow below steps:
- Find the clusters and sort them in decreasing order.
- To each cluster, points add a local outlier factor.
- If object p belongs to a larger part of the cluster, CBLOF = product of the size of the cluster and similarity between point and cluster.
- If object p belongs to a smaller one, CBLOF = product of the size of the cluster and similarity between point and the closest larger cluster.
Clustering-primarily based totally procedures can also additionally incur excessive computational charges in the event that they must discover clusters earlier than detecting outliers. Several strategies had been advanced for stepped forward efficiency. For instance, fixed-width clustering is a linear-time method this is utilized in a few outlier detection methods. The concept is easy but efficient. A factor is assigned to a cluster if the middle of the cluster is inside a predefined distance threshold from the factor. If a factor can not be assigned to any current cluster, the new cluster is created.
Strength and Weakness for cluster-based outlier detection:
Advantages: The cluster-based outlier detection method has the following advantages. First, they can detect outliers without labeling the data, that is, they are out of control. You deal with multiple types of data. You can think of a cluster as a collection of data. Once the cluster is obtained, the cluster-based method only needs to compare the object with the cluster to determine whether the object is an outlier. This process is usually fast because the number of clusters is usually small in comparison. In the total number of objects.
Disadvantages: The weakness of clustering outlier detection is its effectiveness, which largely depends on the clustering method used. These methods cannot be optimized for outlier detection. Clustering techniques for large data sets are usually expensive, which may be a bottleneck.
Similar Reads
Data Analysis (Analytics) Tutorial Data Analytics is a process of examining, cleaning, transforming and interpreting data to discover useful information, draw conclusions and support decision-making. It helps businesses and organizations understand their data better, identify patterns, solve problems and improve overall performance.
4 min read
Prerequisites for Data Analysis
Exploratory Data Analysis (EDA) with NumPy, Pandas, Matplotlib and SeabornExploratory Data Analysis (EDA) serves as the foundation of any data science project. It is an essential step where data scientists investigate datasets to understand their structure, identify patterns, and uncover insights. Data preparation involves several steps, including cleaning, transforming,
4 min read
SQL for Data AnalysisSQL (Structured Query Language) is an indispensable tool for data analysts, providing a powerful way to query and manipulate data stored in relational databases. With its ability to handle large datasets and perform complex operations, SQL has become a fundamental skill for anyone involved in data a
7 min read
Python | Math operations for Data analysisPython is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.There are some important math operations that can be performed on a pandas series to si
2 min read
Python - Data visualization tutorialData visualization is a crucial aspect of data analysis, helping to transform analyzed data into meaningful insights through graphical representations. This comprehensive tutorial will guide you through the fundamentals of data visualization using Python. We'll explore various libraries, including M
7 min read
Free Public Data Sets For AnalysisData analysis is a crucial aspect of modern decision-making processes across various domains, including business, academia, healthcare, and government. However, obtaining high-quality datasets for analysis can be challenging and costly. Fortunately, there are numerous free public datasets available
5 min read
Data Analysis Libraries
Understanding the Data
What is Data ?Data is a word we hear everywhere nowadays. In general, data is a collection of facts, information, and statistics and this can be in various forms such as numbers, text, sound, images, or any other format.In this article, we will learn about What is Data, the Types of Data, Importance of Data, and
9 min read
Understanding Data Attribute Types | Qualitative and QuantitativeWhen we talk about data mining , we usually discuss knowledge discovery from data. To learn about the data, it is necessary to discuss data objects, data attributes, and types of data attributes. Mining data includes knowing about data, finding relations between data. And for this, we need to discus
6 min read
Univariate, Bivariate and Multivariate data and its analysisIn this article,we will be discussing univariate, bivariate, and multivariate data and their analysis. Univariate data: Univariate data refers to a type of data in which each observation or data point corresponds to a single variable. In other words, it involves the measurement or observation of a s
5 min read
Attributes and its Types in Data AnalyticsIn this article, we are going to discuss attributes and their various types in data analytics. We will also cover attribute types with the help of examples for better understanding. So let's discuss them one by one. What are Attributes?Attributes are qualities or characteristics that describe an obj
4 min read
Loading the Data
Data Cleaning
What is Data Cleaning?Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies within a dataset. This crucial step in the data management and data science pipeline ensures that the data is accurate, consistent, and
12 min read
ML | Overview of Data CleaningData cleaning is a important step in the machine learning (ML) pipeline as it involves identifying and removing any missing duplicate or irrelevant data. The goal of data cleaning is to ensure that the data is accurate, consistent and free of errors as raw data is often noisy, incomplete and inconsi
13 min read
Best Data Cleaning Techniques for Preparing Your DataData cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets to improve their quality, accuracy, and reliability for analysis or other applications. It involves several steps aimed at detecting and r
6 min read
Handling Missing Data
Working with Missing Data in PandasIn Pandas, missing data occurs when some values are missing or not collected properly and these missing values are represented as:None: A Python object used to represent missing values in object-type arrays.NaN: A special floating-point value from NumPy which is recognized by all systems that use IE
5 min read
Drop rows from Pandas dataframe with missing values or NaN in columnsWe are given a Pandas DataFrame that may contain missing values, also known as NaN (Not a Number), in one or more columns. Our task is to remove the rows that have these missing values to ensure cleaner and more accurate data for analysis. For example, if a row contains NaN in any specified column,
4 min read
Count NaN or missing values in Pandas DataFrameIn this article, we will see how to Count NaN or missing values in Pandas DataFrame using isnull() and sum() method of the DataFrame. 1. DataFrame.isnull() MethodDataFrame.isnull() function detect missing values in the given object. It return a boolean same-sized object indicating if the values are
3 min read
ML | Handling Missing ValuesMissing values are a common issue in machine learning. This occurs when a particular variable lacks data points, resulting in incomplete information and potentially harming the accuracy and dependability of your models. It is essential to address missing values efficiently to ensure strong and impar
12 min read
Working with Missing Data in PandasIn Pandas, missing data occurs when some values are missing or not collected properly and these missing values are represented as:None: A Python object used to represent missing values in object-type arrays.NaN: A special floating-point value from NumPy which is recognized by all systems that use IE
5 min read
ML | Handle Missing Data with Simple ImputerSimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder. It is implemented by the use of the SimpleImputer() method which takes the following arguments : missing_values : The missing_
2 min read
How to handle missing values of categorical variables in Python?Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. Often we come across datasets in which some values are missing from the columns. This causes problems when we apply a machine learning model to the dataset. This increases the cha
4 min read
Replacing missing values using Pandas in PythonDataset is a collection of attributes and rows. Data set can have missing data that are represented by NA in Python and in this article, we are going to replace missing values in this article We consider this data set: Dataset data set In our data contains missing values in quantity, price, bought,
2 min read
Outliers Detection
Exploratory Data Analysis
Time Series Data Analysis