0% found this document useful (0 votes)

18 views4 pages

Data Cleaning

Uploaded by

namrathameedinti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views4 pages

Data Cleaning

Uploaded by

namrathameedinti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Data Cleaning

Data cleaning is the process of identifying and correcting issues within a dataset. To assess your
skillset, the interviewer may ask you about a scenario or present you with a sample dataset to
evaluate. In a statistics interview round, you’re expected to verbally walk through the data
cleaning approach you’d take. In contrast, you’re more likely to write code to clean the dataset in
a technical/coding interview round.

What to Expect

Example questions include:

1. Imagine you're analyzing a dataset about customer transactions for an e-commerce

platform. You discover that some of the records have missing values for key variables
such as purchase amount, product category, and customer demographics. How would you
handle these missing values?
2. Suppose you're analyzing a dataset about nancial transactions for a banking institution.
During your analysis, you notice some transactions with unusually large or small
amounts, compared to the rest of the data. How would you identify these outliers, and
what strategies would you use to address them effectively?
This lesson will discuss:

• Handling missing data

• Outlier detection
For each topic, we’ll provide a brief description of each issue and list common mitigation
methods.

Handling Missing Data

Assessing the missing data mechanism minimizes bias and ensures the validity of your analysis.

The main types of missing data are:

Missin
Description Example Handling Strategy
g Data
Type
Missin The probability of data A survey on Removing observations with missing values (if
g being missing is customer <20% of total observations) or any imputation
Compl random and unrelated satisfaction where method (simple: mean/median/mode imputation,
etely to any observed or some respondents advanced: regression, K Nearest Neighbors
at
Missin unobserved variable
The probability in
of data accidentally skip
A study on income (KNN),
Multiplemultiple imputation)
imputation can be used
or model-based as the
imputation
g at being missing may where respondents methods can be used, incorporating information
Rando depend on observed with higher income from other observed variables to impute missing
m variables but not on levels are less likely values. Incorporate domain knowledge to guide
(MAR the missing values to disclose their the choice of imputation method and assess the
) themselves. The earnings. suitability of imputed values. Create indicator
fi
Missin The probability of data A clinical trial
MNAR data are more challenging to handle, and
g Not being missing depends where participants
advanced techniques such as pattern mixture
at on the missing values with severe side
models or selection models may be required to
Rando themselves, even after effects from a
account for the missingness mechanism.
m accounting for medication are more
(MNA observed variables. likely to drop out of
Outlier Detection

Outlier detection is the process of identifying observations or data points that deviate
signi cantly from the majority of the data in a dataset. Outliers can arise due to various reasons
such as measurement errors, data entry mistakes, natural variability, or rare events. Outlier
detection is an essential step in data analysis and modeling that ensures the accuracy, reliability,
and robustness of insights derived from the data.

Common Outliers:

• Univariate
• Multivariate
• Contextual
• Collective
Univariate Outliers

Outliers in a single variable (e.g., anomaly in temperature sensor readings).

Detection Methods:

1. Visual inspection:

◦ Box plots
◦ Histograms
◦ Scatter plots
2. Statistical Methods:

◦Z-score: Calculates the number of standard deviations an observation is away

from the mean. Observations with a z-score above a certain threshold (e.g., 3) are
considered outliers.
◦ Modi ed Z-Score: Similar to the z-score but more robust to outliers. It uses the
median and median absolute deviation (MAD) instead of the mean and standard
deviation.
◦ IQR (Interquartile Range): De nes the range between the rst quartile (Q1) and
the third quartile (Q3). Observations outside a certain multiple of the IQR (e.g.,
1.5 times the IQR) from the quartiles are considered outliers.
Treatment Methods:

1. Data Transformation:
fi
fi
fi
fi
◦
Winsorization: Replace outliers with nearest non-outlier value.
◦
Trimming: Remove extreme values beyond a certain percentile.
◦
Logarithmic Transformation: Useful for reducing the in uence of large outliers
on skewed data distributions.
2. Imputation: For datasets with missing values, outliers can be treated as missing data and
imputed using appropriate techniques mentioned above.

Multivariate Outliers

Outliers involving multiple variables (e.g., anomaly in credit card transactions).

Detection Methods:

1. Distance-based methods:

◦
Mahalanobis Distance: Measures distance from centroid.
◦
K-nearest Neighbors (KNN): Identify points with unusually large distances from
neighbors.
2. Machine Learning Algorithms:

◦
Isolation Forest: Constructs an ensemble of decision trees to isolate outliers
ef ciently by partitioning the feature space. Outliers are expected to require fewer
partitions to be isolated.
◦ One-class SVM (Support Vector Machine): Trains a model on the majority
class (normal data) to de ne the region of normality. Observations lying outside
this region are considered outliers.
Treatment Methods:

1. Clustering: Group similar data points together, isolating outliers, and perform analysis
separately for each of these clusters.
2. Robust Methods: Use algorithms less sensitive to outliers.

Contextual Outliers

Outliers based on context or domain knowledge (e.g., a sudden surge in website traf c).

Detection Methods:

1. Expert Judgment: Consult domain experts to identify unusual data points.

2. Time-series Analysis: Detect anomalies based on temporal patterns.
Treatment Methods:

1. Domain-Speci c Treatment: Handle outliers based on domain-speci c rules or

requirements.
In some cases, outliers may be valid data points that represent rare or extreme events. If the
outliers are genuine observations and do not signi cantly affect the analysis, it may be
fi
fi
fi
fi
fl
fi
fi
appropriate to leave them in the dataset. One straightforward approach is to remove outliers from
the dataset. However, this approach should be used with caution as removing outliers can lead to
a loss of valuable information and potentially bias the analysis if the outliers are not truly
erroneous.

Collective Outliers

Groups of outliers occurring together (e.g., a cluster of defective products).

Detection Methods:

1. Clustering: Identify clusters of data points deviating from the norm. Example: DBSCAN
(Density-Based Spatial Clustering of Applications with Noise) identi es clusters of
data points based on density. Outliers are data points that do not belong to any cluster.
2. Association Rule Mining: Identify patterns of co-occurring outliers.

Treatment Methods:

1. Investigate Root Cause: Determine if outliers are due to data collection errors or
genuine anomalies.
fi

17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Unit 2 Data Preprocessing (1)
No ratings yet
Unit 2 Data Preprocessing (1)
66 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Data Quality
No ratings yet
Data Quality
14 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
07 OUTLIER DETECTION
No ratings yet
07 OUTLIER DETECTION
54 pages
Unit 5
No ratings yet
Unit 5
70 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
Lec 3 Data Preprocessing and Transformation(1)
No ratings yet
Lec 3 Data Preprocessing and Transformation(1)
73 pages
data science slides
No ratings yet
data science slides
57 pages
DSI237_GROUP_2
No ratings yet
DSI237_GROUP_2
27 pages
LECTURE 12
No ratings yet
LECTURE 12
54 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Outlier Detection
No ratings yet
Outlier Detection
45 pages
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
No ratings yet
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
42 pages
PHD seminar
No ratings yet
PHD seminar
38 pages
Outliers ML
No ratings yet
Outliers ML
14 pages
UNIT 4
No ratings yet
UNIT 4
17 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
6. Data Quality and Remediation
No ratings yet
6. Data Quality and Remediation
40 pages
Outlier Detection Techniques
100% (2)
Outlier Detection Techniques
56 pages
Data Collection Cleaning Preprocessing Presentation
No ratings yet
Data Collection Cleaning Preprocessing Presentation
13 pages
Unit 1
No ratings yet
Unit 1
21 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Guide on Outlier Detection Methods
No ratings yet
Guide on Outlier Detection Methods
11 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
chapter3 DS
No ratings yet
chapter3 DS
17 pages
Outliners
No ratings yet
Outliners
15 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Machine: Learning
No ratings yet
Machine: Learning
15 pages
Data Quality
100% (2)
Data Quality
16 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
12Outlier
No ratings yet
12Outlier
16 pages
Research File 3
No ratings yet
Research File 3
10 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
A Survey on Outlier Detection Methods
No ratings yet
A Survey on Outlier Detection Methods
4 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
What is Outlier
No ratings yet
What is Outlier
3 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Detecting Data Outliers
No ratings yet
Detecting Data Outliers
7 pages
Data Cleaning Techniques
No ratings yet
Data Cleaning Techniques
11 pages
Outliers
No ratings yet
Outliers
3 pages
Anomalies in dataset
No ratings yet
Anomalies in dataset
4 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
001 Introduction To Business Analytics
100% (2)
001 Introduction To Business Analytics
57 pages
Outlier Detection and Removal
No ratings yet
Outlier Detection and Removal
2 pages
Data Screening (Sometimes Referred To As "Data Screaming") Is The Process of Ensuring Your Data Is
No ratings yet
Data Screening (Sometimes Referred To As "Data Screaming") Is The Process of Ensuring Your Data Is
4 pages
Missing Values in a Dataset
No ratings yet
Missing Values in a Dataset
2 pages
Data Analytics 02: Drag Connect It Change Remove Cabin, Life Boat, Name, and Ticket Number
No ratings yet
Data Analytics 02: Drag Connect It Change Remove Cabin, Life Boat, Name, and Ticket Number
2 pages
How To Calculate Outliers
No ratings yet
How To Calculate Outliers
7 pages
417-Artificial Intelligence-IX-X-2022-23
100% (1)
417-Artificial Intelligence-IX-X-2022-23
15 pages
SAP Business One Customization Tools
67% (3)
SAP Business One Customization Tools
114 pages
Barcode Scanner Module User Manual: Perface
No ratings yet
Barcode Scanner Module User Manual: Perface
93 pages
Data Migration To S4 Hana
No ratings yet
Data Migration To S4 Hana
28 pages
9626 AICE Information Technology Student Learner Guide
100% (1)
9626 AICE Information Technology Student Learner Guide
27 pages
Debretabor Universty: OF Access To Services
No ratings yet
Debretabor Universty: OF Access To Services
39 pages
SQL Server
No ratings yet
SQL Server
184 pages
SH 080008 W
No ratings yet
SH 080008 W
498 pages
The Influence of Internet Slang on the Speech Patterns of Filipino Teenagers from Abu Dhabi, United Arab Emirates
No ratings yet
The Influence of Internet Slang on the Speech Patterns of Filipino Teenagers from Abu Dhabi, United Arab Emirates
35 pages
(2G1) Global Nav - About Us - Budget - Budget Entries - FY2016 (Approved)
No ratings yet
(2G1) Global Nav - About Us - Budget - Budget Entries - FY2016 (Approved)
308 pages
HPE Reference Configuration for Veeam Availability Suite With HPE Nimble Storage-A00079582enw
No ratings yet
HPE Reference Configuration for Veeam Availability Suite With HPE Nimble Storage-A00079582enw
43 pages
Continuous Improvement Project: (Status Report)
100% (1)
Continuous Improvement Project: (Status Report)
1 page
CDS - One Concept, Two Flavors - SAP Blogs
100% (1)
CDS - One Concept, Two Flavors - SAP Blogs
33 pages
tcs-add-analytics-insights-intelligent-decision-making
No ratings yet
tcs-add-analytics-insights-intelligent-decision-making
8 pages
dabur strategic marketing
No ratings yet
dabur strategic marketing
7 pages
Java Lectture
No ratings yet
Java Lectture
23 pages
Change Data Capture Using Snowflake Streams - by Alexander - Snowflake - Medium
No ratings yet
Change Data Capture Using Snowflake Streams - by Alexander - Snowflake - Medium
5 pages
Live Leak - SSC Scientific Assistant Model Question Paper For Computer Science & Information Technology 2017
100% (8)
Live Leak - SSC Scientific Assistant Model Question Paper For Computer Science & Information Technology 2017
29 pages
A Beginner's Guide To Fraud Detection With Data Analytics
No ratings yet
A Beginner's Guide To Fraud Detection With Data Analytics
12 pages
88.challenges and Opportunities in Maintaining Data Integrity
No ratings yet
88.challenges and Opportunities in Maintaining Data Integrity
7 pages
DB2 Databases: Database Architecture
No ratings yet
DB2 Databases: Database Architecture
15 pages
Infopackage
No ratings yet
Infopackage
5 pages
SQL Data Types
No ratings yet
SQL Data Types
5 pages
04 Section III Scope of Requirements
No ratings yet
04 Section III Scope of Requirements
18 pages
Explore - LeetCodesort
No ratings yet
Explore - LeetCodesort
3 pages
70 462
No ratings yet
70 462
3 pages
Exam 70-762: Developing SQL Databases - Skills Measured: Audience Profile
No ratings yet
Exam 70-762: Developing SQL Databases - Skills Measured: Audience Profile
3 pages
MongoDB Tutorial
No ratings yet
MongoDB Tutorial
4 pages
Active Directory Replication Explained
No ratings yet
Active Directory Replication Explained
3 pages
Count Data Analysis: A Comprehensive Guide
From Everand
Count Data Analysis: A Comprehensive Guide
Pasquale De Marco
No ratings yet
Extending the Boundaries: An Expansive Journey into Nonparametric Curve Estimation
From Everand
Extending the Boundaries: An Expansive Journey into Nonparametric Curve Estimation
Pasquale De Marco
No ratings yet

Data Cleaning

Uploaded by

Data Cleaning

Uploaded by

Data Cleaning

Example questions include:

1. Imagine you're analyzing a dataset about customer transactions for an e-commerce

• Handling missing data

Handling Missing Data

The main types of missing data are:

Outliers in a single variable (e.g., anomaly in temperature sensor readings).

◦Z-score: Calculates the number of standard deviations an observation is away

Outliers involving multiple variables (e.g., anomaly in credit card transactions).

1. Expert Judgment: Consult domain experts to identify unusual data points.

1. Domain-Speci c Treatment: Handle outliers based on domain-speci c rules or

Groups of outliers occurring together (e.g., a cluster of defective products).

You might also like