Data Cleaning
Data Cleaning
Data cleaning is the process of identifying and correcting issues within a dataset. To assess your
skillset, the interviewer may ask you about a scenario or present you with a sample dataset to
evaluate. In a statistics interview round, you’re expected to verbally walk through the data
cleaning approach you’d take. In contrast, you’re more likely to write code to clean the dataset in
a technical/coding interview round.
What to Expect
Assessing the missing data mechanism minimizes bias and ensures the validity of your analysis.
Missin
Description Example Handling Strategy
g Data
Type
Missin The probability of data A survey on Removing observations with missing values (if
g being missing is customer <20% of total observations) or any imputation
Compl random and unrelated satisfaction where method (simple: mean/median/mode imputation,
etely to any observed or some respondents advanced: regression, K Nearest Neighbors
at
Missin unobserved variable
The probability in
of data accidentally skip
A study on income (KNN),
Multiplemultiple imputation)
imputation can be used
or model-based as the
imputation
g at being missing may where respondents methods can be used, incorporating information
Rando depend on observed with higher income from other observed variables to impute missing
m variables but not on levels are less likely values. Incorporate domain knowledge to guide
(MAR the missing values to disclose their the choice of imputation method and assess the
) themselves. The earnings. suitability of imputed values. Create indicator
fi
Missin The probability of data A clinical trial
MNAR data are more challenging to handle, and
g Not being missing depends where participants
advanced techniques such as pattern mixture
at on the missing values with severe side
models or selection models may be required to
Rando themselves, even after effects from a
account for the missingness mechanism.
m accounting for medication are more
(MNA observed variables. likely to drop out of
Outlier Detection
Outlier detection is the process of identifying observations or data points that deviate
signi cantly from the majority of the data in a dataset. Outliers can arise due to various reasons
such as measurement errors, data entry mistakes, natural variability, or rare events. Outlier
detection is an essential step in data analysis and modeling that ensures the accuracy, reliability,
and robustness of insights derived from the data.
Common Outliers:
• Univariate
• Multivariate
• Contextual
• Collective
Univariate Outliers
Detection Methods:
1. Visual inspection:
◦ Box plots
◦ Histograms
◦ Scatter plots
2. Statistical Methods:
1. Data Transformation:
fi
fi
fi
fi
◦
Winsorization: Replace outliers with nearest non-outlier value.
◦
Trimming: Remove extreme values beyond a certain percentile.
◦
Logarithmic Transformation: Useful for reducing the in uence of large outliers
on skewed data distributions.
2. Imputation: For datasets with missing values, outliers can be treated as missing data and
imputed using appropriate techniques mentioned above.
Multivariate Outliers
Detection Methods:
1. Distance-based methods:
◦
Mahalanobis Distance: Measures distance from centroid.
◦
K-nearest Neighbors (KNN): Identify points with unusually large distances from
neighbors.
2. Machine Learning Algorithms:
◦
Isolation Forest: Constructs an ensemble of decision trees to isolate outliers
ef ciently by partitioning the feature space. Outliers are expected to require fewer
partitions to be isolated.
◦ One-class SVM (Support Vector Machine): Trains a model on the majority
class (normal data) to de ne the region of normality. Observations lying outside
this region are considered outliers.
Treatment Methods:
1. Clustering: Group similar data points together, isolating outliers, and perform analysis
separately for each of these clusters.
2. Robust Methods: Use algorithms less sensitive to outliers.
Contextual Outliers
Outliers based on context or domain knowledge (e.g., a sudden surge in website traf c).
Detection Methods:
Collective Outliers
Detection Methods:
1. Clustering: Identify clusters of data points deviating from the norm. Example: DBSCAN
(Density-Based Spatial Clustering of Applications with Noise) identi es clusters of
data points based on density. Outliers are data points that do not belong to any cluster.
2. Association Rule Mining: Identify patterns of co-occurring outliers.
Treatment Methods:
1. Investigate Root Cause: Determine if outliers are due to data collection errors or
genuine anomalies.
fi