0% found this document useful (0 votes)
18 views4 pages

Data Cleaning

Uploaded by

namrathameedinti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views4 pages

Data Cleaning

Uploaded by

namrathameedinti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Cleaning

Data cleaning is the process of identifying and correcting issues within a dataset. To assess your
skillset, the interviewer may ask you about a scenario or present you with a sample dataset to
evaluate. In a statistics interview round, you’re expected to verbally walk through the data
cleaning approach you’d take. In contrast, you’re more likely to write code to clean the dataset in
a technical/coding interview round.

What to Expect

Example questions include:

1. Imagine you're analyzing a dataset about customer transactions for an e-commerce


platform. You discover that some of the records have missing values for key variables
such as purchase amount, product category, and customer demographics. How would you
handle these missing values?
2. Suppose you're analyzing a dataset about nancial transactions for a banking institution.
During your analysis, you notice some transactions with unusually large or small
amounts, compared to the rest of the data. How would you identify these outliers, and
what strategies would you use to address them effectively?
This lesson will discuss:

• Handling missing data


• Outlier detection
For each topic, we’ll provide a brief description of each issue and list common mitigation
methods.

Handling Missing Data

Assessing the missing data mechanism minimizes bias and ensures the validity of your analysis.

The main types of missing data are:

Missin
Description Example Handling Strategy
g Data
Type
Missin The probability of data A survey on Removing observations with missing values (if
g being missing is customer <20% of total observations) or any imputation
Compl random and unrelated satisfaction where method (simple: mean/median/mode imputation,
etely to any observed or some respondents advanced: regression, K Nearest Neighbors
at
Missin unobserved variable
The probability in
of data accidentally skip
A study on income (KNN),
Multiplemultiple imputation)
imputation can be used
or model-based as the
imputation
g at being missing may where respondents methods can be used, incorporating information
Rando depend on observed with higher income from other observed variables to impute missing
m variables but not on levels are less likely values. Incorporate domain knowledge to guide
(MAR the missing values to disclose their the choice of imputation method and assess the
) themselves. The earnings. suitability of imputed values. Create indicator
fi
Missin The probability of data A clinical trial
MNAR data are more challenging to handle, and
g Not being missing depends where participants
advanced techniques such as pattern mixture
at on the missing values with severe side
models or selection models may be required to
Rando themselves, even after effects from a
account for the missingness mechanism.
m accounting for medication are more
(MNA observed variables. likely to drop out of
Outlier Detection

Outlier detection is the process of identifying observations or data points that deviate
signi cantly from the majority of the data in a dataset. Outliers can arise due to various reasons
such as measurement errors, data entry mistakes, natural variability, or rare events. Outlier
detection is an essential step in data analysis and modeling that ensures the accuracy, reliability,
and robustness of insights derived from the data.

Common Outliers:

• Univariate
• Multivariate
• Contextual
• Collective
Univariate Outliers

Outliers in a single variable (e.g., anomaly in temperature sensor readings).

Detection Methods:

1. Visual inspection:

◦ Box plots
◦ Histograms
◦ Scatter plots
2. Statistical Methods:

◦Z-score: Calculates the number of standard deviations an observation is away


from the mean. Observations with a z-score above a certain threshold (e.g., 3) are
considered outliers.
◦ Modi ed Z-Score: Similar to the z-score but more robust to outliers. It uses the
median and median absolute deviation (MAD) instead of the mean and standard
deviation.
◦ IQR (Interquartile Range): De nes the range between the rst quartile (Q1) and
the third quartile (Q3). Observations outside a certain multiple of the IQR (e.g.,
1.5 times the IQR) from the quartiles are considered outliers.
Treatment Methods:

1. Data Transformation:
fi
fi
fi
fi

Winsorization: Replace outliers with nearest non-outlier value.

Trimming: Remove extreme values beyond a certain percentile.

Logarithmic Transformation: Useful for reducing the in uence of large outliers
on skewed data distributions.
2. Imputation: For datasets with missing values, outliers can be treated as missing data and
imputed using appropriate techniques mentioned above.

Multivariate Outliers

Outliers involving multiple variables (e.g., anomaly in credit card transactions).

Detection Methods:

1. Distance-based methods:


Mahalanobis Distance: Measures distance from centroid.

K-nearest Neighbors (KNN): Identify points with unusually large distances from
neighbors.
2. Machine Learning Algorithms:


Isolation Forest: Constructs an ensemble of decision trees to isolate outliers
ef ciently by partitioning the feature space. Outliers are expected to require fewer
partitions to be isolated.
◦ One-class SVM (Support Vector Machine): Trains a model on the majority
class (normal data) to de ne the region of normality. Observations lying outside
this region are considered outliers.
Treatment Methods:

1. Clustering: Group similar data points together, isolating outliers, and perform analysis
separately for each of these clusters.
2. Robust Methods: Use algorithms less sensitive to outliers.

Contextual Outliers

Outliers based on context or domain knowledge (e.g., a sudden surge in website traf c).

Detection Methods:

1. Expert Judgment: Consult domain experts to identify unusual data points.


2. Time-series Analysis: Detect anomalies based on temporal patterns.
Treatment Methods:

1. Domain-Speci c Treatment: Handle outliers based on domain-speci c rules or


requirements.
In some cases, outliers may be valid data points that represent rare or extreme events. If the
outliers are genuine observations and do not signi cantly affect the analysis, it may be
fi
fi
fi
fi
fl
fi
fi
appropriate to leave them in the dataset. One straightforward approach is to remove outliers from
the dataset. However, this approach should be used with caution as removing outliers can lead to
a loss of valuable information and potentially bias the analysis if the outliers are not truly
erroneous.

Collective Outliers

Groups of outliers occurring together (e.g., a cluster of defective products).

Detection Methods:

1. Clustering: Identify clusters of data points deviating from the norm. Example: DBSCAN
(Density-Based Spatial Clustering of Applications with Noise) identi es clusters of
data points based on density. Outliers are data points that do not belong to any cluster.
2. Association Rule Mining: Identify patterns of co-occurring outliers.

Treatment Methods:

1. Investigate Root Cause: Determine if outliers are due to data collection errors or
genuine anomalies.
fi

You might also like