Missing Data Handling

The document discusses the challenges and methods for handling missing data in datasets, including reasons for missing values and various imputation techniques. It emphasizes the importance of understanding the type of missing data (MCAR, MAR) and provides examples of how to handle missing values using Python's pandas library. Additionally, it includes exercises for practical application and cautions against potential pitfalls in imputation methods.

Uploaded by

Shreya Parekh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views19 pages

Missing Data Handling

Uploaded by

Shreya Parekh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Missing Data

Handling
Incomplete (Missing) Data

• Data is not always available

• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time
of entry
• not register history or changes of the data
• Missing data may need to be inferred
2
How to Handle Missing
Data?
• Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when
the % of missing values per attribute varies considerably

• Class Exercise: Suppose there are 20 attributes (columns) in a data set and in each column, 10% of the
values are missing. What % of the records will have at least one missing attribute?

• Fill in the missing value manually: tedious + infeasible?

• Fill in it automatically with

• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision tree (hot deck
imputation)

• Class Exercise: Suppose we decide to fill in with the global mean. Find a scenario where this is an
acceptable solution, and one in which it is not. 3
Missing Data Handling
• How to replace missing data in attributes like – Age, Income, Gender, Daily
Rainfall amount, Population of country/state/city, Revenue of Company etc

• One technique is to use average for the numeric values. However, average
cannot/should not be used for Age (why?).

• What to do in such case? -> Missing value imputation using regression on

ML model on:
• Salary
• Or better, years of experience
Missing Data Handling in Python
• In pandas, missing values are indicated as NaN (not a number)

• Class exercise: Read the csv file Cost_of_power in pandas, then count the number of missing values:
• In each column
• In each row
• Across the dataset

You can use the pandas function isnull().sum() or isna().sum() on the created data frame to answer
these questions.
Missing Data Handling in Python –
Full Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import numpy as np

cost_of_power =pd.read_csv(r'C:\\Users\\IFIM\\Documents\\Academic\\IFIM\\Teaching\\Predictive Analytics in

Business Apr 22\\Cost of Power.csv')

print(cost_of_power['Cost'].isnull().sum())
cost_of_power = cost_of_power.dropna()
cost_of_power = cost_of_power.reset_index(drop=True)

cost_of_power_df.dropna(subset=['Cost'],inplace=True)
Missing Data Handling in Python
• Check index values in the data frame

• To reset the index, use reset_index(drop=True)

• Check index values again. Are the indices correct now?

Missing Data Handling in Python
• This code gives the missing values for the Cost column
• Get the missing values for all columns

• For full worked out code, refer

https://www.geeksforgeeks.org/how-to-count-the-number-of-nan-valu
es-in-pandas/
Missing Data Imputation in Python
Missing values can be imputed with a provided For categorical data, use
constant value, or using the statistics (mean, most_frequent or constant strategy
median or most frequent) of each column in
which the missing values are located.
import pandas as pd
df = pd.DataFrame([["a", "x"],
import numpy as np [np.nan, "y"],
from sklearn.impute import SimpleImputer ["a", np.nan],
imp = SimpleImputer(missing_values=np.nan, ["b", "y"]], dtype="category")
strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]]) imp =
SimpleImputer() SimpleImputer(strategy="most_frequent")
X = [[np.nan, 2], [6, np.nan], [7, 6]] print(imp.fit_transform(df))
print(imp.transform(X))
Y = imp.transform(X)
Class Exercise
• Use the SimpleImputer on Cost of Power dataset to impute the
missing values.

• What happens to the statistical properties of the distribution if mean

imputation is used? For example, think of variance.
(Ref:
https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch3/imputation/
5214784-eng.htm
)
SimpleImputer – Cost of Power
• As the SimpleImputer expects a numpy array, first we have convert the dataframe
into an array:

cost_of_power =pd.read_csv(r'C:\\Users\\IFIM\\Documents\\Academic\\IFIM\\
Teaching\\Predictive Analytics in Business Apr 22\\Cost of Power.csv’)

Z = cost_of_power[["Month","Cost"]].to_numpy()

• Then we can call the SimpleImputer function on this array:

cost_of_power_new = imp.transform(Z)
How Did the Missing Value
Imputation Perform?
https://machinelearningmastery.com/statistical-imputation-for-missing-
values-in-machine-learning/
Missing Data Mechanisms
• Types of missing values –
• Missing Completely at Random (MCAR)
• Missing at Random (MAR)
• Missingness that depends on unobserved predictors
• Missingness that depends on the missing value itself

• Self-study and presentation:

http://www.stat.columbia.edu/~gelman/arm/missing.pdf

Missing value imputation can be dangerous! Do it with caution.

Missing Data Mechanisms
Case Explanation Example Possible Way to Handle
Missing Completely at Random Probability of a data Remove cases with missing
(MCAR) missing is the same values
for all cases
Missing at Random (MAR) Missing Not if sex, race, Remove cases with missing
Completely at education, values
Random – and age are
Probability of a data recorded for all the OR
missing depends people in the survey,
only on available then “earnings” is Model the missing values
information (all missing with Regression / Logistic
observed variables) at random if the Regression
probability of
nonresponse to this
question depends
only on
these other, fully
recorded variables.
Missing Data Mechanisms
Case Explanation Example Possible Way to Handle
Missingness that depends on Missingness is no Suppose that people Must be explicitly modeled
unobserved predictors longer “at with college
random” if it degrees are less
depends on likely to reveal their
information that has earnings, having a
not been recorded college degree is
and this in- predictive of
formation also earnings, and there
predicts the missing is also some
values. nonresponse to the
education question.
Missingness that depends on the the probability of In the extreme case Must be explicitly modeled
missing value itself missingness (for example, all per-
depends on the (po- sons earning more
tentially missing) than $100,000
variable itself. refuse to respond),
this is called
censoring
Caveat!
When is it NORMAL to have a missing value in a dataset?
Caveat - discussion
• For example, if the survey questionnaire asks about the Name of
Spouse – but the person is unmarried / divorced etc – then Name of
Spouse would be blank or missing
Missing Value Handling Methods –
from LinkedIn post of Danny
Butnivik
Self Practice
Kaggle tutorial using Melbourne Housing data set:

https://www.kaggle.com/code/alexisbcook/missing-values

Accounting Meigs Haka Bettner 11th Edition
0% (4)
Accounting Meigs Haka Bettner 11th Edition
3 pages
Machine Learning
100% (2)
Machine Learning
136 pages
Thesis Format Bukidnon State University
No ratings yet
Thesis Format Bukidnon State University
11 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
Missing Data Values and How To Handle It
No ratings yet
Missing Data Values and How To Handle It
5 pages
Missing Data
No ratings yet
Missing Data
25 pages
Unit 3
No ratings yet
Unit 3
30 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
10 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
Data - Preprocessing - 2
No ratings yet
Data - Preprocessing - 2
10 pages
Handling Missing Value
No ratings yet
Handling Missing Value
12 pages
Unit 2 Data Preprocessing (1)
No ratings yet
Unit 2 Data Preprocessing (1)
66 pages
Missing Values
No ratings yet
Missing Values
3 pages
Missing Data
No ratings yet
Missing Data
14 pages
How to Handle Missing Data in Python. [Explained in 5 Easy Steps]
No ratings yet
How to Handle Missing Data in Python. [Explained in 5 Easy Steps]
10 pages
Day 19 - Numpy
No ratings yet
Day 19 - Numpy
5 pages
platias2020-Greece
No ratings yet
platias2020-Greece
10 pages
Handling The Missing Values
No ratings yet
Handling The Missing Values
4 pages
Lecture 8 Handling Missing Values
No ratings yet
Lecture 8 Handling Missing Values
25 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples)
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples)
10 pages
Handling Missing Values in Data Mining
No ratings yet
Handling Missing Values in Data Mining
12 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
FDS_U4.pptx
No ratings yet
FDS_U4.pptx
93 pages
Machine Learning Based Missing Data Imputation
No ratings yet
Machine Learning Based Missing Data Imputation
13 pages
DT - Missing Values
No ratings yet
DT - Missing Values
11 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
MIssing Data Imputation Using Machine Learning Algorithm
No ratings yet
MIssing Data Imputation Using Machine Learning Algorithm
11 pages
04 05 PDE Missing Value
No ratings yet
04 05 PDE Missing Value
3 pages
EXP-12_IAIML
No ratings yet
EXP-12_IAIML
13 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
VVImp Missing Values v14
No ratings yet
VVImp Missing Values v14
35 pages
04 05 PDE Missing Value
No ratings yet
04 05 PDE Missing Value
3 pages
DA unit 2 15m handling missing data
No ratings yet
DA unit 2 15m handling missing data
3 pages
Data Cleaning Workshop:: Club Data Science and Cloud Computing
No ratings yet
Data Cleaning Workshop:: Club Data Science and Cloud Computing
6 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
Emmanuel 2021 A Survey On Missing Data in Machine Learning
No ratings yet
Emmanuel 2021 A Survey On Missing Data in Machine Learning
37 pages
Emmanuel Et Al. - 2021 - A Survey on Missing Data in Machine Learning
No ratings yet
Emmanuel Et Al. - 2021 - A Survey on Missing Data in Machine Learning
37 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
6 pages
Missing Data Imputation Using Singular Value Decomposition
No ratings yet
Missing Data Imputation Using Singular Value Decomposition
6 pages
What Are the Different Ways to Handle Missing Values
No ratings yet
What Are the Different Ways to Handle Missing Values
2 pages
Data Imputation for Missing Values
No ratings yet
Data Imputation for Missing Values
14 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
handling missing values
No ratings yet
handling missing values
5 pages
Handling Missing Data
No ratings yet
Handling Missing Data
23 pages
m Akaba 2019
No ratings yet
m Akaba 2019
7 pages
chapter3 DS
No ratings yet
chapter3 DS
17 pages
lec 4
No ratings yet
lec 4
9 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
DADM S5 Imputation of Missing Data
No ratings yet
DADM S5 Imputation of Missing Data
15 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
ADS-EXP2
No ratings yet
ADS-EXP2
3 pages
chapter_3
No ratings yet
chapter_3
58 pages
Data Cleaning_Project work
No ratings yet
Data Cleaning_Project work
10 pages
Chapter 1. Data Preparation (2)
No ratings yet
Chapter 1. Data Preparation (2)
74 pages
Lec9 Dealing With Missing Values
No ratings yet
Lec9 Dealing With Missing Values
22 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Slides on DataII
No ratings yet
Slides on DataII
26 pages
Errors of Regression Models: Bite-Size Machine Learning, #1
From Everand
Errors of Regression Models: Bite-Size Machine Learning, #1
Lee Baker
No ratings yet
Pr-2 FINAAAALLL
No ratings yet
Pr-2 FINAAAALLL
22 pages
Spss Exercises
100% (1)
Spss Exercises
12 pages
F2 Statistical Inference
No ratings yet
F2 Statistical Inference
43 pages
Central Limit Thorem
No ratings yet
Central Limit Thorem
2 pages
Asking Users & Experts
No ratings yet
Asking Users & Experts
31 pages
LTE-MR Measurement
No ratings yet
LTE-MR Measurement
9 pages
Inbound 212131435071882435
No ratings yet
Inbound 212131435071882435
14 pages
All Notes
No ratings yet
All Notes
6 pages
24 The Influence of Shopping Motivation Samira
No ratings yet
24 The Influence of Shopping Motivation Samira
21 pages
A Survey On Machine Learning Applications in Vlsi Cad
No ratings yet
A Survey On Machine Learning Applications in Vlsi Cad
9 pages
06 Cell Range GC
No ratings yet
06 Cell Range GC
26 pages
JMET2010 Question Paper With Detailed Solutions
No ratings yet
JMET2010 Question Paper With Detailed Solutions
44 pages
Miscellaneous Exercises: Answer 1
No ratings yet
Miscellaneous Exercises: Answer 1
15 pages
Predicting Farmer Uptake of New Agricultural Practices A T 2017 Agricultura
No ratings yet
Predicting Farmer Uptake of New Agricultural Practices A T 2017 Agricultura
11 pages
Using R For Time Series Analysis - Time Series 0.2 Documentation
No ratings yet
Using R For Time Series Analysis - Time Series 0.2 Documentation
37 pages
Bootstrap Methods and Their Applications.
No ratings yet
Bootstrap Methods and Their Applications.
96 pages
Lecture 10 - Chapter 20 - Simulteneous Equation Methods
No ratings yet
Lecture 10 - Chapter 20 - Simulteneous Equation Methods
9 pages
Statistical Simulation of Wind Speed in Athens, Greece Based On Weibull and ARMA Models
No ratings yet
Statistical Simulation of Wind Speed in Athens, Greece Based On Weibull and ARMA Models
8 pages
Nonparametric Regression
No ratings yet
Nonparametric Regression
24 pages
A Self Instructing Course in Mode Choice Modeling Multinomial and Nested Logit Models
No ratings yet
A Self Instructing Course in Mode Choice Modeling Multinomial and Nested Logit Models
249 pages
Error Analysis of Using Verbs Ruigigo "Belajar" by The Sixth Semester Students of Japanese Language Education FBS UNIMA
No ratings yet
Error Analysis of Using Verbs Ruigigo "Belajar" by The Sixth Semester Students of Japanese Language Education FBS UNIMA
10 pages
HKUST - School of Science - BSC in Data Science and Technology (For Students Admitted in 2019-20 Under The 4-Year Degree)
No ratings yet
HKUST - School of Science - BSC in Data Science and Technology (For Students Admitted in 2019-20 Under The 4-Year Degree)
3 pages
The Sequential Probability Ratio Test (SPRT) in Feature Extractio
No ratings yet
The Sequential Probability Ratio Test (SPRT) in Feature Extractio
120 pages
Guidelines writing quantitative academic article
No ratings yet
Guidelines writing quantitative academic article
93 pages
Statistical Soup - ANOVA, ANCOVA, MANOVA, & MANCOVA
No ratings yet
Statistical Soup - ANOVA, ANCOVA, MANOVA, & MANCOVA
1 page
F Test T Test Chi Square Test
No ratings yet
F Test T Test Chi Square Test
6 pages
MA1254 Random Processes :: Unit 2 :: Standard Distributions
No ratings yet
MA1254 Random Processes :: Unit 2 :: Standard Distributions
3 pages
Honors Integrated Math 3: Mr. Barber
0% (1)
Honors Integrated Math 3: Mr. Barber
4 pages