0% found this document useful (0 votes)
12 views19 pages

Missing Data Handling

The document discusses the challenges and methods for handling missing data in datasets, including reasons for missing values and various imputation techniques. It emphasizes the importance of understanding the type of missing data (MCAR, MAR) and provides examples of how to handle missing values using Python's pandas library. Additionally, it includes exercises for practical application and cautions against potential pitfalls in imputation methods.

Uploaded by

Shreya Parekh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views19 pages

Missing Data Handling

The document discusses the challenges and methods for handling missing data in datasets, including reasons for missing values and various imputation techniques. It emphasizes the importance of understanding the type of missing data (MCAR, MAR) and provides examples of how to handle missing values using Python's pandas library. Additionally, it includes exercises for practical application and cautions against potential pitfalls in imputation methods.

Uploaded by

Shreya Parekh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Missing Data

Handling
Incomplete (Missing) Data

• Data is not always available


• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time
of entry
• not register history or changes of the data
• Missing data may need to be inferred
2
How to Handle Missing
Data?
• Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when
the % of missing values per attribute varies considerably

• Class Exercise: Suppose there are 20 attributes (columns) in a data set and in each column, 10% of the
values are missing. What % of the records will have at least one missing attribute?

• Fill in the missing value manually: tedious + infeasible?

• Fill in it automatically with


• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision tree (hot deck
imputation)

• Class Exercise: Suppose we decide to fill in with the global mean. Find a scenario where this is an
acceptable solution, and one in which it is not. 3
Missing Data Handling
• How to replace missing data in attributes like – Age, Income, Gender, Daily
Rainfall amount, Population of country/state/city, Revenue of Company etc

• One technique is to use average for the numeric values. However, average
cannot/should not be used for Age (why?).

• What to do in such case? -> Missing value imputation using regression on


ML model on:
• Salary
• Or better, years of experience
Missing Data Handling in Python
• In pandas, missing values are indicated as NaN (not a number)

• Class exercise: Read the csv file Cost_of_power in pandas, then count the number of missing values:
• In each column
• In each row
• Across the dataset

You can use the pandas function isnull().sum() or isna().sum() on the created data frame to answer
these questions.
Missing Data Handling in Python –
Full Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import numpy as np

cost_of_power =pd.read_csv(r'C:\\Users\\IFIM\\Documents\\Academic\\IFIM\\Teaching\\Predictive Analytics in


Business Apr 22\\Cost of Power.csv')

print(cost_of_power['Cost'].isnull().sum())
cost_of_power = cost_of_power.dropna()
cost_of_power = cost_of_power.reset_index(drop=True)

OR

cost_of_power_df.dropna(subset=['Cost'],inplace=True)
Missing Data Handling in Python
• Check index values in the data frame

• To reset the index, use reset_index(drop=True)

• Check index values again. Are the indices correct now?


Missing Data Handling in Python
• This code gives the missing values for the Cost column
• Get the missing values for all columns

• For full worked out code, refer


https://www.geeksforgeeks.org/how-to-count-the-number-of-nan-valu
es-in-pandas/
Missing Data Imputation in Python
Missing values can be imputed with a provided For categorical data, use
constant value, or using the statistics (mean, most_frequent or constant strategy
median or most frequent) of each column in
which the missing values are located.
import pandas as pd
df = pd.DataFrame([["a", "x"],
import numpy as np [np.nan, "y"],
from sklearn.impute import SimpleImputer ["a", np.nan],
imp = SimpleImputer(missing_values=np.nan, ["b", "y"]], dtype="category")
strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]]) imp =
SimpleImputer() SimpleImputer(strategy="most_frequent")
X = [[np.nan, 2], [6, np.nan], [7, 6]] print(imp.fit_transform(df))
print(imp.transform(X))
Y = imp.transform(X)
Class Exercise
• Use the SimpleImputer on Cost of Power dataset to impute the
missing values.

• What happens to the statistical properties of the distribution if mean


imputation is used? For example, think of variance.
(Ref:
https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch3/imputation/
5214784-eng.htm
)
SimpleImputer – Cost of Power
• As the SimpleImputer expects a numpy array, first we have convert the dataframe
into an array:

cost_of_power =pd.read_csv(r'C:\\Users\\IFIM\\Documents\\Academic\\IFIM\\
Teaching\\Predictive Analytics in Business Apr 22\\Cost of Power.csv’)

Z = cost_of_power[["Month","Cost"]].to_numpy()

• Then we can call the SimpleImputer function on this array:

cost_of_power_new = imp.transform(Z)
How Did the Missing Value
Imputation Perform?
https://machinelearningmastery.com/statistical-imputation-for-missing-
values-in-machine-learning/
Missing Data Mechanisms
• Types of missing values –
• Missing Completely at Random (MCAR)
• Missing at Random (MAR)
• Missingness that depends on unobserved predictors
• Missingness that depends on the missing value itself

• Self-study and presentation:


http://www.stat.columbia.edu/~gelman/arm/missing.pdf

Missing value imputation can be dangerous! Do it with caution.


Missing Data Mechanisms
Case Explanation Example Possible Way to Handle
Missing Completely at Random Probability of a data Remove cases with missing
(MCAR) missing is the same values
for all cases
Missing at Random (MAR) Missing Not if sex, race, Remove cases with missing
Completely at education, values
Random – and age are
Probability of a data recorded for all the OR
missing depends people in the survey,
only on available then “earnings” is Model the missing values
information (all missing with Regression / Logistic
observed variables) at random if the Regression
probability of
nonresponse to this
question depends
only on
these other, fully
recorded variables.
Missing Data Mechanisms
Case Explanation Example Possible Way to Handle
Missingness that depends on Missingness is no Suppose that people Must be explicitly modeled
unobserved predictors longer “at with college
random” if it degrees are less
depends on likely to reveal their
information that has earnings, having a
not been recorded college degree is
and this in- predictive of
formation also earnings, and there
predicts the missing is also some
values. nonresponse to the
education question.
Missingness that depends on the the probability of In the extreme case Must be explicitly modeled
missing value itself missingness (for example, all per-
depends on the (po- sons earning more
tentially missing) than $100,000
variable itself. refuse to respond),
this is called
censoring
Caveat!
When is it NORMAL to have a missing value in a dataset?
Caveat - discussion
• For example, if the survey questionnaire asks about the Name of
Spouse – but the person is unmarried / divorced etc – then Name of
Spouse would be blank or missing
Missing Value Handling Methods –
from LinkedIn post of Danny
Butnivik
Self Practice
Kaggle tutorial using Melbourne Housing data set:

https://www.kaggle.com/code/alexisbcook/missing-values

You might also like