Missing Data Handling
Missing Data Handling
Handling
Incomplete (Missing) Data
• Class Exercise: Suppose there are 20 attributes (columns) in a data set and in each column, 10% of the
values are missing. What % of the records will have at least one missing attribute?
• Class Exercise: Suppose we decide to fill in with the global mean. Find a scenario where this is an
acceptable solution, and one in which it is not. 3
Missing Data Handling
• How to replace missing data in attributes like – Age, Income, Gender, Daily
Rainfall amount, Population of country/state/city, Revenue of Company etc
• One technique is to use average for the numeric values. However, average
cannot/should not be used for Age (why?).
• Class exercise: Read the csv file Cost_of_power in pandas, then count the number of missing values:
• In each column
• In each row
• Across the dataset
You can use the pandas function isnull().sum() or isna().sum() on the created data frame to answer
these questions.
Missing Data Handling in Python –
Full Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import numpy as np
print(cost_of_power['Cost'].isnull().sum())
cost_of_power = cost_of_power.dropna()
cost_of_power = cost_of_power.reset_index(drop=True)
OR
cost_of_power_df.dropna(subset=['Cost'],inplace=True)
Missing Data Handling in Python
• Check index values in the data frame
cost_of_power =pd.read_csv(r'C:\\Users\\IFIM\\Documents\\Academic\\IFIM\\
Teaching\\Predictive Analytics in Business Apr 22\\Cost of Power.csv’)
Z = cost_of_power[["Month","Cost"]].to_numpy()
cost_of_power_new = imp.transform(Z)
How Did the Missing Value
Imputation Perform?
https://machinelearningmastery.com/statistical-imputation-for-missing-
values-in-machine-learning/
Missing Data Mechanisms
• Types of missing values –
• Missing Completely at Random (MCAR)
• Missing at Random (MAR)
• Missingness that depends on unobserved predictors
• Missingness that depends on the missing value itself
https://www.kaggle.com/code/alexisbcook/missing-values