Intermediate Machine learning
Step 2: Missing Values
1. Introduction:
Importance of Handling Missing Values:
o Many datasets have missing values, which can cause issues with machine
learning models.
o Ignoring missing values can lead to errors or biases in predictions.
2. Three Approaches to Handling Missing Values:
Approach 1: Drop Columns with Missing Values
Approach 2: Imputation
Approach 3: Imputation with an Extension (Add a Missing Indicator)
3. Investigating Missing Values:
Check for Missing Values: Use pandas functions to identify missing values in the
dataset.
Example Code:
python
Copy code
import pandas as pd
# Load data
data = pd.read_csv('train.csv')
# Select target and features
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1)
# Break off validation set from training data
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
train_size=0.8, test_size=0.2, random_state=0)
# Shape of training data (num_rows, num_columns)
print(X_train.shape)
4. Approach 1: Drop Columns with Missing Values:
When to Use:
o When a column has many missing values.
o When the column is not critical for analysis.
Example Code:
python
Copy code
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if
X_train[col].isnull().any()]
# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)
# Check the shape of reduced data
print(reduced_X_train.shape)
5. Approach 2: Imputation:
Definition:
o Imputation is the process of filling in missing values with substituted values.
Common Strategies:
o Mean Imputation: Replace missing values with the mean of the column.
o Median Imputation: Replace missing values with the median of the column.
o Most Frequent Imputation: Replace missing values with the most frequent
value in the column.
Example Code:
python
Copy code
from sklearn.impute import SimpleImputer
# Imputation
my_imputer = SimpleImputer(strategy='median')
# Imputation on training and validation data
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
6. Approach 3: Imputation with an Extension (Add a Missing Indicator):
Extension of Imputation:
o Combine imputation with an additional indicator column that shows where the
missing values were.
Why Use It:
o It allows the model to account for the fact that certain values were missing,
which might be informative.
Example Code:
python
Copy code
from sklearn.impute import SimpleImputer
# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()
# Make new columns indicating what will be imputed
for col in cols_with_missing:
X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()
# Imputation
my_imputer = SimpleImputer(strategy='median')
imputed_X_train_plus =
pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus =
pd.DataFrame(my_imputer.transform(X_valid_plus))
# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns
7. Scoring the Approaches:
Scoring Approach: Use Mean Absolute Error (MAE) to compare the different
approaches.
Example Code:
python
Copy code
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# Function to compare MAE with different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
# Score for Approach 1 (Drop Columns with Missing Values)
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)
print("MAE (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train,
y_valid))
# Score for Approach 2 (Imputation)
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
print("MAE (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train,
y_valid))
7. Scoring the Approaches (continued):
Scoring Approach: Use Mean Absolute Error (MAE) to compare the different
approaches.
Example Code (continued):
python
Copy code
# Score for Approach 2 (Imputation)
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
print("MAE (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train,
y_valid))
# Score for Approach 3 (Imputation with Extension)
imputed_X_train_plus =
pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus =
pd.DataFrame(my_imputer.transform(X_valid_plus))
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns
print("MAE (Imputation with Extension):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus,
y_train, y_valid))
8. Conclusion:
Key Takeaways:
o Approach 1 (Drop Columns with Missing Values): Simple but may lose
important information.
o Approach 2 (Imputation): Retains data, but the choice of imputation strategy
can affect model performance.
o Approach 3 (Imputation with Extension): Combines the benefits of
imputation with added indicators for missing values, which can provide
additional information to the model.
Final Thoughts: Handling missing values effectively is crucial for building accurate
and robust machine learning models. Choose the appropriate method based on the
nature of your data and the specific requirements of your analysis.
Exercise(full code)
python
Copy code
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
# Load data
data = pd.read_csv('train.csv')
# Select target and features
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1)
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8,
test_size=0.2, random_state=0)
# Shape of training data (num_rows, num_columns)
print(X_train.shape)
# Define function to measure quality of each approach
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
# Approach 1: Drop columns with missing values
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if
X_train[col].isnull().any()]
# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)
# Score dataset
print("MAE (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))
# Approach 2: Imputation
my_imputer = SimpleImputer(strategy='median')
# Imputation on training and validation data
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
# Score dataset
print("MAE (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))
# Approach 3: Imputation with an Extension
# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()
# Make new columns indicating what will be imputed
for col in cols_with_missing:
X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()
# Imputation
my_imputer = SimpleImputer(strategy='median')
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))
# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns
# Score dataset
print("MAE (Imputation with Extension):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train,
y_valid))
Explanation:
1. Loading Data: Load the dataset from a CSV file.
2. Selecting Target and Features: Define the target variable y and the feature variables
X.
3. Splitting Data: Split the data into training and validation sets using
train_test_split.
4. Defining the Scoring Function: Define a function to measure the mean absolute
error (MAE) for each approach.
5. Approach 1 - Drop Columns with Missing Values: Identify columns with missing
values, drop them, and score the dataset.
6. Approach 2 - Imputation: Use SimpleImputer to impute missing values with the
median and score the dataset.
7. Approach 3 - Imputation with an Extension: Add indicators for missing values,
impute missing values, and score the dataset
Step 3: Categorical Variables
1. Introduction:
Definition: Categorical variables are variables that contain label values rather than
numeric values.
Importance: Many machine learning models require all input features to be numeric,
so categorical variables need to be converted to a suitable numeric format.
2. Methods to Handle Categorical Variables:
Method 1: Drop Categorical Variables
Method 2: Label Encoding
Method 3: One-Hot Encoding
3. Investigating Categorical Variables:
Check for Categorical Variables: Use pandas functions to identify categorical
variables in the dataset.
Example Code:
python
Copy code
import pandas as pd
# Load data
data = pd.read_csv('train.csv')
# Select target and features
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1)
# Break off validation set from training data
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
train_size=0.8, test_size=0.2, random_state=0)
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables:")
print(object_cols)
4. Method 1: Drop Categorical Variables:
When to Use:
o When categorical variables are not critical for the analysis.
Example Code:
python
Copy code
# Drop categorical variables
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
# Define function to measure quality of each approach
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
print("MAE (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
5. Method 2: Label Encoding:
Definition:
o Label Encoding assigns each unique value in a categorical column an integer
value.
When to Use:
o When the categorical variable has an ordinal relationship (e.g., 'low', 'medium',
'high').
Example Code:
python
Copy code
from sklearn.preprocessing import LabelEncoder
# Make copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()
# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
label_X_train[object_cols] =
label_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] =
label_encoder.transform(X_valid[object_cols])
print("MAE (Label Encoding):")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
6. Method 3: One-Hot Encoding:
Definition:
o One-Hot Encoding creates new binary columns indicating the presence of each
possible value in the original column.
When to Use:
o When the categorical variable does not have an ordinal relationship and has a
relatively low number of unique values.
Example Code:
We set handle_unknown='ignore' to avoid errors when the validation data
contains classes that aren't represented in the training data, and
setting sparse=False ensures that the encoded columns are returned as a
numpy array (instead of a sparse matrix).
python
Copy code
from sklearn.preprocessing import OneHotEncoder
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train =
pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid =
pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
print("MAE (One-Hot Encoding):")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
7. Conclusion:
Key Takeaways:
o Dropping Categorical Variables: Simple but may lose important
information.
o Label Encoding: Suitable for ordinal categorical variables.
o One-Hot Encoding: Suitable for nominal categorical variables with relatively
few unique values.
Final Thoughts: Choose the appropriate method for handling categorical variables
based on the nature of your data and the specific requirements of your analysis.
Exercise and code with notes of this step:
Dropping Categorical Columns
Objective: Remove columns with categorical data and evaluate model performance.
python
Copy code
# Import necessary libraries and load data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id')
X_test = pd.read_csv('../input/test.csv', index_col='Id')
# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)
# To keep things simple, drop columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()]
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
train_size=0.8,
test_size=0.2,
random_state=0)
# Function to score the dataset using Random Forest Regressor
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
# Drop categorical columns in training and validation sets
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
# Check MAE from dropping categorical columns
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
Result: MAE from Approach 1 (Drop categorical variables): 17837.83
Ordinal Encoding
Objective: Use ordinal encoding for categorical variables and evaluate model performance.
python
Copy code
from sklearn.preprocessing import OrdinalEncoder
# Identify categorical columns
object_cols = [col for col in X_train.columns if X_train[col].dtype ==
"object"]
# Identify categorical columns that can be safely ordinal encoded
good_label_cols = [col for col in object_cols if
set(X_valid[col]).issubset(set(X_train[col]))]
# Identify problematic categorical columns that will be dropped
bad_label_cols = list(set(object_cols) - set(good_label_cols))
# Print categorical columns for ordinal encoding and columns to be dropped
print('Categorical columns that will be ordinal encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:',
bad_label_cols)
# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)
# Apply ordinal encoder
ordinal_encoder = OrdinalEncoder()
label_X_train[good_label_cols] =
ordinal_encoder.fit_transform(X_train[good_label_cols])
label_X_valid[good_label_cols] =
ordinal_encoder.transform(X_valid[good_label_cols])
# Check MAE from ordinal encoding approach
print("MAE from Approach 2 (Ordinal Encoding):")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
Result: MAE from Approach 2 (Ordinal Encoding): 17098.02
Investigating Cardinality
Objective: Understand the cardinality of categorical variables.
python
Copy code
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))
# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])
Output:
css
Copy code
[('Street', 2), ('Utilities', 2), ('CentralAir', 2), ('LandSlope', 3),
('PavedDrive', 3), ('LotShape', 4), ('LandContour', 4), ('ExterQual', 4),
('KitchenQual', 4), ('MSZoning', 5), ('LotConfig', 5), ('BldgType', 5),
('ExterCond', 5), ('HeatingQC', 5), ('Condition2', 6), ('RoofStyle', 6),
('Foundation', 6), ('Heating', 6), ('Functional', 6), ('SaleCondition', 6),
('RoofMatl', 7), ('HouseStyle', 8), ('Condition1', 9), ('SaleType', 9),
('Exterior1st', 15), ('Exterior2nd', 16), ('Neighborhood', 25)]
Observations:
Categorical variables have varying numbers of unique entries (cardinality).
Some variables have high cardinality (>10), which may impact model performance and
dataset size if one-hot encoded.
One-Hot Encoding
Objective: Apply one-hot encoding to categorical variables with low cardinality and evaluate
model performance.
python
Copy code
from sklearn.preprocessing import OneHotEncoder
# Identify columns for one-hot encoding (low cardinality)
low_cardinality_cols = [col for col in object_cols if
X_train[col].nunique() < 10]
# Identify columns to be dropped (high cardinality)
high_cardinality_cols = list(set(object_cols) - set(low_cardinality_cols))
# Print columns for one-hot encoding and columns to be dropped
print('Categorical columns that will be one-hot encoded:',
low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:',
high_cardinality_cols)
# Initialize one-hot encoder and apply to low cardinality columns
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train =
pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid =
pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))
# Indexing back to original indices
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Drop categorical columns and concatenate with one-hot encoded columns
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)
# Check MAE from one-hot encoding approach
print("MAE from Approach 3 (One-Hot Encoding):")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
Result: MAE from Approach 3 (One-Hot Encoding): 17525.35