Class Assignment on Decision Trees
Name: Ansari Mohammed Shanouf Valijan
Class: B.E. Computer Engineering, Semester - VII
UID: 2021300004
Batch: Monday
Aim:
To implement decision trees for regression analysis on a healthcare dataset.
Dataset Description:
Here, in order to construct the decision tree, the Body Mass Index Detection dataset was
utilized.
(https://www.kaggle.com/datasets/sayanroy058/body-mass-index-detection)
The idea was to predict the BMI of a person given his/her age, weight, bio-impudence and
gender. The dataset has about 741 records.
Implementation:
Following is a step-by-step implementation of the task at hand-
Link to Notebook -> DecisionTreeRegression
Importing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
Importing the dataset
df = pd.read_csv('/content/Body Mass Index.csv')
Dropping irrelevant columns and encoding the categorical columns
df = df.drop(columns=['BmiClass'])
label_encoder = LabelEncoder()
df['Gender_encoded'] = label_encoder.fit_transform(df['Gender'])
df = df.drop(columns=['Gender'])
Visualizing the various features of the dataset to better understand it
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
for col in numeric_columns:
plt.figure(figsize=(8, 4))
sns.histplot(df[col], kde=True, bins=30)
plt.title(f'Distribution of {col}')
plt.show()
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
plt.figure(figsize=(8, 4))
sns.countplot(data=df, x=col)
plt.title(f'Count of {col}')
plt.show()
Viewing the correlation among different features present in the dataset
corr_matrix = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()
The above plot clearly depicts a high dependence of BMI on weight, which is quite logical.
Further, height shows a correlation almost half as strong as weight, still an important factor
to take into consideration. Age seems to have the least positive correlation with the BMI.
Viewing pair-wise plots
sns.pairplot(df, hue='Bmi')
plt.show()
In the above plots, darker hues (purple in colour) depict higher BMI values and as can be
observed, almost all features with values towards higher end are pointing towards a high
BMI value. An exception to this is the Bio Impudence v/s Height plot where high BMI values
seem to be scattered.
Splitting the processed and analysed dataset into train and test sets
X = df.drop(columns='Bmi')
y = df['Bmi']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
Defining the decision tree regressor model and training it (parameters were chosen after
experimenting with different configurations and choosing the ones that avoided overfitting)
regressor = DecisionTreeRegressor(
max_depth=25,
min_samples_split=40,
min_samples_leaf=15,
max_features='sqrt',
random_state=10
)
regressor.fit(X_train, y_train)
Evaluating the model
y_pred = regressor.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R^2): {r2}")
Following performance parameters were obtained on training dataset-
Mean Absolute Error (MAE): 1.85
Mean Squared Error (MSE): 10.16
Root Mean Squared Error (RMSE): 3.19
R-squared (R^2): 0.89
Following performance parameters were obtained on test dataset-
Mean Absolute Error (MAE): 2.1160518106723467
Mean Squared Error (MSE): 10.597756621559329
Root Mean Squared Error (RMSE): 3.255419576883958
R-squared (R^2): 0.8517373327150053
Printing the decision tree as hypothesized
plt.figure(figsize=(20, 10))
plot_tree(regressor,
feature_names=X.columns,
filled=True,
rounded=True,)
plt.title('Decision Tree Visualization')
plt.show()
Decision tree that was hypothesized for the regression task is as follows-
Conclusion:
By implementing the assigned task, I was able to brush up on the basic concepts associated
with building a decision tree. I was able to build, train and test the tree in python and was
able to come up with the following inferences-
For the assigned regression task, the analysis, logically, entailed a heavy dependence
on weight and height as features for the prediction of body mass index of an
individual.
The model trained initially had a test r-square value of 0.98 which was identified as
overfitting. The rectified model, then, had the test r-square value of around 0.8517
while the r-square value on training data was approximately 0.89.