Excited to share my recent hands-on practice with Machine Learning using K-Nearest Neighbors (KNN) Regression. 🚀
I implemented KNN regression on the Diamonds dataset, focusing on:
Data cleaning and preprocessing (scaling & splitting data)
Hyperparameter tuning with GridSearchCV
Evaluating the model with R², MAE, and RMSE
This exercise helped me understand how distance-based algorithms work for regression tasks and the impact of parameter tuning on performance.
Looking forward to exploring more ML algorithms and sharing my journey here!
#MachineLearning #KNN #Regression #Python #Learning
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
""" Load the dataset """""
df = pd.read_csv("diamonds.csv")
print("Initial Data:")
print(df.head())
""" Data Cleaning"""
# Drop unwanted columns (like unnamed index column if present)
if "Unnamed: 0" in df.columns:
df.drop("Unnamed: 0", axis=1, inplace=True)
""" Check for missing values"""
print("\nMissing values before cleaning:")
print(df.isnull().sum())
""" Fill or drop missing values (here we drop rows with null values)"""
df.dropna(inplace=True)
# Encode categorical columns (cut, color, clarity are categorical in diamonds dataset)
categorical_cols = ['cut', 'color', 'clarity']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
print("\nData after cleaning:")
print(df.head())
""" Feature & Target Split"""
X = df.drop("price", axis=1) # Features
y = df["price"] # Target variable (regression)
""" Train-Test Split """
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
""" Model Building (KNN Regression)"""
knn = KNeighborsRegressor()
# Hyperparameter tuning using GridSearchCV
param_grid = {'n_neighbors': [3, 5, 7, 9, 11]}
grid = GridSearchCV(knn, param_grid, cv=5, scoring='r2') ## Cross validation cv=5
grid.fit(X_train_scaled, y_train)
print("\nBest Parameters:", grid.best_params_)
# Final model with best parameters
best_knn = grid.best_estimator_
""" Model Evaluation"""
y_pred = best_knn.predict(X_test_scaled)
print("\nModel Performance:")
print("R2 Score:", r2_score(y_test, y_pred))
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
Mohan Sivaraman
Data Science Tech Lead at Satalia
1moCatBoost, in my experience is much slower.