ICS322 Machine Learning
ML mini project assignment
Name:CV KUSHAL KUMAR
Roll no:2022BCS0098
Batch :2
Machine Learning Datasets and Algorithms
This notebook explores suitable datasets from popular repositories and applies different ML algorithms to them:
1. Simple Linear Regression with Iris Dataset
2. Multiple Linear Regression with Iris Dataset
3. Decision Tree with Social network ads Dataset
4. Logistic Regression with Social network ads Dataset
5. Support Vector Machine (SVM) with Social network ads Dataset
6. K-Nearest Neighbors (KNN) with Social network ads Dataset
7. K-Means Clustering with Mall customers Dataset
8. Hierarchical Clustering with Mall customers Dataset
Below, we implement these algorithms with preprocessing, evaluation, and visualizations.
Implementation and comparison
Importing required packages
In [30]:
import pandas as pd
import [Link] as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler
from sklearn.linear_model import LinearRegression
from [Link] import mean_squared_error, r2_score
from [Link] import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from [Link] import SVC
from [Link] import KNeighborsClassifier
from [Link] import accuracy_score, classification_report, confusion_matrix
from [Link] import KMeans
import [Link] as sch
from [Link] import AgglomerativeClustering
The dataset contains five numerical columns (SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm,
Id) and one categorical column (Species). I'll perform:
Simple Linear Regression: Predict PetalLengthCm using SepalLengthCm.
Multiple Linear Regression: Predict PetalLengthCm using SepalLengthCm, SepalWidthCm, and PetalWidthCm.
[Link] → Regression Models (Simple & Multiple Linear Regression)
Relevance:
Though usually used for classification, we assumed a numerical feature (e.g., Petal Length) as a dependent
variable for regression analysis.
Models Used & Justification: Simple Linear Regression : Used to predict one feature based on another.
Multiple Linear Regression : Used to predict one feature using multiple other features.
Preprocessing
In [2]:
df = pd.read_csv("[Link]")
# Drop missing values
df_cleaned = [Link]()
# Define features and target
X_simple = df_cleaned[['SepalLengthCm']]
X_multiple = df_cleaned[['SepalLengthCm', 'SepalWidthCm', 'PetalWidthCm']]
y = df_cleaned['PetalLengthCm']
In [3]:
# Split data into training and testing sets
X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_simple, y, test_size
=0.2, random_state=42)
X_train_multiple, X_test_multiple, _, _ = train_test_split(X_multiple, y, test_size=0.2,
random_state=42)
Simple Linear Regression
In [4]:
# Train simple linear regression model
model_simple = LinearRegression()
model_simple.fit(X_train_simple, y_train)
y_pred_simple = model_simple.predict(X_test_simple)
Multiple Linear Regression
In [5]:
# Train multiple linear regression model
model_multiple = LinearRegression()
model_multiple.fit(X_train_multiple, y_train)
y_pred_multiple = model_multiple.predict(X_test_multiple)
In [6]:
# Evaluate models
mse_simple = mean_squared_error(y_test, y_pred_simple)
r2_simple = r2_score(y_test, y_pred_simple)
mse_multiple = mean_squared_error(y_test, y_pred_multiple)
r2_multiple = r2_score(y_test, y_pred_multiple)
# Print results
print("Simple Linear Regression:")
print(f"MSE: {mse_simple:.4f}, R² Score: {r2_simple:.4f}")
print("\n Multiple Linear Regression:")
print(f"MSE: {mse_multiple:.4f}, R² Score: {r2_multiple:.4f}")
Simple Linear Regression:
MSE: 0.8372, R² Score: 0.6969
Multiple Linear Regression:
MSE: 0.1464, R² Score: 0.9470
In [7]:
# Visualization for simple linear regression
[Link](figsize=(8, 6))
[Link](x=X_test_simple['SepalLengthCm'], y=y_test, label='Actual')
[Link](x=X_test_simple['SepalLengthCm'], y=y_pred_simple, color='red', label='Predi
cted')
[Link]('Sepal Length (cm)')
[Link]('Petal Length (cm)')
[Link]('Simple Linear Regression')
[Link]()
[Link]()
Comparison of Simple and Multiple Linear Regression: Simple Linear Regression:
Mean Squared Error (MSE): 0.837
R² Score: 0.697 (indicates that SepalLengthCm alone explains ~70% of the variance in PetalLengthCm)
Multiple Linear Regression:
Mean Squared Error (MSE): 0.146
R² Score: 0.947 (shows that using SepalLengthCm, SepalWidthCm, and PetalWidthCm together explains ~95% of
the variance in PetalLengthCm)
Hence , Multiple linear regression is performing better than Simple linear regression
Classification:-
Decision Tree
Logistic regression
SVM
KNN
Social_Network_Ads.csv → Classification Models
Relevance:
This dataset contains features like Age and Estimated Salary, with a target variable Purchased (0 or 1), making it
ideal for binary classification.
Models Used & Justification:
Decision Tree Classifier → Good for interpretable models and handling non-linear data.
Logistic Regression → Best for linearly separable data.
Support Vector Machine (SVM) → Works well for complex decision boundaries.
K-Nearest Neighbors (KNN) → Useful when data has local patterns.
In [9]:
# Load dataset
df_classification = pd.read_csv("Social_Network_Ads.csv")
# Define features and target variable
X = df_classification[['Age', 'EstimatedSalary']]
y = df_classification['Purchased']
In [10]:
# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42
)
In [11]:
# Standardize the features for models that rely on distance calculations
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)
In [15]:
# Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)
[Link](X_train, y_train)
dt_pred = [Link](X_test)
print("Decision Tree Classification Report:\n ", classification_report(y_test, dt_pred))
Decision Tree Classification Report:
precision recall f1-score support
0 0.87 0.88 0.88 52
1 0.78 0.75 0.76 28
accuracy 0.84 80
macro avg 0.82 0.82 0.82 80
weighted avg 0.84 0.84 0.84 80
In [16]:
# Logistic Regression
lr = LogisticRegression()
[Link](X_train_scaled, y_train)
lr_pred = [Link](X_test_scaled)
print("Logistic Regression Classification Report:\n ", classification_report(y_test, lr_pr
ed))
Logistic Regression Classification Report:
precision recall f1-score support
0 0.85 0.96 0.90 52
1 0.90 0.68 0.78 28
accuracy 0.86 80
macro avg 0.88 0.82 0.84 80
weighted avg 0.87 0.86 0.86 80
In [17]:
# Support Vector Machine (SVM)
svm = SVC(kernel='linear')
[Link](X_train_scaled, y_train)
svm_pred = [Link](X_test_scaled)
print("SVM Classification Report:\n ", classification_report(y_test, svm_pred))
SVM Classification Report:
precision recall f1-score support
0 0.85 0.96 0.90 52
1 0.90 0.68 0.78 28
accuracy 0.86 80
macro avg 0.88 0.82 0.84 80
weighted avg 0.87 0.86 0.86 80
In [18]:
# K-Nearest Neighbors (KNN)
knn = KNeighborsClassifier(n_neighbors=5)
[Link](X_train_scaled, y_train)
knn_pred = [Link](X_test_scaled)
print("KNN Classification Report:\n ", classification_report(y_test, knn_pred))
KNN Classification Report:
precision recall f1-score support
0 0.94 0.92 0.93 52
1 0.86 0.89 0.88 28
accuracy 0.91 80
macro avg 0.90 0.91 0.90 80
weighted avg 0.91 0.91 0.91 80
In [19]:
# Accuracy comparison
results = {
"Decision Tree": accuracy_score(y_test, dt_pred),
"Logistic Regression": accuracy_score(y_test, lr_pred),
"SVM": accuracy_score(y_test, svm_pred),
"KNN": accuracy_score(y_test, knn_pred)
}
print("Model Accuracy Comparison:", results)
Model Accuracy Comparison: {'Decision Tree': 0.8375, 'Logistic Regression': 0.8625, 'SVM'
: 0.8625, 'KNN': 0.9125}
In [14]:
# Plot confusion matrices
fig, axes = [Link](3, 3, figsize=(15, 15))
axes = [Link]()
for i, (name, cm) in enumerate(conf_matrices.items()):
[Link](cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Purchase', 'Purc
hase'], yticklabels=['No Purchase', 'Purchase'], ax=axes[i])
axes[i].set_title(f'{name} Confusion Matrix')
axes[i].set_xlabel('Predicted Label')
axes[i].set_ylabel('True Label')
plt.tight_layout()
[Link]()
Classification Model Comparison: Decision Tree: 83.75% accuracy
Logistic Regression: 86.25% accuracy
SVM: 86.25% accuracy
KNN: 91.25% accuracy
Observations: KNN performed the best with 91.25% accuracy, likely because it leverages neighborhood-based
classification.
Logistic Regression and SVM performed similarly (86.25% accuracy), suggesting a linear decision boundary fits
well.
Decision Tree performed slightly worse (83.75% accuracy), indicating it may be overfitting to some degree.
Unsupervised Learning:
Clustering:
K-means clustering
Heirarchial clustering
Mall_Customers.csv → Clustering Models Relevance: This dataset contains Annual Income (k$) and Spending
Score (1-100), making it ideal for customer segmentation.
Models Used & Justification: K-Means Clustering → Best for finding distinct customer groups.
Hierarchical Clustering → Helps in understanding relationships between customers.
In [22]:
In [22]:
# Load dataset
df1 = pd.read_csv("Mall_Customers.csv")
# Select features for clustering
X = [Link][:, [3, 4]].values
In [23]:
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
K-means Clustering
In [24]:
# K-Means Clustering
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
[Link](X_scaled)
[Link](kmeans.inertia_)
In [25]:
# Plot Elbow Method
[Link](figsize=(6, 4))
[Link](range(1, 11), wcss, marker='o', linestyle='--')
[Link]('Number of Clusters')
[Link]('WCSS')
[Link]('Elbow Method for Optimal K')
[Link]()
In [26]:
# Apply K-Means with optimal clusters
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
kmeans_labels = kmeans.fit_predict(X_scaled)
# Plot K-Means Clusters
[Link](figsize=(6, 4))
[Link](x=X[:, 0], y=X[:, 1], hue=kmeans_labels, palette='viridis', legend='full
')
[Link](kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red'
, marker='X', label='Centroids')
[Link]('Annual Income (k$)')
[Link]('Spending Score (1-100)')
[Link]('K-Means Clustering')
[Link]()
[Link]()
Heirarchial clustering
In [27]:
# Hierarchical Clustering Dendrogram
[Link](figsize=(6, 4))
dendrogram = [Link]([Link](X_scaled, method='ward'))
[Link]('Dendrogram for Hierarchical Clustering')
[Link]('Customers')
[Link]('Euclidean Distances')
[Link]()
In [29]:
# Apply Hierarchical Clustering
hc = AgglomerativeClustering(n_clusters=5, metric='euclidean', linkage='ward')
hc_labels = hc.fit_predict(X_scaled)
# Plot Hierarchical Clustering
[Link](figsize=(6, 4))
[Link](x=X[:, 0], y=X[:, 1], hue=hc_labels, palette='viridis', legend='full')
[Link]('Annual Income (k$)')
[Link]('Spending Score (1-100)')
[Link]('Hierarchical Clustering')
[Link]()
[Link]()
If speed & scalability matter → K-Means is better.
If understanding relationships & structure is important → Hierarchical Clustering works well.