0% found this document useful (0 votes)
4 views9 pages

ML Mini Project

The document outlines a machine learning mini project that explores various datasets and applies different algorithms, including regression, classification, and clustering techniques. It details the implementation of Simple and Multiple Linear Regression on the Iris dataset, classification models (Decision Tree, Logistic Regression, SVM, KNN) on the Social Network Ads dataset, and clustering methods (K-Means, Hierarchical) on the Mall Customers dataset. The results indicate that Multiple Linear Regression and KNN performed the best in their respective tasks.

Uploaded by

cv.kushal04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

ML Mini Project

The document outlines a machine learning mini project that explores various datasets and applies different algorithms, including regression, classification, and clustering techniques. It details the implementation of Simple and Multiple Linear Regression on the Iris dataset, classification models (Decision Tree, Logistic Regression, SVM, KNN) on the Social Network Ads dataset, and clustering methods (K-Means, Hierarchical) on the Mall Customers dataset. The results indicate that Multiple Linear Regression and KNN performed the best in their respective tasks.

Uploaded by

cv.kushal04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ICS322 Machine Learning

ML mini project assignment

Name:CV KUSHAL KUMAR

Roll no:2022BCS0098

Batch :2

Machine Learning Datasets and Algorithms

This notebook explores suitable datasets from popular repositories and applies different ML algorithms to them:

1. Simple Linear Regression with Iris Dataset


2. Multiple Linear Regression with Iris Dataset
3. Decision Tree with Social network ads Dataset
4. Logistic Regression with Social network ads Dataset
5. Support Vector Machine (SVM) with Social network ads Dataset
6. K-Nearest Neighbors (KNN) with Social network ads Dataset
7. K-Means Clustering with Mall customers Dataset
8. Hierarchical Clustering with Mall customers Dataset

Below, we implement these algorithms with preprocessing, evaluation, and visualizations.

Implementation and comparison

Importing required packages

In [30]:
import pandas as pd
import [Link] as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler
from sklearn.linear_model import LinearRegression
from [Link] import mean_squared_error, r2_score
from [Link] import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from [Link] import SVC
from [Link] import KNeighborsClassifier
from [Link] import accuracy_score, classification_report, confusion_matrix
from [Link] import KMeans
import [Link] as sch
from [Link] import AgglomerativeClustering

The dataset contains five numerical columns (SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm,
Id) and one categorical column (Species). I'll perform:

Simple Linear Regression: Predict PetalLengthCm using SepalLengthCm.

Multiple Linear Regression: Predict PetalLengthCm using SepalLengthCm, SepalWidthCm, and PetalWidthCm.

[Link] → Regression Models (Simple & Multiple Linear Regression)

Relevance:

Though usually used for classification, we assumed a numerical feature (e.g., Petal Length) as a dependent
variable for regression analysis.

Models Used & Justification: Simple Linear Regression : Used to predict one feature based on another.
Multiple Linear Regression : Used to predict one feature using multiple other features.

Preprocessing

In [2]:
df = pd.read_csv("[Link]")
# Drop missing values
df_cleaned = [Link]()
# Define features and target
X_simple = df_cleaned[['SepalLengthCm']]
X_multiple = df_cleaned[['SepalLengthCm', 'SepalWidthCm', 'PetalWidthCm']]
y = df_cleaned['PetalLengthCm']

In [3]:
# Split data into training and testing sets
X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_simple, y, test_size
=0.2, random_state=42)
X_train_multiple, X_test_multiple, _, _ = train_test_split(X_multiple, y, test_size=0.2,
random_state=42)

Simple Linear Regression

In [4]:
# Train simple linear regression model
model_simple = LinearRegression()
model_simple.fit(X_train_simple, y_train)
y_pred_simple = model_simple.predict(X_test_simple)

Multiple Linear Regression

In [5]:
# Train multiple linear regression model
model_multiple = LinearRegression()
model_multiple.fit(X_train_multiple, y_train)
y_pred_multiple = model_multiple.predict(X_test_multiple)

In [6]:
# Evaluate models
mse_simple = mean_squared_error(y_test, y_pred_simple)
r2_simple = r2_score(y_test, y_pred_simple)

mse_multiple = mean_squared_error(y_test, y_pred_multiple)


r2_multiple = r2_score(y_test, y_pred_multiple)

# Print results
print("Simple Linear Regression:")
print(f"MSE: {mse_simple:.4f}, R² Score: {r2_simple:.4f}")
print("\n Multiple Linear Regression:")
print(f"MSE: {mse_multiple:.4f}, R² Score: {r2_multiple:.4f}")

Simple Linear Regression:


MSE: 0.8372, R² Score: 0.6969

Multiple Linear Regression:


MSE: 0.1464, R² Score: 0.9470

In [7]:
# Visualization for simple linear regression
[Link](figsize=(8, 6))
[Link](x=X_test_simple['SepalLengthCm'], y=y_test, label='Actual')
[Link](x=X_test_simple['SepalLengthCm'], y=y_pred_simple, color='red', label='Predi
cted')
[Link]('Sepal Length (cm)')
[Link]('Petal Length (cm)')
[Link]('Simple Linear Regression')
[Link]()
[Link]()

Comparison of Simple and Multiple Linear Regression: Simple Linear Regression:

Mean Squared Error (MSE): 0.837

R² Score: 0.697 (indicates that SepalLengthCm alone explains ~70% of the variance in PetalLengthCm)

Multiple Linear Regression:

Mean Squared Error (MSE): 0.146

R² Score: 0.947 (shows that using SepalLengthCm, SepalWidthCm, and PetalWidthCm together explains ~95% of
the variance in PetalLengthCm)

Hence , Multiple linear regression is performing better than Simple linear regression

Classification:-

Decision Tree

Logistic regression

SVM

KNN

Social_Network_Ads.csv → Classification Models

Relevance:

This dataset contains features like Age and Estimated Salary, with a target variable Purchased (0 or 1), making it
ideal for binary classification.

Models Used & Justification:

Decision Tree Classifier → Good for interpretable models and handling non-linear data.

Logistic Regression → Best for linearly separable data.

Support Vector Machine (SVM) → Works well for complex decision boundaries.

K-Nearest Neighbors (KNN) → Useful when data has local patterns.

In [9]:
# Load dataset
df_classification = pd.read_csv("Social_Network_Ads.csv")
# Define features and target variable
X = df_classification[['Age', 'EstimatedSalary']]
y = df_classification['Purchased']

In [10]:
# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42
)

In [11]:
# Standardize the features for models that rely on distance calculations
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)

In [15]:
# Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)
[Link](X_train, y_train)
dt_pred = [Link](X_test)
print("Decision Tree Classification Report:\n ", classification_report(y_test, dt_pred))

Decision Tree Classification Report:


precision recall f1-score support

0 0.87 0.88 0.88 52


1 0.78 0.75 0.76 28

accuracy 0.84 80
macro avg 0.82 0.82 0.82 80
weighted avg 0.84 0.84 0.84 80

In [16]:
# Logistic Regression
lr = LogisticRegression()
[Link](X_train_scaled, y_train)
lr_pred = [Link](X_test_scaled)
print("Logistic Regression Classification Report:\n ", classification_report(y_test, lr_pr
ed))

Logistic Regression Classification Report:


precision recall f1-score support

0 0.85 0.96 0.90 52


1 0.90 0.68 0.78 28

accuracy 0.86 80
macro avg 0.88 0.82 0.84 80
weighted avg 0.87 0.86 0.86 80
In [17]:
# Support Vector Machine (SVM)
svm = SVC(kernel='linear')
[Link](X_train_scaled, y_train)
svm_pred = [Link](X_test_scaled)
print("SVM Classification Report:\n ", classification_report(y_test, svm_pred))

SVM Classification Report:


precision recall f1-score support

0 0.85 0.96 0.90 52


1 0.90 0.68 0.78 28

accuracy 0.86 80
macro avg 0.88 0.82 0.84 80
weighted avg 0.87 0.86 0.86 80

In [18]:
# K-Nearest Neighbors (KNN)
knn = KNeighborsClassifier(n_neighbors=5)
[Link](X_train_scaled, y_train)
knn_pred = [Link](X_test_scaled)
print("KNN Classification Report:\n ", classification_report(y_test, knn_pred))

KNN Classification Report:


precision recall f1-score support

0 0.94 0.92 0.93 52


1 0.86 0.89 0.88 28

accuracy 0.91 80
macro avg 0.90 0.91 0.90 80
weighted avg 0.91 0.91 0.91 80

In [19]:
# Accuracy comparison
results = {
"Decision Tree": accuracy_score(y_test, dt_pred),
"Logistic Regression": accuracy_score(y_test, lr_pred),
"SVM": accuracy_score(y_test, svm_pred),
"KNN": accuracy_score(y_test, knn_pred)
}

print("Model Accuracy Comparison:", results)

Model Accuracy Comparison: {'Decision Tree': 0.8375, 'Logistic Regression': 0.8625, 'SVM'
: 0.8625, 'KNN': 0.9125}

In [14]:
# Plot confusion matrices
fig, axes = [Link](3, 3, figsize=(15, 15))
axes = [Link]()
for i, (name, cm) in enumerate(conf_matrices.items()):
[Link](cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Purchase', 'Purc
hase'], yticklabels=['No Purchase', 'Purchase'], ax=axes[i])
axes[i].set_title(f'{name} Confusion Matrix')
axes[i].set_xlabel('Predicted Label')
axes[i].set_ylabel('True Label')

plt.tight_layout()
[Link]()
Classification Model Comparison: Decision Tree: 83.75% accuracy

Logistic Regression: 86.25% accuracy

SVM: 86.25% accuracy

KNN: 91.25% accuracy

Observations: KNN performed the best with 91.25% accuracy, likely because it leverages neighborhood-based
classification.

Logistic Regression and SVM performed similarly (86.25% accuracy), suggesting a linear decision boundary fits
well.

Decision Tree performed slightly worse (83.75% accuracy), indicating it may be overfitting to some degree.

Unsupervised Learning:

Clustering:

K-means clustering

Heirarchial clustering

Mall_Customers.csv → Clustering Models Relevance: This dataset contains Annual Income (k$) and Spending
Score (1-100), making it ideal for customer segmentation.

Models Used & Justification: K-Means Clustering → Best for finding distinct customer groups.

Hierarchical Clustering → Helps in understanding relationships between customers.

In [22]:
In [22]:
# Load dataset
df1 = pd.read_csv("Mall_Customers.csv")

# Select features for clustering


X = [Link][:, [3, 4]].values

In [23]:
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

K-means Clustering

In [24]:
# K-Means Clustering
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
[Link](X_scaled)
[Link](kmeans.inertia_)

In [25]:
# Plot Elbow Method
[Link](figsize=(6, 4))
[Link](range(1, 11), wcss, marker='o', linestyle='--')
[Link]('Number of Clusters')
[Link]('WCSS')
[Link]('Elbow Method for Optimal K')
[Link]()

In [26]:
# Apply K-Means with optimal clusters
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
kmeans_labels = kmeans.fit_predict(X_scaled)

# Plot K-Means Clusters


[Link](figsize=(6, 4))
[Link](x=X[:, 0], y=X[:, 1], hue=kmeans_labels, palette='viridis', legend='full
')
[Link](kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red'
, marker='X', label='Centroids')
[Link]('Annual Income (k$)')
[Link]('Spending Score (1-100)')
[Link]('K-Means Clustering')
[Link]()
[Link]()

Heirarchial clustering

In [27]:
# Hierarchical Clustering Dendrogram
[Link](figsize=(6, 4))
dendrogram = [Link]([Link](X_scaled, method='ward'))
[Link]('Dendrogram for Hierarchical Clustering')
[Link]('Customers')
[Link]('Euclidean Distances')
[Link]()

In [29]:
# Apply Hierarchical Clustering
hc = AgglomerativeClustering(n_clusters=5, metric='euclidean', linkage='ward')
hc_labels = hc.fit_predict(X_scaled)

# Plot Hierarchical Clustering


[Link](figsize=(6, 4))
[Link](x=X[:, 0], y=X[:, 1], hue=hc_labels, palette='viridis', legend='full')
[Link]('Annual Income (k$)')
[Link]('Spending Score (1-100)')
[Link]('Hierarchical Clustering')
[Link]()
[Link]()

If speed & scalability matter → K-Means is better.

If understanding relationships & structure is important → Hierarchical Clustering works well.

You might also like