0% found this document useful (0 votes)
11 views4 pages

Assignment 2

asdasdadas

Uploaded by

zohaibsoomro100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views4 pages

Assignment 2

asdasdadas

Uploaded by

zohaibsoomro100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

DEPARTMENT: ____Computer

Science________________________________________________
Session: Sprint-2024 Course Instructor: ___Shoukat
Ali____________________
Subject: __Machine Learning__ Course Code: __________ Max. Marks:
___5__
Class/Sec.: 8-C Submission Date: 06/15/24 Time Duration: () From: ____ to ______

Student Name: __Muhammad Zohaib___ ID: _CSC-20F-132_

Assignment 02
Apply following machine learning classifier/algorithm on PIMA Indian diabetic database to predict whether
patients in datasets have diabetes or not.

Moreover, perform a comparative study of the mentioned algorithm.

1. Logistics regression
2. Decision tree
3. Random forest
4. Naive Byes
5. KNN
6. SVM

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.neighbors import KNeighborsClassifier

from sklearn.svm import SVC

from sklearn.metrics import accuracy_score, classification_report

from sklearn.preprocessing import StandardScaler, RobustScaler

from sklearn.pipeline import Pipeline


# Load the dataset

data = pd.read_csv('diabetes.csv')

X = data.drop('Outcome', axis=1)

y = data['Outcome']

# Split data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Define pipelines for each model

pipelines = {

'Logistic Regression': Pipeline([('scaler', RobustScaler()),('logreg',


LogisticRegression(max_iter=1000, solver='liblinear'))]),

'Decision Tree': Pipeline([('scaler', StandardScaler()),('tree',


DecisionTreeClassifier())]),

'Random Forest': Pipeline([('scaler', StandardScaler()),('forest',


RandomForestClassifier())]),

'Naive Bayes': Pipeline([('scaler', StandardScaler()),('nb',


GaussianNB())]),

'KNN': Pipeline([('scaler', StandardScaler()),('knn',


KNeighborsClassifier())]),

'SVM': Pipeline([('scaler', StandardScaler()),('svm', SVC())])

# Train and evaluate models

for name, pipeline in pipelines.items():


pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

print(f'\n{name}:')

print(f'Accuracy: {accuracy_score(y_test, y_pred):.4f}') # Format


accuracy to 4 decimal places

print('Classification Report:\n', classification_report(y_test,


y_pred))

Study:

Support
Characteristi Logistic Decision Random K-Nearest Vector
c Regression Tree Forest Naive Bayes Neighbors Machine
Simple,
interpretable, Powerful,
handles accurate, less Non-
linearly Interpretable, prone to parametric, Effective in
separable visualizes overfitting, simple, can high-
data, efficient decision handles non- learn complex dimensional
with large rules, handles linear Simple, fast, decision spaces,
datasets, non-linear relationships, handles high- boundaries, flexible kernel
benefits from relationships, works well dimensional works well choice, works
feature useful for with data, good for with well with
Winning scaling/outlier feature standardized categorical standardized standardized
Qualities handling selection features features features features
Computationa
lly expensive Sensitive to
Assumes Prone to for large hyperparamet
linear overfitting, Assumes datasets, ers, less
relationships, sensitive to feature requires interpretable,
less accurate small data Less independenc careful tuning computationall
with complex changes, may interpretable, e, sensitive to of k, sensitive y demanding
Areas for decision not generalize computational data to irrelevant with large
Improvement boundaries well ly demanding distribution features datasets
Performance
on Pima
Indians 75-80% 70-75% 75-82% 70-75% 72-78% 75-82%
_____________________________________________________________________________________
BEST OF LUCK

You might also like