0% found this document useful (0 votes)
4 views

Machine Learning (BCSL606) Lab Manual

The Machine Learning Lab (BCSL606) course is designed for Bachelor of Engineering students in Computer Science and Engineering, focusing on practical applications of machine learning techniques. Students will engage in hands-on experiments using datasets like California Housing and Iris, implementing algorithms such as k-NN, PCA, and decision trees. The course aims to enhance skills in data visualization, model building, and evaluation, with a total of 30 contact hours and a credit value of 1.

Uploaded by

anaghabhat2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Machine Learning (BCSL606) Lab Manual

The Machine Learning Lab (BCSL606) course is designed for Bachelor of Engineering students in Computer Science and Engineering, focusing on practical applications of machine learning techniques. Students will engage in hands-on experiments using datasets like California Housing and Iris, implementing algorithms such as k-NN, PCA, and decision trees. The course aims to enhance skills in data visualization, model building, and evaluation, with a total of 30 contact hours and a credit value of 1.

Uploaded by

anaghabhat2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Machine Learning Lab (BCSL606)

COURSE INFORMATION
PROGRAMME COMPUTER SCIENCE AND ENGINEERING

DEGREE BACHELOR OF ENGINEERING (BE)

COURSE TITLE MACHINE LEARNING LAB

SEMESTER VI

COURSE CODE BCSL606

COURSE TYPE (CONTENT) PROFESSIONAL CORE COURSE LABORATORY

REGULATION/SCHEME VTU BE 2022

CREDITS 01

L-T-P-S 0-0-2-0

CONTACT HOURS/WEEK 2

TOTAL CONTACT HOURS 30

COURSE CATEGORY (ASSESSMENT) Skill Centric

SEE & CIE MARKS 50, 50

COURSE SYLLABUS:
MODULE CONTENTS HOURS
Develop a program to create histograms for all numerical features and analyze the
1. distribution of each feature. Generate box plots for all numerical features and identify any 2
outliers. Use California Housing dataset.
Develop a program to Compute the correlation matrix to understand the relationships
between pairs of features. Visualize the correlation matrix using a heatmap to know which
2. 2
variables have strong positive/negative correlations. Create a pair plot to visualize
pairwise relationships between features. Use California Housing dataset.
3. Develop a program to implement Principal Component Analysis (PCA) for reducing the
2
dimensionality of the Iris dataset from 4 features to 2.
For a given set of training data examples stored in a .CSV file, implement and
4. demonstrate the Find-S algorithm to output a description of the set of all hypotheses 2
consistent with the training examples.
Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly
generated 100 values of x in the range of [0,1]. Perform the following based on dataset
5. generated.
a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1, else 2
xi ∊ Class1
b. Classify the remaining points, x51,……,x100 using KNN. Perform this for
k=1,2,3,4,5,20,30
Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

6. Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
2
points. Select appropriate data set for your experiment and draw graphs.
Develop a program to demonstrate the working of Linear Regression and Polynomial
7. Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset 3
(for vehicle fuel efficiency prediction) for Polynomial Regression.
Develop a program to demonstrate the working of the decision tree algorithm. Use Breast
8. Cancer Data set for building the decision tree and apply this knowledge to classify a new 2
sample.
Develop a program to implement the Naive Bayesian classifier considering Olivetti Face
9. Data set for training. Compute the accuracy of the classifier, considering a few test data 3
sets.
10. Develop a program to implement k-means clustering using Wisconsin Breast Cancer data
3
set and visualize the clustering result.
TOTAL HOURS 28

TEXT, REFERENCE BOOKS & E-RESOURCES:


BOOK TITLE/AUTHORS/PUBLICATION/LINK
T-1 S Sridhar and M Vijayalakshmi, “Machine Learning”, Oxford University Press, 2021.
M N Murty and Ananthanarayana V S, “Machine Learning: Theory and Practice”,
T-2 Universities Press (India) Pvt. Limited, 2024.
E-1 https://www.drssridhar.com/?page_id=1053
E-2 https://www.universitiespress.com/resources?id=9789393330697
E-3 https://onlinecourses.nptel.ac.in/noc23_cs18/preview

T- Text Book, R-Reference Book, E - e Resource

COURSE PRE-REQUISITES:
COURSE NAME DESCRIPTION COURSE CODE SEM
Students should have the knowledge of probability
MATHEMATICS distributions, specific discrete and continuous
FOR COMPUTER distributions, statistical inferences and the basics of BCS301 III
SCIENCE hypothesis testing essential to understand the working
of ML models.
DATA
Students should have a basic understanding of data
STRUCTURES AND BCS304 III
APPLICATIONS and representations.

R PROGRAMMING/ Students must have programming knowledge needed


BCS358B/BCS358D III
PYTHON for practical implementation.

COURSE DESCRIPTION:
This Machine Learning lab provides hands-on experience in data visualization, supervised and
unsupervised learning, and dimensionality reduction techniques. Students will analyze datasets like
California Housing, Iris, Boston Housing, and Breast Cancer using histograms, box plots, correlation
heatmaps, and pair plots. Key ML algorithms, including Find-S, k-NN, Decision Trees, Naïve Bayes,
Linear & Polynomial Regression, PCA, and k-Means Clustering, will be implemented to develop a
strong foundation in model building and evaluation.

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

COURSE OBJECTIVES:
This course will enable the students to get practical experience in design, develop, implement, analyze and
evaluation/testing of
To become familiar with data and visualize univariate, bivariate, and multivariate data using statistical
1
techniques and dimensionality reduction.
To understand various machine learning algorithms such as similarity-based learning, regression, decision
2
trees, and clustering.
To familiarize with learning theories, probability-based models and developing the skills required for
3
decision-making in dynamic environments.

COURSE OUTCOMES (COs):


DESCRIPTION OF COURSE OUTCOME
CO
After completion of the course, the students will be able to:
Apply exploratory data analysis techniques, including histograms, box plots, correlation matrices, and
CO:1 dimensionality reduction, to interpret patterns in datasets.
Demonstrate machine learning algorithms, including Find-S, K-Nearest Neighbors, and Locally
CO:2 Weighted Regression, for classification and regression tasks.
Develop regression and classification models using Linear Regression, Polynomial Regression And
CO:3 Decision Trees for a given dataset.
Build Bayesian models for probabilistic learning using the Naïve Bayes classifier on the Olivetti Face
CO:4 Dataset and examine its accuracy on test data.
Apply K-Means clustering on the Wisconsin Breast Cancer Dataset to group data and visualize the
CO:5 clustering results.

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Experiment-1
Problem statement
Develop a program to create histograms for all numerical features and analyze the distribution of
each feature. Generate box plots for all numerical features and identify any outliers. Use California
Housing dataset.

Program
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the California Housing dataset


from sklearn.datasets import fetch_california_housing

# Fetch data
"""Imports the California Housing dataset, which contains housing data for different regions
in California."""
housing_data = fetch_california_housing(as_frame=True)
data = housing_data.frame

# Display the first few rows of the dataset


print(data.head())

# Check for numerical features


numerical_features = data.select_dtypes(include=['float64', 'int64']).columns
print(f"Numerical features: {list(numerical_features)}")

# Histograms for all numerical features


"""Loops through each numerical feature and plots a histogram:

Bins = 30: Divides the data range into 30 intervals.


Color = 'skyblue': Sets the bar color.
Edge color = 'black': Highlights bar edges for better visibility.
Grid lines improve readability."""

for feature in numerical_features:


plt.figure(figsize=(8, 5))
plt.hist(data[feature], bins=30, color='skyblue', edgecolor='black')
plt.title(f"Histogram of {feature}")
plt.xlabel(feature)
plt.ylabel("Frequency")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

# Box plots for all numerical features


for feature in numerical_features:
plt.figure(figsize=(8, 5))
sns.boxplot(x=data[feature], color='lightgreen')
plt.title(f"Box Plot of {feature}")
plt.xlabel(feature)
plt.show()

# Identify outliers using IQR (Interquartile Range)


"""Q1 (25th percentile): The first quartile (lower boundary of the middle 50% of data).
Q3 (75th percentile): The third quartile (upper boundary of the middle 50% of data).
IQR (Interquartile Range): Measures the spread of the middle 50% of data.
Lower & Upper Bounds: Defines the range of typical values. Any value outside this range
is an outlier."""

Q1 = data[feature].quantile(0.25) # 25th percentile


Q3 = data[feature].quantile(0.75) # 75th percentile
IQR = Q3 - Q1 # Interquartile range
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data[feature] < lower_bound) | (data[feature] > upper_bound)]

print(f"Outliers in {feature}:")
print(outliers[feature].sort_values())

Output

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Questions:

1. What is the purpose of the fetch_california_housing(as_frame=True) function in


the program?
2. How does the program identify numerical features in the dataset?
3. What is the role of the histogram plots in the program?
4. How does the program detect outliers in numerical features?
5. What is the purpose of the sns.boxplot() function in the program?

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Experiment-2
Problem statement
Develop a program to Compute the correlation matrix to understand the relationships between pairs of
features. Visualize the correlation matrix using a heatmap to know which variables have strong
positive/negative correlations. Create a pair plot to visualize pairwise relationships between features. Use
California Housing dataset.

Program

# Import necessary libraries


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset


housing_data = fetch_california_housing(as_frame=True)
data = housing_data.frame

# Display the first few rows of the dataset


print(data.head())

# Compute the correlation matrix


correlation_matrix = data.corr() #The .corr() method computes the correlation between each
pair of features in the dataset.
#Strong positive correlations are closer to +1, and strong negative correlations are closer to -1.

# Print the correlation matrix


print("\nCorrelation Matrix:")
print(correlation_matrix)

# Visualize the correlation matrix using a heatmap


plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()

# Create a pair plot for pairwise relationships


sns.pairplot(data, diag_kind='kde', plot_kws={'alpha': 0.7})
plt.show()

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

OUTPUT:

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Questions:

1. What is the purpose of computing the correlation matrix in the program?


2. How does the heatmap created using sns.heatmap() help in analyzing the dataset?
3. What does the parameter annot=True do in the sns.heatmap() function?
4. Why does the program use diag_kind='kde' in the sns.pairplot() function?
5. How can the pair plot (sns.pairplot()) be useful for identifying relationships
between features?

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Experiment-3
Problem statement
Develop a program to implement Principal Component Analysis (PCA) for reducing the dimensionality of the
Iris dataset from 4 features to 2.

Program
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load the Iris dataset


iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Apply PCA with 2 components


pca = PCA(n_components=2)
principalComponents = pca.fit_transform(df)

# Create a new DataFrame with the principal components


principalDf = pd.DataFrame(data=principalComponents, columns=['principal component 1',
'principal component 2'])

# Concatenate the DataFrame with the class labels


finalDf = pd.concat([principalDf, pd.DataFrame(data=iris.target, columns=['target'])], axis=1)

# Visualize the data


fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('Principal Component 1', fontsize=15)
ax.set_ylabel('Principal Component 2', fontsize=15)
ax.set_title('2 Component PCA', fontsize=20)

targets = [0, 1, 2]
colors = ['r', 'g', 'b']
for target, color in zip(targets, colors):
indicesToKeep = finalDf['target'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1'], finalDf.loc[indicesToKeep,
'principal component 2'], c=color, s=50)
ax.legend(iris.target_names)
ax.grid()
plt.show()
# Print explained variance ratio
print('Explained variance ratio:', pca.explained_variance_ratio_)

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

OUTPUT:

Explained variance ratio: [0.92461872 0.05306648]

Questions:

1. What is the purpose of using PCA in this program?


2. How many principal components are extracted in this program?
3. What do the two columns 'principal component 1' and 'principal component 2'
in the final DataFrame represent?
4. If n_components=3 was used instead of 2, what would change in the program?
5. What does explained_variance_ratio_ tell us?

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Experiment-4
Problem statement
For a given set of training data examples stored in a .CSV file, implement and demonstrate the Find-S
algorithm to output a description of the set of all hypotheses consistent with the training examples.

Program
import random
import csv
attributes = [['Sunny','Rainy'],
['Warm','Cold'],
['Normal','High'],
['Strong','Weak'],
['Warm','Cool'],
['Same','Change']]
num_attributes = len(attributes) #6
print (" \n The most general hypothesis : ['?','?','?','?','?','?']\n")
print ("\n The most specific hypothesis : ['0','0','0','0','0','0']\n")
a = []
print("\n The Given Training Data Set \n")
with open('ws.csv', 'r') as csvFile:
reader = csv.reader(csvFile)
for row in reader:
a.append (row)
print(row)
print("\n The initial value of hypothesis: ")
hypothesis = ['0'] * num_attributes
print(hypothesis)
# Comparing with First Training Example
for j in range(0,num_attributes):
hypothesis[j] = a[0][j];
print ("Hypothesis:",hypothesis[j])
# Comparing with Remaining Training Examples of Given Data Set
print("\n Find S: Finding a Maximally Specific Hypothesis\n")
for i in range(0,len(a)):
if a[i][num_attributes]=='Yes':
for j in range(0,num_attributes):
if a[i][j]!=hypothesis[j]:
hypothesis[j]='?'
else :
hypothesis[j]= a[i][j]
print(" For Training Example No :{0} the hypothesis is ".format(i),hypothesis)
print("\n The Maximally Specific Hypothesis for a given Training Examples :\n")
print(hypothesis)

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

OUTPUT:
The most general hypothesis : ['?','?','?','?','?','?']
The most specific hypothesis : ['0','0','0','0','0','0']
The Given Training Data Set

['Sunny', 'Warm', 'Normal', 'Strong', 'Warm', 'Same', 'Yes']


['Sunny', 'Warm', 'High', 'Strong', 'Warm', 'Same', 'Yes']
['Rainy', 'Cold', 'High', 'Strong', 'Warm', 'Change', 'No']
['Sunny', 'Warm', 'High', 'Strong', 'Cool', 'Change', 'Yes']

The initial value of hypothesis:


['0', '0', '0', '0', '0', '0']
Hypothesis: Sunny
Hypothesis: Warm
Hypothesis: Normal
Hypothesis: Strong
Hypothesis: Warm
Hypothesis: Same

Find S: Finding a Maximally Specific Hypothesis

For Training Example No :0 the hypothesis is ['Sunny', 'Warm', 'Normal', 'Strong


', 'Warm', 'Same']
For Training Example No :1 the hypothesis is ['Sunny', 'Warm', '?', 'Strong', 'W
arm', 'Same']
For Training Example No :2 the hypothesis is ['Sunny', 'Warm', '?', 'Strong', 'W
arm', 'Same']
For Training Example No :3 the hypothesis is ['Sunny', 'Warm', '?', 'Strong', '?
', '?']

The Maximally Specific Hypothesis for a given Training Examples :

['Sunny', 'Warm', '?', 'Strong', '?', '?']

Questions:

1. What is the main purpose of the Find-S algorithm?


2. What does the initial hypothesis look like in the Find-S algorithm?
3. In this program, when is a ? assigned to a position in the hypothesis?
4. Which examples from the training dataset are used to update the hypothesis?
5. What does the final hypothesis represent in the Find-S algorithm?
6. What will happen if all training examples are negative? What will the hypothesis look
like?

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Experiment-5
Problem statement
Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly generated 100
values of x in the range of [0,1]. Perform the following based on dataset generated.
a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1, else xi ∊ Class1
b. Classify the remaining points, x51,……,x100 }using KNN. Perform this for k=1,2,3,4,5,20,30

Program
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

# Generate 100 random values in the range [0, 1]


data = np.random.rand(100)

# Label the first 50 points


labels = np.zeros(100)
labels[:50] = np.where(data[:50] <= 0.5, 1, 2)

# Separate data into training and testing sets


train_data = data[:50].reshape(-1, 1)
train_labels = labels[:50]
test_data = data[50:].reshape(-1, 1)

# Perform KNN classification for different values of k


k_values = [1, 2, 3, 4, 5, 20, 30]
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(train_data, train_labels)
predicted_labels = knn.predict(test_data)

print(f"K = {k}")
print("Test Value\tPredicted Label")
for val, label in zip(test_data.flatten(), predicted_labels):
print(f"{val:.3f}\t\t{int(label)}")
print()

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

OUTPUT:
K = 1 K = 2 K = 3 K = 4
Test Value Pred Test Value Pred Test Value Pred Test Value Predic
icted Label icted Label icted Label ted Label
0.524 2 0.524 2 0.524 2 0.524 2
0.607 2 0.607 2 0.607 2 0.607 2
0.196 1 0.196 1 0.196 1 0.196 1
0.469 1 0.469 1 0.469 2 0.469 2
0.276 1 0.276 1 0.276 1 0.276 1
0.890 2 0.890 2 0.890 2 0.890 2
0.932 2 0.932 2 0.932 2 0.932 2
0.749 2 0.749 2 0.749 2 0.749 2
0.111 1 0.111 1 0.111 1 0.111 1
0.904 2 0.904 2 0.904 2 0.904 2
0.677 2 0.677 2 0.677 2 0.677 2
0.399 1 0.399 1 0.399 1 0.399 1
0.327 1 0.327 1 0.327 1 0.327 1
0.386 1 0.386 1 0.386 1 0.386 1
0.380 1 0.380 1 0.380 1 0.380 1
0.981 2 0.981 2 0.981 2 0.981 2
0.506 2 0.506 2 0.506 2 0.506 2
0.336 1 0.336 1 0.336 1 0.336 1
0.521 2 0.521 2 0.521 2 0.521 2
0.294 1 0.294 1 0.294 1 0.294 1
0.987 2 0.987 2 0.987 2 0.987 2
0.946 2 0.946 2 0.946 2 0.946 2
0.860 2 0.860 2 0.860 2 0.860 2
0.707 2 0.707 2 0.707 2 0.707 2
0.343 1 0.343 1 0.343 1 0.343 1
0.822 2 0.822 2 0.822 2 0.822 2
0.655 2 0.655 2 0.655 2 0.655 2
0.443 1 0.443 1 0.443 1 0.443 1
0.816 2 0.816 2 0.816 2 0.816 2
0.778 2 0.778 2 0.778 2 0.778 2
0.248 1 0.248 1 0.248 1 0.248 1
0.176 1 0.176 1 0.176 1 0.176 1
0.251 1 0.251 1 0.251 1 0.251 1
0.726 2 0.726 2 0.726 2 0.726 2
0.551 2 0.551 2 0.551 2 0.551 2
0.209 1 0.209 1 0.209 1 0.209 1
0.059 1 0.059 1 0.059 1 0.059 1
0.769 2 0.769 2 0.769 2 0.769 2
0.907 2 0.907 2 0.907 2 0.907 2
0.998 2 0.998 2 0.998 2 0.998 2
0.352 1 0.352 1 0.352 1 0.352 1
0.173 1 0.173 1 0.173 1 0.173 1
0.293 1 0.293 1 0.293 1 0.293 1
0.125 1 0.125 1 0.125 1 0.125 1
0.859 2 0.859 2 0.859 2 0.859 2
0.861 2 0.861 2 0.861 2 0.861 2
0.742 2 0.742 2 0.742 2 0.742 2
0.985 2 0.985 2 0.985 2 0.985 2
0.840 2 0.840 2 0.840 2 0.840 2
0.214 1 0.214 1 0.214 1 0.214 1

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

K = 5 K = 20 K = 30
Test Value Predicted La Test Value Predicted La Test Value Predicted La
bel bel bel
0.524 2 0.524 2 0.524 2
0.607 2 0.607 2 0.607 2
0.196 1 0.196 1 0.196 1
0.469 2 0.469 2 0.469 2
0.276 1 0.276 1 0.276 1
0.890 2 0.890 2 0.890 2
0.932 2 0.932 2 0.932 2
0.749 2 0.749 2 0.749 2
0.111 1 0.111 1 0.111 1
0.904 2 0.904 2 0.904 2
0.677 2 0.677 2 0.677 2
0.399 1 0.399 1 0.399 1
0.327 1 0.327 1 0.327 1
0.386 1 0.386 1 0.386 1
0.380 1 0.380 1 0.380 1
0.981 2 0.981 2 0.981 2
0.506 2 0.506 2 0.506 2
0.336 1 0.336 1 0.336 1
0.521 2 0.521 2 0.521 2
0.294 1 0.294 1 0.294 1
0.987 2 0.987 2 0.987 2
0.946 2 0.946 2 0.946 2
0.860 2 0.860 2 0.860 2
0.707 2 0.707 2 0.707 2
0.343 1 0.343 1 0.343 1
0.822 2 0.822 2 0.822 2
0.655 2 0.655 2 0.655 2
0.443 1 0.443 2 0.443 1
0.816 2 0.816 2 0.816 2
0.778 2 0.778 2 0.778 2
0.248 1 0.248 1 0.248 1
0.176 1 0.176 1 0.176 1
0.251 1 0.251 1 0.251 1
0.726 2 0.726 2 0.726 2
0.551 2 0.551 2 0.551 2
0.209 1 0.209 1 0.209 1
0.059 1 0.059 1 0.059 1
0.769 2 0.769 2 0.769 2
0.907 2 0.907 2 0.907 2
0.998 2 0.998 2 0.998 2
0.352 1 0.352 1 0.352 1
0.173 1 0.173 1 0.173 1
0.293 1 0.293 1 0.293 1
0.125 1 0.125 1 0.125 1
0.859 2 0.859 2 0.859 2
0.861 2 0.861 2 0.861 2
0.742 2 0.742 2 0.742 2
0.985 2 0.985 2 0.985 2
0.840 2 0.840 2 0.840 2
0.214 1 0.214 1 0.214 1

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Questions:

1. What is the role of the KNeighborsClassifier in this program?


2. How are the training and testing data sets separated in this code?
3. How are labels assigned to the training data points in this program?
4. What happens when k is changed in the loop? What effect does it have on predictions?
5. What is the purpose of reshaping the data using reshape(-1, 1) before classification?

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Experiment-6
Problem statement
Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points. Select
appropriate data set for your experiment and draw graphs.

Program
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

def kernel(point, xmat, k):


m,n = np.shape(xmat)
weights = np.mat(np.eye((m)))
for j in range(m):
diff = point - X[j]
weights[j,j] = np.exp(diff*diff.T/(-2.0*k**2))
return weights

def localWeight(point, xmat, ymat, k):


wei = kernel(point,xmat,k)
#print(wei)
W = (X.T*(wei*X)).I*(X.T*(wei*ymat.T))
return W

def localWeightRegression(xmat, ymat, k):


m,n = np.shape(xmat)
ypred = np.zeros(m)
for i in range(m):
weight = localWeight(xmat[i],xmat,ymat,k)
ypred[i] = xmat[i]*weight
return ypred

# load data points


data = pd.read_csv('10-dataset.csv')
bill = np.array(data.total_bill)
tip = np.array(data.tip)

#preparing and add 1 in bill


mbill = np.mat(bill) #converts bill into a matrix
mtip = np.mat(tip) #converts tip into a matrix
m= np.shape(mbill)[1] # returns the number of columns
one = np.mat(np.ones(m))
X = np.hstack((one.T,mbill.T)) #include a bias term (intercept) in the model
#set k here
ypred = localWeightRegression(X, mtip, 0.5) # Convert X to a NumPy array to avoid matrix
indexing issues
X_array = np.array(X)
Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

SortIndex = X_array[:,1].argsort(0)
#.argsort(0) returns indices that would sort X_array[:,1] in ascending order
# Extract the sorted total_bill values
xsort = X_array[SortIndex,1]

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(bill, tip, color='green') # Plot the regression line correctly
ax.plot(xsort, ypred[SortIndex], color='red', linewidth=5)
plt.xlabel('Total bill')
plt.ylabel('Tip')
plt.show()

OUTPUT:

Questions:

1. What is the role of the Gaussian kernel in Locally Weighted Regression?


2. Why do we include a column of ones when constructing matrix X?
3. What does the parameter k represent in the Gaussian kernel function? What happens
when k is very small or very large?
4. How does Locally Weighted Regression differ from Ordinary Least Squares (OLS)?
5. Why do we sort X before plotting the predicted curve?

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Experiment-7
Problem statement
Develop a program to demonstrate the working of Linear Regression and Polynomial Regression. Use Boston
Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel efficiency prediction) for
Polynomial Regression.

Program
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing

california = fetch_california_housing()
X_california = california.data
y_california = california.target

# Split the dataset


X_train_california, X_test_california, y_train_california, y_test_california =
train_test_split(X_california, y_california, test_size=0.2, random_state=42)

# Train the Linear Regression model


linear_model = LinearRegression()
linear_model.fit(X_train_california, y_train_california)

# Predictions using Linear Regression


y_pred_california = linear_model.predict(X_test_california)

# Evaluate Linear Regression model


mse_california = mean_squared_error(y_test_california, y_pred_california)
print(f"Linear Regression Mean Squared Error (California Housing): {mse_california:.2f}")

# Visualization for Linear Regression


plt.figure(figsize=(10, 5))
plt.scatter(y_test_california, y_pred_california, color='blue', alpha=0.5, label='Predicted vs
Actual')
plt.plot([min(y_test_california), max(y_test_california)], [min(y_test_california),
max(y_test_california)], color='red', linestyle='--', label='Ideal Fit')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Linear Regression: California Housing')
plt.legend()
plt.show()

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)
# Load from CSV
df = pd.read_csv("auto_mpg.csv")

# Features and target


X_auto = df[['displacement', 'horsepower', 'weight', 'acceleration']]
y_auto = df['mpg']

# Train-test split
X_train_a, X_test_a, y_train_a, y_test_a = train_test_split(X_auto, y_auto, test_size=0.2,
random_state=42)

# Scaling
scaler_a = StandardScaler()
X_train_a_scaled = scaler_a.fit_transform(X_train_a)
X_test_a_scaled = scaler_a.transform(X_test_a)

# Polynomial Features (degree=2)


poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train_a_scaled)
X_test_poly = poly.transform(X_test_a_scaled)

# Polynomial Regression
lr_poly = LinearRegression()
lr_poly.fit(X_train_poly, y_train_a)
y_pred_poly = lr_poly.predict(X_test_poly)

# Evaluation
mse_poly = mean_squared_error(y_test_a, y_pred_poly)
print(f"Auto MPG - Polynomial Regression (degree=2) MSE: {mse_poly:.2f}")

# Visualization
plt.figure(figsize=(10, 5))
plt.scatter(X_test_a['weight'], y_test_a, color='blue', label='Actual MPG')
plt.scatter(X_test_a['weight'], y_pred_poly, color='red', alpha=0.6, label='Predicted MPG (Poly)')
plt.xlabel("Weight")
plt.ylabel("MPG")
plt.title("Polynomial Regression (Degree=2) - Auto MPG")
plt.legend()
plt.tight_layout()
plt.show()

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

OUTPUT:

Questions:

1. Why is train_test_split() important before model training?


2. What does a lower MSE indicate about model performance?
3. Why might scaling be skipped in Linear Regression but not in Polynomial Regression?
4. What is the role of PolynomialFeatures in model training?
5. Why is it important to transform both training and testing data using the same scaler?

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Experiment-8
Problem statement
Develop a program to demonstrate the working of the decision tree algorithm. Use Breast Cancer Data set for
building the decision tree and apply this knowledge to classify a new sample.

Program
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Step 1: Load the Breast Cancer Dataset


data = load_breast_cancer()
X = data.data # Features
y = data.target # Labels (0: malignant, 1: benign)
# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Create and train the Decision Tree classifier


clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 4: Make predictions on the test set


y_pred = clf.predict(X_test)

# Step 5: Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Step 6: Classify a new sample


# Example: Create a new sample (you can replace these values with actual data)
new_sample = np.array([[17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, 0.3001, 0.1471, 0.2419,
0.07871, 1.095, 0.9053, 8.589, 153.4, 0.006399, 0.04904, 0.05373, 0.01587, 0.03003, 0.006193,
25.38, 17.33, 184.6, 2019.0, 0.1622, 0.6656, 0.7119, 0.2654, 0.4601, 0.1189]])

# Predict the class of the new sample


prediction = clf.predict(new_sample)
print("\nNew Sample Prediction:")
print("Class:", data.target_names[prediction][0])

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

OUTPUT:

Accuracy: 0.9473684210526315

New Sample Prediction:


Class: malignant

Questions:

1. How many features are present in the Breast Cancer dataset used in this program?
2. How does splitting the data prevent overfitting?
3. Which algorithm is used in this program for classification?
4. What do the target values 0 and 1 represent in the dataset?
5. Why is it important to test the model on data not seen during training or testing?

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Experiment-9
Problem statement
Develop a program to implement the Naive Bayesian classifier considering Olivetti Face Data set for training.
Compute the accuracy of the classifier, considering a few test data sets.

Program
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Step 1: Manually Load the Olivetti Faces dataset


X = np.load("olivetti_faces.npy") # Shape: (400, 64, 64)
y = np.load("olivetti_faces_target.npy") # Shape: (400,)

# Step 2: Flatten the images (reshape from 3D to 2D)


X = X.reshape(X.shape[0], -1) # Now shape becomes (400, 4096)

# Step 3: Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Step 4: Train the Naive Bayesian Classifier


gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Step 5: Predict on the test set


y_pred = gnb.predict(X_test)

# Step 6: Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Step 7: Visualize some test images with predictions


n_images = 5
plt.figure(figsize=(10, 4))
for i in range(n_images):
plt.subplot(1, n_images, i + 1)
plt.imshow(X_test[i].reshape(64, 64), cmap='gray') # Reshape back for visualization
plt.title(f"True: {y_test[i]}\nPred: {y_pred[i]}")
plt.axis('off')
plt.show()

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

OUTPUT:

Questions:

1. What is the shape of the original images, and how does it change after flattening?
2. Why do we flatten the image data before feeding it into the Naive Bayes classifier?
3. What are the limitations of using Naive Bayes for image classification?
4. What will happen if you change the test_size from 0.2 to 0.5?
5. How does the Naive Bayes classifier work, and why might it struggle with image data?

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

Experiment-10
Problem statement
Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set and visualize the
clustering result.

Program
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Load the dataset


data = load_breast_cancer()
X = data.data # Features
y = data.target # Labels (not used in clustering)

# Standardize the features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means clustering


kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
kmeans.fit(X_scaled)
labels = kmeans.labels_

# Reduce dimensions using PCA for visualization


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot the clusters


plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', alpha=0.7)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('K-Means Clustering of Breast Cancer Dataset')
plt.colorbar(label='Cluster')
plt.show()

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore
Machine Learning Lab (BCSL606)

OUTPUT:

Questions:

1. What type of learning is demonstrated in this code: supervised or unsupervised? Why?


2. What is the purpose of using the StandardScaler before clustering?
3. Why might we choose n_clusters=2 in K-Means for this dataset?
4. Why is PCA used in this code, and what does it help with?
5. What would happen if you skipped the standardization step before applying K-Means?

Prepared by: Mr. Janardhana Bhat K, Ms. Babitha Ganesh,and Ms. Kavya A M, Dept. of CSE,
Canara Engineering College, Mangalore

You might also like