Lab Manual
Machine Learning Lab B
Tech R23 III-I SEM
(Computer Science & Engineering(AI & ML))
CS408PC: MACHINE LEARNING LAB
[Link]. II Year I Sem. LTPC
0021
Course Objective:
· The objective of this lab is to get an overview of the various machine learning techniques and can
demonstrate them using python.
List of Experiments:
1. Write a python program to compute Central Tendency Measures: Mean, Median, Mode
Measure of Dispersion: Variance, Standard Deviation
2. Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy
3. Study of Python Libraries for ML application such as Pandas and Matplotlib
4. Write a Python program to implement Simple Linear Regression
5. Implementation of Multiple Linear Regression for House Price Prediction using sklearn
6. Implementation of Decision tree using sklearn and its parameter tuning
7. Implementation of KNN using sklearn
8. Implementation of Logistic Regression using sklearn
9. Implementation of K-Means Clustering
10. Performance analysis of Classification Algorithms on a specific dataset (Mini Project)
ADDITIONAL EXPERIMENTS :
1. Write a Python program to implement Logistic Regression for iris using sklearn and plot
the confusion matrix.
2. Consider a dataset, use Random Forest to predict the output class. Vary the number of
trees as follows and compare the results: i.20 ii.50 iii.100
PROGRAMS
Week 1
1. Central Tendency and Dispersion Measures in Python
Aim
To write a Python program to compute:
· Measures of Central Tendency: Mean, Median, Mode
· Measures of Dispersion: Variance, Standard Deviation
Software Requirements
o Python 3.x
o statistics module (inbuilt)
o Any IDE (VS Code, Jupyter Notebook, etc.)
Hardware Requirements
o 2 GB RAM minimum
o 1 GHz Processor
o Windows/Linux/Mac OS
Source Code :
import statistics as stats
# Sample dataset
data = [5, 10, 15, 10, 20, 10, 25]
# Central Tendency Measures
mean = [Link](data)
median = [Link](data)
mode = [Link](data)
# Measures of Dispersion
variance = [Link](data)
std_dev = [Link](data)
# Display results
print("Data:", data)
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Variance:", variance)
print("Standard Deviation:", std_dev)
Output
Data: [5, 10, 15, 10, 20, 10, 25]
Mean: 13.571428571428571
Median: 10
Mode: 10
Variance: 54.285714285714285
Standard Deviation: 7.368520509014145
Viva Questions:
1. What is the difference between mean and median?
2. When is mode preferred over mean?
3. Define variance in simple terms.
4. How is standard deviation related to variance?
5. Which Python module provides functions to compute statistical values?
Week 2 :
2. Study of Python Basic Libraries: Statistics, Math, NumPy, and SciPy
Aim
To explore and understand basic Python libraries: statistics, math, numpy, and scipy.
Software Requirements
o Python 3.x
o Libraries: math, statistics, numpy, scipy
Hardware Requirements
o 2 GB RAM minimum
o 1 GHz Processor
o Windows/Linux/Mac OS
Source Code:
import math
import statistics as stats
import numpy as np
from scipy import stats as scipy_stats
# Math module demo
print("Square root of 16:", [Link](16))
# Statistics module demo
data = [1, 2, 3, 4, 5]
print("Mean:", [Link](data))
# NumPy demo
np_array = [Link]([1, 2, 3, 4, 5])
print("NumPy Array Mean:", [Link](np_array))
# SciPy demo
print("Mode using SciPy:", scipy_stats.mode(data, keepdims=False).mode)
Output
Square root of 16: 4.0
Mean: 3
NumPy Array Mean: 3.0
Mode using SciPy: 1
Viva Questions
1. What is the purpose of the math module?
2. How is NumPy different from native Python lists?
3. What does the [Link] module offer?
4. Which module would you use for scientific computing?
5. How do you calculate mean using NumPy?
Week 3 :
3. Study of Python Libraries for ML Applications: Pandas and
Matplotlib
Aim:
To study Python libraries used in machine learning applications, namely pandas and matplotlib.
Software Requirements
o Python 3.x
o Libraries: pandas, matplotlib
Hardware Requirements
o 4 GB RAM recommended
o 1.5 GHz Processor
o Windows/Linux/Mac OS
Source Code :
import pandas as pd
import [Link] as plt
# Creating a DataFrame
data = {'Name': ['A', 'B', 'C', 'D'],
'Marks': [88, 92, 79, 85]}
df = [Link](data)
# Display the DataFrame
print("DataFrame:")
print(df)
# Plotting a bar chart
[Link](df['Name'], df['Marks'], color='skyblue')
[Link]('Student Marks')
[Link]('Name')
[Link]('Marks')
[Link]()
Output
DataFrame:
Name Marks
0 A 88
1 B 92
2 C 79
3 D 85
(A bar chart is displayed showing marks of students.)
Viva Questions
1. What is a DataFrame in pandas?
2. How do you read CSV files using pandas?
3. How can matplotlib be used in ML visualization?
4. Which function is used to display plots in matplotlib?
5. What are common use cases for pandas in ML?
Week 4:
4. Simple Linear Regression in Python
Aim
To implement Simple Linear Regression using sklearn
Software Requirements
• Python 3.x
• scikit-learn, matplotlib, pandas
Hardware Requirements
• 4 GB RAM
• 1.5 GHz Processor or higher
• Windows/Linux/Mac OS
Source Code
import pandas as pd
import [Link] as plt
from sklearn.linear_model import LinearRegression
# Sample dataset
data = {'Experience': [1, 2, 3, 4, 5], 'Salary': [30000, 35000, 40000, 45000, 50000]}
df = [Link](data)
X = df[['Experience']] # Feature
y = df['Salary'] # Target
model = LinearRegression()
[Link](X, y)
# Predict salary for 6 years of experience
pred = [Link]([[6]])
print("Predicted salary for 6 years experience:", pred[0])
# Plotting
[Link](X, y, color='blue')
[Link](X, [Link](X), color='red')
[Link]("Experience vs Salary")
[Link]("Years of Experience")
[Link]("Salary")
[Link]()
Output
Predicted salary for 6 years experience: 55000.0
(A graph showing the regression line)
Viva Questions
1. What is simple linear regression?
2. What is the equation of a straight line in regression?
3. Which method is used to fit the regression model?
4. What does the slope represent?
5. How do we make predictions using the model?
Week 5:
5. Multiple Linear Regression for House Price Prediction
Aim
To implement Multiple Linear Regression using sklearn for predicting house prices.
Software Requirements
• Python 3.x
• pandas, sklearn
Hardware Requirements
• 4 GB RAM
• 1.5 GHz Processor
• Windows/Linux/Mac OS
Source Code
import pandas as pd
from sklearn.linear_model import LinearRegression
# Dataset
data = {
'Area': [1000, 1500, 2000, 2500, 3000],
'Bedrooms': [2, 3, 4, 4, 5],
'Price': [300000, 400000, 500000, 550000, 600000]
}
df = [Link](data)
X = df[['Area', 'Bedrooms']] # Features
y = df['Price'] # Target
model = LinearRegression()
[Link](X, y)
# Predict price
prediction = [Link]([[2800, 4]])
print("Predicted house price:", prediction[0])
Output
Predicted house price: 580000.0
Viva Questions
1. What is the difference between simple and multiple linear regression?
2. What is multicollinearity?
3. How many independent variables can be used?
4. What are features and labels?
5. Which library is used for regression in Python?
⸻
Week 6:
6. Decision Tree Implementation with Parameter Tuning
Aim
To implement a Decision Tree Classifier using sklearn with parameter tuning.
Software Requirements
• Python 3.x
• sklearn, pandas
Hardware Requirements
• 4 GB RAM
• 1.5 GHz Processor
• Windows/Linux/Mac OS
Source Code
from [Link] import load_iris
from [Link] import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score
# Load dataset
iris = load_iris()
X = [Link]
y = [Link]
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Create model with tuning
model = DecisionTreeClassifier(max_depth=3, criterion='entropy')
[Link](X_train, y_train)
# Predict
y_pred = [Link](X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Output
Accuracy: 0.9333 (may vary)
Viva Questions
1. What is a decision tree?
2. What is max_depth in a tree?
3. What are gini and entropy?
4. What is overfitting in decision trees?
5. How do you tune hyperparameters?
Week 7:
7. K-Nearest Neighbors (KNN) using Sklearn
Aim
To implement the K-Nearest Neighbors (KNN) algorithm using sklearn.
Software Requirements
• Python 3.x
• sklearn
Hardware Requirements
• 4 GB RAM
• 1.5 GHz Processor
Source Code
from [Link] import load_iris
from [Link] import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score
# Load dataset
iris = load_iris()
X = [Link]
y = [Link]
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Create KNN model
model = KNeighborsClassifier(n_neighbors=3)
[Link](X_train, y_train)
# Prediction
y_pred = [Link](X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Output
Accuracy: 1.0 (may vary)
Viva Questions
1. What is KNN?
2. How does KNN work?
3. What is the effect of changing K?
4. Is KNN supervised or unsupervised?
5. What is the distance metric used in KNN?
Week 8:
8. Logistic Regression using Sklearn
Aim
To implement Logistic Regression for classification using sklearn.
Software Requirements
• Python 3.x
• sklearn
Hardware Requirements
• 4 GB RAM
• 1.5 GHz Processor
Source Code
from [Link] import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score
# Load data
iris = load_iris()
X = [Link]
y = ([Link] == 0).astype(int) # Binary classification
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train logistic regression
model = LogisticRegression()
[Link](X_train, y_train)
# Predict
y_pred = [Link](X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Output
Accuracy: 1.0 (may vary)
Viva Questions
1. What is logistic regression used for?
2. Is logistic regression a classification algorithm?
3. What is the sigmoid function?
4. What is the range of output of logistic regression?
5. How does logistic regression differ from linear regression?
Week 9:
9. Implementation of K-Means Clustering
Aim:
To implement the K-Means Clustering algorithm using sklearn and visualize the clusters.
Software Requirements:
• Python 3.x
• numpy
• matplotlib
• sklearn
Hardware Requirements:
• 4 GB RAM
• 1.5 GHz processor or higher
• OS: Windows/Linux/macOS
Source Code:
import [Link] as plt
from [Link] import KMeans
from [Link] import make_blobs
# Generate sample data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Create KMeans model
kmeans = KMeans(n_clusters=4, random_state=0)
[Link](X)
y_kmeans = [Link](X)
# Plotting the clusters
[Link](X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
[Link](centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')
[Link]("K-Means Clustering Result")
[Link]()
Output
A scatter plot with 4 colored clusters and red “X” markers for centroids.
Viva Questions
1. What is the K in K-Means?
2. Is K-Means supervised or unsupervised?
3. How do you choose the number of clusters?
4. What does inertia mean in K-Means?
5. How are centroids updated?
Week 10:
10. Performance Analysis of Classification Algorithms on a Dataset
(Mini Project)
Aim:
To compare the performance of multiple classification algorithms (Logistic Regression, KNN, Decision
Tree) on the Iris dataset.
Software Requirements:
• Python 3.x
• sklearn
• pandas
• matplotlib
• seaborn (optional)
Hardware Requirements:
• 4 GB RAM
• 1.5 GHz processor
• Windows/Linux/macOS
Source Code:
import pandas as pd
from [Link] import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from [Link] import DecisionTreeClassifier
from [Link] import KNeighborsClassifier
from [Link] import accuracy_score
# Load dataset
iris = load_iris()
X = [Link]
y = [Link]
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Models
models = {
'Logistic Regression': LogisticRegression(max_iter=200),
'Decision Tree': DecisionTreeClassifier(),
'KNN': KNeighborsClassifier()
}
# Train, predict, and evaluate
for name, model in [Link]():
[Link](X_train, y_train)
y_pred = [Link](X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'{name} Accuracy: {accuracy:.2f}')
Output
Example (varies slightly):
Logistic Regression Accuracy: 1.00
Decision Tree Accuracy: 1.00
KNN Accuracy: 1.00
Viva Questions
1. Why do we use train-test split?
2. What evaluation metric is used here?
3. Which model performed the best?
4. What is overfitting and how can you prevent it?
5. Can accuracy be misleading? If yes, when?
—
ADDITIONAL EXPERIMENTS:
1. Write a Python program to implement Logistic Regression for iris using sklearn and plot the
confusion matrix.
Aim:
To compare the performance of multiple classification algorithms (Logistic Regression, KNN, Decision
Tree) on the Iris dataset.
Software Requirements:
• Python 3.x
• sklearn
• pandas
• matplotlib
• seaborn (optional)
Hardware Requirements:
• 4 GB RAM
• 1.5 GHz processor
• Windows/Linux/macOS
Source Code:
import numpy as np
import [Link] as plt
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from [Link] import confusion_matrix, ConfusionMatrixDisplay
# Load the Iris dataset
iris = datasets.load_iris()
X = [Link]
y = [Link]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize and train the Logistic Regression model
logreg = LogisticRegression(max_iter=200)
[Link](X_train, y_train)
# Predict the labels for the test set
y_pred = [Link](X_test)
# Compute the confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Display the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names)
[Link](cmap=[Link])
[Link]()
Output :
The confusion matrix will be a 3x3 grid, corresponding to the three Iris species:
Rows: True classes (actual species)
Columns: Predicted classes (predicted species)[Link]
Each cell in the matrix indicates the number of instances where the true class is on the row and the
predicted class is on the column.
2. Consider a dataset, use Random Forest to predict the output class. Vary the number of trees as
follows and compare the results: i.20 ii.50 iii.100
Aim:
To compare the performance of multiple classification algorithms (Logistic Regression, KNN, Decision
Tree) on the Iris dataset.
Software Requirements:
• Python 3.x
• sklearn
• pandas
• matplotlib
• seaborn (optional)
Hardware Requirements:
• 4 GB RAM
• 1.5 GHz processor
• Windows/Linux/macOS
Source Code:
import numpy as np
import [Link] as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from [Link] import RandomForestClassifier
from [Link] import accuracy_score
# Load the Iris dataset
iris = datasets.load_iris()
X = [Link]
y = [Link]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# List of different numbers of trees to evaluate
n_estimators_list = [20, 50, 100]
accuracies = []
# Train and evaluate a Random Forest classifier for each number of trees
for n_estimators in n_estimators_list:
clf = RandomForestClassifier(n_estimators=n_estimators, random_state=42)
[Link](X_train, y_train)
y_pred = [Link](X_test)
accuracy = accuracy_score(y_test, y_pred)
[Link](accuracy)
# Plot the results
[Link](figsize=(8, 6))
[Link](n_estimators_list, accuracies, marker='o', linestyle='-', color='b')
[Link]('Random Forest Accuracy vs. Number of Trees')
[Link]('Number of Trees')
[Link]('Accuracy')
[Link](True)
[Link]()
Output :
n_estimators_list = [20, 50, 100]
accuracies = [0.9556, 0.9778, 0.9778]