0% found this document useful (0 votes)
2 views36 pages

ML Lab Manual

The document outlines various practice programs focused on data analysis and machine learning using Python, including tasks such as creating histograms, computing correlation matrices, implementing PCA, and using algorithms like k-Nearest Neighbour and Naive Bayesian classifier. Each program includes specific datasets, such as the California Housing and Iris datasets, and provides source code examples for implementation. Additionally, it covers data visualization techniques using libraries like Matplotlib and Seaborn.

Uploaded by

surakshakeerthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views36 pages

ML Lab Manual

The document outlines various practice programs focused on data analysis and machine learning using Python, including tasks such as creating histograms, computing correlation matrices, implementing PCA, and using algorithms like k-Nearest Neighbour and Naive Bayesian classifier. Each program includes specific datasets, such as the California Housing and Iris datasets, and provides source code examples for implementation. Additionally, it covers data visualization techniques using libraries like Matplotlib and Seaborn.

Uploaded by

surakshakeerthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

SI.

No Description Page
No
1 Practice Programs 1-7
2 Develop a program to create histograms for all numerical features and
analyze the distribution of each feature. Generate box plots for all 8 - 12
numerical features and identify any outliers. Use California Housing
dataset.
3 Develop a program to Compute the correlation matrix to understand
the relationships between pairs of features. Visualize the correlation 13 - 15
matrix using a heatmap to know which variables have strong
positive/negative correlations. Create a pair plot to visualize pairwise
relationships between features. Use California Housing dataset.
4 Develop a program to implement Principal Component Analysis (PCA) 16 – 17
for reducing the dimensionality of the Iris dataset from 4 features to 2.
5 For a given set of training data examples stored in a .CSV file, implement 18 - 19
and demonstrate the Find-S algorithm to output a description of the set
of all hypotheses consistent with the training examples.
6 Develop a program to implement k-Nearest Neighbour algorithm to
classify the randomly generated 100
values of x in the range of [0,1]. Perform the following based on dataset 20 – 22
generated.
a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε
Class1, else xi ε Class2
b. Classify the remaining points, x51,……,x100 using KNN. Perform this
for k=1,2,3,4,5,20,30
7 Implement the non-parametric Locally Weighted Regression algorithm
in order to fit data points. Select appropriate data set for your 23- 24
experiment and draw graphs
8 Develop a program to demonstrate the working of Linear Regression
and Polynomial Regression. Use Boston Housing Dataset for Linear
Regression and Auto MPG Dataset (for vehicle fuel efficiency 25 - 28
prediction) for Polynomial Regression.
9 Develop a program to demonstrate the working of the decision tree
algorithm. Use Breast Cancer Data set for building the decision tree and 29- 30
apply this knowledge to classify a new sample.
10 Develop a program to implement the Naive Bayesian classifier
considering Olivetti Face Data set for training. Compute the accuracy of 31 - 33
the classifier, considering a few test data sets.
11 Develop a program to implement k-means clustering using Wisconsin 34 - 35
Breast Cancer data set and visualize the clustering result.
12 Viva Questions 36
Practice Programs:

1. Write Python Script to create a DataFrame.

import pandas as pd

# Creating a DataFrame from a dictionary

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'David'],

'Age': [25, 30, 35, 40],

'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']

df = pd.DataFrame(data)

# Display the DataFrame

print(df)

2. Write a Python Script to Read and Write CSV Files

# Save DataFrame to a CSV file


df.to_csv('data.csv', index=False)

# Read DataFrame from a CSV file


df_read = pd.read_csv('data.csv')
print(df_read)

3. Write a Python Script to perform Basic DataFrame Operations

# Show first 2 rows

print(df.head(2))

# Show last 2 rows

print(df.tail(2))

# Get summary statistics

print(df.describe())

# Get column names

print(df.columns)
# Get DataFrame shape (rows, columns)

print(df.shape)

# Get data types of each column

print(df.dtypes)

4. Write a Python Script for Selecting and Filtering Data

# Select a single column


print(df['Name'])

# Select multiple columns


print(df[['Name', 'Age']])

# Filter rows based on a condition


print(df[df['Age'] > 30])
5. Write a Python Script for Adding and Modifying Columns

# Add a new column

df['Salary'] = [50000, 60000, 70000, 80000]

# Modify an existing column

df['Age'] = df['Age'] + 1 # Increase age by 1

print(df)

6. Write a Python Script for Sorting and Grouping Data

# Sort DataFrame by Age in ascending order


print(df.sort_values(by='Age'))

# Group data by City and find the mean Age


print(df.groupby('City')['Age'].mean())

7. Write a Python Script for Handling Missing Values

import numpy as np

# Introduce missing values


df.loc[1, 'Age'] = np.nan

# Check for missing values

print(df.isnull().sum())

# Fill missing values with mean

df['Age'].fillna(df['Age'].mean(), inplace=True)

print(df)

8. Write a Python Script for Applying Functions to DataFrame

# Apply a function to a column


df['Age_Category'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Old')
print(df)
9. Write a Python Script to Plot a Line Plot (Trends over Time)
import matplotlib.pyplot as plt
import numpy as np

# Sample Data
x = np.arange(1, 11)
y = np.sin(x)

# Line Plot
plt.plot(x, y, marker='o', linestyle='-', color='b', label='Sine Wave')
plt.xlabel("X Values")
plt.ylabel("Y Values")
plt.title("Simple Line Plot")
plt.legend()
plt.grid(True)
plt.show()
10. Write a Python Script to Plot a Bar Chart (Category Comparison)

import matplotlib.pyplot as plt

# Sample Data

categories = ['A', 'B', 'C', 'D', 'E']

values = [10, 25, 15, 30, 20]

# Bar Plot

plt.bar(categories, values, color=['red', 'blue', 'green', 'purple', 'orange'])


plt.xlabel("Categories")

plt.ylabel("Values")

plt.title("Bar Chart Example")

plt.show()

11. Write a Python Script to Plot a Histogram (Distribution of Data)

import numpy as np

import matplotlib.pyplot as plt

# Generate Random Data

data = np.random.randn(1000)

# Histogram

plt.hist(data, bins=30, color='skyblue', edgecolor='black')

plt.xlabel("Value")

plt.ylabel("Frequency")

plt.title("Histogram of Random Data")

plt.show()

12. Write a Python Script to Plot a Scatter Plot (Relationship between Two Variables)

import numpy as np

import matplotlib.pyplot as plt

# Generate Data

x = np.random.rand(100)

y = np.random.rand(100)

# Scatter Plot

plt.scatter(x, y, c='red', alpha=0.6)

plt.xlabel("X Values")
plt.ylabel("Y Values")

plt.title("Scatter Plot Example")

plt.show()

13. Write a Python Script to Plot a Box Plot (Detecting Outliers)

import seaborn as sns

import numpy as np

import matplotlib.pyplot as plt

# Generate Random Data

data = np.random.randn(100)

# Box Plot

sns.boxplot(data=data, color='lightblue')

plt.title("Box Plot Example")

plt.show()

14. Write a Python Script to Plot a Pair Plot (Multiple Feature Relationships - Iris
Dataset)

import seaborn as sns

import pandas as pd

from sklearn.datasets import load_iris

# Load Iris Dataset

iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)

df['species'] = iris.target

# Pair Plot

sns.pairplot(df, hue='species', palette='coolwarm')


plt.show()

15. Write a Python Script to Plot a Heatmap (Correlation Matrix - Titanic Dataset)

import seaborn as sns

import pandas as pd

# Load Sample Dataset

df = sns.load_dataset("titanic").dropna()

# Compute Correlation

corr_matrix = df.corr()

# Heatmap

plt.figure(figsize=(8,6))

sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")

plt.title("Correlation Heatmap")

plt.show()

Program 1: Develop a program to create histograms for all numerical features and analyze
the distribution of each feature. Generate box plots for all numerical features and identify
any outliers. Use California Housing dataset.

Source Code:

import numpy as np

import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.datasets import fetch_california_housing

# Load California Housing dataset

data = fetch_california_housing()

df = pd.DataFrame(data.data, columns=data.feature_names)

# Display basic dataset information

print("Dataset Overview:")

print(df.info())

print("\nSummary Statistics:")

print(df.describe())

# Set plot style

sns.set_style("whitegrid")

# Create histograms for all numerical features

plt.figure(figsize=(12, 8))

df.hist(bins=30, figsize=(12, 8), edgecolor='black')

plt.suptitle("Histograms of Numerical Features in California Housing Dataset", fontsize=14)

plt.show()

# Create improved box plots for all numerical features to identify outliers

plt.figure(figsize=(14, 8))

for i, col in enumerate(df.columns):

plt.subplot(3, 3, i + 1)

sns.boxplot(x=df[col], color="skyblue", width=0.6, fliersize=3)


plt.title(col, fontsize=12)

plt.tight_layout()

plt.suptitle("Box Plots of Numerical Features", fontsize=14, y=1.02)

plt.show()

# Identify outliers using IQR method

Q1 = df.quantile(0.25)

Q3 = df.quantile(0.75)

IQR = Q3 - Q1

outliers = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)))

print("\nOutlier Detection:")

print(outliers.sum())

Output :

Dataset Overview:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 20640 entries, 0 to 20639

Data columns (total 8 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 MedInc 20640 non-null float64

1 HouseAge 20640 non-null float64

2 AveRooms 20640 non-null float64

3 AveBedrms 20640 non-null float64

4 Population 20640 non-null float64

5 AveOccup 20640 non-null float64

6 Latitude 20640 non-null float64

7 Longitude 20640 non-null float64

dtypes: float64(8)
Summary Statistics:

MedInc HouseAge AveRooms AveBedrms Population \

count 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000

mean 3.870671 28.639486 5.429000 1.096675 1425.476744

std 1.899822 12.585558 2.474173 0.473911 1132.462122

min 0.499900 1.000000 0.846154 0.333333 3.000000

25% 2.563400 18.000000 4.440716 1.006079 787.000000

50% 3.534800 29.000000 5.229129 1.048780 1166.000000

75% 4.743250 37.000000 6.052381 1.099526 1725.000000

max 15.000100 52.000000 141.909091 34.066667 35682.000000

AveOccup Latitude Longitude

count 20640.000000 20640.000000 20640.000000

mean 3.070655 35.631861 -119.569704

std 10.386050 2.135952 2.003532

min 0.692308 32.540000 -124.350000

25% 2.429741 33.930000 -121.800000

50% 2.818116 34.260000 -118.490000

75% 3.282261 37.710000 -118.010000

max 1243.333333 41.950000 -114.310000

<Figure size 1200x800 with 0 Axes>


Outlier Detection:

MedInc 681

HouseAge 0

AveRooms 511

AveBedrms 1424

Population 1196

AveOccup 711

Latitude 0

Longitude 0
Program 2: Develop a program to Compute the correlation matrix to understand the
relationships between pairs of features. Visualize the correlation matrix using a heatmap to
know which variables have strong positive/negative correlations. Create a pair plot to
visualize pairwise relationships between features. Use California Housing dataset.

Source Code:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.datasets import fetch_california_housing

# Load California Housing dataset

data = fetch_california_housing()

df = pd.DataFrame(data.data, columns=data.feature_names)

# Set plot style

sns.set_style("whitegrid")

# Compute and visualize the correlation matrix

plt.figure(figsize=(10, 6))

corr_matrix = df.corr()

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

plt.title("Feature Correlation Heatmap", fontsize=14)

plt.show()

# Create pair plot to visualize pairwise relationships between features

sns.pairplot(df, diag_kind='kde', plot_kws={'alpha':0.5})

plt.suptitle("Pair Plot of Features", fontsize=14, y=1.02)

plt.show()
# Identify skewness of numerical features

skew_values = df.skew()

print("\nSkewness of Features:")

print(skew_values)

Output:
Skewness of Features:

MedInc 1.646657

HouseAge 0.060331

AveRooms 20.697869

AveBedrms 31.316956

Population 4.935858

AveOccup 97.639561

Latitude 0.465953 Longitude -0.297801


Program 3: Develop a program to implement Principal Component Analysis (PCA) for
reducing the dimensionality of the Iris dataset from 4 features to 2.

Source Code:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from sklearn.datasets import load_iris

from sklearn.preprocessing import StandardScaler

# Load Iris dataset

data = load_iris()

df = pd.DataFrame(data.data, columns=data.feature_names)

# Standardize the data

scaler = StandardScaler()

df_scaled = scaler.fit_transform(df)

# Apply PCA to reduce dimensionality from 4 to 2

pca = PCA(n_components=2)

principal_components = pca.fit_transform(df_scaled)

# Create a new DataFrame with principal components

pca_df = pd.DataFrame(principal_components, columns=['PC1', 'PC2'])

pca_df['Target'] = data.target

# Visualize the PCA results

plt.figure(figsize=(8, 6))
for target, label in enumerate(data.target_names):

subset = pca_df[pca_df['Target'] == target]

plt.scatter(subset['PC1'], subset['PC2'], label=label, alpha=0.7)

plt.xlabel('Principal Component 1')

plt.ylabel('Principal Component 2')

plt.title('PCA of Iris Dataset')

plt.legend()

plt.grid(True)

plt.show()

# Print explained variance ratio

print("Explained Variance Ratio:", pca.explained_variance_ratio_)

Output:

Explained Variance Ratio: [0.72962445 0.22850762]


Program 4: For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Find-S algorithm to output a description of the set of all hypotheses
consistent with the training examples.

Source Code:

import csv

num_attributes = 6

a = []

print("\n The Given Training Data Set \n")

with open('data.csv', 'r') as csvfile:

reader = csv.reader(csvfile)

for row in reader:

a.append (row)

print(row)

print("\n The initial value of hypothesis: ")

hypothesis = ['0'] * num_attributes

print(hypothesis)

for j in range(0,num_attributes):

hypothesis[j] = a[0][j];

print("\n Find S: Finding a Maximally Specific Hypothesis\n")

for i in range(0,len(a)):

if a[i][num_attributes]=='yes':

for j in range(0,num_attributes):

if a[i][j]!=hypothesis[j]:

hypothesis[j]='?'

else :

hypothesis[j]= a[i][j]

print(" For Training instance No:{0} the hypothesis is ".format(i),hypothesis)

print("\n The Maximally Specific Hypothesis for a given Training Examples :\n")
print(hypothesis)

Output:

The Given Training Data Set

['sunny', 'warm', 'normal', 'strong', 'warm', 'same', 'yes']

['sunny', 'warm', 'high', 'strong', 'warm', 'same', 'yes']

['rainy', 'cold', 'high', 'strong', 'warm', 'change', 'no']

['sunny', 'warm', 'high', 'strong', 'cool', 'change', 'yes']

The initial value of hypothesis:

['0', '0', '0', '0', '0', '0']

Find S: Finding a Maximally Specific Hypothesis

For Training instance No:3 the hypothesis is ['sunny', 'warm', '?', 'strong', '?', '?']

The Maximally Specific Hypothesis for a given Training Examples :

['sunny', 'warm', '?', 'strong', '?', '?']


Program 5: Develop a program to implement k-Nearest Neighbour algorithm to classify the
randomly generated 100 values of x in the range of [0,1]. Perform the following based on
dataset generated.

a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε Class1, else xi ε Class1

b. Classify the remaining points, x51,……,x100 using KNN. Perform this for k=1,2,3,4,5,20,30

Source Code:

import numpy as np

import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier

# Generate 100 random values in the range [0,1]

x = np.random.rand(100, 1)

# Label the first 50 points based on the given condition

labels = np.array([1 if xi <= 0.5 else 2 for xi in x[:50]])

# Prepare training and test sets

X_train, y_train = x[:50], labels # First 50 for training

X_test = x[50:] # Remaining 50 for classification

# Test for different values of k

k_values = [1, 2, 3, 4, 5, 20, 30]

plt.figure(figsize=(10, 6))

for k in k_values:

knn = KNeighborsClassifier(n_neighbors=k)

knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

# Visualization of classification results

plt.scatter(X_test, y_pred, label=f'k={k}', alpha=0.7)

# Mark training points for reference

plt.scatter(X_train, y_train, color='red', marker='x', label='Training Data')

plt.xlabel('X values')

plt.ylabel('Predicted Class')

plt.title('KNN Classification for Different k-values')

plt.legend()

plt.show()

# Print classification results for each k

for k in k_values:

knn = KNeighborsClassifier(n_neighbors=k)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print(f'Predictions for k={k}:', y_pred)

Output:
Predictions for k=1: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 2 2 1 2 1 2 1

1 1 1 1 1 2 1 1 1 1 2 2 1]

Predictions for k=2: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 2 2 1 2 1 2 1

1 1 1 1 1 2 1 1 1 1 2 2 1]

Predictions for k=3: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 2 2 1 2 1 2 1

1 1 1 2 1 2 1 1 1 1 2 2 1]

Predictions for k=4: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 2 2 1 2 1 2 1

1 1 1 1 1 2 1 1 1 1 2 2 1]

Predictions for k=5: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 2 2 1 2 1 2 1

1 1 1 1 1 2 1 1 1 1 2 2 1]

Predictions for k=20: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 1 2 1 2 1 2 1

1 1 1 1 1 1 1 1 1 1 2 2 1]

Predictions for k=30: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 1 2 1 2 1 2 1

1 1 1 1 1 1 1 1 1 1 2 2 1]
Program 6: Implement the non-parametric Locally Weighted Regression algorithm in order
to fit data points. Select appropriate data set for your experiment and draw graphs.

Source Code:

import numpy as np

import matplotlib.pyplot as plt

# Generate synthetic dataset

np.random.seed(42)

X = np.linspace(0, 10, 100)

y = np.sin(X) + np.random.normal(0, 0.1, 100) # Sinusoidal data with noise

# Define Locally Weighted Regression function

def locally_weighted_regression(x_query, X, y, tau):

m = X.shape[0]

W = np.diag(np.exp(-((X[:, 1] - x_query[1]) ** 2) / (2 * tau ** 2))) # Diagonal weight matrix

theta = np.linalg.pinv(X.T @ W @ X) @ X.T @ W @ y # Compute theta

return x_query @ theta

# Fit Locally Weighted Regression for different values of tau

tau_values = [0.1, 0.5, 1, 5]

X_ones = np.c_[np.ones(X.shape[0]), X] # Add bias term

plt.figure(figsize=(10, 6))

plt.scatter(X, y, label='Data', color='blue', alpha=0.5)

for tau in tau_values:

y_pred = np.array([locally_weighted_regression(np.array([1, x_i]), X_ones, y, tau) for x_i in


X])

plt.plot(X, y_pred, label=f'tau={tau}')


plt.xlabel('X values')

plt.ylabel('Y values')

plt.title('Locally Weighted Regression with Different Bandwidths')

plt.legend()

plt.show()

Output:
Program 7: Develop a program to demonstrate the working of Linear Regression and
Polynomial Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG
Dataset (for vehicle fuel efficiency prediction) for Polynomial Regression.

Source Code:

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import PolynomialFeatures

from sklearn.pipeline import make_pipeline

from sklearn.datasets import fetch_california_housing

from sklearn.metrics import mean_squared_error

# Load Boston Housing Dataset for Linear Regression

boston = fetch_california_housing()

X_boston = boston.data[:, :2] # Selecting first two features for simplicity

y_boston = boston.target

# Split data

X_train, X_test, y_train, y_test = train_test_split(X_boston, y_boston, test_size=0.2,


random_state=42)

# Train Linear Regression Model

linear_reg = LinearRegression()

linear_reg.fit(X_train, y_train)

y_pred = linear_reg.predict(X_test)

# Evaluate Model
mse = mean_squared_error(y_test, y_pred)

print(f'Linear Regression MSE: {mse}')

# Plot Predictions vs Actual

plt.scatter(y_test, y_pred, color='blue', alpha=0.5)

plt.xlabel('Actual Prices')

plt.ylabel('Predicted Prices')

plt.title('Linear Regression: Actual vs Predicted Prices')

plt.show()

# Load Auto MPG Dataset for Polynomial Regression

auto_mpg = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-
data/master/mpg.csv").dropna()

X_auto = auto_mpg[['horsepower']].values

y_auto = auto_mpg['mpg'].values

# Split data

X_train, X_test, y_train, y_test = train_test_split(X_auto, y_auto, test_size=0.2,


random_state=42)

# Train Polynomial Regression Model

degree = 3 # Choosing a cubic polynomial model

poly_model = make_pipeline(PolynomialFeatures(degree), LinearRegression())

poly_model.fit(X_train, y_train)

y_pred_poly = poly_model.predict(X_test)

# Evaluate Model

mse_poly = mean_squared_error(y_test, y_pred_poly)

print(f'Polynomial Regression MSE: {mse_poly}')


# Plot Polynomial Regression Results

X_sorted = np.sort(X_test, axis=0)

y_sorted = poly_model.predict(X_sorted)

plt.scatter(X_test, y_test, color='blue', alpha=0.5, label='Actual')

plt.plot(X_sorted, y_sorted, color='red', label=f'Polynomial Degree {degree}')

plt.xlabel('Horsepower')

plt.ylabel('MPG')

plt.title('Polynomial Regression: Horsepower vs MPG')

plt.legend()

plt.show()

Output:

Linear Regression MSE: 0.6629874283048177

Polynomial Regression MSE: 18.460267222145088


Program 8: Develop a program to demonstrate the working of the decision tree algorithm.
Use Breast Cancer Data set for building the decision tree and apply this knowledge to
classify a new sample.

Source Code:

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.datasets import load_breast_cancer

from sklearn.metrics import accuracy_score, classification_report

# Load Breast Cancer Dataset

cancer = load_breast_cancer()

X = cancer.data

y = cancer.target

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Model

decision_tree = DecisionTreeClassifier(random_state=42)

decision_tree.fit(X_train, y_train)

# Predict on test data

y_pred = decision_tree.predict(X_test)

# Evaluate Model

accuracy = accuracy_score(y_test, y_pred)


print(f'Decision Tree Accuracy: {accuracy}')

print(classification_report(y_test, y_pred))

# Classify a new sample

new_sample = np.array([X_test[0]]) # Using first test sample as an example

predicted_class = decision_tree.predict(new_sample)

print(f'Predicted class for new sample: {cancer.target_names[predicted_class[0]]}')

Output:

Decision Tree Accuracy: 0.9473684210526315

precision recall f1-score support

0 0.93 0.93 0.93 43

1 0.96 0.96 0.96 71

accuracy 0.95 114

macro avg 0.94 0.94 0.94 114

weighted avg 0.95 0.95 0.95 114

Predicted class for new sample: benign


Program 9: Develop a program to implement the Naive Bayesian classifier considering
Olivetti Face Data set for training. Compute the accuracy of the classifier, considering a few
test data sets.

Source Code:

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB

from sklearn.datasets import fetch_olivetti_faces

from sklearn.metrics import accuracy_score, classification_report

# Load Olivetti Faces Dataset

faces = fetch_olivetti_faces(shuffle=True, random_state=42)

X = faces.data

y = faces.target

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Naive Bayes Classifier

naive_bayes = GaussianNB()

naive_bayes.fit(X_train, y_train)

# Predict on test data

y_pred = naive_bayes.predict(X_test)

# Evaluate Model

accuracy = accuracy_score(y_test, y_pred)


print(f'Naive Bayes Accuracy: {accuracy}')

print(classification_report(y_test, y_pred))

# Classify a new sample

new_sample = np.array([X_test[0]]) # Using first test sample as an example

predicted_class = naive_bayes.predict(new_sample)

print(f'Predicted class for new sample: {predicted_class[0]}')

Output:

Naive Bayes Accuracy: 0.775

precision recall f1-score support

0 1.00 1.00 1.00 2

1 1.00 1.00 1.00 1

2 0.33 1.00 0.50 1

3 0.00 0.00 0.00 3

4 1.00 0.50 0.67 4

5 1.00 1.00 1.00 2

7 1.00 1.00 1.00 3

8 1.00 0.67 0.80 3

9 0.50 1.00 0.67 2

10 1.00 1.00 1.00 1

11 1.00 1.00 1.00 1

12 0.50 0.67 0.57 3

13 1.00 0.50 0.67 2

14 0.00 0.00 0.00 4

15 1.00 1.00 1.00 1

16 0.67 1.00 0.80 2

17 1.00 1.00 1.00 2

18 1.00 1.00 1.00 3


19 0.40 1.00 0.57 2

20 1.00 1.00 1.00 3

21 1.00 0.50 0.67 2

22 1.00 0.40 0.57 5

23 1.00 0.50 0.67 2

24 1.00 1.00 1.00 1

25 0.67 1.00 0.80 2

26 1.00 1.00 1.00 1

27 1.00 1.00 1.00 4

28 0.00 0.00 0.00 0

29 1.00 1.00 1.00 2

30 1.00 1.00 1.00 1

31 1.00 0.67 0.80 3

32 1.00 1.00 1.00 1

34 0.00 0.00 0.00 0

35 1.00 1.00 1.00 2

36 1.00 1.00 1.00 2

38 1.00 1.00 1.00 3

39 0.57 1.00 0.73 4

accuracy 0.78 80

macro avg 0.80 0.79 0.77 80

weighted avg 0.82 0.78 0.76 80

Predicted class for new sample: 18


Program 10: Develop a program to implement k-means clustering using Wisconsin Breast
Cancer data set and visualize the clustering result.

Source Code:

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

from sklearn.cluster import KMeans

from sklearn.datasets import load_breast_cancer

from sklearn.decomposition import PCA

# Load Breast Cancer Dataset

cancer = load_breast_cancer()

X = cancer.data

# Apply K-Means Clustering

kmeans = KMeans(n_clusters=2, random_state=42)

kmeans.fit(X)

labels = kmeans.labels_

# Reduce dimensions for visualization using PCA

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X)

# Scatter plot of the clusters

plt.figure(figsize=(8, 6))

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', alpha=0.5)

plt.title('K-Means Clustering on Breast Cancer Data')

plt.xlabel('Principal Component 1')


plt.ylabel('Principal Component 2')

plt.colorbar(label='Cluster Label')

plt.show()

Output:
Viva Questions:

1. What is the difference between supervised and unsupervised learning?

2. What are the key assumptions of the Naive Bayes classifier?

3. How does the k-Nearest Neighbors (k-NN) algorithm work?

4. What is the curse of dimensionality, and how does PCA help mitigate it?

5. What is the significance of the correlation matrix in data analysis?

6. How does the Find-S algorithm work for hypothesis learning?

7. What is the difference between parametric and non-parametric regression?

8. Why is feature scaling important in machine learning?

9. How do you evaluate the performance of a clustering algorithm?

10. What is the difference between K-Means clustering and hierarchical clustering?

11. How does Locally Weighted Regression differ from traditional regression models?

12. How does k-NN classify a new data point?

13. What are the advantages and disadvantages of Decision Trees?

14. How does the Naive Bayes classifier handle continuous data?

15. What is the role of the Gaussian assumption in Naive Bayes?

16. What are the hyperparameters in K-Means clustering, and how do they affect results?

17. What is the role of eigenvalues and eigenvectors in PCA?

18. How does polynomial regression differ from linear regression?

19. Why do we use test-train splits in machine learning models?

20. What are some real-world applications of K-Means clustering?

You might also like