ML Lab Manual
ML Lab Manual
No Description Page
No
1 Practice Programs 1-7
2 Develop a program to create histograms for all numerical features and
analyze the distribution of each feature. Generate box plots for all 8 - 12
numerical features and identify any outliers. Use California Housing
dataset.
3 Develop a program to Compute the correlation matrix to understand
the relationships between pairs of features. Visualize the correlation 13 - 15
matrix using a heatmap to know which variables have strong
positive/negative correlations. Create a pair plot to visualize pairwise
relationships between features. Use California Housing dataset.
4 Develop a program to implement Principal Component Analysis (PCA) 16 – 17
for reducing the dimensionality of the Iris dataset from 4 features to 2.
5 For a given set of training data examples stored in a .CSV file, implement 18 - 19
and demonstrate the Find-S algorithm to output a description of the set
of all hypotheses consistent with the training examples.
6 Develop a program to implement k-Nearest Neighbour algorithm to
classify the randomly generated 100
values of x in the range of [0,1]. Perform the following based on dataset 20 – 22
generated.
a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε
Class1, else xi ε Class2
b. Classify the remaining points, x51,……,x100 using KNN. Perform this
for k=1,2,3,4,5,20,30
7 Implement the non-parametric Locally Weighted Regression algorithm
in order to fit data points. Select appropriate data set for your 23- 24
experiment and draw graphs
8 Develop a program to demonstrate the working of Linear Regression
and Polynomial Regression. Use Boston Housing Dataset for Linear
Regression and Auto MPG Dataset (for vehicle fuel efficiency 25 - 28
prediction) for Polynomial Regression.
9 Develop a program to demonstrate the working of the decision tree
algorithm. Use Breast Cancer Data set for building the decision tree and 29- 30
apply this knowledge to classify a new sample.
10 Develop a program to implement the Naive Bayesian classifier
considering Olivetti Face Data set for training. Compute the accuracy of 31 - 33
the classifier, considering a few test data sets.
11 Develop a program to implement k-means clustering using Wisconsin 34 - 35
Breast Cancer data set and visualize the clustering result.
12 Viva Questions 36
Practice Programs:
import pandas as pd
data = {
df = pd.DataFrame(data)
print(df)
print(df.head(2))
print(df.tail(2))
print(df.describe())
print(df.columns)
# Get DataFrame shape (rows, columns)
print(df.shape)
print(df.dtypes)
print(df)
import numpy as np
print(df.isnull().sum())
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)
# Sample Data
x = np.arange(1, 11)
y = np.sin(x)
# Line Plot
plt.plot(x, y, marker='o', linestyle='-', color='b', label='Sine Wave')
plt.xlabel("X Values")
plt.ylabel("Y Values")
plt.title("Simple Line Plot")
plt.legend()
plt.grid(True)
plt.show()
10. Write a Python Script to Plot a Bar Chart (Category Comparison)
# Sample Data
# Bar Plot
plt.ylabel("Values")
plt.show()
import numpy as np
data = np.random.randn(1000)
# Histogram
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
12. Write a Python Script to Plot a Scatter Plot (Relationship between Two Variables)
import numpy as np
# Generate Data
x = np.random.rand(100)
y = np.random.rand(100)
# Scatter Plot
plt.xlabel("X Values")
plt.ylabel("Y Values")
plt.show()
import numpy as np
data = np.random.randn(100)
# Box Plot
sns.boxplot(data=data, color='lightblue')
plt.show()
14. Write a Python Script to Plot a Pair Plot (Multiple Feature Relationships - Iris
Dataset)
import pandas as pd
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target
# Pair Plot
15. Write a Python Script to Plot a Heatmap (Correlation Matrix - Titanic Dataset)
import pandas as pd
df = sns.load_dataset("titanic").dropna()
# Compute Correlation
corr_matrix = df.corr()
# Heatmap
plt.figure(figsize=(8,6))
plt.title("Correlation Heatmap")
plt.show()
Program 1: Develop a program to create histograms for all numerical features and analyze
the distribution of each feature. Generate box plots for all numerical features and identify
any outliers. Use California Housing dataset.
Source Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
print("Dataset Overview:")
print(df.info())
print("\nSummary Statistics:")
print(df.describe())
sns.set_style("whitegrid")
plt.figure(figsize=(12, 8))
plt.show()
# Create improved box plots for all numerical features to identify outliers
plt.figure(figsize=(14, 8))
plt.subplot(3, 3, i + 1)
plt.tight_layout()
plt.show()
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
outliers = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)))
print("\nOutlier Detection:")
print(outliers.sum())
Output :
Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
dtypes: float64(8)
Summary Statistics:
MedInc 681
HouseAge 0
AveRooms 511
AveBedrms 1424
Population 1196
AveOccup 711
Latitude 0
Longitude 0
Program 2: Develop a program to Compute the correlation matrix to understand the
relationships between pairs of features. Visualize the correlation matrix using a heatmap to
know which variables have strong positive/negative correlations. Create a pair plot to
visualize pairwise relationships between features. Use California Housing dataset.
Source Code:
import numpy as np
import pandas as pd
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
sns.set_style("whitegrid")
plt.figure(figsize=(10, 6))
corr_matrix = df.corr()
plt.show()
plt.show()
# Identify skewness of numerical features
skew_values = df.skew()
print("\nSkewness of Features:")
print(skew_values)
Output:
Skewness of Features:
MedInc 1.646657
HouseAge 0.060331
AveRooms 20.697869
AveBedrms 31.316956
Population 4.935858
AveOccup 97.639561
Source Code:
import numpy as np
import pandas as pd
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
pca = PCA(n_components=2)
principal_components = pca.fit_transform(df_scaled)
pca_df['Target'] = data.target
plt.figure(figsize=(8, 6))
for target, label in enumerate(data.target_names):
plt.legend()
plt.grid(True)
plt.show()
Output:
Source Code:
import csv
num_attributes = 6
a = []
reader = csv.reader(csvfile)
a.append (row)
print(row)
print(hypothesis)
for j in range(0,num_attributes):
hypothesis[j] = a[0][j];
for i in range(0,len(a)):
if a[i][num_attributes]=='yes':
for j in range(0,num_attributes):
if a[i][j]!=hypothesis[j]:
hypothesis[j]='?'
else :
hypothesis[j]= a[i][j]
print("\n The Maximally Specific Hypothesis for a given Training Examples :\n")
print(hypothesis)
Output:
For Training instance No:3 the hypothesis is ['sunny', 'warm', '?', 'strong', '?', '?']
a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε Class1, else xi ε Class1
b. Classify the remaining points, x51,……,x100 using KNN. Perform this for k=1,2,3,4,5,20,30
Source Code:
import numpy as np
x = np.random.rand(100, 1)
plt.figure(figsize=(10, 6))
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
plt.xlabel('X values')
plt.ylabel('Predicted Class')
plt.legend()
plt.show()
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
Output:
Predictions for k=1: [1 2 2 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 2 2 1 2 1 2 1
1 1 1 1 1 2 1 1 1 1 2 2 1]
1 1 1 1 1 2 1 1 1 1 2 2 1]
1 1 1 2 1 2 1 1 1 1 2 2 1]
1 1 1 1 1 2 1 1 1 1 2 2 1]
1 1 1 1 1 2 1 1 1 1 2 2 1]
1 1 1 1 1 1 1 1 1 1 2 2 1]
1 1 1 1 1 1 1 1 1 1 2 2 1]
Program 6: Implement the non-parametric Locally Weighted Regression algorithm in order
to fit data points. Select appropriate data set for your experiment and draw graphs.
Source Code:
import numpy as np
np.random.seed(42)
m = X.shape[0]
plt.figure(figsize=(10, 6))
plt.ylabel('Y values')
plt.legend()
plt.show()
Output:
Program 7: Develop a program to demonstrate the working of Linear Regression and
Polynomial Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG
Dataset (for vehicle fuel efficiency prediction) for Polynomial Regression.
Source Code:
import numpy as np
import pandas as pd
boston = fetch_california_housing()
y_boston = boston.target
# Split data
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
y_pred = linear_reg.predict(X_test)
# Evaluate Model
mse = mean_squared_error(y_test, y_pred)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.show()
auto_mpg = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-
data/master/mpg.csv").dropna()
X_auto = auto_mpg[['horsepower']].values
y_auto = auto_mpg['mpg'].values
# Split data
poly_model.fit(X_train, y_train)
y_pred_poly = poly_model.predict(X_test)
# Evaluate Model
y_sorted = poly_model.predict(X_sorted)
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.legend()
plt.show()
Output:
Source Code:
import numpy as np
import pandas as pd
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)
# Evaluate Model
print(classification_report(y_test, y_pred))
predicted_class = decision_tree.predict(new_sample)
Output:
Source Code:
import numpy as np
import pandas as pd
X = faces.data
y = faces.target
naive_bayes = GaussianNB()
naive_bayes.fit(X_train, y_train)
y_pred = naive_bayes.predict(X_test)
# Evaluate Model
print(classification_report(y_test, y_pred))
predicted_class = naive_bayes.predict(new_sample)
Output:
accuracy 0.78 80
Source Code:
import numpy as np
import pandas as pd
cancer = load_breast_cancer()
X = cancer.data
kmeans.fit(X)
labels = kmeans.labels_
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.figure(figsize=(8, 6))
plt.colorbar(label='Cluster Label')
plt.show()
Output:
Viva Questions:
4. What is the curse of dimensionality, and how does PCA help mitigate it?
10. What is the difference between K-Means clustering and hierarchical clustering?
11. How does Locally Weighted Regression differ from traditional regression models?
14. How does the Naive Bayes classifier handle continuous data?
16. What are the hyperparameters in K-Means clustering, and how do they affect results?