0% found this document useful (0 votes)
121 views6 pages

Iris Dataset Analysis with KNN & K-Means

The document describes implementing various clustering and classification algorithms on an iris flower dataset. Functions are defined to read in the dataset, calculate distances between data points, perform k-nearest neighbors classification, and k-means clustering. KNN classification is demonstrated using different distance metrics. K-means clustering is also performed using different distance metrics to cluster the iris data into k=3 groups.

Uploaded by

no0r32200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views6 pages

Iris Dataset Analysis with KNN & K-Means

The document describes implementing various clustering and classification algorithms on an iris flower dataset. Functions are defined to read in the dataset, calculate distances between data points, perform k-nearest neighbors classification, and k-means clustering. KNN classification is demonstrated using different distance metrics. K-means clustering is also performed using different distance metrics to cluster the iris data into k=3 groups.

Uploaded by

no0r32200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

dsm-3

December 19, 2023

0.1 1. Write afunction to read a data set and store it as a matrix


[201]: import pandas as pd
import numpy as np
from [Link] import distance
from collections import Counter

def read_dataset(filename):
df = pd.read_csv(filename)
matrix = df.to_numpy()
return matrix

csv_file = "[Link]"
dataset = read_dataset(csv_file)

[202]: # Print the matrix


print(dataset)

[[5.1 3.5 1.4 0.2 'Setosa']


[4.9 3.0 1.4 0.2 'Setosa']
[4.7 3.2 1.3 0.2 'Setosa']
[5.0 3.4 1.5 0.2 'Setosa']
[5.7 3.8 1.7 0.3 'Setosa']
[5.1 3.8 1.5 0.3 'Setosa']
[5.5 4.2 1.4 0.2 'Setosa']
[4.9 3.1 1.5 0.2 'Setosa']
[5.0 3.2 1.2 0.2 'Setosa']
[5.0 3.3 1.4 0.2 'Setosa']
[7.0 3.2 4.7 1.4 'Versicolor']
[6.9 3.1 4.9 1.5 'Versicolor']
[5.5 2.3 4.0 1.3 'Versicolor']
[6.5 2.8 4.6 1.5 'Versicolor']
[6.3 2.5 4.9 1.5 'Versicolor']
[6.0 3.4 4.5 1.6 'Versicolor']
[6.7 3.1 4.7 1.5 'Versicolor']
[6.3 2.3 4.4 1.3 'Versicolor']
[5.6 3.0 4.1 1.3 'Versicolor']
[5.1 2.5 3.0 1.1 'Versicolor']

1
[6.3 3.3 6.0 2.5 'Virginica']
[5.8 2.7 5.1 1.9 'Virginica']
[7.1 3.0 5.9 2.1 'Virginica']
[6.3 2.9 5.6 1.8 'Virginica']
[6.5 3.0 5.8 2.2 'Virginica']
[7.6 3.0 6.6 2.1 'Virginica']
[4.9 2.5 4.5 1.7 'Virginica']
[7.3 2.9 6.3 1.8 'Virginica']
[6.7 2.5 5.8 1.8 'Virginica']
[7.2 3.6 6.1 2.5 'Virginica']]

0.2 2.a Calculate Data mean for each attribute and represent it as a vector
[216]: def calculate_data_mean(filename):
df = pd.read_csv(filename)
mean_vector = [Link](numeric_only=True)

return mean_vector

csv_file = "[Link]"
mean_vector = calculate_data_mean(csv_file)
print("Mean Vector:")
print(mean_vector)

Mean Vector:
[Link] 5.95
[Link] 3.07
[Link] 3.86
[Link] 1.22
dtype: float64

0.3 2.b Calculate Manhattan distance between two data objects


[204]: def manhattan_distance(vec1, vec2):
dist = [Link]([Link]([Link](vec1) - [Link](vec2)))
return dist

0.4 2.c Calculate Euclidian distance between two data objects


[205]: # calculating Euclidean distance using [Link]()
def euclidean_distance(vec1, vec2):
dist = [Link](vec1 - vec2)
return dist

2
0.5 2.d Calculate Chebyshev distance between two data objects
[206]: def Chebychev_distance(vec1,vec2):
dist= [Link]([Link]([Link](vec1) - [Link](vec2)))
return dist

0.6 2.e Calculate Mahalanobis distance.


[207]: def mahalanobis_distance(data, x):
mean_vector = [Link]().values
cov_matrix = [Link]().values
inv_cov_matrix = [Link](cov_matrix)
x_minus_mean = x - mean_vector
mahalanobis_sq = [Link]([Link](x_minus_mean, inv_cov_matrix), x_minus_mean.
↪T)

mahalanobis_distance = [Link](mahalanobis_sq)
return mahalanobis_distance

iris_data = pd.read_csv('[Link]')
columns = ['[Link]', '[Link]', '[Link]', '[Link]']
iris_subset = iris_data[columns]

point = [Link]([5.0, 3.2, 1.4, 0.2]) # Example point


distance = mahalanobis_distance(iris_subset, point)
print("Mahalanobis Distance:", distance)

# ref [Link]

Mahalanobis Distance: 1.357839356712021

0.7 Write a separate function to implement the K-Nearest Neighbors classifi-


cation method using all the functions implemented in question(2) above
[209]: def knn_classify(data, labels, query_point, k, distance_metric):
distances = []
for i, row in [Link]():
if distance_metric == 'manhattan':
dist = manhattan_distance(row, query_point)
elif distance_metric == 'chebyshev':
dist = chebyshev(row, query_point)
elif distance_metric == 'euclidean':
dist = euclidean_distance(row, query_point)
elif distance_metric == 'mahalanobis':
dist = mahalanobis_distance(data, query_point)
else:
raise ValueError("Invalid distance metric. Supported options are␣
↪'manhattan', 'chebyshev', 'euclidean', and 'mahalanobis'.")

3
[Link]((dist, labels[i]))

[Link]()
k_nearest = distances[:k]
k_nearest_labels = [label for (_, label) in k_nearest]

most_common = Counter(k_nearest_labels).most_common(1)
predicted_label = most_common[0][0]

return predicted_label

iris_data = pd.read_csv('[Link]')
feature_columns = ['[Link]', '[Link]', '[Link]', '[Link]']
iris_features = iris_data[feature_columns]
iris_labels = iris_data['variety']
random_point = [Link]([6.1, 2.9, 4.7, 1.3])
k = 5 # Number of nearest neighbors to consider

distance_metrics = ['manhattan', 'chebyshev', 'euclidean', 'mahalanobis']

for metric in distance_metrics:


predicted_label = knn_classify(iris_features, iris_labels, random_point, k,␣
↪metric)

print(f"Predicted variety using {[Link]()} distance:␣


↪{predicted_label}")

Predicted variety using Manhattan distance: Versicolor


Predicted variety using Chebyshev distance: Versicolor
Predicted variety using Euclidean distance: Versicolor
Predicted variety using Mahalanobis distance: Setosa

0.8 Write a separate function to implement the K-means clustering method


using all the functions implemented in question (2) above
[214]: def initialize_centroids(data, k):
centroids = data[[Link](range([Link][0]), k, replace=False)]
return centroids

def assign_clusters(data, centroids, distance_metric):


cluster_labels = [Link]([Link][0], dtype=int)
for i, point in enumerate(data):
distances = []
if distance_metric == 'mahalanobis':
covariance_matrix = [Link](data.T)
for centroid in centroids:
[Link](mahalanobis_distance(point, centroid,␣
↪covariance_matrix))

4
elif distance_metric == 'manhattan':
for centroid in centroids:
[Link](manhattan_distance(point, centroid))
elif distance_metric == 'chebyshev':
for centroid in centroids:
[Link](chebyshev_distance(point, centroid))
elif distance_metric == 'euclidean':
for centroid in centroids:
[Link](euclidean_distance(point, centroid))
cluster_labels[i] = [Link](distances)
return cluster_labels

def update_centroids(data, cluster_labels, k):


centroids = []
for i in range(k):
cluster_data = data[cluster_labels == i]
centroid = [Link](cluster_data, axis=0)
[Link](centroid)
centroids = [Link](centroids)
return centroids

def kmeans(data, k, distance_metric='euclidean', max_iterations=100):


centroids = initialize_centroids(data, k)
for _ in range(max_iterations):
cluster_labels = assign_clusters(data, centroids, distance_metric)
new_centroids = update_centroids(data, cluster_labels, k)
if np.array_equal(centroids, new_centroids):
break
centroids = new_centroids
return cluster_labels, centroids

def euclidean_distance(vec1, vec2):


dist = [Link](vec1 - vec2)
return dist

def manhattan_distance(vec1, vec2):


dist = [Link]([Link](vec1 - vec2))
return dist

def chebyshev_distance(vec1, vec2):


dist = [Link]([Link](vec1 - vec2))
return dist

def mahalanobis_distance(vec1, vec2, covariance_matrix):


diff = vec1 - vec2
inv_covariance = [Link](covariance_matrix)
dist = [Link]([Link]([Link](diff, inv_covariance), diff.T))

5
return dist

iris_data = [Link]('[Link]', delimiter=',', skip_header=1, usecols=(0,␣


↪1, 2, 3))

k = 3
distance_metrics = ['mahalanobis', 'manhattan', 'chebyshev', 'euclidean']
for metric in distance_metrics:
cluster_labels, centroids = kmeans(iris_data, k, distance_metric=metric)
print(f"Distance Metric: {metric}")
print("Cluster Labels:")
print(cluster_labels)
print("Centroids:")
print(centroids)
print()

Distance Metric: mahalanobis


Cluster Labels:
[2 0 2 2 2 2 2 2 0 2 1 1 0 1 1 2 1 0 2 0 1 1 1 2 1 1 2 1 1 1]
Centroids:
[[5.36 2.66 2.8 0.82 ]
[6.76153846 2.97692308 5.49230769 1.86923077]
[5.31666667 3.34166667 2.53333333 0.68333333]]

Distance Metric: manhattan


Cluster Labels:
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 2 1 2 2 2 2 1 2 2 2]
Centroids:
[[5.09090909 3.36363636 1.57272727 0.3 ]
[6.13636364 2.80909091 4.58181818 1.5 ]
[6.875 3.025 6.0125 2.1 ]]

Distance Metric: chebyshev


Cluster Labels:
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 2 1 2 2 2 2 1 2 2 2]
Centroids:
[[5.09090909 3.36363636 1.57272727 0.3 ]
[6.13636364 2.80909091 4.58181818 1.5 ]
[6.875 3.025 6.0125 2.1 ]]

Distance Metric: euclidean


Cluster Labels:
[0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 0 1 2 1 1 1 1 2 1 1 1]
Centroids:
[[5.09090909 3.36363636 1.57272727 0.3 ]
[6.875 3.025 6.0125 2.1 ]
[6.13636364 2.80909091 4.58181818 1.5 ]]

You might also like