clustering_assignment
July 13, 2020
1 Peer-graded Assignment: Clustering
In [ 3]: import pandas as pd
import numpyas np
import matplotlib . pyplot as plt
import seaborn as sns
from sklearn . cluster import KMeans
banknote_data = pd. read_csv( ' banknote_authentication_dataset.csv ' )
value_one = [banknote_data[' V1' ] . mean(),banknote_data[' V2' ] . mean()]
value_two = [banknote_data[' V1' ] . std(),banknote_data[ ' V2' ] . std()]
# Calculate V1/V2 Mean
print ( " Mean:" , value_one)
# Calculate V1/V2 Standard Deviation
print ( " Standard Deviation: ", value_two)
banknote_data. dropna()
banknote_data. describe()
Mean: [0.43373525728862977, 1.9223531209912539]
Standard Deviation: [2.8427625862451658, 5.869046743580378]
Out[3]: V1 V2
count 1372.000000 1372.000000
mean 0.433735 1.922353
std 2.842763 5.869047
min -7.042100 -13.773100
25% -1.773000 -1.708200
50% 0.496180 2.319650
75% 2.821475 6.814625
max 6.824800 12.951600
1
1.0.1 Data Analysis
We used fiseaborn.pairplot function to plot pairwise relationships in the provided banknote au-
thentication dataset. An example of scatterplots with joint relationships and histograms for uni-
variate distributions are below:
In [ 4]: sns. pairplot(banknote_data)
Out[4]: <seaborn.axisgrid.PairGrid at 0x7f529070a400>
We can deduce looking at the scatterplots the values are slightly high for fake banknotes.
1.0.2 Dataset Normalization
We can normalize the mean and standard deviation output to measure fake and authentic ban-
knotes.
In [ 6]: # Normalize the dataset
normalize_banknote_data= (banknote_data - banknote_data. min()) / (banknote_data. max() - banknote_data. min())
normalize_mean= [normalize_banknote_data[' V1' ] . mean(), normalize_banknote_data[' V2' ] . mean()]
2
normalize_std = [normalize_banknote_data[' V1' ] . std(), normalize_banknote_data[ ' V2' ] . std()]
print ( " Mean:" ,normalize_mean)
print ( " Standard Deviation: ",normalize_std)
Mean: [0.5391136632764807, 0.5873013774145737]
Standard Deviation: [0.20500346769971411, 0.2196113237409729]
In [ 7]: plt . xlabel( ' V1' )
plt . ylabel( ' V2' )
plt . scatter(normalize_banknote_data[ ' V1' ],normalize_banknote_data[ ' V2' ],alpha =0.25)
plt . scatter(normalize_mean[0],normalize_mean[1],label =" Mean")
plt . title( " OpenML Banknote Authentication Dataset
")
plt . legend()
plt . show()
In [ 11]: # K-means Algorithm
from sklearn . cluster import KMeans
for i in range( 1):
kmeans= KMeans(n_clusters=2) . fit(normalize_banknote_data)
3
# Centres of our clusters
clusters = kmeans. cluster_centers_
# Classification of the elements
y_kmeans= kmeans . predict(normalize_banknote_data)
# Create a column with the labels
normalize_banknote_data[' Class' ] = y_kmeans
class_1 = normalize_banknote_data[ normalize_banknote_data[
' Class ' ] ==0 ]
class_2 = normalize_banknote_data[ normalize_banknote_data[
' Class ' ] ==1 ]
# Two clusters plotting
plt . xlabel( ' V1' )
plt . ylabel( ' V2' )
plt . scatter(class_1[ ' V1' ], class_1[ ' V2' ], label =" Class 1" , alpha =0.5)
plt . scatter(class_2[ ' V1' ], class_2[ ' V2' ], label =" Class 2" , alpha =0.5)
plt . scatter(clusters[:, 0], clusters[:, 1], c =' blue' , s =10000
, alpha =0.2)
plt . title( "Fake and Authentic Banknotes
")
plt . legend()
plt . show()
4
1.0.3 Discussion
We use K-Mean Algorithm to identify the two clusters and categorize which element belongs
to a speci c class after normalizing the mean and standard deviation outputs. Additionally, we
labelled the column and displayed the graph to visualize the fake and authentic banknote classes.
As instructed by the question of the task, I ran the K-means severally to check its stability and
noticed the position of the centroids didn't change much.
1.0.4 Recommendation
The algorithm can help nancial institutions or independent researchers categorize new input into
two classes which will make it easier for them to differentiate between a real or fake banknote.