Name: Harshad Kamble
Roll No : 23
Aim: Assignment on Clustering Techniques
Download the following customer dataset from below link:
Data Set: https://www.kaggle.com/shwetabh123/mall-customers
This dataset gives the data of Income and money spent by the
customers visiting a Shopping Mall. The
data set contains Customer ID, Gender, Age, Annual Income, Spending
Score. Therefore, as a mall owner
you need to find the group of people who are the profitable
customers for the mall owner. Apply at least
two clustering algorithms (based on Spending Score) to find the
group of customers.
a. Apply Data pre-processing (Label Encoding , Data
Transformation….) techniques if
necessary. b. Perform data-preparation( Train-Test Split)
c. Apply Machine Learning Algorithm
d. Evaluate Model.
e. Apply Cross-Validation and Evaluate Model
In [1]: import pandas as pd
In [2]: import matplotlib.pyplot as plt
In [3]: from matplotlib.lines import Line2D
In [4]: from sklearn.preprocessing import StandardScaler
In [5]: from sklearn.decomposition import PCA
In [8]: from sklearn.cluster import KMeans
In [9]: df=pd.read_csv("/home/student/TE52/Mall_Customers.csv")#read specific menti
In [15]: df.head()#to show first few rows of dataframe
Out[15]:
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
In [56]: df.rename(columns = {'Annual Income (k$)':'Annual Income'},inplace=True
In [55]: df.rename(columns={'Spending Score (1-100)':'Spending Score'},inplace
1 of 6
In [57]: df.describe()#to describe the framework
Out[57]:
CustomerID Age Annual Income Spending Score
count 200.000000 200.000000 200.000000 200.000000
mean 100.500000 38.850000 60.560000 50.200000
std 57.879185 13.969007 26.264721 25.823522
min 1.000000 18.000000 15.000000 1.000000
25% 50.750000 28.750000 41.500000 34.750000
50% 100.500000 36.000000 61.500000 50.000000
75% 150.250000 49.000000 78.000000 73.000000
max 200.000000 70.000000 137.000000 99.000000
In [60]: df.isnull().sum()#check for null values
Out[60]: CustomerID 0
Gender 0
Age 0
Annual Income 0
Spending Score 0
dtype: int64
In [47]: df.shape#to show no.of rows and columns
Out[47]: (200, 5)
In [49]: df['Gender'].value_counts()#view count
Out[49]: Gender
Female 112
Male 88
Name: count, dtype: int64
In [61]: print(df.columns.tolist())#check column names
['CustomerID', 'Gender', 'Age', 'Annual Income', 'Spending Score']
In [79]: sc = StandardScaler()#importing standard scaler
In [80]: numeric_features = df[['Age', 'Annual Income', 'Spending Score']]
In [81]: print(numeric_features.head())#printing first few values
Age Annual Income Spending Score
0 19 15 39
1 21 15 81
2 20 16 6
3 23 16 77
4 31 17 40
2 of 6
In [82]: numeric_features_scaled = sc.fit_transform(numeric_features)#sclae the valu
In [83]: df_scaled = pd.DataFrame(numeric_features_scaled, columns=numeric_features
In [90]: print(df_scaled)#display scaled dataframe
Age Annual Income Spending Score
0 -1.424569 -1.738999 -0.434801
1 -1.281035 -1.738999 1.195704
2 -1.352802 -1.700830 -1.715913
3 -1.137502 -1.700830 1.040418
4 -0.563369 -1.662660 -0.395980
.. ... ... ...
195 -0.276302 2.268791 1.118061
196 0.441365 2.497807 -0.861839
197 -0.491602 2.497807 0.923953
198 -0.491602 2.917671 -1.250054
199 -0.635135 2.917671 1.273347
[200 rows x 3 columns]
In [91]: pca = PCA(n_components = 2)#creating pca object
In [93]: df_pca = pca.fit_transform(df_scaled)#fitting and transforming data
In [95]: print("data shape after PCA :",df_pca.shape)#printing shape of transformed
data shape after PCA : (200, 2)
In [96]: print("data_pca is:",df_pca)#printing transformed data
data_pca is: [[-6.15720019e-01 -1.76348088e+00]
[-1.66579271e+00 -1.82074695e+00]
[ 3.37861909e-01 -1.67479894e+00]
[-1.45657325e+00 -1.77242992e+00]
[-3.84652078e-02 -1.66274012e+00]
[-1.48168526e+00 -1.73500173e+00]
[ 1.09461665e+00 -1.56610230e+00]
[-1.92630736e+00 -1.72111049e+00]
[ 2.64517786e+00 -1.46084721e+00]
[-9.70130513e-01 -1.63558108e+00]
[ 2.49568861e+00 -1.47048914e+00]
[-1.45688256e+00 -1.66436050e+00]
[ 2.01018729e+00 -1.45329897e+00]
[-1.41321072e+00 -1.61776746e+00]
[ 1.00042965e+00 -1.49579176e+00]
[-1.56943170e+00 -1.62502669e+00]
[ 2.94060318e-01 -1.49425585e+00]
[-1.31624924e+00 -1.57216383e+00]
[ 1.31669910e+00 -1.37243404e+00]
[-1.43679899e+00 -1.51039469e+00]
In [97]: plt_font = {'family':'serif' , 'size':16}#font size
3 of 6
In [99]: wcss_list = []
for i in range(1, 15):
kmeans = KMeans(n_clusters = i , init = 'k-means++' , random_state
In [101]: kmeans.fit(df_pca)
wcss_list.append(kmeans.inertia_)
In [5]: import matplotlib.pyplot as plt
# Example font properties (customize as needed)
plt_font = {
'fontsize': 12,
'fontweight': 'bold',
'family': 'serif'
}
# Example: Define wcss_list (replace this with your actual WCSS values)
wcss_list = [500, 300, 250, 200, 180, 175, 170, 160, 150, 145, 140, 135
# Plotting
plt.plot(range(1, len(wcss_list) + 1), wcss_list)
plt.plot([4, 4], [0, max(wcss_list)], linestyle='--', alpha=0.7) # Adjuste
plt.xlabel('K', fontdict=plt_font)
plt.ylabel('WCSS', fontdict=plt_font)
plt.title('Elbow Method for Optimal k', fontdict=plt_font)
plt.show()
4 of 6
In [30]:
# Example: Creating a sample dataset
# Replace this with your actual data
data = pd.DataFrame({
'feature1': np.random.rand(100),
'feature2': np.random.rand(100),
'feature3': np.random.rand(100),
})
# Step 1: Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# Step 2: Perform PCA
pca = PCA(n_components=2) # Adjust the number of components as needed
df_pca = pca.fit_transform(data_scaled)
# Step 3: Perform KMeans clustering
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=1)
kmeans.fit(df_pca)
cluster_id = kmeans.predict(df_pca)
# Step 4: Creating a result DataFrame
result_data = pd.DataFrame()
result_data['PC1'] = df_pca[:, 0]
result_data['PC2'] = df_pca[:, 1]
result_data['Cluster'] = cluster_id # Use 'Cluster' as defined earlier
# Print the result data
print(result_data)
# Define colors for clusters
cluster_colors = {0: 'tab:red', 1: 'tab:green', 2: 'tab:blue', 3: 'tab:pink
cluster_dict = {
'Centroid': 'tab:orange',
'Cluster0': 'tab:red',
'Cluster1': 'tab:green',
'Cluster2': 'tab:blue',
'Cluster3': 'tab:pink'
}
# Scatter plot for the clusters
plt.scatter(x=result_data['PC1'],
y=result_data['PC2'],
c=result_data['Cluster'].map(cluster_colors))
# Create legend handles
handles = [Line2D([0], [0], marker='o', color='w', markerfacecolor=v,
markersize=8) for k, v in cluster_dict.items()]
plt.legend(title='Color', handles=handles, bbox_to_anchor=(1.05, 1),
# Scatter plot for centroids
plt.scatter(x=kmeans.cluster_centers_[:, 0],
y=kmeans.cluster_centers_[:, 1],
marker='o',
c='tab:orange',
s=150,
alpha=1)
5 of 6
# Plot settings
plt.title("Clustered by KMeans", fontdict={'fontsize': 14, 'fontweight'
plt.xlabel("PC1", fontdict={'fontsize': 12})
plt.ylabel("PC2", fontdict={'fontsize': 12})
plt.show()
PC1 PC2 Cluster
0 -0.708627 0.160780 0
1 -1.604748 0.467547 0
2 2.637297 0.254471 2
3 -0.265911 -1.520968 3
4 -0.256524 1.008739 0
.. ... ... ...
95 1.335878 -1.062640 1
96 -0.139539 -1.098409 3
97 -0.077331 0.132332 0
98 0.473854 -0.749248 1
99 -0.749690 1.394149 0
[100 rows x 3 columns]
6 of 6