0% found this document useful (0 votes)
2 views

Vertopal.com Untitled

The document describes a Python script that reads a CSV file into a pandas DataFrame and drops specific columns. It then calculates k-means clustering on the numeric features of the DataFrame, determines the optimal number of clusters using the Elbow method, and applies k-means clustering to assign clusters to the data. Finally, it visualizes the clusters using a scatter plot.

Uploaded by

vinaybuddyy2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Vertopal.com Untitled

The document describes a Python script that reads a CSV file into a pandas DataFrame and drops specific columns. It then calculates k-means clustering on the numeric features of the DataFrame, determines the optimal number of clusters using the Elbow method, and applies k-means clustering to assign clusters to the data. Finally, it visualizes the clusters using a scatter plot.

Uploaded by

vinaybuddyy2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import sklearn

df = pd.read_csv('/content/Champo data clustering.csv')


df

{"summary":"{\n \"name\": \"df\",\n \"rows\": 45,\n \"fields\": [\n


{\n \"column\": \"Row Labels\",\n \"properties\": {\n
\"dtype\": \"string\",\n \"num_unique_values\": 45,\n
\"samples\": [\n \"T-4\",\n \"L-4\",\n \"L-
5\"\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\": \"Sum
of QtyRequired\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 30550,\n \"min\": 2,\n
\"max\": 183206,\n \"num_unique_values\": 45,\n
\"samples\": [\n 5677,\n 776,\n 25840\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"Sum of TotalArea\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
34474.17720152305,\n \"min\": 1.35,\n \"max\":
209725.222,\n \"num_unique_values\": 45,\n \"samples\":
[\n 2811.375,\n 7.36,\n 210.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"Sum of Amount\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
1808976.9642323288,\n \"min\": 328.8752,\n \"max\":
11341052.51,\n \"num_unique_values\": 45,\n \"samples\":
[\n 238241.0,\n 44234.0,\n 358890.0\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"DURRY\",\n \"properties\":
{\n \"dtype\": \"number\",\n \"std\": 22160,\n
\"min\": 0,\n \"max\": 139618,\n \"num_unique_values\":
32,\n \"samples\": [\n 1560,\n 5310,\n
9950\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"HANDLOOM\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 607,\n \"min\": 0,\n
\"max\": 3673,\n \"num_unique_values\": 12,\n
\"samples\": [\n 450,\n 395,\n 1445\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"DOUBLE BACK\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
1166,\n \"min\": 0,\n \"max\": 5439,\n
\"num_unique_values\": 19,\n \"samples\": [\n 0,\n
16,\n 160\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"JACQUARD\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 175,\n \"min\": 0,\n
\"max\": 714,\n \"num_unique_values\": 19,\n
\"samples\": [\n 0,\n 6,\n 60\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"HAND TUFTED\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
9917,\n \"min\": 0,\n \"max\": 60685,\n
\"num_unique_values\": 29,\n \"samples\": [\n 1661,\n
2697,\n 3657\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"HAND WOVEN\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 2418,\n \"min\": 0,\n
\"max\": 14314,\n \"num_unique_values\": 22,\n
\"samples\": [\n 0,\n 56,\n 3000\
n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"KNOTTED\",\n \"properties\": {\n \"dtype\": \"number\",\
n \"std\": 1503,\n \"min\": 0,\n \"max\": 9502,\n
\"num_unique_values\": 14,\n \"samples\": [\n 9502,\n
350,\n 0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\": \"GUN
TUFTED\",\n \"properties\": {\n \"dtype\": \"number\",\n
\"std\": 34,\n \"min\": 0,\n \"max\": 195,\n
\"num_unique_values\": 5,\n \"samples\": [\n 19,\n
122,\n 30\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Powerloom Jacquard\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 1453,\n \"min\": 0,\n
\"max\": 9753,\n \"num_unique_values\": 2,\n
\"samples\": [\n 9753,\n 0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"INDO TEBETAN\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
3,\n \"min\": 0,\n \"max\": 20,\n
\"num_unique_values\": 3,\n \"samples\": [\n 0,\n
20\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Unnamed: 14\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": null,\n \"min\": null,\n
\"max\": null,\n \"num_unique_values\": 0,\n
\"samples\": [],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n }\n ]\
n}","type":"dataframe","variable_name":"df"}

# prompt: generate a code to drop Row Labels and Unnamed: 14 columns

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
df = pd.read_csv('/content/Champo data clustering.csv')

# Drop 'Row Labels' and 'Unnamed: 14' columns if they exist


if 'Row Labels' in df.columns:
df = df.drop('Row Labels', axis=1)
if 'Unnamed: 14' in df.columns:
df = df.drop('Unnamed: 14', axis=1)

df

{"summary":"{\n \"name\": \"df\",\n \"rows\": 45,\n \"fields\": [\n


{\n \"column\": \"Sum of QtyRequired\",\n \"properties\": {\
n \"dtype\": \"number\",\n \"std\": 30550,\n
\"min\": 2,\n \"max\": 183206,\n \"num_unique_values\":
45,\n \"samples\": [\n 5677,\n 776,\n
25840\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\": \"Sum
of TotalArea\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 34474.17720152305,\n \"min\":
1.35,\n \"max\": 209725.222,\n \"num_unique_values\":
45,\n \"samples\": [\n 2811.375,\n 7.36,\n
210.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\": \"Sum
of Amount\",\n \"properties\": {\n \"dtype\": \"number\",\
n \"std\": 1808976.9642323288,\n \"min\": 328.8752,\n
\"max\": 11341052.51,\n \"num_unique_values\": 45,\n
\"samples\": [\n 238241.0,\n 44234.0,\n
358890.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"DURRY\",\n \"properties\": {\n \"dtype\": \"number\",\n
\"std\": 22160,\n \"min\": 0,\n \"max\": 139618,\n
\"num_unique_values\": 32,\n \"samples\": [\n 1560,\n
5310,\n 9950\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"HANDLOOM\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 607,\n \"min\": 0,\n
\"max\": 3673,\n \"num_unique_values\": 12,\n
\"samples\": [\n 450,\n 395,\n 1445\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"DOUBLE BACK\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
1166,\n \"min\": 0,\n \"max\": 5439,\n
\"num_unique_values\": 19,\n \"samples\": [\n 0,\n
16,\n 160\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"JACQUARD\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 175,\n \"min\": 0,\n
\"max\": 714,\n \"num_unique_values\": 19,\n
\"samples\": [\n 0,\n 6,\n 60\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"HAND TUFTED\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
9917,\n \"min\": 0,\n \"max\": 60685,\n
\"num_unique_values\": 29,\n \"samples\": [\n 1661,\n
2697,\n 3657\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"HAND WOVEN\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 2418,\n \"min\": 0,\n
\"max\": 14314,\n \"num_unique_values\": 22,\n
\"samples\": [\n 0,\n 56,\n 3000\
n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"KNOTTED\",\n \"properties\": {\n \"dtype\": \"number\",\
n \"std\": 1503,\n \"min\": 0,\n \"max\": 9502,\n
\"num_unique_values\": 14,\n \"samples\": [\n 9502,\n
350,\n 0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\": \"GUN
TUFTED\",\n \"properties\": {\n \"dtype\": \"number\",\n
\"std\": 34,\n \"min\": 0,\n \"max\": 195,\n
\"num_unique_values\": 5,\n \"samples\": [\n 19,\n
122,\n 30\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Powerloom Jacquard\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 1453,\n \"min\": 0,\n
\"max\": 9753,\n \"num_unique_values\": 2,\n
\"samples\": [\n 9753,\n 0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"INDO TEBETAN\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
3,\n \"min\": 0,\n \"max\": 20,\n
\"num_unique_values\": 3,\n \"samples\": [\n 0,\n
20\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n }\n ]\
n}","type":"dataframe","variable_name":"df"}

# prompt: generate a code to calculate kMeans from the above data

from sklearn.cluster import KMeans

# Select features for clustering (excluding non-numeric columns if


any)
features = df.select_dtypes(include=np.number)

# Determine the optimal number of clusters (e.g., using the Elbow


method)
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300,
n_init=10, random_state=0)
kmeans.fit(features)
wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss)


plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

# Based on the Elbow method plot, choose the optimal number of


clusters (e.g., k=3)
optimal_k = 3 # Replace with the value you determined from the plot

# Apply k-means clustering


kmeans = KMeans(n_clusters=optimal_k, init='k-means++', max_iter=300,
n_init=10, random_state=0)
df['cluster'] = kmeans.fit_predict(features)

# Print or visualize the results


print(df.head())

# Visualize the clusters (example for 2D data)

# If you have more than 2 features, you may need to use dimensionality
reduction techniques like PCA first
if len(features.columns) >= 2:
plt.scatter(features.iloc[:, 0], features.iloc[:, 1],
c=df['cluster'], cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1], s=300, c='black', marker='*',
label='Centroids')
plt.title('Clusters of customers')
plt.xlabel('Feature 1') # Replace with your actual feature name
plt.ylabel('Feature 2') # Replace with your actual feature name
plt.legend()
plt.grid()
plt.show()
Sum of QtyRequired Sum of TotalArea Sum of Amount DURRY
HANDLOOM \
0 2466 139.5900 1.854041e+05 1021
1445
1 131 2086.0000 6.247460e+03 0
0
2 18923 53625.6544 1.592080e+06 3585
0
3 624 202.8987 1.481116e+04 581
0
4 464 8451.5625 5.862686e+04 0
0

DOUBLE BACK JACQUARD HAND TUFTED HAND WOVEN KNOTTED GUN TUFTED
\
0 0 0 0 0 0 0

1 25 106 0 0 0 0

2 175 714 11716 2116 617 0

3 0 2 0 41 0 0
4 459 5 0 0 0 0

Powerloom Jacquard INDO TEBETAN cluster


0 0 0 0
1 0 0 0
2 0 0 2
3 0 0 0
4 0 0 0

# prompt: generate a code to plot silhouette_scores

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# ... (Your existing code for data loading and preprocessing) ...

# Determine the optimal number of clusters (e.g., using the Silhouette


method)
silhouette_scores = []
for i in range(2, 11): # Silhouette score is not defined for a single
cluster
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300,
n_init=10, random_state=0)
cluster_labels = kmeans.fit_predict(features)
silhouette_avg = silhouette_score(features, cluster_labels)
silhouette_scores.append(silhouette_avg)

# Plot silhouette scores


plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.title('Silhouette Scores for Different Cluster Numbers')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.grid()
plt.show()

# Find the optimal k based on highest silhouette score


optimal_k = np.argmax(silhouette_scores) + 2 # Add 2 because range
starts from 2

# Apply k-means clustering with the optimal k


kmeans = KMeans(n_clusters=optimal_k, init='k-means++', max_iter=300,
n_init=10, random_state=0)
df['cluster'] = kmeans.fit_predict(features)

# ... (rest of your code for visualization)

You might also like