0% found this document useful (0 votes)
37 views33 pages

Industrial Training Report

The Industrial Training Report by Sukriti Nautiyal details her internship project titled 'Clustering with LLM' at Unified Mentor, focusing on utilizing large language models to enhance clustering of unstructured text data. The report outlines the project's methodology, including data collection, preprocessing, embedding generation, and the implementation of various clustering algorithms, while also addressing challenges faced and results achieved. It emphasizes the potential applications of LLM-powered clustering in fields such as healthcare, e-commerce, and education, highlighting the project's contributions to improving clustering quality and interpretability.

Uploaded by

pikachupikka60
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views33 pages

Industrial Training Report

The Industrial Training Report by Sukriti Nautiyal details her internship project titled 'Clustering with LLM' at Unified Mentor, focusing on utilizing large language models to enhance clustering of unstructured text data. The report outlines the project's methodology, including data collection, preprocessing, embedding generation, and the implementation of various clustering algorithms, while also addressing challenges faced and results achieved. It emphasizes the potential applications of LLM-powered clustering in fields such as healthcare, e-commerce, and education, highlighting the project's contributions to improving clustering quality and interpretability.

Uploaded by

pikachupikka60
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Industrial Training Report

Submitted in partial fulfillment of the requirement for the award of the degree
Bachelor of Technology
in
Information Technology

PROJECT TITLE - Clustering With LLM


SESSION 2024-25

SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

GAUTAM BUDDHA UNIVERSITY

GREATER NOIDA —201312, GAUTAM BUDDHA NAGAR

UTTAR PRADESH, INDIA

Submitted to: Submitted By:


Mr. Vishvajeet Yadav Sukriti Nautiyal (225/LIT/001)
SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY
GAUTAM BUDDHA UNIVERSITY, GREATER NOIDA, 201312, U. P., (INDIA)

Certificate

The industrial training report submitted by Sukriti Nautiyal (Roll no- 225/LIT/001), has been found satisfactory
in terms of scope, quality, and presentation as fulfillment of the requirements for the course IT-493
(INDUSTRIAL TRAINING - 7th Semester) of the degree of Bachelors of Technology in Information
Technology at Gautam Buddha University, Greater Noida, Uttar Pradesh, India-201312.

(Signature)
Mr. Vishvajeet Yadav
(Supervisor)
Table Of Contents
1. Candidate’s
Declaration…………………………………………………………………
2. Acknowledgment……………………………………………………………………….
3. Offer Letter………………………………………………………………………………..
4. Abstract…………………………………………………………………………………
5. Introduction……………………………………………………………………………..
6. Project Description………………………………………………………………………..
 About the Project…………………………………………………………….
 Methodology ………………………………………………………………...
 Objective and scope………………………………………………………….
 Workflow……………………………………………………….....................
 Code and Output……………………………………………………………..
7. Limitations And
Challenges……………………………………………………………….
8. Conclusion………………………………………………………………………………...
9. Completion
Certificate…………………………………………………………………….
10. References………………………………………………………………………………...
Candidate's Declaration

I (Sukriti Nautiyal) hereby declare that the training report is an authentic record of my own work

as requirement of 1 Month Internship during period 15/09/24 to 15/10/24 for the award of degree

of B. Tech (Information Technology) at Unified Mentor, under the guidance of Mr. Abhishek

(Project Manager).

The work presented in this report is the result of my own efforts and has not been submitted

elsewhere for any academic or professional purpose. I take responsibility for the content and

conclusions presented in this report.

SUKRITI NAUTIYAL

225/LIT/001

Date: Place: Greater Noida


Acknowledgment

Every Internship big or small is successful largely due to the effort of a number of wonderful
people who have always given their valuable advice or let a helping hand. I sincerely
appreciate the inspiration, support and guidance of all those people who have been
instrumental in making this project a success. I, Sukriti Nautiyal, the student of Gautam
Buddha University, Greater Noida (B. Tech-IT), am extremely grateful to Unified Mentor for
the confidence bestowed in me and entrusting my training entitled "Clustering in LLM". At
this juncture I feel deeply honoured in expressing my sincere thanks to Mr. Abhishek Sir
(Project Manager) for providing valuable insights leading to the successful completion of my
training.
I would like to thank Dean of ICT Dr. Arpit Bhardwaj, and HOD of IT, Dr. Neeta Singh
whose work helped me to complete my project and express my gratitude to Mr. Vishvajeet
Yadav (supervisor) for assisting me in completing the project. I would also like to thank all
the faculty members of GAUTAM BUDDHA UNIVERSITY for their critical advice and
guidance without which this training would not have been possible. Last but not the least I
place a deep sense of gratitude to my family members and my friends who have been
constant source of inspiration during the preparation of this training report.
Offer Letter
Abstract
This report presents the work undertaken during a data science internship, focusing on the
project titled "Clustering with LLM". The project explores the application of large
language models (LLMs), such as BERT and OpenAI’s GPT, in clustering tasks involving
unstructured textual data. Traditional clustering algorithms often face challenges when
dealing with high-dimensional and noisy data. By leveraging the rich semantic embeddings
generated by LLMs, the project aimed to overcome these limitations and uncover meaningful
patterns and groupings within complex datasets.
The workflow encompassed data collection, preprocessing, embedding generation, and the
application of clustering algorithms like K-Means, DBSCAN, and Agglomerative Clustering.
Evaluation metrics such as the Silhouette Score and Davies-Bouldin Index were employed to
validate the quality of clusters, while visualization techniques like t-SNE and PCA provided
interpretable insights into the high-dimensional data.
Introduction

Data Science is a multidisciplinary field that uses scientific methods, processes, algorithms,
and systems to extract knowledge and insights from structured and unstructured data. As a
Data Science Intern at Unified Mentor, this internship has provided me with an opportunity to
apply theoretical knowledge in real-world projects, enhancing my technical and problem-
solving skills. This report details my contributions, learning, and challenges during the
internship.
In recent years, data science has witnessed significant advancements, driven by the
emergence of cutting-edge technologies like large language models (LLMs). These models,
trained on vast amounts of text data, excel at understanding and generating natural language,
making them invaluable for tasks such as text analysis, sentiment analysis, and clustering.
Clustering, an unsupervised machine learning technique, groups similar data points based on
their intrinsic properties, revealing hidden patterns in the data. However, traditional clustering
algorithms often struggle with high-dimensional or unstructured data, such as text. This is
where LLMs can bridge the gap by providing high-quality, meaningful embeddings that
enhance the performance of clustering algorithms.
This report delves into the project titled "Clustering with LLM", undertaken during my
internship. The project aimed to leverage the power of LLMs to generate embeddings for
textual data and apply clustering techniques to uncover meaningful insights. The following
sections of this report provide a detailed account of the project's methodology, objectives,
challenges, and outcomes. Additionally, the report reflects on the learning experiences and
professional growth achieved throughout this internship journey.
Future Scope
The integration of LLMs in clustering workflows holds immense potential for future
advancements in various domains. By addressing current limitations such as computational
constraints and improving interpretability, LLM-powered clustering can unlock new
opportunities in:
 Healthcare: Analyzing clinical notes and medical research to identify trends and
group similar cases for better patient care.
 E-commerce: Enhancing recommendation systems by clustering customer reviews
and preferences more accurately.
 Education: Categorizing research papers and educational content for streamlined
knowledge discovery.
As LLM architectures continue to evolve and computational resources become more
accessible, the application of LLMs in clustering is poised to become a cornerstone in data-
driven decision-making.
Project Description

About the Project


The project I worked on during my internship was titled "Clustering with LLM". The aim of
this project was to explore the potential of large language models (LLMs) in clustering tasks,
with a focus on unstructured text data. By leveraging the embeddings generated by LLMs, the
project aimed to improve the efficiency and accuracy of clustering algorithms. Clustering is
an essential task in data science, widely used in applications such as market segmentation,
customer analysis, and topic modelling.
The primary challenge with traditional clustering algorithms lies in their inability to handle
high-dimensional and noisy data effectively. Unstructured text data, in particular, poses
unique challenges due to its variability and complexity. LLMs, pre-trained on vast text
corpora, have the capability to generate dense, semantic-rich embeddings that encapsulate the
meaning and context of text. These embeddings serve as high-quality input for clustering
algorithms, enabling the discovery of coherent and meaningful groups.
The project was divided into the following key phases:

Data Collection:
 Sourcing textual data from publicly available datasets and internal repositories. The
datasets included customer feedback, product reviews, and research articles.
 Tools used: Web scraping APIs, SQL for database queries, and manual data collection
techniques.

Data Cleaning and Preprocessing:


 Handling inconsistencies in raw text data through preprocessing steps like
tokenization, removal of stop words, and lemmatization. These steps ensured the data
was clean, consistent, and ready for embedding generation.
 Libraries used: Python packages such as NLTK and SpaCy were instrumental in
performing text preprocessing tasks.

Text Embedding Generation:


 Utilizing pre-trained LLMs like BERT (Bidirectional Encoder Representations from
Transformers) and OpenAI’s GPT models to generate embeddings. These embeddings
captured semantic and contextual relationships in the text.
 Libraries: Hugging Face Transformers provided an efficient interface for generating
and handling embeddings.

Clustering Algorithm Implementation:


 Algorithms such as K-Means, DBSCAN (Density-Based Spatial Clustering of
Applications with Noise), and Agglomerative Clustering were implemented. Each
algorithm was evaluated for its suitability in handling high-dimensional embeddings.
 Distance metrics like cosine similarity were employed to measure the closeness of
text embeddings.
Evaluation and Visualization:
 The quality of clustering was assessed using metrics such as Silhouette Score and
Davies-Bouldin Index. These metrics provided insights into cluster cohesion and
separation.
 Dimensionality reduction techniques, such as t-SNE (t-distributed Stochastic
Neighbour Embedding) and PCA (Principal Component Analysis), were used to
visualize clusters in two-dimensional space. These visualizations offered a clearer
understanding of cluster formation and density.

Context and Relevance


In a data-driven era, businesses and organizations generate massive amounts of unstructured
text data, ranging from customer feedback to academic publications. Making sense of this
data is critical for informed decision-making, trend analysis, and operational efficiency.
Traditional clustering techniques such as K-Means and hierarchical clustering often struggle
to process and analyse text effectively because they rely on basic numerical representations of
text, such as TF-IDF or bag-of-words models, which fail to capture semantic meaning and
context.
Large language models like BERT and GPT have revolutionized text processing by
generating dense, semantic-rich embeddings that encapsulate the contextual meaning of
words and phrases. These embeddings, represented as high-dimensional vectors, offer a more
nuanced input for clustering algorithms, enabling them to form coherent and meaningful
clusters. This project aimed to exploit this capability by integrating LLM embeddings into the
clustering pipeline.
Innovation and Approach
The innovation in this project lay in combining state-of-the-art LLMs with traditional
clustering algorithms to create a hybrid approach for unstructured data analysis. The project
introduced the following enhancements:
1. Embedding Generation: By employing pre-trained LLMs, the textual data was
transformed into embeddings that captured both the context and semantics. This
approach offered significant advantages over traditional text representation methods.
2. Algorithm Optimization: Various clustering algorithms were implemented and
optimized for handling high-dimensional embeddings. The suitability of each
algorithm was evaluated in terms of performance, cluster quality, and scalability.
3. Evaluation Metrics: The project utilized advanced evaluation metrics, such as
Silhouette Score and Davies-Bouldin Index, to ensure clusters were meaningful and
well-separated.
Applications and Case Studies
This methodology was tested across multiple datasets to evaluate its versatility:
1. Customer Feedback Analysis: The clustering of customer reviews identified common
sentiments and recurring issues, enabling businesses to address customer concerns
proactively.
2. Academic Research Grouping: The project categorized academic papers into clusters
based on topics, aiding researchers in navigating large corpora of literature.
3. Product Categorization in E-commerce: By analyzing product descriptions and
reviews, similar products were grouped, enhancing recommendation systems and
inventory management.
Impact and Results
The project delivered tangible improvements in clustering quality and interpretability. By
leveraging LLM embeddings, clusters exhibited higher cohesion and separation compared to
traditional methods. Additionally, the dimensionality reduction techniques used for
visualization revealed distinct, meaningful clusters, providing insights into the data's latent
structure.
Methodology Used:
The methodology for the project "Clustering with LLM" followed a structured and iterative
workflow designed to tackle unstructured text data effectively. Each step built upon the
previous one, ensuring that the final results were both accurate and interpretable. The key
phases of the methodology are outlined below:

1. Data Collection
The first step involved sourcing diverse and representative datasets suitable for clustering
tasks.
 Sources:
o Public datasets such as Kaggle repositories, UCI Machine Learning
Repository, and research databases.
o Internal data repositories containing customer feedback, survey responses, and
unstructured text logs.
 Tools Used:
o Web Scraping: Libraries like BeautifulSoup and Selenium were utilized for
extracting data from online sources.
o APIs: Data from platforms such as Twitter or Reddit was collected using APIs
to gather real-time text data.
o SQL Queries: Structured data from relational databases was extracted and
processed.
 Challenges Addressed:
o Inconsistent formatting of data.
o Duplication of entries, requiring careful curation.

2. Data Cleaning and Preprocessing


Raw text data is often noisy, making preprocessing a critical step to prepare the data for
analysis.
 Steps Involved:
o Tokenization: Breaking down sentences into individual words or tokens for
analysis.
o Stopword Removal: Filtering out commonly used words (e.g., "is," "the,"
"and") that do not add significant meaning.
o Lemmatization: Reducing words to their base or root form (e.g., "running" to
"run").
o Noise Removal: Eliminating special characters, HTML tags, and excessive
whitespace.
 Tools Used:
o NLTK (Natural Language Toolkit): For text tokenization and lemmatization.
o SpaCy: For efficient and scalable text preprocessing.
 Challenges Addressed:
o Handling domain-specific jargon or acronyms.
o Balancing preprocessing intensity to retain contextual richness.

3. Embedding Generation
This step involved transforming the cleaned text data into numerical embeddings using pre-
trained Large Language Models (LLMs).
 Models Used:
o BERT: A transformer-based model known for generating contextual
embeddings.
o OpenAI GPT Models: Used for generating semantic-rich embeddings that
capture nuanced relationships between words.
 Libraries and Tools:
o Hugging Face Transformers: For seamless integration of pre-trained LLMs.
o PyTorch: For managing embeddings and performing computations efficiently.
 Output:
o High-dimensional vector representations of text, where similar texts are closer
in the vector space.

4. Clustering Algorithm Implementation


The embeddings generated were fed into clustering algorithms to group similar data points.
 Algorithms Implemented:
o K-Means Clustering: For partitioning data into a predefined number of
clusters.
o DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
For discovering clusters based on data density.
o Agglomerative Clustering: For building a hierarchy of clusters.
 Distance Metric:
o Cosine Similarity was used as the primary metric to evaluate the similarity
between embedding vectors.
 Challenges Addressed:
o Determining the optimal number of clusters for K-Means.
o Handling noise and outliers in DBSCAN.

5. Evaluation and Visualization


The clustering results were evaluated and visualized to ensure interpretability and quality.
 Evaluation Metrics:
o Silhouette Score: Measured how similar data points in a cluster are compared
to other clusters.
o Davies-Bouldin Index: Quantified intra-cluster similarity and inter-cluster
separation.
 Visualization Techniques:
o t-SNE (t-distributed Stochastic Neighbor Embedding): For projecting high-
dimensional embeddings into 2D space, offering an intuitive view of cluster
formation.
o PCA (Principal Component Analysis): For dimensionality reduction to
simplify visualizations while retaining variance.
 Tools Used:
o Matplotlib and Seaborn: For generating insightful visualizations.
o Scikit-learn: For implementing evaluation metrics and dimensionality
reduction techniques.
o

6. Iterative Refinement
The process involved multiple iterations to improve clustering quality.
 Adjustments to preprocessing steps, such as experimenting with lemmatization vs.
stemming.
 Fine-tuning LLM parameters and selecting the most suitable pre-trained model for the
dataset.
 Experimenting with different clustering algorithms and hyperparameters to optimize
results.
Objective and Scope:
Objective:
The primary objective of the project, "Clustering with LLM," was to explore the integration
of Large Language Models (LLMs) into clustering workflows to address the limitations of
traditional approaches when dealing with unstructured text data. The project aimed to:
1. Enhance Clustering Accuracy: By generating semantic-rich embeddings with LLMs,
the project sought to improve the precision and meaningfulness of clustering results.
2. Enable Interpretability: Develop intuitive methods for evaluating and visualizing
high-dimensional clusters to facilitate actionable insights.
3. Overcome Data Challenges: Address the challenges posed by unstructured, noisy, or
high-dimensional data through advanced preprocessing and embedding techniques.
4. Promote Scalability: Ensure the clustering pipeline is adaptable for large datasets
across diverse domains, including e-commerce, healthcare, and academia.
By achieving these goals, the project aimed to demonstrate the transformative potential of
LLMs in clustering tasks and provide a foundation for future innovations in unsupervised
learning.

Scope:
The scope of the project extends to a wide range of applications and industries where text
data plays a pivotal role. Some of the key areas covered include:
1. Customer Sentiment Analysis
o Use Case: Grouping customer feedback and reviews based on shared
sentiments or themes.
o Impact: Enables businesses to address recurring customer concerns and
improve service quality proactively.
2. Topic Modelling and Knowledge Discovery
o Use Case: Clustering academic papers, research articles, or news data to
uncover dominant topics and trends.
o Impact: Assists researchers, policymakers, and businesses in navigating and
synthesizing large volumes of information.
3. Personalized Recommendation Systems
o Use Case: Grouping users or products based on latent preferences and interests
derived from textual data, such as product descriptions or user reviews.
o Impact: Enhances recommendation algorithms, improving user engagement
and satisfaction in e-commerce platforms.
4. Healthcare Data Analysis
o Use Case: Clustering medical notes, patient records, or research data to
identify patterns in symptoms, diagnoses, or treatment outcomes.
o Impact: Facilitates personalized medicine and data-driven decision-making in
healthcare.
5. Market Segmentation and Trend Analysis
o Use Case: Grouping market data or consumer opinions into distinct clusters
for trend analysis.
o Impact: Helps businesses tailor their marketing strategies and products to
specific audience segments.
6. Content Categorization in Media and Education
o Use Case: Clustering articles, books, or educational resources into related
categories.
o Impact: Streamlines knowledge discovery and supports efficient content
organization.

Future Potential:
The integration of LLMs in clustering workflows can evolve further as these models become
more sophisticated and computational resources become more accessible. Future
advancements could include:
 Real-time Clustering: Applying LLM-powered clustering techniques to process
streaming text data, such as social media feeds or live customer interactions.
 Explainable AI (XAI): Incorporating interpretability frameworks to make LLM-based
clustering decisions more transparent and trustworthy.
 Cross-domain Applications: Extending the methodology to diverse fields such as
finance, law, and entertainment, where unstructured text data is prevalent.
Workflow
Code & Output
Code:
import pandas as pd # dataframe manipulation
import numpy as np # linear algebra

# data visualization
import [Link] as plt
import [Link] as cm
import [Link] as px
import plotly.graph_objects as go
import seaborn as sns
import shap

# sklearn
from [Link] import KMeans
from [Link] import PowerTransformer, OrdinalEncoder
from [Link] import Pipeline
from [Link] import TSNE
from [Link] import silhouette_score, silhouette_samples,
accuracy_score, classification_report

from [Link] import ECOD


from [Link] import KElbowVisualizer

import lightgbm as lgb


import prince
def get_pca_2d(df, predict):

pca_2d_object = [Link](
n_components=2,
n_iter=3,
rescale_with_mean=True,
rescale_with_std=True,
copy=True,
check_input=True,
engine='sklearn',
random_state=42
)

pca_2d_object.fit(df)

df_pca_2d = pca_2d_object.transform(df)
df_pca_2d.columns = ["comp1", "comp2"]
df_pca_2d["cluster"] = predict

return pca_2d_object, df_pca_2d


def get_pca_3d(df, predict):

pca_3d_object = [Link](
n_components=3,
n_iter=3,
rescale_with_mean=True,
rescale_with_std=True,
copy=True,
check_input=True,
engine='sklearn',
random_state=42
)

pca_3d_object.fit(df)

df_pca_3d = pca_3d_object.transform(df)
df_pca_3d.columns = ["comp1", "comp2", "comp3"]
df_pca_3d["cluster"] = predict

return pca_3d_object, df_pca_3d

def plot_pca_3d(df, title = "PCA Space", opacity=0.8, width_line =


0.1):

df = [Link]({"cluster": "object"})
df = df.sort_values("cluster")

fig = px.scatter_3d(df,
x='comp1',
y='comp2',
z='comp3',
color='cluster',
template="plotly",

# symbol = "cluster",

color_discrete_sequence=[Link],
title=title).update_traces(
# mode = 'markers',
marker={
"size": 4,
"opacity": opacity,
# "symbol" : "diamond",
"line": {
"width": width_line,
"color": "black",
}
}
).update_layout(
width = 1000,
height = 800,
autosize = False,
showlegend = True,
legend=dict(title_font_family="Times
New Roman",
font=dict(size= 20)),
scene = dict(xaxis=dict(title =
'comp1', titlefont_color = 'black'),
yaxis=dict(title = 'comp2',
titlefont_color = 'black'),
zaxis=dict(title = 'comp3',
titlefont_color = 'black')),
font = dict(family = "Gilroy", color =
'black', size = 15))

[Link]()

def plot_pca_2d(df, title = "PCA Space", opacity=0.8, width_line =


0.1):

df = [Link]({"cluster": "object"})
df = df.sort_values("cluster")

fig = [Link](df,
x='comp1',
y='comp2',
color='cluster',
template="plotly",
# symbol = "cluster",

color_discrete_sequence=[Link],
title=title).update_traces(
# mode = 'markers',
marker={
"size": 8,
"opacity": opacity,
# "symbol" : "diamond",
"line": {
"width": width_line,
"color": "black",
}
}
).update_layout(
width = 800,
height = 700,
autosize = False,
showlegend = True,
legend=dict(title_font_family="Times
New Roman",
font=dict(size= 20)),
scene = dict(xaxis=dict(title =
'comp1', titlefont_color = 'black'),
yaxis=dict(title = 'comp2',
titlefont_color = 'black'),
),
font = dict(family = "Gilroy", color =
'black', size = 15))

[Link]()

df = pd.read_csv("data/[Link]", sep = ";")


df = [Link][:, 0:8]

df_embedding = pd.read_csv("data/embedding_train.csv", sep = ",")


df_embedding = pd.read_csv("data/embedding_train.csv", sep = ",")
from [Link] import ECOD
# [Link]

clf = ECOD()
[Link](df_embedding)

out = [Link](df_embedding)
df_embedding["outliers"] = out
df["outliers"] = out

df_embedding_no_out = df_embedding[df_embedding["outliers"] == 0]
df_embedding_no_out = df_embedding_no_out.drop(["outliers"], axis = 1)

df_embedding_with_out = df_embedding.copy()
df_embedding_with_out = df_embedding_with_out.drop(["outliers"], axis =
1)
df_embedding_no_out.shape
df_embedding_with_out.shape
# Instantiate the clustering model and visualizer
km = KMeans(init="k-means++", random_state=0, n_init="auto")
visualizer = KElbowVisualizer(km, k=(2,10), locate_elbow=False)

[Link](df_embedding_with_out) # Fit the data to the


visualizer
[Link]()
from [Link] import silhouette_score
from [Link] import calinski_harabasz_score
from [Link] import davies_bouldin_score

"""
The Davies Bouldin index is defined as the average similarity measure
of each cluster with its most similar cluster, where similarity is the
ratio of within-cluster distances to between-cluster distances.
The minimum value of the DB Index is 0, whereas a smaller value (closer
to 0) represents a better model that produces better clusters.
"""
print(f"Davies bouldin score:
{davies_bouldin_score(df_embedding_no_out,clusters_predict)}")

"""
Calinski Harabaz Index -> Variance Ratio Criterion.
Calinski Harabaz Index is defined as the ratio of the sum of between-
cluster dispersion and of within-cluster dispersion.
The higher the index the more separable the clusters.
"""
print(f"Calinski Score:
{calinski_harabasz_score(df_embedding_no_out,clusters_predict)}")

"""
The silhouette score is a metric used to calculate the goodness of fit
of a clustering algorithm, but can also be used as a method for
determining an optimal value of k (see here for more).
Its value ranges from -1 to 1.
A value of 0 indicates clusters are overlapping and either the data or
the value of k is incorrect.
1 is the ideal value and indicates that clusters are very dense and
nicely separated.
"""
print(f"Silhouette Score:
{silhouette_score(df_embedding_no_out,clusters_predict)}")
pca_3d_object, df_pca_3d = get_pca_3d(df_embedding_no_out,
clusters_predict)
plot_pca_3d(df_pca_3d, title = "PCA Space", opacity=1, width_line =
0.1)
print("The variability is :", pca_3d_object.eigenvalues_summary)
pca_2d_object, df_pca_2d = get_pca_2d(df_embedding_no_out,
clusters_predict)
plot_pca_2d(df_pca_2d, title = "PCA Space", opacity=1, width_line =
0.2)

sampling_data = df_embedding_no_out.sample(frac=0.5, replace=True,


random_state=1)
sampling_clusters = [Link](clusters_predict).sample(frac=0.5,
replace=True, random_state=1)[0].values

df_tsne_3d = TSNE(
n_components=3,
learning_rate=500,
init='random',
perplexity=200,
n_iter = 5000).fit_transform(sampling_data)

df_tsne_3d = [Link](df_tsne_3d, columns=["comp1",


"comp2",'comp3'])
df_tsne_3d["cluster"] = sampling_clusters
plot_pca_3d(df_tsne_3d, title = "T-SNE Space", opacity=1, width_line =
0.1)
plot_pca_3d(df_tsne_3d, title = "T-SNE Space", opacity=0.1, width_line
= 0.1)
df_tsne_2d = TSNE(
n_components=2,
learning_rate=500,
init='random',
perplexity=200,
n_iter = 5000).fit_transform(sampling_data)

df_tsne_2d = [Link](df_tsne_2d, columns=["comp1", "comp2"])


df_tsne_2d["cluster"] = sampling_clusters

plot_pca_2d(df_tsne_2d, title = "PCA Space", opacity=0.5, width_line =


0.5)
plot_pca_2d(df_tsne_2d, title = "PCA Space", opacity=1, width_line =
0.5)

clf_km = [Link](colsample_by_tree=0.8)

for col in ["job", "marital", "education", "housing", "loan",


"default"]:
df_no_outliers[col] = df_no_outliers[col].astype('category')

clf_km.fit(X = df_no_outliers , y = clusters_predict)

#SHAP values
explainer_km = [Link](clf_km)
shap_values_km = explainer_km.shap_values(df_no_outliers)
shap.summary_plot(shap_values_km, df_no_outliers, plot_type="bar",
plot_size=(15, 10))
y_pred = clf_km.predict(df_no_outliers)
accuracy=accuracy_score(y_pred, clusters_predict)
print('Training-set accuracy score: {0:0.4f}'. format(accuracy))
print(classification_report(clusters_predict, y_pred))
df_no_outliers["cluster"] = clusters_predict

df_group = df_no_outliers.groupby('cluster').agg(
{
'job': lambda x: x.value_counts().index[0],
'marital': lambda x: x.value_counts().index[0],
'education': lambda x: x.value_counts().index[0],
'housing': lambda x: x.value_counts().index[0],
'loan': lambda x: x.value_counts().index[0],
'age':'mean',
'balance': 'mean',
'default': lambda x: x.value_counts().index[0],

}
).sort_values("job").reset_index()
df_group

Output:
Limitations and Challenges

Limitations:
Despite its promising results, the project faced several limitations that constrained its
potential outcomes:
1. Computational Complexity
One of the primary challenges of this project was the high computational requirements
associated with LLMs.
 Resource Intensity:
Generating embeddings using large language models like BERT or GPT is
computationally expensive, requiring significant memory and processing power.
 Scalability Issues:
As the size of the dataset increased, both embedding generation and clustering
became increasingly time-consuming, limiting scalability for real-time applications.
 Hardware Dependency:
The dependency on GPUs or TPUs for efficient computation posed a challenge for
deploying the solution in resource-constrained environments.
Potential Solutions:
 Using smaller, domain-specific LLMs such as DistilBERT to reduce computational
costs.
 Leveraging cloud-based resources for dynamic scalability.

2. High-dimensional Embeddings
The text embeddings generated by LLMs are typically high-dimensional vectors, which can
pose challenges for clustering algorithms.
 Curse of Dimensionality:
High-dimensional data can degrade the performance of clustering algorithms like K-
Means and make it harder to identify meaningful patterns.
 Visualization Difficulties:
Visualizing high-dimensional clusters for interpretability required additional
dimensionality reduction techniques, such as PCA or t-SNE, which introduced their
own limitations and biases.
Potential Solutions:
 Experimenting with dimensionality reduction methods like UMAP (Uniform
Manifold Approximation and Projection) for better performance.
 Adopting specialized clustering algorithms designed for high-dimensional data.
3. Model Selection and Fine-tuning
Choosing the appropriate LLM for the task and fine-tuning it posed significant challenges:
 Trade-offs in Pre-trained Models:
Models like BERT prioritize understanding context but may struggle with domain-
specific nuances. GPT models excel at generating embeddings but are
computationally intensive.
 Domain-specific Adaptation:
Pre-trained LLMs often require fine-tuning to perform optimally on domain-specific
text, which necessitates additional labeled data and computational resources.
Potential Solutions:
 Using hybrid models that combine pre-trained embeddings with task-specific fine-
tuning.
 Leveraging open-domain models but augmenting them with domain-specific
vocabulary.

4. Cluster Interpretability
While LLM-powered embeddings enhanced clustering accuracy, interpreting the resulting
clusters remained a challenge:
 Semantic Overlap:
Clusters derived from text embeddings sometimes showed semantic overlap, making
it difficult to delineate clear boundaries between groups.
 Lack of Explainability:
Traditional clustering algorithms like K-Means and DBSCAN do not inherently
provide insights into why specific data points are grouped together.
Potential Solutions:
 Applying explainability techniques, such as SHAP (SHapley Additive exPlanations),
to understand feature contributions.
 Developing domain-specific heuristics to label and interpret clusters.

5. Data-related Challenges
The text data used in the project introduced several challenges during preprocessing and
analysis:
 Noisy and Inconsistent Data:
Text data often contained typos, slang, and inconsistent formatting, which required
extensive preprocessing.
 Handling Rare or Outlier Data:
Outlier data points often disrupted clustering results, particularly for density-based
algorithms like DBSCAN.
Potential Solutions:
 Implementing advanced preprocessing pipelines, including spell-checkers and
synonym mapping.
 Using robust algorithms that can handle noise and outliers effectively.

6. Evaluation Metrics
Evaluating the quality of clustering outcomes presented unique challenges:
 Subjectivity in Clustering:
Unlike supervised learning, clustering lacks predefined labels, making the evaluation
inherently subjective.
 Metric Limitations:
Metrics like Silhouette Score and Davies-Bouldin Index provide quantitative insights
but may not capture the semantic quality of text clusters.
Potential Solutions:
 Incorporating qualitative evaluation, such as manual inspection or user feedback.
 Combining multiple evaluation metrics to ensure a holistic assessment.

7. Real-world Deployment
Transitioning from a proof-of-concept to real-world deployment revealed several practical
challenges:
 Dynamic Data:
In real-world scenarios, data is dynamic and constantly evolving, requiring periodic
retraining and re-clustering.
 Integration with Existing Systems:
Integrating LLM-powered clustering pipelines into existing workflows required
significant customization.
Potential Solutions:
 Automating periodic retraining and clustering workflows.
 Building modular pipelines for seamless integration with enterprise systems.
Challenges:
1. Embedding Generation Time:
o Generating high-quality embeddings for extensive datasets was time-
consuming, leading to delays in subsequent steps like clustering and
evaluation. Optimizing this process required careful consideration of model
parameters and batch sizes.
2. Cluster Validation:
o Ensuring the validity of clusters posed significant challenges. Metrics like
Silhouette Score provided numerical validation but did not always align with
domain-specific insights. Balancing quantitative and qualitative validation
methods was difficult.
3. Data Preprocessing:
o Handling diverse textual data from different sources required customized
preprocessing pipelines. For instance, some datasets contained special
characters or non-English text, which needed additional handling.
4. Model Fine-Tuning:
o While pre-trained LLMs were used, fine-tuning them on domain-specific data
required significant computational resources and expertise. This step, although
beneficial, was largely constrained.
5. Visualizing High-Dimensional Data:
o Reducing the dimensions of embeddings for visualization purposes was
challenging. Ensuring that the reduced dimensions captured meaningful
relationships between data points required iterative experimentation with
techniques like t-SNE and UMAP.
The project demonstrated resilience in overcoming many of these challenges, often through
creative problem-solving and leveraging available resources. Despite these hurdles, the
outcomes were insightful and laid the groundwork for future advancements.
Conclusion
The internship project titled "Clustering with LLM" provided a valuable opportunity to
explore the intersection of large language models and unsupervised machine learning
techniques. By leveraging the semantic power of LLM-generated embeddings, this project
successfully demonstrated how clustering could uncover meaningful patterns and insights
within unstructured text data. The integration of advanced clustering methods and LLMs
addressed traditional challenges of dimensionality and data complexity, offering a robust
solution for organizing and interpreting large volumes of textual information.
Key outcomes of the project include:
 A scalable workflow for text clustering that begins with data preprocessing, proceeds
to embedding generation using LLMs, and concludes with clustering and
visualization.
 Insights into the strengths and weaknesses of various clustering algorithms,
particularly in handling high-dimensional embeddings.
 Practical experience in implementing and fine-tuning advanced language models and
integrating them into real-world workflows.
This internship fostered the development of technical and analytical skills, including:
 Hands-on expertise in tools like Hugging Face Transformers, Python libraries for
preprocessing (NLTK, SpaCy), and visualization (t-SNE, PCA).
 Strengthened problem-solving abilities to address computational, interpretative, and
resource-related challenges.
 An enhanced understanding of the potential applications of clustering in domains like
sentiment analysis, recommendation systems, and topic modeling.
While the project showcased promising results, it also highlighted several areas for
improvement. Addressing limitations such as computational constraints, data quality, and
interpretability of clusters can further refine the outcomes. Future work could focus on
optimizing the embedding generation process, exploring alternative LLM architectures, and
developing more intuitive ways to validate and interpret clusters.
Overall, the internship was a transformative learning experience, bridging theoretical
knowledge with practical applications. The insights and skills gained from this project not
only enhanced my technical proficiency but also prepared me for future challenges in the
dynamic field of data science.
Completion Certificate
References
1. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of
Deep Bidirectional Transformers for Language Understanding. arXiv preprint
arXiv:1810.04805.
2. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving
Language Understanding by Generative Pre-training. OpenAI preprint.
3. Van Der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of
Machine Learning Research.
4. Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the 2nd
International Conference on Knowledge Discovery and Data Mining.
5. Scikit-learn: Machine Learning in Python. Pedregosa, F., et al. (2011). Journal of
Machine Learning Research.
6. Hugging Face Transformers Library. Retrieved from [Link]
7. SpaCy Natural Language Processing Library. Retrieved from [Link]
8. NLTK: Natural Language Toolkit. Bird, S., Klein, E., & Loper, E. (2009). Retrieved
from [Link]

You might also like