0% found this document useful (0 votes)
43 views30 pages

Thesis Final - Pham Dung - Quang Anh - Ver2

Chapter 5 discusses the implementation of a chatbot system using Python, detailing the setup, data collection, preprocessing, and saving processes. It outlines the necessary libraries, data sources from UIT's admission documents, and techniques for cleaning and chunking data for effective chatbot responses. The chapter also covers the use of TF-IDF for text analysis and the integration of a vector database (Chroma) for storing and retrieving high-dimensional data efficiently.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views30 pages

Thesis Final - Pham Dung - Quang Anh - Ver2

Chapter 5 discusses the implementation of a chatbot system using Python, detailing the setup, data collection, preprocessing, and saving processes. It outlines the necessary libraries, data sources from UIT's admission documents, and techniques for cleaning and chunking data for effective chatbot responses. The chapter also covers the use of TF-IDF for text analysis and the integration of a vector database (Chroma) for storing and retrieving high-dimensional data efficiently.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

CHAPTER 5: SYSTEM IMPLEMENTATION, ASSESSMENT

RESULT AND USER INTERFACE

5.1 System implementation


5.1.1 System setup
Python was selected as the programming language for this project due to its
widespread popularity, versatility, and extensive ecosystem, particularly in the realms
of artificial intelligence (AI) and natural language processing (NLP). Python is very
compatible with technologies like Visual Studio Code (VSCode) and integrates
seamlessly with Linux systems, rendering it an optimal selection for chatbot
implementations. To guarantee the chatbot functions optimally, many libraries and
dependencies in Python must be installed.
Here is the list of these libraries:
streamlit==1.37.1
sentence_transformers==3.0.1
chromadb==0.5.11
google-generativeai==0.8.0
ipython==8.18.1
tiktoken==0.7.0
nltk==3.8.1
python-dotenv==1.0.1
langchain-experimental==0.3.2
langchain-community==0.3.3
langchain-google-genai==2.0.1
networkx
pyvis==0.3.2
pysqlite3-binary
ngrock
Table 1: Required libraries for the program.
5.1.2 Data collection
Creating a chatbot to assist admission services at the University of Information
Technology (UIT) demands the preparation of a dataset. This section includes the
collection of relevant data and the implementation of high cleaning and preprocessing
methods. Therefore, chatbot provides precise and reliability information.
Sources of Data (UIT Admission Documents for 2023–2024):
The collection comprises official admission documents from UIT for the
academic years 2023 and 2024. We want the chatbot to be able to answer most
questions well, so the data coverage must be wide.
These include:
● Admission Information: Detailed information concerning admission criteria,
application procedures, and essential dates.
o School history and reputation
o Admission information
● Program Brochures: Descriptions of undergraduate and postgraduate programs,
including curricula and career options.
o Training quality and curriculum
o Majors and training programs
● Scholarship Details: Details concerning available scholarships, eligibility
criteria, and application processes.
o Scholarship
o Tuition and financial aid
● Common Inquiries (CIs): Standard questions from prospective students and
official responses.
o Mission and vision of UIT
o Student size and research team at UIT
o Opportunities for internships and jobs after graduating from UIT
o Student life
o Facilities and learning environment at UIT
o Student life and learning support at UIT
o Information on dormitories and support houses
o Extracurricular programs and international cooperation at UIT
o Main scientific research activities and financial support at UIT
o Standard scores of majors and faculties at UIT
o ……..
These materials are sourced from UIT's official website and verified publications to
ensure accuracy and relevance.
5.1.3 Data Upload and Pre-Processing
Objective: Prepare clean and broken data for use in applications such as chatbots or
information searches.
1. Data Upload: Allows to upload a CSV file.
Using pandas library to read the CSV files, also check that files must be CSV.

# --- Data Upload and Processing ---


all_data = []
if uploaded_files:
for file in uploaded_files:
if file.name.endswith(".csv"): # Check if the file is a CSV file
try:
df = pd.read_csv(file) # Read the CSV file into a pandas dataframe
all_data.append(df)
except pd.errors.ParserError:
st.error(f"Error: The file {file.name} is not a valid .csv file.")
df = pd.concat(all_data, ignore_index=True)
Table 2: Load the CSV Files.

2. Data Processing:
Data cleaning (Questions and Answers), deduplication and save the processed
data to st.session_state.
# --- Preprocessing ---
if not df.empty:
df["Câu hỏi"] = df["Câu hỏi"].apply(preprocess_text) # preprocess_text
function
df["Câu trả lời"] = df["Câu trả lời"].apply(preprocess_text)

df = remove_duplicate_rows(df, "Câu hỏi") # remove_duplicate_rows


function
df = remove_duplicate_rows(df, "Câu trả lời")

st.session_state.df = df
st.dataframe(df)

Table 3: Data preprocessing.

3. Chunking:
Split the Answer column into semantic chunks by using “split_text” function
and save the chunking result to st.session_state.chunks_df.

# --- Split in Chunk ---


st.subheader("Chunking")
if not df.empty:
index_column = "Câu trả lời" # Set the column name to use for indexing
st.write(f"Selected column for indexing: {index_column}")

chunk_records = []
# Iterate over each row in the dataframe
for _, row in df.iterrows():
selected_column_value = row[index_column]
# Check if the value is a non-empty string
if isinstance(selected_column_value, str) and selected_column_value:
chunker = SemanticChunker(embedding_type="tfidf")
chunks = chunker.split_text(selected_column_value) #processing
for chunk in chunks:
chunk_records.append({**row.to_dict(), 'chunk': chunk})

st.session_state.chunks_df = pd.DataFrame(chunk_records)

if "chunks_df" in st.session_state and not st.session_state.chunks_df.empty:


st.write("Number of chunks:", len(st.session_state.chunks_df))
st.dataframe(st.session_state.chunks_df)

Table 3: Split the Answer column into chunks.

5.1.3.1 Data Preprocessing


Text preprocessing is a crucial step in Natural Language Processing (NLP) to
prepare raw text data for analysis. The aim of preprocessing is to clean and standardize
the text to reduce noise and irrelevant information.
Common techniques include:
● Lowercasing: Converts all characters to lowercase to avoid case-sensitive
distinctions between words like “Trường” and “trường” This is essential to
reduce the vocabulary size and improve consistency.
● Removing Special Characters: Because symbols and punctuation marks (like
@, #) often have no semantic significance in a sentence, they are discarded.
● Removing Extra Whitespaces: Extra spaces are removed to prevent issues
during tokenization and parsing.
def preprocess_text(text):
text = re.sub(r'<.*?>', ' ', text)
text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
text = re.sub(r'[^\w\s.!?@]', ' ', text).lower()
return text

def remove_duplicate_rows(df, column_name):


df.drop_duplicates(subset=column_name, keep='first', inplace=True)
return df

Table 3: Remove special characters and duplicate rows.


5.1.3.2 TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF is utilized to identify important words that distinguish documents. A
higher TF-IDF score indicates that the term is both frequent in a particular document
and seldom in the corpus. And we utilizes TF-IDF (Term Frequency-Inverse
Document Frequency) to convert sentences into numerical vectors, which are then
compared using cosine similarity to group similar sentences into chunks.
def embed_function(self, sentences):

if self.embedding_type == "tfidf":
# Create a TfidfVectorizer instance and fit it to the input sentences
vectorizer = TfidfVectorizer().fit_transform(sentences)
return vectorizer.toarray()

else:
raise ValueError("Unsupported embedding type")
Table 3: Convert sentence to vector TFI_DF.
5.1.3.3 Sematic Chunking
Chunking: This represents the process of partitioning text into smaller,
semantically significant portions (chunks). It utilizes the organization of information
for management and analysis. In the semantics chunking [1], sentences or phrases are
categorized according to their meaning rather than their structure.
Cosine Similarity: Cosine similarity quantifies the cosine of the angle formed
by two vectors. Using for calculating the similar between two sentences.
Thresholding: Sentences(chunk)are combined into a new sentence (new chunk)
if their cosine similarity exceeds a specified threshold (threshold=0.3). The threshold
represents the level of similarity required for phrases to be classified as similar. In the
admissions data, the chunks of each answer have quite low similarity although they
have good semantic similarity, so 0.3 is a good number for this data set.
class SemanticChunker(BaseChunker):
def __init__(self, threshold=0.3, embedding_type="tfidf",
model="keepitreal/vietnamese-sbert"):

self.threshold = threshold
self.embedding_type = embedding_type
self.model = model
nltk.download("punkt", quiet=True)

def embed_function(self, sentences):

if self.embedding_type == "tfidf":
# Generate embeddings for the sentences
vectorizer = TfidfVectorizer().fit_transform(sentences)
return vectorizer.toarray()

else:
raise ValueError("Unsupported embedding type")
def split_text(self, text):
# Tokenize the text into sentences
sentences = nltk.sent_tokenize(text)
sentences = [item for item in sentences if item and item.strip()]
if not len(sentences):
return []
vectors = self.embed_function(sentences)
# Compute cosine similarity between sentence embeddings
similarities = cosine_similarity(vectors)

chunks = [[sentences[0]]]

for i in range(1, len(sentences)):


sim_score = similarities[i-1, i]
# If similarity exceeds the threshold, add sentence to the last chunk
if sim_score >= self.threshold:
chunks[-1].append(sentences[i])
else:
chunks.append([sentences[i]])
# Join the sentences in each chunk to form coherent paragraphs
return [' '.join(chunk) for chunk in chunks]

Table 3: Semantic Chunking


5.1.4 Data Saving
Save vector data to serve semantic queries (e.g., chatbots or information
searches).
This code saves the chunked data to ChromaDB:
● Data Check: If the data is empty, display a warning.

● Create Collection: Create a new or retrieve collection in ChromaDB.

● Split chunks in batch: The data is split into batches (256 records/batch): We
choose the batch size is 256 because it help the process to store the data
smoother and faster than a other batch size.
● Save in batches: Save each batch to ChromaDB, displaying the progress.

● Notification: Report success or error after processing.

# --- Data Saving ---


if st.button("Save Data"):
# Check if there is no data to save
if st.session_state.chunks_df.empty:
st.warning("No data available to process.")
else:
try:
# Initialize collection if not already set
if st.session_state.collection is None:
# Define collection name based on uploaded file or default
collection_name = "rag_collection"
if uploaded_files:
collection_name =
f"rag_collection_{clean_collection_name(os.path.splitext(uploaded_files[0].name)
[0])}"
st.session_state.random_collection_name = collection_name
# Create or get the collection in the Chroma vector store
st.session_state.collection =
st.session_state.client.get_or_create_collection(
name=collection_name,
metadata={"Chunk ": "", "Question": "", "Answer": ""}
)

# Set batch size and divide the dataframe into batches


batch_size = 256
df_batches = divide_dataframe(st.session_state.chunks_df, batch_size)
num_batches = len(df_batches)

# Set up progress bar


progress_text = "Saving data to Chroma. Please wait..."
my_bar = st.progress(0, text=progress_text)

# Process each batch and update progress bar


for i, batch_df in enumerate(df_batches):
if not batch_df.empty:
process_batch(batch_df, st.session_state.embedding_model,
st.session_state.collection)
progress_percentage = int(((i + 1) / num_batches) * 100)
my_bar.progress(progress_percentage, text=f"Processing batch {i +
1}/{num_batches}")
time.sleep(0.1)

# Clear the progress bar and show success message


my_bar.empty()
st.success("Data saved to Chroma vector store successfully!")
st.markdown(f"Collection name:
{st.session_state.random_collection_name}")
st.session_state.data_saved_success = True

except Exception as e:
# Show error message if something goes wrong
st.error(f"Error saving data to Chroma: {e}")
Table 3: Process of saving to vector database.
5.1.4.1 Sentence Embeddings Utilizing Pre-trained Models
Sentence Embedding: In contrast to conventional word embeddings (e.g.,
Word2Vec, GloVe), sentence embeddings capture complete sentences within a dense
vector space. These embeddings incorporate the contextual significance of sentences.
SBERT (Sentence-BERT): SBERT is a modified version of the BERT model
[65] that has been fine-tuned particularly for the generation of sentence-level
embeddings. BERT (Bidirectional Encoder Representations from Transformers) is a
transformer-based model engineered for comprehending the contextual meaning of
words within a given sentence.
# Initialize language first
if "language" not in st.session_state:
st.session_state.language = VIETNAMESE

# Initialize embedding_model based on language


if "embedding_model" not in st.session_state:
if st.session_state.language == VIETNAMESE:
st.session_state.embedding_model =
SentenceTransformer('keepitreal/vietnamese-sbert')
st.session_state.embedding_model_name = 'keepitreal/vietnamese-sbert'
st.success("Using Vietnamese embedding model: keepitreal/vietnamese-sbert")
else:
st.session_state.embedding_model = SentenceTransformer('all-MiniLM-L6-
v2')
st.session_state.embedding_model_name = 'all-MiniLM-L6-v2'
st.success("Using Vietnamese embedding model: all-MiniLM-L6-v2")

Table 3: Select the model embedding.


5.1.4.2 Data storage in Chroma database
Vector databases: Vector database is specialized systems intended for the
storage and retrieval of high-dimensional vectors. Chroma exemplifies a system that
offers efficient similarity searches forpurposes such as document retrieval, semantic
search, and question-answering.
Batch Processing: Batch storage of embeddings and information enhances
storage efficiency and decreases memory saturation levels.
# Initialize other session state variables
# Check if the 'client' variable is not already in session state
if "client" not in st.session_state:
# If not, create a new PersistentClient for Chroma with the database location 'db'
st.session_state.client = chromadb.PersistentClient("db")
Table 3: Initialize database.
5.1.5 Load data from saved collection
Load or delete data from a collection saved in ChromaDB, Serve the search or
query operations of saved data from ChromaDB, ensuring efficient data reuse.
● Press the button to open the collection list.

● Select the collection to load or delete.


When the load is successful, the data is saved to a DataFrame (chunks_df) for use in
the next steps.
# --- Load from Saved Collection ---
st.subheader("1.2. Or load from saved collection", divider=True)

# Check if the button is clicked to load data


if st.button("Load from saved collection"):
st.session_state.open_dialog = "LIST_COLLECTION" # Open the dialog to list
collections

# Define the function to load a collection


def load_func(collection_name):
# Get the specified collection from the Chroma client
st.session_state.collection =
st.session_state.client.get_collection(name=collection_name)
st.session_state.random_collection_name = collection_name
st.session_state.data_saved_success = True
st.session_state.source_data = DB

# Retrieve documents and metadata from the collection


data = st.session_state.collection.get(include=["documents", "metadatas"])
metadatas = data["metadatas"]

# Extract column names from the metadata


column_names = []
if metadatas and metadatas[0].keys():
column_names.extend(metadatas[0].keys())
column_names = list(set(column_names))

# Create a dataframe from the metadata with dynamic column names


st.session_state.chunks_df = pd.DataFrame(metadatas,
columns=column_names)

# Define the function to delete a collection


def delete_func(collection_name):
# Delete the specified collection from the Chroma client
st.session_state.client.delete_collection(name=collection_name)

# List collections and provide options to load or delete


list_collection(st.session_state, load_func, delete_func)

# If a valid collection is loaded and the dataframe is not empty


if "random_collection_name" in st.session_state and
st.session_state.random_collection_name and not st.session_state.chunks_df.empty:
# Identify columns to answer (excluding 'chunk' column)
st.session_state.columns_to_answer = [col for col in
st.session_state.chunks_df.columns if col != "chunk"]
Table 3: Manage the collection.
5.1.6 Large Language Model (LLM)
Initialize an online LLM model (GEMINI) with API Key, prepare the LLM model
to handle tasks such as generating responses or natural language processing (NLP) in
the application.
● Check if llm_model doesn't already exist.

● Save the API Key to the session state (st.session_state).

● Initialize a GEMINI model with a specific version.


# Initialize llm_model
if "llm_model" not in st.session_state:
api_key = "AIzaSyBwvb8_AXyn_IZwn92WnqWQpKKM9sMfszQ" # Replace
with your actual API key
st.session_state.llm_model = OnlineLLMs(name=GEMINI, api_key=api_key,
model_version="learnlm-1.5-pro-experimental")
st.session_state.api_key_saved = True
print(" API Key saved successfully!")

Table 3: Call API GEMINI LLM.


5.1.6.1 Gemini API
Gemini API is a generative AI model developed by Google, used to generate
human-like text responses based on user input. It utilizes transformer architecture,
which is the foundation for many modern NLP models. The purpose of using Gemini
is to enhance the chatbot’s capabilities by generating hypothetical documents that help
provide more accurate and contextually relevant answers based on user queries.
The model employs pre-trained transformer architectures (such as GPT and
LaMDA) for text generation. It enhances the system by incorporating dynamically
created content into retrieved information (via RAG, Retrieval-Augmented
Generation), so assuring more comprehensive and effective responses [66].
The purpose of Gemini is to improve semantic search and user experience by
generating material that supplements information retrieved from ChromaDB. This
enables the system to provide responses even though exact data may not be readily
accessible.
Advantages of the Gemini API in the project:
● Accuracy: Produces material to enhance acquired data, providing accurate and
useful answers.
● Flexibility: Supports dynamic content generation in the absence of adequate
information inside the database.
● Scalability: As the dataset expands, the model continues to give high-quality
results.
● Improved User Experience: Generates natural, contextually relevant responses,
enhancing interaction quality [67]

class OnlineLLMs(LLM):
def __init__(self, name, api_key=None, model_version=None):
# Initialize the model with the given name, API key, and model version
self.name = name
self.model = None
self.model_version = model_version

# Configure the model if the name is GEMINI and an API key is provided
if self.name.lower() == GEMINI and api_key:
genai.configure(
api_key=api_key # Set the API key for the generative AI service
)
# Create the GenerativeModel object with the specified model version
self.model = genai.GenerativeModel(
model_name=model_version # Pass the model version
)
def set_model(self, model):
# Set the model to a custom model
self.model = model

def parse_message(self, messages):


# Map the roles ("user" to "user" and "assistant" to "model") and format the
message content
mapping = {
"user": "user", # Role mapping for user
"assistant": "model" # Role mapping for assistant
}
# Iterate through the list of messages and format the roles and content
return [
{"role": mapping[mess["role"]], "parts": mess["content"]}
for mess in messages
]

@backoff.on_exception(backoff.expo, Exception, max_tries=3)


def create_agentic_chunker_message(self, system_prompt, messages,
max_tokens=1000, temperature=1):
# Check if the model is "GEMINI"
if self.name.lower() == GEMINI:
try:
# Parse the message roles and content
messages = self.parse_message(messages)

# Generate content using the model by passing the system prompt and
messages
response = self.model.generate_content(
[
{"role": "user", "parts": system_prompt}, # The system's message
{"role": "model", "parts": "I understand. I will strictly follow your
instruction!"}, # Acknowledgment by the model
*messages # Append the processed messages
],
generation_config=genai.types.GenerationConfig(
max_output_tokens=max_tokens, # Set the maximum number of
tokens for the response
temperature=temperature # Control the randomness of the model's
output
)
)
# Return the model's response text
return response.text
except Exception as e:
# In case of an error, print the error message and retry
print(f"Error occurred: {e}, retrying...")
raise e # Raise the exception to be retried
else:
# Raise an error if the model name is unknown
raise ValueError(f"Unknown model name: {self.name}")

def generate_content(self, prompt):


"""Generate content based on the specified LLM model."""
# Check if the model has been set, raise error if not
if not self.model:
raise ValueError("Model is not set. Please set a model using set_model().")

# Generate content based on the model type


if self.name.lower() == GEMINI:
# Generate content using the Gemini model
response = self.model.generate_content(prompt)
try:
# Extract the content from the response (handling possible nested
structure)
content = response.candidates[0].content.parts[0].text
except (IndexError, AttributeError):
# Raise an error if the response structure is not as expected
raise ValueError("Failed to parse the Gemini response structure.")
else:
# Raise an error if the model name is unknown
raise ValueError(f"Unknown model name: {self.name}")

# Ensure the content is returned as a string


if not isinstance(content, str):
content = str(content)

return content # Return the generated content


Table 3: Define a class OnlineLLMs.
5.1.7 Search Algorithm Setup
Allow to choose a search algorithm (HYDE or Vector Search), decide how to
search for data in the next processing steps
● Display radio options with 2 algorithms.

● Save the user's selection to the session_state.

# --- Search Algorithm Setup ---


st.header("2. Set up search algorithms")
st.radio(
"Please select one of the options below.",
["Hyde Search", "Vector Search"],
captions=["Search using the HYDE algorithm", "Search using vector similarity"],
key="search_option",
index=0,
)
Table 3: Choosing search method.

5.1.7.1 Vector Search


Vector search uses sentence embeddings, which are numerical representations
of text, to assess semantic similarity between queries and documents. Sentence-BERT
produces these embeddings, while ChromaDB saves them, setting up rapid similarity
searches. Cosine similarity is utilized to compare vectors and extract the most relevant
information based on semantic content rather than just keywords. [2]
Advantages:
● Semantic Search: Generates the result by meaning rather than only on
keywords.
● Accuracy: Gets more accurate and pertinent results due to analyzing meaning
rather than exact matches.
● Velocity: Rapid retrieval of relevant details, thus necessary for real-time
applications.
● Scalability: Effectively manages extensive datasets as the system increases.
This technique improves the chatbot's abilities to respond to questions with greater
accuracy and efficiency.
Def vector_search(model, query, collection, columns_to_answer,
number_docs_retrieval):
query_embeddings = model.encode([query])

# Fetch results from the collection


search_results = collection.query(
query_embeddings=query_embeddings,
n_results=number_docs_retrieval
)
metadatas = search_results['metadatas']
scores = search_results['distances']

# Prepare the search result output


search_result = ""
for i, (meta, score) in enumerate(zip(metadatas[0], scores[0]), start=1): #tao ra 1
cap(metadata, scores)
search_result += f"\n{i}) Distances: {score:.4f}"
for column in columns_to_answer:
if column in meta:
search_result += f" {column}: {meta.get(column)}"
search_result += "\n"

return metadatas, search_result

Table 3: Vector search process.

5.1.7.2 HyDE Search


Hype Search use Gemini AI for generating hypothetical papers that enrich the
outcomes obtained from ChromaDB. This technique improves the query response
based on providing further context when the obtained data is insufficient. [3]
Advantages:
● Improved Accuracy: Integrates real information with AI-generated content for
more comprehensive responses.
● Context Enrichment: Addresses inadequacies in the database when information
is missing.
● Flexibility: Manages complicated or unclear inquiries by producing relevant
text.
HYDE increases the chatbot's capacity of providing more comprehensive and precise
responses.
def generate_hypothetical_documents(model, query, num_samples=10):
# Generate multiple hypothetical documents answering the provided query using
the model
hypothetical_docs = []
for _ in range(num_samples):
enhanced_prompt = f"Write a paragraph that answers the question: {query}"
# Generate content using the model
response = model.generate_content(enhanced_prompt)
hypothetical_docs.append(response)
return hypothetical_docs

def encode_hypothetical_documents(documents, encoder_model):


# Convert the generated documents into embeddings using the encoder model
embeddings = [encoder_model.encode([doc])[0] for doc in documents]
# Calculate the average embedding of all documents
avg_embedding = np.mean(embeddings, axis=0)
return avg_embedding

def hyde_search(llm_model, encoder_model, query, collection, columns_to_answer,


number_docs_retrieval, num_samples=10):
# Generate hypothetical documents based on the query
hypothetical_documents = generate_hypothetical_documents(llm_model, query,
num_samples)

# Print the generated hypothetical documents for debugging


print("hypothetical_documents", hypothetical_documents)

# Encode the documents to get their embeddings and aggregate them


aggregated_embedding =
encode_hypothetical_documents(hypothetical_documents, encoder_model)

# Perform a search in the collection using the aggregated embedding


search_results = collection.query(
query_embeddings=aggregated_embedding,
n_results=number_docs_retrieval) # Fetch top results
search_result = ""

metadatas = search_results['metadatas']
i=0
for meta in metadatas[0]:
i += 1
search_result += f"\n{i})"
for column in columns_to_answer:
if column in meta:
search_result += f" {column.capitalize()}: {meta.get(column)}"
search_result += "\n"
return metadatas, search_result

Table 3: Hyde search process.


5.1.8 Interactive Chatbot
Set up a chatbot interface for users to input questions, query data from
ChromaDB, and receive feedback from the LLM model, assist users in finding out
Enrollment information at UIT with the ability to retrieve and answer semantics
accurately.
Show conversation history.
● Get questions from users.

● Query data using a search algorithm (HYDE or Vector Search).

● Generate responses from LLMs based on search data.


# --- Chat Interface ---
st.header("Interactive Chatbot")

if "chat_history" not in st.session_state:


st.session_state.chat_history = []

for message in st.session_state.chat_history:


with st.chat_message(message["role"]):
st.markdown(message["content"])

if prompt := st.chat_input("How can I assist you today?"):


# Append the prompt to the chat history
st.session_state.chat_history.append({"role": USER, "content": prompt})
with st.chat_message(USER):
st.markdown(prompt)

# Save the user's question to a text file


with open("user_questions.txt", "a", encoding="utf-8") as file:
file.write(f"Question user: {prompt}\n")

with st.chat_message(ASSISTANT):
if st.session_state.collection:
metadatas, retrieved_data = [], ""
if st.session_state.columns_to_answer:
search_func = hyde_search if st.session_state.search_option == "Hyde
Search" else vector_search
model = st.session_state.llm_model if st.session_state.llm_type ==
ONLINE_LLM else None

if st.session_state.search_option == "Vector Search":


metadatas, retrieved_data = vector_search(
st.session_state.embedding_model,
prompt,
st.session_state.collection,
st.session_state.columns_to_answer,
st.session_state.number_docs_retrieval
)

enhanced_prompt = """
Nêu câu trả lời sau không liên quan đến tuyển Sinh trường Đại Học
Công Nghệ Thông Tin Đại Học Quốc GIa Thành Phố HCM (UIT) thì lich sự từ chối
trả lời,
Câu hỏi của người dùng là: "{}".Trả lời nó dựa trên dữ liệu được truy
xuất sau đây: \n{} """.format(prompt, retrieved_data)

elif st.session_state.search_option == "Hyde Search":


if st.session_state.llm_type == ONLINE_LLM:
metadatas, retrieved_data = search_func(
model,
st.session_state.embedding_model,
prompt,
st.session_state.collection,
st.session_state.columns_to_answer,
st.session_state.number_docs_retrieval,
num_samples=1 if st.session_state.search_option == "Hyde
Search" else None
)

enhanced_prompt = """
Nêu câu trả lời sau không liên quan đến tuyển Sinh trường Đại Học
Công Nghệ Thông Tin Đại Học Quốc GIa Thành Phố HCM (UIT) thì lich sự từ chối
trả lời,
Câu hỏi của người dùng là: "{}".Trả lời nó dựa trên dữ liệu được truy
xuất sau đây: \n{} """.format(prompt, retrieved_data)

if metadatas:
flattened_metadatas = [item for sublist in metadatas for item in sublist]
metadata_df = pd.DataFrame(flattened_metadatas)
st.sidebar.subheader("Retrieval data")
st.sidebar.dataframe(metadata_df)
st.sidebar.subheader("Full prompt for LLM")
st.sidebar.markdown(enhanced_prompt)
else:
st.sidebar.write("No metadata to display.")

if st.session_state.llm_model:
response =
st.session_state.llm_model.generate_content(enhanced_prompt)
with open("user_questions.txt", "a", encoding="utf-8") as file:
file.write(f"Answer of Chatbot: {response}\n")
st.markdown(response)
st.session_state.chat_history.append({"role": ASSISTANT, "content":
response})
else:
st.warning("Please select a model to run.")
else:
st.warning("Please select columns for the chatbot to answer from.")
Table 3: Process the query of user and send response.
5.1.9 Login and logout from admin
Allow users to sign out of the app, check for authentication and redirect to the
login page if the user is not already logged in and secure and manage user sessions in
chatbot
● Logout Button: Set the login status to False and go back to the login page.

● Authentication check: Make sure only logged-in users stay on the page.
# Logout Buton
if st.sidebar.button("Logout"):
st.session_state.authenticated = False
st.success("Đã đăng xuất thành công!")
st.switch_page("app.py") # Back to register
st.sidebar.title("Doc Retrievaled")

if "authenticated" not in st.session_state or not st.session_state.authenticated:


st.switch_page("app.py") # Go to login page
Table 3: Logout the system.
Create a user registration and login system with password encryption and
secure storage in ChromaDB, manage users in the application, ensuring security and
effective retrievability.
Main functions:
● Register:
o Check accounts that already exist.
o Encrypt passwords with bcrypt and save them to ChromaDB.
● Log:
o Find the account.
o Check that the incoming password matches the encrypted password.
# Connect to ChromaDB and create a collection
def get_chroma_client():
client = chromadb.PersistentClient(path="db_auth/auth_db")
return client

# Create or get user collection (adjusted for consistency)


def get_user_collection():
client = get_chroma_client()
return client.get_or_create_collection(name="user_authen")

# Check if the user already exists


def register_user(username, password):
collection = get_user_collection()

results = collection.query(query_texts=[username], n_results=1)


if results['ids'] and len(results['ids'][0]) > 0:
return False # Account already
hashed_pw = bcrypt.hashpw(password.encode('utf-8'), bcrypt.gensalt())

# Add the user to the ChromaDB collection


collection.add(
ids=[username],
documents=[username], # Thêm documents để tránh lỗi thiếu thông tin
metadatas=[{"username": username, "password": hashed_pw.decode('utf-8')}]
)
return True

# Authenticate login
def authenticate_user(username, password):
collection = get_user_collection()
results = collection.query(query_texts=[username], n_results=1)

# Check if the result is empty


if not results['ids'] or len(results['ids'][0]) == 0:
return False # Không tìm thấy tài khoản

# Get the stored password and verify with bcrypt


stored_pw = results['metadatas'][0][0]['password']
return bcrypt.checkpw(password.encode('utf-8'), stored_pw.encode('utf-8'))
Table 3: Login the system.
5.1.10 Assessment from user
Allow users to review the chatbot and save the review to a file, collect feedback
from users to improve the chatbot in the future.
● Create an import interface and save the review.

● Save the review to a user_reviews.txt file.

● Display a response message (success or error) depending on the situation.


st.sidebar.header("Đánh giá Chatbot")
review = st.sidebar.text_area("Viết đánh giá của bạn tại đây:")

if st.sidebar.button("Lưu đánh giá"):


if review.strip():
with open("user_reviews.txt", "a", encoding="utf-8") as review_file:
review_file.write(f" Rview of user: {review}\n")
st.sidebar.success("Đánh giá của bạn đã được lưu!")
else:
st.sidebar.error("Vui lòng nhập nội dung đánh giá trước khi lưu!")
Table 3: Assessment from user.
5.2 Assessment results
5.2.1 Comprehensive system assessment
To evaluate the ability of the models to answer questions, we tested directly by
evaluating the answers to the included questions, to assess the capacity to respond in
accordance with the subsequent evaluations.
• Total samples: Total number of questions asked
• Samples passed: Number of questions answered
• Samples not passed: Number of questions not answered
• Pass rate: Percentage of questions answered
• non-pass rate: Percentage of questions not answered
• Samples passed with high confidence: Number of answers answered with high
accuracy
• Pass rate with high confidence: Percentage of answers answered with high
accuracy
sentence-transformers/all-
keepitreal/vietnamese-sbert MiniLM-L6-v2
RAG with RAG with RAG with RAG with
Assessment
Vector search HYDE search HYDE search Vector search
Total samples: 120 120 100 100
Samples passed 120 117 60 89
Samples not passed 0 3 40 11
Pass rate 100% 97.50% 60% 89%
non-pass rate 0% 2.50% 40% 11%
Samples passed with 118 117 60 88
high confidence
Pass rate with high 98.33% 100% 100% 98,88%
confidence
Table 3: Comprehensive system assessment
Result:
● The keepitreal/vietnamese-sbert model is superior in accuracy and reliability to
sentence-transformers/all-MiniLM-L6-v2.
● Vector Search should be preferred because of its high efficiency and stability.

● HYDE Search needs to be improved or optimized to reduce the fail rate,


especially on sentence-transformers/all-MiniLM-L6-v2.
5.2.2 Evaluation of ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
score
 Use when: Compare chatbot with reference answer, good for text
summarization.
 Main metrics: ROUGE-1, ROUGE-2, ROUGE-L.
o Precision = Ratio of correct words to total words generated by the
chatbot
o Recall = Ratio of correct words to total words in the reference sentence
o F1-Score = Neutralize Precision and Recall
 Measure the word overlap between chatbot and reference sentence.
 Only evaluates structure. Does not evaluate the actual semantics of the
sentence.
import pandas as pd
from rouge import Rouge

def calculate_rouge(file_path):
print(f"Đang xử lý file: {file_path}")
df = pd.read_csv(file_path)
if "Câ u trả lờ i" not in df.columns or "Answer" not in df.columns:
raise ValueError("File CSV phả i chứ a cộ t 'Câ u trả lờ i' và 'Answer'.")
df = df.dropna(subset=["Câ u trả lờ i", "Answer"])
reference_answers = df["Câ u trả lờ i"].astype(str).tolist()
chatbot_answers = df["Answer"].astype(str).tolist()
rouge = Rouge()
rouge_scores = {
"rouge-1": {"f": [], "p": [], "r": []},
"rouge-2": {"f": [], "p": [], "r": []},
"rouge-l": {"f": [], "p": [], "r": []}
}
for ref, hyp in zip(reference_answers, chatbot_answers):
try:
scores = rouge.get_scores(hyp, ref)[0]
for metric in rouge_scores:
for sub_metric in rouge_scores[metric]:
rouge_scores[metric][sub_metric].append(scores[metric][sub_metric])
except Exception as e:
print(f"Lỗ i khi tính ROUGE cho: {hyp} vs {ref}. Lỗ i: {e}")
continue
results = {
"metric": [],
"sub_metric": [],
"average": []
}
for metric in rouge_scores:
for sub_metric in rouge_scores[metric]:

if len(rouge_scores[metric][sub_metric]) > 0:
avg_score = sum(rouge_scores[metric][sub_metric]) /
len(rouge_scores[metric][sub_metric])
else:
avg_score = 0.0
results["metric"].append(metric)
results["sub_metric"].append(sub_metric)
results["average"].append(avg_score)
results_df = pd.DataFrame(results)
print("Trung bình cộ ng điểm ROUGE:")
print(results_df)
# export CSV
output_file = "rouge_scores_output.csv"
results_df.to_csv(output_file, index=False)

print(f"Kết quả điểm ROUGE đã đượ c lưu và o file: {output_file}")

# path CSV file


file_path = input("Nhậ p đườ ng dẫ n tớ i file CSV: ")
calculate_rouge(file_path)

#/home/quang-anh/Bả n tả i về/all-MiniLM-L6-v2 + Vector - Sheet1.csv


#/home/quang-anh/Bả n tả i về/all-MiniLM-L6-v2 + HYDE - Sheet1.csv
#/home/quang-anh/Bả n tả i về/SBert_Vector - Sheet1.csv
#/home/quang-anh/Bả n tả i về/SBert_HYDE - Sheet1.csv

Table 3: Calculate ROUGE score.


5.2.2.1 keepitreal/vietnamese-sbert + Vector search + GEMINI "learnlm-1.5-pro-
experimental"
Recall
Metric Precision (P) (R) F1-Score (F)
ROUGE-1 0.591 0.668 0.546
ROUGE-2 0.341 0.482 0.384
ROUGE-L 0.471 0.641 0.525
Table 4: keepitreal/vietnamese-sbert + Vector search + GEMINI "learnlm-1.5-pro-
experimental"
Result:
● Highest efficiency: ROUGE-1, good word reproduction and structure.

● Limitations: ROUGE-2 is not optimal, consecutive phrases need to be


improved.
5.2.2.2 keepitreal/vietnamese-sbert + HYDE search + GEMINI "learnlm-1.5-pro-
experimental"

Recall
Metric Precision (P) (R) F1-Score (F)
ROUGE-1 0.385 0.455 0.388
ROUGE-2 0.221 0.273 0.224
ROUGE-L 0.356 0.42 0.359
Table 5: keepitreal/vietnamese-sbert + HYDE search + GEMINI"learnlm-1.5-pro-
experimental"
Result:
● Decent performance: ROUGE-1 is decent, but inferior to Vector Search.

● Drawbacks: ROUGE-2 and ROUGE-L are lower, HYDE Search is not optimal.

5.2.2.3 sentence-transformers/all-MiniLM-L6-v2 + Vector search +


GEMINI"learnlm-1.5-pro-experimental"

Recall
Metric Precision (P) (R) F1-Score (F)
ROUGE-1 0.549 0.65 0.567
ROUGE-2 0.417 0.528 0.445
ROUGE-L 0.53 0.63 0.549
Table 6: sentence-transformers/all-MiniLM-L6-v2 + Vector search +
GEMINI"learnlm-1.5-pro-experimental"
Result:
● Good effect: ROUGE-1is good coverage but not accurate.

● Limitations: ROUGE-2 is weak.

5.2.2.4 sentence-transformers/all-MiniLM-L6-v2 + HYDE search + GEMINI"learnlm-


1.5-pro-experimental"z

Recall
Metric Precision (P) (R) F1-Score (F)
ROUGE-1 0.382 0.454 0.386
ROUGE-2 0.219 0.273 0.223
ROUGE-L 0.354 0.42 0.358
Table 7: sentence-transformers/all-MiniLM-L6-v2 + HYDE search +
GEMINI"learnlm-1.5-pro-experimental"
Result:
● Least efficient: ROUGE-1, ROUGE-2, ROUGE-L are all low.

● Drawbacks: HYDE Search doesn't work with this model.

Overall:
⇒Best Model: keepitreal/vietnamese-sbert + Vector Search.
⇒Much needs improvement: sentence-transformers/all-MiniLM-L6-v2 +
HYDE Search.
5.2.3 Document retrieval evaluation score

Search
Model Method RETRIVALED
Vector
Search 120/120
keepitreal/vietnamese-sbert
HYDE
Search 117/120
Vector
sentence-transformers/all- Search 87/100
MiniLM-L6-v2 HYDE
Search 54/100
Table 8: Document retrieval evaluation score
Result:
● Best fit: keepitreal/vietnamese-sbert with Vector Search.
● Needs improvement: HYDE Search with both models.

5.3 User interface


5.3.1 User interface for admin
5.3.1.1 Admin login page

Figure 16: Admin login page

Action: Admin login and Logout


5.3.1.2 Admin registration page

Figure 17: Admin registration page

Action: Register admin


5.3.1.3 Chatbot customization page for admin
Figure 18: Chatbot customization page for admin

Action: Fine-tune the Chatbot, can fine-tune the embedding model to "vietnamese-
sbert" or "all-MiniLM-L6-v2". Search methods such as Vector search and HYDE
search.
5.3.1.4 Collection management for chatbot

Figure 19: Load Collection into Chatbot for use, delete collection
Action: Load collection into chatbot for use, delete collection
Figure 20: Create a Collection to use as a save data
Action: Create a collection to use as a save data

5.3.1.5 Interact with Chatbot

Figure 21: Interact with the chatbot


Action: Interact with the chatbot, view the documents retrieved during the retrieval
process, see the Prompt sentence transmitted to the Gemini LLM.
Conclusion of Action for Admin
● Action: Admin login and logout

● Action: Register admin

● Action: Fine-tune the Chatbot, can fine-tune the embedding model to


"vietnamese-sbert" or "all-MiniLM-L6-v2". Search methods such as Vector
search and HyDE search.
● Action: Load Collection into Chatbot for use, delete collection

● Action: Create a collection to use as a save data

● Action: Interact with the chatbot, view the documents retrieved during the
retrieval process, see the Prompt sentence transmitted to Gemini LLM.
5.3.2 User Interface for User
5.3.2.1 Chatbot Interaction and Reviews

Figure 22: Chatbot Interaction and Reviews


Action: Users can interact with the chatbot and write reviews, comments, and
questions to improve the chatbot.
Conclusion of action for user
● Action: Users can interact with the chatbot and write reviews, comments, and
questions to improve the chatbot.
[1]SEMANTIC CHUNKING:
https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/
LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb
[2] Vector search https://github.com/currentslab/awesome-vector-search

[3]HYDE searchhttps://github.com/texttron/hyde/blob/main/hyde-demo.ipynb
[4] ROUGE https://thepythoncode.com/code/calculate-rouge-score-in-python

You might also like