Thesis Final - Pham Dung - Quang Anh - Ver2
Thesis Final - Pham Dung - Quang Anh - Ver2
2. Data Processing:
Data cleaning (Questions and Answers), deduplication and save the processed
data to st.session_state.
# --- Preprocessing ---
if not df.empty:
df["Câu hỏi"] = df["Câu hỏi"].apply(preprocess_text) # preprocess_text
function
df["Câu trả lời"] = df["Câu trả lời"].apply(preprocess_text)
st.session_state.df = df
st.dataframe(df)
3. Chunking:
Split the Answer column into semantic chunks by using “split_text” function
and save the chunking result to st.session_state.chunks_df.
chunk_records = []
# Iterate over each row in the dataframe
for _, row in df.iterrows():
selected_column_value = row[index_column]
# Check if the value is a non-empty string
if isinstance(selected_column_value, str) and selected_column_value:
chunker = SemanticChunker(embedding_type="tfidf")
chunks = chunker.split_text(selected_column_value) #processing
for chunk in chunks:
chunk_records.append({**row.to_dict(), 'chunk': chunk})
st.session_state.chunks_df = pd.DataFrame(chunk_records)
if self.embedding_type == "tfidf":
# Create a TfidfVectorizer instance and fit it to the input sentences
vectorizer = TfidfVectorizer().fit_transform(sentences)
return vectorizer.toarray()
else:
raise ValueError("Unsupported embedding type")
Table 3: Convert sentence to vector TFI_DF.
5.1.3.3 Sematic Chunking
Chunking: This represents the process of partitioning text into smaller,
semantically significant portions (chunks). It utilizes the organization of information
for management and analysis. In the semantics chunking [1], sentences or phrases are
categorized according to their meaning rather than their structure.
Cosine Similarity: Cosine similarity quantifies the cosine of the angle formed
by two vectors. Using for calculating the similar between two sentences.
Thresholding: Sentences(chunk)are combined into a new sentence (new chunk)
if their cosine similarity exceeds a specified threshold (threshold=0.3). The threshold
represents the level of similarity required for phrases to be classified as similar. In the
admissions data, the chunks of each answer have quite low similarity although they
have good semantic similarity, so 0.3 is a good number for this data set.
class SemanticChunker(BaseChunker):
def __init__(self, threshold=0.3, embedding_type="tfidf",
model="keepitreal/vietnamese-sbert"):
self.threshold = threshold
self.embedding_type = embedding_type
self.model = model
nltk.download("punkt", quiet=True)
if self.embedding_type == "tfidf":
# Generate embeddings for the sentences
vectorizer = TfidfVectorizer().fit_transform(sentences)
return vectorizer.toarray()
else:
raise ValueError("Unsupported embedding type")
def split_text(self, text):
# Tokenize the text into sentences
sentences = nltk.sent_tokenize(text)
sentences = [item for item in sentences if item and item.strip()]
if not len(sentences):
return []
vectors = self.embed_function(sentences)
# Compute cosine similarity between sentence embeddings
similarities = cosine_similarity(vectors)
chunks = [[sentences[0]]]
● Split chunks in batch: The data is split into batches (256 records/batch): We
choose the batch size is 256 because it help the process to store the data
smoother and faster than a other batch size.
● Save in batches: Save each batch to ChromaDB, displaying the progress.
except Exception as e:
# Show error message if something goes wrong
st.error(f"Error saving data to Chroma: {e}")
Table 3: Process of saving to vector database.
5.1.4.1 Sentence Embeddings Utilizing Pre-trained Models
Sentence Embedding: In contrast to conventional word embeddings (e.g.,
Word2Vec, GloVe), sentence embeddings capture complete sentences within a dense
vector space. These embeddings incorporate the contextual significance of sentences.
SBERT (Sentence-BERT): SBERT is a modified version of the BERT model
[65] that has been fine-tuned particularly for the generation of sentence-level
embeddings. BERT (Bidirectional Encoder Representations from Transformers) is a
transformer-based model engineered for comprehending the contextual meaning of
words within a given sentence.
# Initialize language first
if "language" not in st.session_state:
st.session_state.language = VIETNAMESE
class OnlineLLMs(LLM):
def __init__(self, name, api_key=None, model_version=None):
# Initialize the model with the given name, API key, and model version
self.name = name
self.model = None
self.model_version = model_version
# Configure the model if the name is GEMINI and an API key is provided
if self.name.lower() == GEMINI and api_key:
genai.configure(
api_key=api_key # Set the API key for the generative AI service
)
# Create the GenerativeModel object with the specified model version
self.model = genai.GenerativeModel(
model_name=model_version # Pass the model version
)
def set_model(self, model):
# Set the model to a custom model
self.model = model
# Generate content using the model by passing the system prompt and
messages
response = self.model.generate_content(
[
{"role": "user", "parts": system_prompt}, # The system's message
{"role": "model", "parts": "I understand. I will strictly follow your
instruction!"}, # Acknowledgment by the model
*messages # Append the processed messages
],
generation_config=genai.types.GenerationConfig(
max_output_tokens=max_tokens, # Set the maximum number of
tokens for the response
temperature=temperature # Control the randomness of the model's
output
)
)
# Return the model's response text
return response.text
except Exception as e:
# In case of an error, print the error message and retry
print(f"Error occurred: {e}, retrying...")
raise e # Raise the exception to be retried
else:
# Raise an error if the model name is unknown
raise ValueError(f"Unknown model name: {self.name}")
metadatas = search_results['metadatas']
i=0
for meta in metadatas[0]:
i += 1
search_result += f"\n{i})"
for column in columns_to_answer:
if column in meta:
search_result += f" {column.capitalize()}: {meta.get(column)}"
search_result += "\n"
return metadatas, search_result
with st.chat_message(ASSISTANT):
if st.session_state.collection:
metadatas, retrieved_data = [], ""
if st.session_state.columns_to_answer:
search_func = hyde_search if st.session_state.search_option == "Hyde
Search" else vector_search
model = st.session_state.llm_model if st.session_state.llm_type ==
ONLINE_LLM else None
enhanced_prompt = """
Nêu câu trả lời sau không liên quan đến tuyển Sinh trường Đại Học
Công Nghệ Thông Tin Đại Học Quốc GIa Thành Phố HCM (UIT) thì lich sự từ chối
trả lời,
Câu hỏi của người dùng là: "{}".Trả lời nó dựa trên dữ liệu được truy
xuất sau đây: \n{} """.format(prompt, retrieved_data)
enhanced_prompt = """
Nêu câu trả lời sau không liên quan đến tuyển Sinh trường Đại Học
Công Nghệ Thông Tin Đại Học Quốc GIa Thành Phố HCM (UIT) thì lich sự từ chối
trả lời,
Câu hỏi của người dùng là: "{}".Trả lời nó dựa trên dữ liệu được truy
xuất sau đây: \n{} """.format(prompt, retrieved_data)
if metadatas:
flattened_metadatas = [item for sublist in metadatas for item in sublist]
metadata_df = pd.DataFrame(flattened_metadatas)
st.sidebar.subheader("Retrieval data")
st.sidebar.dataframe(metadata_df)
st.sidebar.subheader("Full prompt for LLM")
st.sidebar.markdown(enhanced_prompt)
else:
st.sidebar.write("No metadata to display.")
if st.session_state.llm_model:
response =
st.session_state.llm_model.generate_content(enhanced_prompt)
with open("user_questions.txt", "a", encoding="utf-8") as file:
file.write(f"Answer of Chatbot: {response}\n")
st.markdown(response)
st.session_state.chat_history.append({"role": ASSISTANT, "content":
response})
else:
st.warning("Please select a model to run.")
else:
st.warning("Please select columns for the chatbot to answer from.")
Table 3: Process the query of user and send response.
5.1.9 Login and logout from admin
Allow users to sign out of the app, check for authentication and redirect to the
login page if the user is not already logged in and secure and manage user sessions in
chatbot
● Logout Button: Set the login status to False and go back to the login page.
● Authentication check: Make sure only logged-in users stay on the page.
# Logout Buton
if st.sidebar.button("Logout"):
st.session_state.authenticated = False
st.success("Đã đăng xuất thành công!")
st.switch_page("app.py") # Back to register
st.sidebar.title("Doc Retrievaled")
# Authenticate login
def authenticate_user(username, password):
collection = get_user_collection()
results = collection.query(query_texts=[username], n_results=1)
def calculate_rouge(file_path):
print(f"Đang xử lý file: {file_path}")
df = pd.read_csv(file_path)
if "Câ u trả lờ i" not in df.columns or "Answer" not in df.columns:
raise ValueError("File CSV phả i chứ a cộ t 'Câ u trả lờ i' và 'Answer'.")
df = df.dropna(subset=["Câ u trả lờ i", "Answer"])
reference_answers = df["Câ u trả lờ i"].astype(str).tolist()
chatbot_answers = df["Answer"].astype(str).tolist()
rouge = Rouge()
rouge_scores = {
"rouge-1": {"f": [], "p": [], "r": []},
"rouge-2": {"f": [], "p": [], "r": []},
"rouge-l": {"f": [], "p": [], "r": []}
}
for ref, hyp in zip(reference_answers, chatbot_answers):
try:
scores = rouge.get_scores(hyp, ref)[0]
for metric in rouge_scores:
for sub_metric in rouge_scores[metric]:
rouge_scores[metric][sub_metric].append(scores[metric][sub_metric])
except Exception as e:
print(f"Lỗ i khi tính ROUGE cho: {hyp} vs {ref}. Lỗ i: {e}")
continue
results = {
"metric": [],
"sub_metric": [],
"average": []
}
for metric in rouge_scores:
for sub_metric in rouge_scores[metric]:
if len(rouge_scores[metric][sub_metric]) > 0:
avg_score = sum(rouge_scores[metric][sub_metric]) /
len(rouge_scores[metric][sub_metric])
else:
avg_score = 0.0
results["metric"].append(metric)
results["sub_metric"].append(sub_metric)
results["average"].append(avg_score)
results_df = pd.DataFrame(results)
print("Trung bình cộ ng điểm ROUGE:")
print(results_df)
# export CSV
output_file = "rouge_scores_output.csv"
results_df.to_csv(output_file, index=False)
Recall
Metric Precision (P) (R) F1-Score (F)
ROUGE-1 0.385 0.455 0.388
ROUGE-2 0.221 0.273 0.224
ROUGE-L 0.356 0.42 0.359
Table 5: keepitreal/vietnamese-sbert + HYDE search + GEMINI"learnlm-1.5-pro-
experimental"
Result:
● Decent performance: ROUGE-1 is decent, but inferior to Vector Search.
● Drawbacks: ROUGE-2 and ROUGE-L are lower, HYDE Search is not optimal.
Recall
Metric Precision (P) (R) F1-Score (F)
ROUGE-1 0.549 0.65 0.567
ROUGE-2 0.417 0.528 0.445
ROUGE-L 0.53 0.63 0.549
Table 6: sentence-transformers/all-MiniLM-L6-v2 + Vector search +
GEMINI"learnlm-1.5-pro-experimental"
Result:
● Good effect: ROUGE-1is good coverage but not accurate.
Recall
Metric Precision (P) (R) F1-Score (F)
ROUGE-1 0.382 0.454 0.386
ROUGE-2 0.219 0.273 0.223
ROUGE-L 0.354 0.42 0.358
Table 7: sentence-transformers/all-MiniLM-L6-v2 + HYDE search +
GEMINI"learnlm-1.5-pro-experimental"
Result:
● Least efficient: ROUGE-1, ROUGE-2, ROUGE-L are all low.
Overall:
⇒Best Model: keepitreal/vietnamese-sbert + Vector Search.
⇒Much needs improvement: sentence-transformers/all-MiniLM-L6-v2 +
HYDE Search.
5.2.3 Document retrieval evaluation score
Search
Model Method RETRIVALED
Vector
Search 120/120
keepitreal/vietnamese-sbert
HYDE
Search 117/120
Vector
sentence-transformers/all- Search 87/100
MiniLM-L6-v2 HYDE
Search 54/100
Table 8: Document retrieval evaluation score
Result:
● Best fit: keepitreal/vietnamese-sbert with Vector Search.
● Needs improvement: HYDE Search with both models.
Action: Fine-tune the Chatbot, can fine-tune the embedding model to "vietnamese-
sbert" or "all-MiniLM-L6-v2". Search methods such as Vector search and HYDE
search.
5.3.1.4 Collection management for chatbot
Figure 19: Load Collection into Chatbot for use, delete collection
Action: Load collection into chatbot for use, delete collection
Figure 20: Create a Collection to use as a save data
Action: Create a collection to use as a save data
● Action: Interact with the chatbot, view the documents retrieved during the
retrieval process, see the Prompt sentence transmitted to Gemini LLM.
5.3.2 User Interface for User
5.3.2.1 Chatbot Interaction and Reviews
[3]HYDE searchhttps://github.com/texttron/hyde/blob/main/hyde-demo.ipynb
[4] ROUGE https://thepythoncode.com/code/calculate-rouge-score-in-python