2 利用语言模型进行语义搜索(Semantic Search with Language Models)
现在,让我们深入探讨可以提升语言模型搜索能力的主要系统类别。我们先从高密度检索开始,然后再讨论重排和 RAG。
密集检索
回顾一下,嵌入(embeddings)将文本转化为数字表示。如图 8-4 所示,我们可以将其视为空间中的点。相邻的点意味着它们所代表的文本是相似的。因此,在本例中,文本 1 和文本 2 之间的相似度(因为它们彼此靠近)要高于文本 3(因为它距离较远)。
图 8-4. 嵌入的直觉:每个文本都是一个点,意思相近的文本彼此靠近
这是用于构建搜索系统的属性。在这种情况下,当用户输入搜索查询时,我们会嵌入查询,从而将其投射到与文本档案相同的空间中。然后,我们只需在该空间中找到与查询最近的文档,这些文档就是搜索结果(图 8-5)。
图 8-5. 密集检索依赖于搜索查询将靠近相关结果这一特性
从图 8-5 中的距离来看,“text 2”是该查询的最佳结果,其次是 “text 1”。不过,这里可能会出现两个问题:
- text 3是否应该作为结果返回?这要由系统设计者来决定。有时,最好设定一个相似度得分的最大阈值,以过滤掉不相关的结果(以防语料库中没有与查询相关的结果)。
- 查询和最佳结果在语义上相似吗?不一定。这就是为什么语言模型需要在问题-答案对上进行训练,以提高检索能力。第 10 章将详细介绍这一过程。图 8-6 显示了我们如何在嵌入每个分块之前对文档进行分块。然后,这些嵌入向量(embedding vectors)将被存储到向量数据库中,随时可供检索。
图 8-6. 将某个外部知识库转换为矢量数据库。然后,我们就可以查询该矢量数据库,获取有关知识库的信息
密集检索示例(Dense retrieval example)
让我们以密集检索为例,使用 Cohere 搜索电影《Interstellar》的维基百科页面。在这个示例中,我们将进行以下操作:
- 获取我们想要搜索的文本,并进行一些简单的处理,将其分块成句子。
- 嵌入句子。
- 建立搜索索引
- 搜索并查看结果。
登录 https://oreil.ly/GxrQ1 获取 Cohere API 密钥。将其粘贴到以下代码中。运行此示例无需支付任何费用。
导入我们需要的库:
import cohere
import numpy as np
import pandas as pd
from tqdm import tqdm
# Paste your API key here. Remember to not share publicly
api_key = ''
# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key)
获取文本存档并将其分块
让我们使用维基百科上关于电影《星际穿越》的第一部分。我们先获取文本,然后将其分成若干句子:
text = """
Interstellar is a 2014 epic science fiction film co-written,directed, and produced by Christopher Nolan.It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain,Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.Set in a dystopian future where humanity is struggling tosurvive, the film follows a group of astronauts who travelthrough a wormhole near Saturn in search of a new home formankind.
Brothers Christopher and Jonathan Nolan wrote the screenplay,which had its origins in a script Jonathan developed in 2007.Caltech theoretical physicist and 2017 Nobel laureate inPhysics[4] Kip Thorne was an executive producer, acted as ascientific consultant, and wrote a tie-in book, The Science ofInterstellar.
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film inthe Panavision anamorphic format and IMAX 70 mm.Principal photography began in late 2013 and took place inAlberta, Iceland, and Los Angeles.
Interstellar uses extensive practical and miniature effects andthe company Double Negative created additional digital effects.Interstellar premiered on October 26, 2014, in Los Angeles.In the United States, it was first released on film stock,expanding to venues using digital projectors.
The film had a worldwide gross over $677 million (and $773million with subsequent re-releases), making it the tenth-highestgrossing film of 2014.
It received acclaim for its performances, direction, screenplay,musical score, visual effects, ambition, themes, and emotionalweight.
It has also received praise from many astronomers for itsscientific accuracy and portrayal of theoretical astrophysics.Since its premiere, Interstellar gained a cult following,[5] andnow is regarded by many sci-fi experts as one of the bestscience-fiction films of all time.
Interstellar was nominated for five awards at the 87th AcademyAwards, winning Best Visual Effects, and received numerous otheraccolades"""
# Split into a list of sentences
texts = text.split('.')
# Clean up to remove empty spaces and new lines
texts = [t.strip(' \n') for t in texts]
计算文本块的嵌入向量(Embedding the text chunks)
现在让我们计算文本嵌入向量。我们将把它们发送到 Cohere API,并为每个文本返回一个向量:
# Get the embeddings
response = co.embed(
texts=texts,
input_type="search_document",
).embeddings
embeds = np.array(response)
print(embeds.shape)
输出结果为 (15, 4096),表示我们有 15 个向量,每个向量的大小为 4096。
建立搜索索引(Building the search index)
在搜索之前,我们需要建立一个搜索索引(search index)。索引可以存储嵌入信息,并经过优化,即使我们有大量的点,也能快速检索到最近的邻居:
import faiss
dim = embeds.shape[1]
index = faiss.IndexFlatL2(dim)
print(index.is_trained)
index.add(np.float32(embeds))
通过索引进行搜索(Search the index)
我们现在可以使用任何查询来搜索数据集。我们只需嵌入查询并将其嵌入到索引中,索引就会从维基百科文章中检索出最相似的句子。
让我们定义一下搜索函数:
def search(query, number_of_results=3):
# 1. Get the query's embedding
query_embed = co.embed(texts=[query],
input_type="s