带地图的 RAG：多模态 + 地理空间在 Elasticsearch 中

Elastic 中国社区官方博客

于 2025-09-11 09:26:07 发布

阅读量360

点赞数 8

CC 4.0 BY-SA版权

分类专栏： Elasticsearch AI Elastic 文章标签： elasticsearch 大数据搜索引擎全文检索人工智能 ai 语言模型

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/UbuntuTouch/article/details/151552848

Elastic 同时被 3 个专栏收录

2028 篇文章

订阅专栏

Elasticsearch

1363 篇文章

订阅专栏

572 篇文章

订阅专栏

作者：来自 Elastic Alexander Dávila

将多模态 RAG 功能与 Elasticsearch 的核心功能结合起来，例如地理空间查询和词汇搜索。

Elasticsearch 配备了新功能，帮助你为你的用例构建最佳搜索解决方案。深入查看我们的示例笔记本以了解更多内容，开始免费的云试用，或立即在本地机器上尝试 Elastic。

在使用 RAG 系统时，Elasticsearch 通过将混合搜索（向量搜索 + 传统文本搜索）方法与硬过滤相结合，提供了显著优势，以确保检索到的数据与用户查询相关。这使模型不易产生幻觉，并且总体上提高了系统质量。在这篇博客中，我们将探讨如何利用 Elastic 的地理空间搜索功能，将多模态 RAG 系统提升到新水平。

开始

你可以在这里找到本博客中使用的完整源代码。

先决条件

Elasticsearch 8.0.0+
Ollama
- cogito:3b 模型
Python 3.8+
Python 依赖：
- elasticsearch
- elasticsearch-dsl
- ollama
- clip_processor
  - torch
  - transformers
  - PIL
- streamlit
- json
- os
- typing

设置

1）克隆仓库：

git clone https://github.com/Alex1795/multimodal_RAG_elasticsearch.git  
cd multimodal_RAG_elasticsearch

2）安装所需库：

pip install -r requirements.txt

3）安装并设置 Ollama：

Download from https://ollama.com/download/

# Download and start the required model
ollama pull cogito:3b
ollama run cogito:3b

4）配置 Elasticsearch

确保设置以下环境变量：
- ES_INDEX
- ES_HOST
- ES_API_KEY
在 Elasticsearch 上设置索引映射，特别注意地理位置和向量定义：

PUT mmrag_blog
{  
  "mappings": {  
    "properties": {  
      "title": {  
        "type": "text",  
        "analyzer": "standard"  
      },  
      "geolocation": {  
        "type": "geo_point"  
      },  
      "image_filename": {  
        "type": "keyword"  
      },  
      "generated_description": {  
        "type": "text",  
        "analyzer": "standard"  
      },  
      "description": {  
        "type": "text",  
        "analyzer": "standard"  
      },  
      "text_embedding": {  
        "type": "dense_vector",  
        "dims": 512,  
        "index": true,  
        "similarity": "cosine"  
      },  
      "image_embedding": {  
        "type": "dense_vector",  
        "dims": 512,  
        "index": true,  
        "similarity": "cosine"  
      },  
      "photo_id": {  
        "type": "keyword"  
      }  
    }  
  }  
}

运行应用程序

1）生成并索引图片的向量和元数据：

python upload_documents.py

这个文件运行数据索引管道。它处理图片元数据文件，并使用多模态向量（通过描述和使用 CLIP 模型的图片本身）丰富它们。最后，它将文档上传到 Elasticsearch。执行此命令后，你应该能在 Elasticsearch 中看到 mmrag_blog 索引，其中包含图片元数据、地理位置，以及图片和文本向量。

2）运行 streamlit 应用，并在浏览器中使用 UI：

streamlit run streamlit_app.py #comment

执行此命令后，你可以在 http://localhost:8501 查看项目网页。

该网页是 RAG 应用的界面。在这里，你可以提出问题，助手会从你的问题中提取相应参数，在 Elasticsearch 上运行 RRF 搜索以找到相关图片，并生成回答。同时，它还会展示部分搜索结果中的图片。

实现概览

为了展示 Elastic 的 RAG 功能，我们将构建一个可以使用相关数据回答关于国家公园问题的助手。搜索结合了 4 种方法，并使用从用户文本查询中推断的数据：

图片向量搜索
文本向量搜索
词汇文本搜索
地理空间过滤

这使我们的助手能够回答相关且聚焦于用户需求的问题。

那么，地理空间过滤如何改善助手的结果呢？例如，如果用户问：“Where can I find canyons near Salt Lake City?”（我在哪里可以找到靠近盐湖城的峡谷？）
如果没有地理空间过滤，助手可能会建议：

Canyonlands National Park - Utah
Grand Canyon National Park - Arizona

然而，因为我们知道用户特别想找盐湖城附近的地点，所以只搜索 Utah 的答案更合理。因此，正确选项仅为 Canyonlands National Park。

本博客中的实现使用 geo_distance 查询，以便能够在特定国家公园区域内找到结果（图片的 geopoint）。我们还使用 geoshapes 来绘制公园的区域。

然而，Elastic 的地理查询能力远不止于此：

geo_bounding_box查询：查找与指定矩形相交的文档（geopoints 或 geoshapes）
geo_grid查询：查找与指定 geohash、地图瓦片或 H3 bin 相交的文档
geo_shape 查询：查找与指定 geoshape 相关的文档（相交、被包含、在内部，或不相交操作）

数据集

我们将使用从 Flickr 获取的带地理标签的国家公园图片。我们为这些图片添加描述，并使用 openai/clip-vit-base-patch32 模型对图片和描述进行向量化：

我们将这些向量与图片的元数据合并，最终得到的文档如下所示：

{
         "title": "Spa Geyser in Yellowstone National Park on a sunny day",
         "geolocation": {
           "lat": 44.45899722222222,
           "lon": -110.82573611111111
         },
         "image_filename": "52631363114_Spa_Geyser_in_Yellowstone_National_Park_on_a_sunny.jpg",
         "generated_description": "A small geyser releases steady streams of hot water and steam into the air on a clear sunny day. Colorful mineral deposits surround the thermal feature, creating vibrant orange and yellow formations. The active geothermal vent demonstrates the underground volcanic activity that powers these natural fountains.",
         "text_embedding": [
           0.02323250286281109,
           …
           -0.17811810970306396
         ],
         "image_embedding": [
           -0.22548234462738037,
		…
		-0.040389999747276306
         ]

       }

使用 Elastic 处理地理位置的一个主要好处是 Kibana Maps 可视化。在 Kibana 中，我们的数据集如下所示（注意我们还为国家公园添加了地理形状）：

放大后，我们可以在 Yellowstone 中看到之前的相同文档。此外，Yellowstone 公园的（大致）形状也绘制在下面的图层中。

系统架构

索引管道

索引管道将处理图片和描述的向量化。同时，它还会为图片添加更多元数据以创建文档，并将其索引到 Elastic：

起点是来自国家公园的图片和元数据对。这些元数据包括地理位置、图片标题和描述。
我们将图片和描述输入 CLIP 模型（openai/clip-vit-base-patch32），以获取它们在相同 512 维向量空间中的向量。你可以在这里查看此步骤的完整源代码。

def create_image_embedding(image_path):
    image = Image.open(image_path).convert('RGB')
    inputs = processor(images=image, return_tensors="pt", padding=True, truncation=True, use_fast=True)
    with torch.no_grad():
        outputs = model.get_image_features(**inputs)
    return outputs.numpy().flatten()

从图片生成向量的过程如下：

使用 Image.open() 以 RGB 格式加载图片
使用 processor() 将图片转换为张量（tensors），这是模型期望的格式
使用 model.get_image_features() 从图片中提取 512 维的稠密向量表示
最后，使用 outputs.numpy().flatten() 将 PyTorch 张量转换为扁平的 numpy 数组

def create_text_embedding(text):
    # Process the text
    inputs = processor(text=[text],  return_tensors="pt", padding=True, truncation=True)
    # Generate embedding
    with torch.no_grad():
        text_features = model.get_text_features(**inputs)
        # Normalize the embedding (CLIP embeddings are typically normalized)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)
    # Convert to numpy array
    embedding = text_features.numpy().flatten()

    return embedding

从文本生成向量的过程如下：

使用 processor() 处理输入文本，将其分词并转换为张量（tensors）
使用 model.get_text_features() 解析分词后的文本以提取语义特征。生成的向量也是 512 维
对向量进行归一化，以便计算点相似度，使用 text_features / text_features.norm()
最后，将向量转换为扁平的 numpy 数组，使用 text_features.numpy().flatten()

我们选择这个模型是因为它是一个多模态模型，可以最大化图片与文本之间的相似度。这样，图片的描述和图片本身在向量空间中生成的向量通常是接近的。

3）我们将所有元数据、描述、地理位置，以及图片和描述的向量合并到一个 JSON 文件中

我们使用以下方式将 JSON 文件索引到 Elastic：

es.index(document=doc, index=index)

其中 doc 是每张图片的元数据。

搜索管道

这一阶段将处理用户的查询，创建搜索，并根据搜索结果生成回答。使用的 LLM 是 Ollama 的 cogito:3b，但它也可以很容易地替换为任何远程模型 —— 例如 Claude 或 ChatGPT。我们选择这个模型是因为它轻量且在通用任务上表现出色（正如一个助手所期望的），相比类似模型（如 Llama 3.2 3B）。这意味着我们可以在本地运行时得到正确结果，而无需长时间等待！

管道工作流程如下：

1）我们接收到用户输入：Where can I see mountains in Washington State?（我在哪里可以看到华盛顿州的山脉？）

2）我们将用户输入和包含公园、州及地理位置的字典（在同一个 Python 文件中定义）提供给 LLM，并指示其提取用于 Elastic 查询的参数。具体的提示为：

”””You are going to extract data from a user query for a national parks search system. 

Available National Parks:
{parks_info}

Extract the following information and format it as JSON:
- context_search: the main activity or interest (e.g., "hike","walk dog"," or"camping")
- distance_km: estimated search radius in kilometers (default: 100 if not specified)
- location_type: specific state, city, or region mentioned
- reference_location: if a city is mentioned, include it (e.g., "Boston","Denver")
- relevant_parks: list of park IDs that might be relevant based on location (use the exact park IDs from the list above)

Examples:
User query: "Where can I hike in Utah?"
Response: {{"context_search": "hike", "distance_km": 100, "location_type": "Utah", "reference_location": null, "relevant_parks": ["arches_national_park", "canyonlands_national_park"]}}

Only respond with valid JSON. No additional text. If a city is mentioned, use the state that city is in as the location_type.

User query: {query}”””

这是 parks_info 字典中的一个数据示例：

 { 
    "mt_rainier_national_park": {
        "coordinates": (46.8523, -121.7603),
        "state": "Washington"
    }
  }

模型从上述提示中提取以下数据：

{
 'context_search': 'mountains', 
 'distance_km': 100, 
 'location_type': 'Washington', 
 'reference_location': None, 
 'relevant_parks': ['mt_rainier_national_park']
}

3）我们使用 context_search 参数，通过相同的 CLIP 模型生成新的向量。

4）我们从 parks_info 字典中提取坐标。

5）我们使用所有这些参数来创建 Elasticsearch 查询。这是 RAG 功能的核心：

使用坐标和 distance_km 参数创建 geo_distance 过滤器
针对 generated_description 字段创建文本匹配查询
创建一个标准检索器，使用上一步的文本查询和 geo_distance 过滤器
创建两个 knn 检索器，使用第 4 步生成的向量，分别与每个文档中索引的图片向量和文本向量匹配。每个检索器也使用 geo_distance 过滤器
使用 RRF 检索器将所有其他检索器生成的数据集合并

整个过程在 rrf_search() 函数中执行：

def rrf_search(index_name, lat, lon, distance, text_query, k=10,
               num_candidates=100):
    """
    Create an RRF search object bound to a specific index. Then executes the search. 

    Args:
        index_name (str): Name of the Elasticsearch index
        lat (float): Latitude for geo filtering
        lon (float): Longitude for geo filtering
        distance (int/str): Distance for geo filtering
        text_query (str): Text to search in description fields
        k (int): Number of top results for KNN search
        num_candidates (int): Number of candidates for KNN search

    Returns:
        Search: List of results frm Elasticsearch
    """

    embedding = create_text_embedding(text_query).tolist()

    # Create geo distance query
    geo_filter = Q('geo_distance',
                   distance=distance,
                   geolocation={'lat': lat, 'lon': lon})

    # Create text search queries
    text_queries = [
        Q('match', generated_description=text_query),
        Q('match', description=text_query)
    ]

    # Create boolean query for standard search
    standard_query = Q('bool', filter=[geo_filter], should=text_queries)

    # Create search object bound to index
    s = Search(index=index_name)
    s = s.source(["image_filename", "generated_description"])
    # Build RRF configuration
    retrievers = [
        # Standard retriever
        {
            "standard": {
                "query": standard_query.to_dict()
            }
        },
        # Text KNN retriever
        {
            "knn": {
                "filter": geo_filter.to_dict(),
                "field": "text_embedding",
                "query_vector": embedding,
                "k": k,
                "num_candidates": num_candidates
            }
        },
        # Image KNN retriever
        {
            "knn": {
                "filter": geo_filter.to_dict(),
                "field": "image_embedding",
                "query_vector": embedding,
                "k": k,
                "num_candidates": num_candidates
            }
        }
    ]

    # Apply RRF configuration
    s = s.extra(retriever={'rrf': {'retrievers': retrievers}}, size=3)

    #print(s.to_dict())

    es = Elasticsearch(cloud_id=cloud_id, api_key=api_key)

    results = s.using(es).execute()["hits"]["hits"]

    return results

最后，我们得到一个如下的查询：

{
 "retriever": {
   "rrf": {
     "retrievers": [
       {
         "standard": {
           "query": {
             "bool": {
               "filter": [
                 {
                   "geo_distance": {
                     "distance": "100km",
                     "geolocation": {
                       "lat": 46.8523,
                       "lon": -121.7603
                     }
                   }
                 }
               ],
               "should": [
                 {
                   "match": {
                     "generated_description": "mountains"
                   }
                 }
               ]
             }
           }
         }
       },
       {
         "knn": {
           "filter": {
             "geo_distance": {
               "distance": "100km",
               "geolocation": {
                 "lat": 46.8523,
                 "lon": -121.7603
               }
             }
           },
           "field": "text_embedding",
           "query_vector": [
             0.01967986486852169,
             ...
             0.00988344382494688],
           "k": 10,
           "num_candidates": 100
         }
       },
       {
         "knn": {
           "filter": {
             "geo_distance": {
               "distance": "100km",
               "geolocation": {
                 "lat": 46.8523,
                 "lon": -121.7603
               }
             }
           },
           "field": "image_embedding",
           "query_vector": [
             0.01967986486852169,
             ...
             0.00988344382494688],
           "k": 10,
           "num_candidates": 100
         }
       }
     ]
   }
 },
 "size": 3,
 "_source": [
   "image_filename",
   "generated_description"
 ]
}

6）接着，我们将从 Elastic 获取的文档和用户的原始查询输入 LLM，使用以下提示：

f"""You are a helpful assistant for national parks activities. Based on the search results below, provide a comprehensive and helpful response to the user's original query.

Original User Query: {original_query}

Search Parameters Used:
- Activity/Interest: {search_params.get('context_search', 'N/A')}
- Search Distance: {search_params.get('distance_km', 'N/A')} km
- Location: {search_params.get('location_type', 'N/A')}

Search results: {results_text}

Instructions:
- Provide a natural, conversational response
- Recommend specific activities and locations based on the search results only
- Include practical information when available
- Do not suggest alternatives if no results were found
- Be enthusiastic and helpful about national parks experiences
- Keep the response focused and not too lengthy
- Structure your response separating your suggestions per national park
- Do not include anything about national parks that are not in the results"""

7）最后，LLM 根据搜索结果生成回答：

I'd be happy to help you find mountains in Washington State! Based on the search results, here are some fantastic locations:
Mount Rainier National Park is a must-visit destination for mountain lovers. Paradise Valley offers breathtaking views of the Tatoosh Mountain Range and Mount Rainier itself. The best time to visit is during late spring when the wildflowers bloom.

This location offers incredible opportunities to see mountains up close and personal - whether you're hiking, camping, or simply taking in the breathtaking scenery. Would you like more specific information about this park?

Web 应用

基于 Streamlit 的前端处理用户输入，运行搜索管道以获取 LLM 最终回答，并展示搜索结果中的图片及其描述。

你可以在这里找到应用源代码和使用说明。

Multimodal RAG and geospatial search usage example

Query: Any places to ride a boat in Oregon?

在这里，LLM 提取了这些参数：

{
 'context_search': 'boat ride', 
 'distance_km': 100, 
 'location_type': 'Oregon', 
 'reference_location': None, 
 'relevant_parks': ['crater_lake_national_park']
}

搜索以 Crater Lake National Park 为中心，因此回答仅来自 Oregon 的该国家公园。这样，系统确保在给定约束下回应用户，不会提及其他可以划船但不在 Oregon 的公园。

结论

在本文中，我们看到将多模态 RAG 功能与 Elasticsearch 强大的地理空间特性集成，可以显著提高 RAG 系统搜索结果的相关性和准确性。通过将图片和文本向量搜索与词汇搜索及精确地理过滤相结合，系统能够提供高度情境化的回答。这种方法不仅最小化了幻觉问题，还利用了 Elasticsearch 多样的地理查询选项和 Kibana 的可视化工具，提供全面且以用户为中心的搜索体验。

原文：https://www.elastic.co/search-labs/blog/multimodal-rag-elasticsearch-geospatial