使用 Elasticsearch 构建多模式 RAG 系统：哥谭市的故事

aidu_pl">

作者：来自 Elastic Alex Salgado

学习如何构建一个多模态检索增强生成 (RAG) 系统，该系统集成文本、音频、视频和图像数据，以提供更丰富的、具有上下文的信息检索。

在这篇博客中，你将学习如何使用 Elasticsearch 构建一个多模态 RAG（Retrieval-Augmented Generation - 检索增强生成）流水线。我们将探讨如何利用 ImageBind 生成各种数据类型（文本、图像、音频、深度图等）的嵌入向量，并了解如何使用 dense_vector 和 k-NN 搜索 高效存储和检索这些嵌入向量。最后，我们将集成 大语言模型（LLM） 来分析检索到的证据，并生成一份综合报告。

流水线如何工作？

🔍 收集线索 → 从哥谭市犯罪现场提取图像、音频、文本和深度图数据。
📌 生成嵌入 → 使用 ImageBind 多模态模型，将每个文件转换为向量。
📂 索引至 Elasticsearch → 存储向量以便高效检索。
🔎 相似性搜索 → 给定新线索，检索最相似的向量。
🕵️ LLM 分析证据 → GPT-4 综合分析，锁定嫌疑人！

使用的技术

ImageBind → 生成各种模态的统一嵌入向量。
Elasticsearch → 提供快速高效的向量检索。
LLM（GPT-4, OpenAI） → 分析证据并生成最终报告。

谁适合阅读这篇博客？

✅ Elasticsearch 用户——对多模态向量搜索感兴趣的开发者。
✅ 希望实践多模态 RAG 的开发者——想要了解如何在实际应用中构建多模态 RAG。
✅ 寻求可扩展数据分析方案的工程师——需要处理来自多个来源的数据并进行深入分析。

先决条件：环境搭建

想要破解哥谭市的案件？首先，你需要搭建技术环境。请按照以下步骤进行设置：

1. 技术要求

Component	Specification
Sistem OS	Linux, macOS, or Windows
Python	3.10 or later
RAM	Minimum 8GB (16GB recommended)
GPU	Optional but recommended for ImageBind

2. 设置项目

所有调查材料都可在 GitHub 上找到，我们将在 Jupyter Notebook（Google Colab） 中进行这次互动式破案体验。请按照以下步骤开始：

使用 Jupyter Notebook（Google Colab）进行设置

1）访问 Notebook
打开我们已准备好的 Google Colab Notebook：Multimodal RAG with Elasticsearch。
该 Notebook 包含所有必要的代码和说明，方便你跟随学习。

2）克隆代码仓库

# Clone the repository with the multimodal RAG code
!git clone -b https://github.com/elastic/elasticsearch-labs.git# Navigate to the project directory
cd elasticsearch-labs/supporting-blog-content/building-multimodal-rag-with-elasticsearch-gotham

3）安装依赖

 # Install PyTorch and related libraries
!pip install torch>=2.1.0 torchvision>=0.16.0 torchaudio>=2.1.0# Install vision processing libraries
!pip install opencv-python-headless pillow numpy# Install the specific ImageBind fork
!pip install git+https://github.com/hkchengrex/ImageBind.git# Install Elasticsearch and environment management
!pip install elasticsearch python-dotenv# This solves the problem: Couldn't find appropriate backend to handle uri data/audios/joker_laugh.wav 
!pip install torchaudio soundfile

4. 配置凭证

# Input your credentials securely
import getpassELASTICSEARCH_URL = input("Enter the Elasticsearch endpoint url: ")
ELASTICSEARCH_API_KEY = getpass.getpass("Enter the Elasticsearch API key: ")
OPENAI_API_KEY = getpass.getpass("Enter the OpenAI API key: ")# Configure environment variables
import os
os.environ["ELASTICSEARCH_API_KEY"] = ELASTICSEARCH_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["ELASTICSEARCH_URL"] = ELASTICSEARCH_URL

注意：ImageBind 模型（约 2GB）将在第一次运行时自动下载。

现在一切都已设置好，让我们深入细节，解决案件！

介绍：哥谭市的犯罪

在一个雨夜，哥谭市发生了一起令人震惊的犯罪事件。戈登局长需要你的帮助来解开这个谜团。线索散布在不同的格式中：模糊的图像、神秘的音频、加密的文本，甚至是深度图。你准备好使用最先进的 AI 技术来破案了吗？

在这篇博客中，我们将一步步指导你构建一个多模态 RAG（检索增强生成）系统，将不同类型的数据（图像、音频、文本和深度图）统一到一个搜索空间中。我们将使用 ImageBind 来生成多模态嵌入，使用 Elasticsearch 来存储和检索这些嵌入，并使用 大语言模型（LLM） 来分析证据并生成最终报告。

基础：多模态 RAG 架构

什么是多模态 RAG？

检索增强生成（RAG）多模态的兴起正在改变我们与 AI 模型互动的方式。传统的 RAG 系统仅处理文本，从数据库中检索相关信息后生成响应。然而，世界不仅仅局限于文本 —— 图像、视频和音频也携带着宝贵的知识。这就是为什么多模态架构变得越来越重要，它允许 AI 系统结合不同格式的信息，从而生成更丰富、更精准的响应。

三种主要的多模态 RAG 方法

要实现多模态 RAG，常用三种策略。每种方法都有自己的优点和局限性，具体取决于使用场景：

1. 共享向量空间

不同模态的数据通过像 ImageBind 这样的多模态模型映射到一个公共的向量空间。这使得文本查询可以检索图像、视频和音频，而无需显式的格式转换。

优点：

实现跨模态检索，无需显式的格式转换。
提供不同模态之间的流畅集成，允许直接跨文本、图像、音频和视频进行检索。
可扩展到多种数据类型，非常适用于大规模检索应用。

缺点：

训练需要大规模的多模态数据集，这些数据集可能并不总是可用。
共享嵌入空间可能引入语义漂移，导致模态之间的关系不完全保留。
多模态模型中的偏差可能会影响检索准确性，具体取决于数据集分布。

2. 单一基础模态

所有模态在检索前都转换为一个单一格式，通常是文本。例如，图像通过自动生成的标题进行描述，音频转录为文本。

优点：

简化了检索过程，因为所有内容都转换为统一的文本表示。
与现有的基于文本的搜索引擎兼容，无需专门的多模态基础设施。
可提高可解释性，因为检索结果是人类可读的格式。

缺点：

信息丢失：某些细节（例如图像中的空间关系、音频中的语气）可能无法在文本描述中完全捕捉。
依赖于标题/转录的质量：自动注释中的错误可能会降低检索效果。
对纯视觉或听觉查询不理想，因为转换过程可能会移除关键信息。

3. 独立检索

为每个模态保持不同的模型。系统对每种数据类型进行独立搜索，然后合并结果。

优点：

允许针对每种模态进行自定义优化，提高每种数据类型的检索准确性。
较少依赖复杂的多模态模型，使得集成现有检索系统更容易。
提供对排名和重新排名的精细控制，因为来自不同模态的结果可以动态合并。

缺点：

需要结果融合，使得检索和排名过程更复杂。
如果不同模态返回冲突信息，可能会生成不一致的响应。
计算成本较高，因为每个模态都需要单独进行搜索，从而增加处理时间。

我们的选择：使用 ImageBind 的共享向量空间

在这些方法中，我们选择了共享向量空间，这一策略非常适合高效的多模态搜索需求。我们的实现基于 ImageBind，该模型能够将多种模态（文本、图像、音频和视频）表示在一个公共的向量空间中。这使我们能够：

在不同的媒体格式之间执行跨模态搜索，而无需将所有内容转换为文本。
使用高度表达力的嵌入来捕捉不同模态之间的关系。
确保可扩展性和高效性，存储优化后的嵌入以便在 Elasticsearch 中快速检索。

通过采用这种方法，我们构建了一个强大的多模态搜索流水线，在这里，文本查询可以直接检索图像或音频，无需额外的预处理。这种方法将实际应用从大型库中的智能搜索扩展到先进的多模态推荐系统。

下图展示了多模态 RAG 流水线中的数据流，突出了基于多模态数据的索引、检索和响应生成过程：

嵌入空间如何工作？

传统上，文本嵌入来自语言模型（例如 BERT、GPT）。现在，借助像 Meta AI 的 ImageBind 这样的原生多模态模型，我们有了一个基础，能够为多种模态生成向量：

文本：句子和段落被转换为相同维度的向量。
图像（视觉）：像素被映射到与文本相同的维度空间。
音频：声音信号被转换为与图像和文本可比的嵌入。
深度图：深度数据也会被处理，并生成向量。

因此，任何线索（文本、图像、音频、深度图）都可以通过向量相似度度量（如余弦相似度）与其他线索进行比较。如果一个笑声音频样本和嫌疑人面部的图像在这个空间中是 “接近的”，我们可以推断出某种关联（例如，同一个身份）。

阶段 1 - 收集犯罪现场线索

在分析证据之前，我们需要先收集它们。哥谭市的犯罪留下了可能隐藏在图像、音频、文本，甚至深度数据中的痕迹。让我们整理这些线索，以便输入到系统中。

我们有什么？

戈登局长给我们发送了以下文件，包含从犯罪现场收集的四种不同模态的证据：

线索描述与模态：

a) 图像（2 张照片）

crime_scene1.jpg, crime_scene2.jpg → 从犯罪现场拍摄的照片，显示地面上可疑的痕迹。
suspect_spotted.jpg → 安全摄像头图像，显示一名身影从现场逃跑。

b) 音频（1 个录音）

joker_laugh.wav → 一只靠近犯罪现场的麦克风录下了一声邪恶的笑声。

c) 文本（1 条消息）

Riddle.txt, note2.txt → 在现场发现了一些神秘的便条，可能是犯罪嫌疑人留下的。

d) 深度（1 个深度图）

depth_suspect.png → 一台带有深度传感器的安全摄像头捕捉到一个嫌疑人出现在附近的小巷。
jdancing-depth.png → 一台带有深度传感器的安全摄像头捕捉到一个嫌疑人走向地铁站。

这些证据以不同的格式存在，无法直接以相同的方式进行分析。我们需要将它们转换为嵌入 —— 数值向量，以便进行跨模态比较。

文件组织

在开始处理之前，我们需要确保所有线索都正确地组织在 data/ 目录中，以确保流水线顺利运行。

预期的目录结构：

data/
├── images/
│   ├── crime_scene1.jpg
│   ├── suspect_spotted.jpg
│   ...
├── audios/
│   ├── joker_laugh.wav
│   ...
├── texts/
│   ├── riddle.txt
│   ... 
├── depths/
│   ├── depth_suspect.png

验证线索组织的代码

在继续之前，让我们确保所有必需的文件都位于正确的位置。

import os# Base directory for clues
data_dir = "data"# List of expected files
evidences = {"images": ["crime_scene1.jpg","crime_scene1.jpg", "joker_alley.jpg"],"audios": ["joker_laugh.wav"],"texts": ["riddle.txt", "note2.txt”],"depths": ["depth_suspect.png", "jdancing-depth.png"]
}# Create directories if they don't exist
for category, files in evidences.items():category_path = os.path.join(data_dir, category)os.makedirs(category_path, exist_ok=True)for file in files:file_path = os.path.join(category_path, file)if not os.path.exists(file_path):print(f"Warning: {file} not found in {category_path}.")print("All files are correctly organized!")

运行文件:

python  stages/01-stage/files_check.py

预期输出（如果所有文件正确）：

All files are correctly organized!

预期输出（如果缺少任何文件）：

Warning: joker_laugh.wav not found in data/audios/
Warning: depth_suspect.png not found in data/depths/

这个脚本有助于在开始生成嵌入并将其索引到 Elasticsearch 之前防止错误。

阶段 2 - 组织证据

使用 ImageBind 生成嵌入

为了统一线索，我们需要将它们转换为嵌入 —— 捕捉每种模态意义的向量表示。我们将使用 ImageBind，Meta AI 提供的一个模型，它可以在共享向量空间内为不同的数据类型（图像、音频、文本和深度图）生成嵌入。

ImageBind 如何工作？

为了比较不同类型的证据（图像、音频、文本和深度图），我们需要使用 ImageBind 将它们转换为数值向量。这个模型允许将任何类型的输入转换为相同的嵌入格式，从而实现跨模态搜索。

以下是优化后的代码（src/embedding_generator.py），用于使用适当的处理器为每种模态生成嵌入：

class EmbeddingGenerator:"""Class for generating multimodal embeddings using ImageBind."""def __init__(self):self.device = "cuda" if torch.cuda.is_available() else "cpu"self.model = self._load_model()def _load_model(self):"""Loads the ImageBind model and sets it to inference mode."""model = imagebind_model.imagebind_huge(pretrained=True)model.eval()model.to(self.device)return modeldef generate_embedding(self, input_data, modality):"""Generates embedding for different modalities"""processors = {"vision": lambda x: data.load_and_transform_vision_data(x, self.device),"audio": lambda x: data.load_and_transform_audio_data(x, self.device),"text": lambda x: data.load_and_transform_text(x, self.device),"depth": self.process_depth}try:# Input type verificationif not isinstance(input_data, list):raise ValueError(f"Input data must be a list. Received: {type(input_data)}")# Convert input data to a tensor format that the model can process# For images: [batch_size, channels, height, width] # For audio: [batch_size, channels, time] # For text: [batch_size, sequence_length]inputs = {modality: processors[modality](input_data)}with torch.no_grad():embedding = self.model(inputs)[modality]return embedding.squeeze(0).cpu().numpy()except Exception as e:logger.error(f"Error generating {modality} embedding: {str(e)}", exc_info=True)raise

tensor 是机器学习和深度学习中的基本数据结构，特别是在使用像 ImageBind 这样的模型时。在我们的上下文中：

input_tensor = processors[modality]([input_data], self.device)

在这里，张量表示输入数据（图像、音频或文本），并将其转换为模型可以处理的数学格式。具体来说：

对于图像：张量将图像表示为一个多维矩阵，矩阵中的数值代表像素（按高度、宽度和颜色通道组织）。
对于音频：张量将声音波形表示为随时间变化的幅度序列。
对于文本：张量将单词或标记表示为数值向量。

测试嵌入生成：

让我们通过以下代码测试我们的嵌入生成。将其保存为 02-stage/test_embedding_generation.py 并使用以下命令执行：

python stages/02-stage/test_embedding_generation.py

generator = EmbeddingGenerator()
image_embedding = generator.generate_embedding("data/images/crime_scene1.jpg","vision")print(image_embedding.shape)

预期输出：

(1024,)

现在，图像已经被转换为一个 1024 维的向量。

阶段 3 - 在 Elasticsearch 中存储和搜索

现在我们已经为证据生成了嵌入，我们需要将它们存储在向量数据库中，以实现高效的搜索。为此，我们将使用 Elasticsearch，它支持密集向量（dense_vector）并允许相似度搜索。

这一步包含两个主要过程：

索引嵌入 → 将生成的向量存储到 Elasticsearch 中。
相似度搜索 → 检索与新证据最相似的记录。

在 Elasticsearch 中索引证据

每一条通过 ImageBind 处理的证据（图像、音频、文本或深度图）都会被转换为一个 1024 维的向量。我们需要将这些向量存储到 Elasticsearch 中，以便进行未来的搜索。

以下代码（src/elastic_manager.py）会在 Elasticsearch 中创建一个索引，并配置映射来存储这些嵌入。

from elasticsearch import Elasticsearch, helpers
...class ElasticsearchManager:"""Manages multimodal operations in Elasticsearch"""def __init__(self):load_dotenv()  # Load variables from .envself.es = self._connect_elastic()self.index_name = "multimodal_content"self._setup_index()def _connect_elastic(self):"""Connects to Elasticsearch"""return Elasticsearch(os.getenv("ELASTICSEARCH_URL"),  # Elasticsearch endpointapi_key=os.getenv("ELASTICSEARCH_API_KEY"))def _setup_index(self):"""Sets up the index if it doesn't exist"""if not self.es.indices.exists(index=self.index_name):mapping = {"mappings": {"properties": {"embedding": {"type": "dense_vector","dims": 1024,"index": True,"similarity": "cosine"},"modality": {"type": "keyword"},"content": {"type": "binary"},"description": {"type": "text"},"metadata": {"type": "object"},"content_path": {"type": "text"}}}}self.es.indices.create(index=self.index_name, body=mapping)def index_content(self, embedding, modality, content=None, description="", metadata=None, content_path=None):"""Indexes multimodal content"""doc = {"embedding": embedding.tolist(),"modality": modality,"description": description,"metadata": metadata or {},"content_path": content_path}if content:doc["content"] = base64.b64encode(content).decode() if isinstance(content, bytes) else contentreturn self.es.index(index=self.index_name, document=doc)def search_similar(self, query_embedding, modality=None, k=5):"""Searches for similar contents"""query = {"knn": {"field": "embedding","query_vector": query_embedding.tolist(),"k": k,"num_candidates": 100,"filter": [{"term": {"modality": modality}}] if modality else []}}try:response = self.es.search(index=self.index_name,query=query,size=k            )# Return both source data and score for each hitreturn [{**hit["_source"],"score": hit["_score"]} for hit in response["hits"]["hits"]]except Exception as e:print(f"Error: processing search_evidence: {str(e)}")return "Error generating search evidence"

运行索引

现在，让我们索引一条证据来测试这个过程。在项目的根目录下定义如下的一个文件 test_index.py：

# Example: Indexing an image from the crime scene
import sys
import os
import jsonsys.path.append(os.path.join((os.path.dirname(__file__)), "src")
)# Dump the object to a JSON string
json_string = json.dumps(data, indent=2)
print(json_string)from elastic_manager import ElasticsearchManager
from embedding_generator import EmbeddingGeneratorgenerator = EmbeddingGenerator()
es_manager = ElasticsearchManager()image_embedding = generator.generate_embedding(["data/images/crime_scene1.jpg"], "vision")response = es_manager.index_content(embedding=image_embedding,modality="vision",description="Photo of the crime scene with suspicious traces",content_path="data/images/crime_scene1.jpg"
)print(response)

预期的 Elasticsearch 输出（索引文档的摘要）：

{"embedding": [0.12, -0.53, 0.89, ...],  "modality": "vision",  "description": "Photo of the crime scene with suspicious traces",  "content_path": "data/images/crime_scene1.jpg"  
}

要索引所有多模态证据，请执行以下 Python 命令：

python stages/03-stage/index_all_modalities.py

现在，证据已存储在 Elasticsearch 中，并且可以在需要时检索。

验证索引过程

在运行索引脚本后，让我们验证所有证据是否正确存储在 Elasticsearch 中。你可以使用 Kibana 的开发者工具运行一些验证查询：

1）首先，检查索引是否已创建：

GET _cat/indices/multimodal_content?v

2）然后，验证每种模态的文档数量：

GET multimodal_content/_search
{"size": 0,"aggs": {"modalities": {"terms": {"field": "modality.keyword"}}}
}

3）最后，检查索引文档的结构：

GET multimodal_content/_search
{"size": 1,"query": {"match_all": {}}
}

预期结果：

应该存在一个名为 multimodal_content 的索引。
大约 7 个文档分布在不同的模态（视觉、音频、文本、深度）中。
每个文档应包含：embedding、modality、description、metadata 和 content_path 字段。

此验证步骤确保我们的证据数据库在进行相似度搜索之前已正确设置。

在 Elasticsearch 中搜索相似证据

现在证据已经被索引，我们可以执行搜索，找到与新线索最相似的记录。此搜索使用向量相似度来返回嵌入空间中最接近的记录。

以下代码执行此搜索：

def search_similar_evidence(self, query_embedding, k=5, modality=None):"""Performs a kNN search to find the most similar clues."""knn_query = {"field": "embedding","query_vector": query_embedding.tolist(),"k": k,"num_candidates": 100}query_body = {"knn": knn_query}if modality:query_body = {"bool": {"must": [query_body, {"term": {"modality": modality}}]}}try:results = self.es.search(index=self.index_name,query=query_body,_source_includes=["description", "modality", "content_path"],size=k)except Exception as e:print(f"Error processing search_evidence: {str(e)}")return "Error generating search evidence”return results["hits"]["hits"]

测试搜索 - 使用音频作为查询进行多模态结果搜索

现在，让我们使用一个可疑的音频文件来测试证据搜索。我们需要以相同的方式生成该文件的嵌入，并搜索相似的嵌入：

python stages/03-stage/search_by_audio.py

# Initialize classes
generator = EmbeddingGenerator()
es_manager = ElasticsearchManager(cloud_id="YOUR_CLOUD_ID", api_key="YOUR_API_KEY")# Generate embedding for a suspicious audio
audio_embedding = generator.generate_embedding("data/audios/mysterious_laugh.wav", "audio")# Search for similar evidence in Elasticsearch
similar_evidences = es_manager.search_similar_evidence(audio_embedding, k=3)# Display the retrieved results
print("\n🔎 Similar evidence found:\n")
for i, evidence in enumerate(similar_evidences, start=1):description = evidence['_source']['description']modality = evidence['_source']['modality']score = evidence['_score']content_path = evidence['_source'].get('content_path', 'N/A')print(f"{i}. {description} ({modality})")print(f"   Similarity: {score:.4f}")print(f"   File path: {content_path}\n")

预期输出（终端）：

🔎 Similar evidence found:1. A sinister laugh captured near the crime scene (audio)Similarity: 0.9985File path: data/audios/joker_laugh.wav2. The Joker with green hair, white face paint, and a sinister smile in an urban night setting. (vision)Similarity: 0.6068File path: data/images/joker_laughing.png3. Suspect dancing (vision)Similarity: 0.5591File path: data/images/jdancing.png

现在，我们可以分析检索到的证据并确定它与案件的相关性。

超越音频 - 探索多模态搜索

反转角色：任何模态都可以是 “问题”

在我们的多模态 RAG 系统中，每种模态都是潜在的搜索查询。让我们超越音频示例，探索其他数据类型如何启动调查。

1）通过文本搜索（破译犯罪分子的笔记）
场景：你发现了一条加密的文本消息，并希望找到相关证据。

python stages/03-stage/search_by_text.py

# Generate embedding from text
text = "Why so serious?"
embedding_text = generator.generate_embedding([text], "text")# Search for related evidence
similar_evidences = es_manager.search_similar(query_embedding=embedding_text,k=3
)

预期结果：

🔎 Similar evidence found:1. Mysterious note found at the location (text)Similarity: 0.7639File path: data/texts/riddle.txt2. The Joker with green hair, white face paint, and a sinister smile in an urban night setting. (vision)Similarity: 0.7161File path: data/images/joker_laughing.png3. Why so serious (text)Similarity: 0.7132File path: data/texts/note2.txt

2）图像搜索（追踪可疑的犯罪现场）
场景：需要将新的犯罪现场图像（crime_scene2.jpg）与其他证据进行比较。

python stages/03-stage/search_by_image.py

# Generate embedding for a suspicious image
vision_embedding = generator.generate_embedding(["data/images/crime_scene2.jpg"], "vision")# Search for similar evidence in Elasticsearch
similar_evidences = es_manager.search_similar(query_embedding=vision_embedding,k=3
)

输出：

🔎 Similar evidence found:1. Photo of the crime scene: A dark, rain-soaked alley is filled with playing cards, while a sinister graffiti of the Joker laughing stands out on the brick wall. (vision)Similarity: 0.8258File path: data/images/crime_scene1.jpg2. The Joker with green hair, white face paint, and a sinister smile in an urban night setting. (vision)Similarity: 0.6897File path: data/images/joker_laughing.png3. Suspect dancing (vision)Similarity: 0.6588File path: data/images/jdancing.png

3）深度图搜索（3D 追踪）
场景：深度图（jdancing-depth.png）揭示了逃跑路线的图像模式。

python stages/03-stage/search_by_depth.py

# Generate embedding for a suspicious depth map
vision_embedding = generator.generate_embedding(["data/depths/jdancing-depth.png"], "depth")# Search for similar evidence in Elasticsearch
similar_evidences = es_manager.search_similar(query_embedding=vision_embedding,modality="vision",k=3
)

输出：

🔎 Similar evidence found:1. The Joker with green hair, white face paint, and a sinister smile in an urban night setting. (vision)Similarity: 0.5329File path: data/images/joker_laughing.png

2. Photo of the crime scene: A dark, rain-soaked alley is filled with playing cards, while a sinister graffiti of the Joker laughing stands out on the brick wall. (vision)Similarity: 0.5053File path: data/images/crime_scene1.jpg

3. Suspect dancing (vision)Similarity: 0.4859File path: data/images/jdancing.png

为什么这很重要？

每种模态揭示了独特的联系：

文本 → 嫌疑人的语言模式。
图像 → 位置和物体的识别。
深度 → 3D 场景重建。

现在，我们在 Elasticsearch 中拥有一个结构化的证据数据库，使我们能够高效地存储和检索多模态证据。

我们所做的总结：

将多模态嵌入存储在 Elasticsearch 中。
执行相似度搜索，找到与新线索相关的证据。
使用可疑音频文件测试搜索，确保系统正常工作。

下一步：

我们将使用大型语言模型（LLM）来分析检索到的证据并生成最终报告。

阶段 4 - 通过 LLM 连接线索

现在，证据已被索引到 Elasticsearch 中，并且可以通过相似度检索，我们需要一个大型语言模型（LLM）来分析这些证据并生成一份最终报告，提交给戈登警官。LLM 将负责识别模式、连接线索，并根据检索到的证据建议一个可能的嫌疑人。

为此任务，我们将使用 GPT-4 Turbo，制定详细的提示，使模型能够高效地解释结果。

LLM 集成

为了将 LLM 集成到我们的系统中，我们创建了 LLMAnalyzer 类（src/llm_analyzer.py），该类接收来自 Elasticsearch 的检索证据，并以这些证据作为提示上下文生成法医报告。

import os
from openai import OpenAI
import logging
from dotenv import load_dotenvlogging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)class LLMAnalyzer:"""Evidence analyzer using GPT-4"""def __init__(self):load_dotenv()self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))def analyze_evidence(self, evidence_results):"""Analyzes multimodal search results and generates a reportArgs:evidence_results: Dict with results by modality{'vision': [...],'audio': [...],'text': [...],'depth': [...]}"""# Format evidence for the promptevidence_summary = self._format_evidence(evidence_results)# final promptprompt = f"""
You are a highly experienced forensic detective specializing in multimodal evidence analysis. Your task is to analyze the collected evidence (audio, images, text, depth maps) and conclusively determine the **prime suspect** responsible for the Gotham Central Bank case.---### **Collected Evidence:**
{evidence_summary}### **Task:**
1. **Analyze all the evidence** and identify cross-modal connections.
2. **Determine the exact identity of the criminal** based on behavioral patterns, visual/auditory/textual clues, and symbolic markers.
3. **Justify your conclusion** by explaining why this suspect is definitively responsible.
4. **Assign a confidence score (0-100%)** to your conclusion.---### **Final Output Format (Strictly Follow This Format):**
- **Prime Suspect:** [Full Name or Alias]
- **Evidence Supporting Conclusion:** [Detailed breakdown of visual, auditory, textual, and behavioral evidence]
- **Behavioral Patterns:** [Key actions, motives, and criminal signature]
- **Confidence Level:** [0-100%]
- **Next Steps (if any):** [What additional evidence would further confirm the identity? If none, state "No further evidence required."]If there is **insufficient evidence**, specify exactly what is missing and suggest what additional data would be needed for a conclusive identification.This report must be **direct and definitive**--avoid speculation and provide a final, actionable determination of the suspect's identity.
"""try:response = self.client.chat.completions.create(model="gpt-4-turbo-preview",messages=[{"role": "system","content": "You are a forensic detective specialized in multimodal evidence analysis."},{"role": "user", "content": prompt_01}],temperature=0.5,max_tokens=1000)report = response.choices[0].message.contentlogger.info("\n📋 Forensic Report Generated:")logger.info("=" * 50)logger.info(report)logger.info("=" * 50)return reportexcept Exception as e:logger.error(f"Error generating report: {str(e)}")return None

LLM 分析中的温度设置：

对于我们的法医分析系统，我们使用了 0.5 的适中温度设置。选择这个平衡的设置是因为：

它代表了确定性（过于僵化）和高度随机输出之间的中间地带；
在 0.5 的设置下，模型保持足够的结构性，以提供逻辑性和可辩解的法医结论；
该设置使模型能够识别模式并建立联系，同时保持在合理的法医分析参数范围内；
它平衡了对一致性、可靠输出的需求，以及生成富有洞察力分析的能力。

这个适中的温度设置有助于确保我们的法医分析既可靠又富有洞察力，避免了过于僵化和过于推测的结论。

运行证据分析

现在我们已经完成了 LLM 的集成，我们需要一个脚本将所有系统组件连接起来。这个脚本将会：

在 Elasticsearch 中搜索相似证据。
使用 LLM 分析检索到的证据并生成最终报告。

代码：证据分析脚本

python stages/04-stage/rag_crime_analyze.py

import sys
import os
sys.path.append(os.path.join(os.path.dirname(os.path.dirname(__file__)), 'src'))from embedding_generator import EmbeddingGenerator
from elastic_manager import ElasticsearchManager
from llm_analyzer import LLMAnalyzerimport json
import logging
from dotenv import load_dotenv# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)# Load environment variables
load_dotenv()# Initialize classes
generator = EmbeddingGenerator()
es_manager = ElasticsearchManager()llm = LLMAnalyzer()
logger.info("✅ All components initialized successfully")try:evidence_data = {}# Get data for each modalitytest_files = {'vision': 'data/images/crime_scene2.jpg','audio': 'data/audios/joker_laugh.wav','text': 'Why so serious?','depth': 'data/depths/jdancing-depth.png'}logger.info("🔍 Collecting evidence...")for modality, test_input in test_files.items():try:if modality == 'text':embedding = generator.generate_embedding([test_input], modality)else:embedding = generator.generate_embedding([str(test_input)], modality)results = es_manager.search_similar(embedding, k=2)if results:evidence_data[modality] = resultslogger.info(f"✅ Data retrieved for {modality}: {len(results)} results")else:logger.warning(f"⚠️ No results found for {modality}")except Exception as e:logger.error(f"❌ Error retrieving {modality} data: {str(e)}")if not evidence_data:raise ValueError("No evidence data found in Elasticsearch!")# Test forensic report generationlogger.info("\n📝 Generating forensic report...")report = llm.analyze_evidence(evidence_data)if report:logger.info("✅ Forensic report generated successfully")logger.info("\n📊 Report Preview:")logger.info("+" * 50)logger.info(report)logger.info("+" * 50)else:raise ValueError("Failed to generate forensic report")except Exception as e:logger.error(f"❌ Error in analysis : {str(e)}")

预期的 LLM 输出：

**Prime Suspect:** The Joker**Evidence Supporting Conclusion:**- **Visual Evidence:**- The photo of the crime scene with playing cards scattered around and the graffiti of the Joker laughing matches the Joker's known calling cards and thematic elements. The similarity score of 0.83 indicates a high likelihood that these elements are directly associated with the Joker.- The image of the Joker with green hair, white face paint, and a sinister smile in an urban night setting, although with a lower similarity score of 0.69, still supports the presence or recent activity of the Joker in areas consistent with the crime scene's characteristics.- **Auditory Evidence:**- The captured sinister laugh with a similarity score of 1.00 perfectly matches known audio profiles of the Joker, making it a direct auditory signature of his presence at or near the crime scene.- Despite the lower similarity score of 0.61, the second audio piece further corroborates the Joker's involvement through thematic consistency.- **Textual Evidence:**- The mysterious note found at the location, with a similarity score of 0.76, likely contains thematic or direct references to the Joker's modus operandi or signature phrases, further implicating him in the crime.- The similarity score of 0.72 for the Joker's description in textual evidence reinforces the thematic connection to the crime scene.- **Depth Evidence:**- Depth sensor capture of the suspect with a similarity score of 0.77 suggests a physical presence matching the Joker's known dimensions or characteristic movements.- The lower similarity score of 0.53 in the second depth evidence still contributes to the overall pattern of evidence pointing towards the Joker, albeit with less certainty.**Behavioral Patterns:**
- The Joker is known for his theatrical crimes, often leaving behind a signature trail of chaos, including playing cards, sinister laughter, and thematic graffiti. These elements are not only consistent with his known criminal signature but also directly observed at the crime scene.
- His motives often include creating chaos, drawing attention to his acts, and challenging his arch-nemesis, Batman, making a high-profile bank heist fitting within his behavioral patterns.**Confidence Level:** 95%**Next Steps:** No further evidence required.The combination of visual, auditory, textual, and depth evidence strongly points to the Joker as the prime suspect. The thematic consistency across multiple modes of evidence, combined with known behavioral patterns and criminal signature, leaves little doubt regarding his involvement. While there is always a small margin of uncertainty in forensic analysis, the evidence at hand provides a compelling case against the Joker with a high degree of confidence.

结论：案件已解决

通过收集和分析所有线索，多模态 RAG 系统已经识别出嫌疑人：小丑（joker）。

通过使用 ImageBind 将图像、音频、文本和深度图像转换为共享向量空间，系统能够检测到人工无法发现的关联。Elasticsearch 确保了快速高效的搜索，而 LLM 则将证据综合成一份清晰且具有决定性结论的报告。

然而，这个系统的真正潜力远超哥谭市。多模态RAG架构为许多现实世界的应用打开了大门：

城市监控：根据图像、音频和传感器数据识别嫌疑人。
法医分析：将来自多个来源的证据关联起来，解决复杂案件。
多媒体推荐：创建能够理解多模态背景的推荐系统（例如，根据图像或文本推荐音乐）。
社交媒体趋势：跨不同数据格式检测流行话题。

现在，你已经学会了如何构建一个多模态 RAG 系统，为什么不用自己的线索来测试一下呢？

分享你的发现，帮助社区在多模态 AI 领域取得进展！

特别感谢

我要感谢 Adrian Cole 在定义此代码部署架构过程中所做出的宝贵贡献和审阅。

参考文献

使用 KNN 搜索和 CLIP 嵌入构建多模态图像检索系统
k-最近邻（kNN）搜索
PyTorch 官方文档关于 tensors 的介绍
ImageBind：一种通过 “连接” 不同感官实现AI的新方式

Elasticsearch原生集成行业领先的生成 AI 工具和供应商。查看我们的网络研讨会，了解如何超越 RAG 基础，或构建生产就绪的应用程序 Elastic 向量数据库。

为了构建适合你使用案例的最佳搜索解决方案，开始免费云试用或在本地机器上尝试 Elastic。

原文：Building a Multimodal RAG system with Elasticsearch: The story of Gotham City - Elasticsearch Labs