大模型工程师学习日记(十一):FAISS 高效相似度搜索和密集向量聚类的库

devtools/2025/3/5 5:30:12/

Facebook AI Similarity Search (Faiss /Fez/) 是一个用于高效相似度搜索和密集向量聚类的库。它包含了在任意大小的向量集合中进行搜索的算法,甚至可以处理可能无法完全放入内存的向量集合。它还包含用于评估和参数调整的支持代码。

Faiss 官方文档:Welcome to Faiss Documentation — Faiss documentation

下面展示如何使用与 FAISS 向量数据库相关的功能。它将展示特定于此集成的功能。在学习完这些内容后,探索相关的用例页面可能会很有帮助,以了解如何将这个向量存储作为更大链条的一部分来使用。

设置

该集成位于 langchain-community 包中。我们还需要安装 faiss 包本身。我们还将使用 OpenAI 进行嵌入,因此需要安装这些要求。我们可以使用以下命令进行安装:

pip install -U langchain-community faiss-cpu langchain-openai tiktoken

请注意,如果您想使用启用了 GPU 的版本,也可以安装 faiss-gpu

设置 LangSmith 以获得最佳的可观测性也会很有帮助(但不是必需的)

# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = ""

导入

在这里,我们将文档导入到向量存储中。

#示例:faiss_search.py
# 如果您需要使用没有 AVX2 优化的 FAISS 进行初始化,请取消下面一行的注释
# os.environ['FAISS_NO_AVX2'] = '1'
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
loader = TextLoader("../../resource/knowledge.txt", encoding="UTF-8")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(docs, embeddings)
print(db.index.ntotal)

输出结果

16

查询

现在,我们可以查询向量存储。有几种方法可以做到这一点。最常见的方法是使用 similarity_search

#示例:faiss_search.py
query = "Pixar公司是做什么的?"
docs = db.similarity_search(query)
print(docs[0].page_content)

输出结果

During the next five years, I started a company named NeXT, another company named Pixar, and fell in love with an amazing woman who would become my wife. Pixar went on to create the worlds first computer animated feature film, Toy Story, and is now the most successful animation studio in the world. In a remarkable turn of events, Apple bought NeXT, I retuned to Apple, and the technology we developed at NeXT is at the heart of Apple's current renaissance. And Laurene and I have a wonderful family together.
在接下来的五年里, 我创立了一个名叫 NeXT 的公司,还有一个叫Pixar的公司,然后和一个后来成为我妻子的优雅女人相识。Pixar 制作了世界上第一个用电脑制作的动画电影——“”玩具总动员”,Pixar 现在也是世界上最成功的电脑制作工作室。在后来的一系列运转中,Apple 收购了NeXT,然后我又回到了苹果公司。我们在NeXT 发展的技术在 Apple 的复兴之中发挥了关键的作用。我还和 Laurence 一起建立了一个幸福的家庭。

作为检索器

我们还可以将向量存储转换为 Retriever 类。这使我们能够轻松地在其他 LangChain 方法中使用它,这些方法主要用于检索器。

#示例:faiss_retriever.py
retriever = db.as_retriever()
docs = retriever.invoke(query)
print(docs[0].page_content)

输出结果

During the next five years, I started a company named NeXT, another company named Pixar, and fell in love with an amazing woman who would become my wife. Pixar went on to create the worlds first computer animated feature film, Toy Story, and is now the most successful animation studio in the world. In a remarkable turn of events, Apple bought NeXT, I retuned to Apple, and the technology we developed at NeXT is at the heart of Apple's current renaissance. And Laurene and I have a wonderful family together.
在接下来的五年里, 我创立了一个名叫 NeXT 的公司,还有一个叫Pixar的公司,然后和一个后来成为我妻子的优雅女人相识。Pixar 制作了世界上第一个用电脑制作的动画电影——“”玩具总动员”,Pixar 现在也是世界上最成功的电脑制作工作室。在后来的一系列运转中,Apple 收购了NeXT,然后我又回到了苹果公司。我们在NeXT 发展的技术在 Apple 的复兴之中发挥了关键的作用。我还和 Laurence 一起建立了一个幸福的家庭。

带分数的相似度搜索

还有一些 FAISS 特定的方法。其中之一是 similarity_search_with_score,它允许您返回文档以及查询与它们之间的距离分数。返回的距离分数是 L2 距离。因此,得分越低越好。

#示例:faiss_similarity.py
#返回文档以及查询与它们之间的距离分数。返回的距离分数是 L2 距离。因此,得分越低越好。
docs_and_scores = db.similarity_search_with_score(query)
print(docs_and_scores)
#还可以使用`similarity_search_by_vector`来搜索与给定嵌入向量相似的文档,该函数接受一个嵌入向量作为参数,而不是字符串。
embedding_vector = embeddings.embed_query(query)
docs_and_scores = db.similarity_search_by_vector(embedding_vector)
print(docs_and_scores)
[(Document(metadata={'source': '../../resource/knowledge.txt'}, page_content="During the next five years, I started a company named NeXT, another company named Pixar, and fell in love with an amazing woman who would become my wife. Pixar went on to create the worlds first computer animated feature film, Toy Story, and is now the most successful animation studio in the world. In a remarkable turn of events, Apple bought NeXT, I retuned to Apple, and the technology we developed at NeXT is at the heart of Apple's current renaissance. And Laurene and I have a wonderful family together.\n在接下来的五年里, 我创立了一个名叫 NeXT 的公司,还有一个叫Pixar的公司,然后和一个后来成为我妻子的优雅女人相识。Pixar 制作了世界上第一个用电脑制作的动画电影——“”玩具总动员”,Pixar 现在也是世界上最成功的电脑制作工作室。在后来的一系列运转中,Apple 收购了NeXT,然后我又回到了苹果公司。我们在NeXT 发展的技术在 Apple 的复兴之中发挥了关键的作用。我还和 Laurence 一起建立了一个幸福的家庭。"), 0.3155345), (Document(metadata={'source': '../../resource/knowledge.txt'}, page_content='I was lucky – I found what I loved to do early in life. Woz and I started Apple in my parents garage when I was 20. We worked hard, and in 10 years Apple had grown from just the two of us in a garage into a billion company with over 4000 employees. We had just released our finest creation - the Macintosh - a year earlier, and I had just turned 30. And then I got fired. How can you get fired from a company you started? Well, as Apple grew we hired someone who I thought was very talented to run the company with me, and for the first year or so things went well. But then our visions of the future began to diverge and eventually we had a falling out. When we did, our Board of Directors sided with him. So at 30 I was out. And very publicly out. What had been the focus of my entire adult life was gone, and it was devastating.\n我非常幸运,因为我在很早的时候就找到了我钟爱的东西。沃兹和我在二十岁的时候就在父母的车库里面开创了苹果公司。我们工作得很努力,十年之后,这个公司从那两个车库中的穷光蛋发展到了超过四千名的雇员、价值超过二十亿的大公司。在公司成立的第九年,我们刚刚发布了最好的产品,那就是 Macintosh。我也快要到三十岁了。在那一年,我被炒了鱿鱼。你怎么可能被你自己创立的公司炒了鱿鱼呢?嗯,在苹果快速成长的时候,我们雇用了一个很有天分的家伙和我一起管理这个公司,在最初的几年,公司运转的很好。但是后来我们对未来的看法发生了分歧, 最终我们吵了起来。当争吵不可开交的时候,董事会站在了他的那一边。所以在三十岁的时候,我被炒了。在这么多人的眼皮下我被炒了。在而立之年,我生命的全部支柱离自己远去,这真是毁灭性的打击。'), 0.44481623), (Document(metadata={'source': '../../resource/knowledge.txt'}, page_content="I really didn't know what to do for a few months. I felt that I had let the previous generation of entrepreneurs down - that I had dropped the baton as it was being passed to me. I met with David Packard and Bob Noyce and tried to apologize for screwing up so badly. I was a very public failure, and I even thought about running away from the valley. But something slowly began to dawn on me – I still loved what I did. The turn of events at Apple had not changed that one bit. I had been rejected, but I was still in love. And so I decided to start over.\n在最初的几个月里,我真是不知道该做些什么。我把从前的创业激情给丢了,我觉得自己让与我一同创业的人都很沮丧。我和 David Pack 和 Bob Boyce 见面,并试图向他们道歉。我把事情弄得糟糕透顶了。但是我渐渐发现了曙光,我仍然喜爱我从事的这些东西。苹果公司发生的这些事情丝毫的没有改变这些,一点也没有。我被驱逐了,但是我仍然钟爱它。所以我决定从头再来。\n\nI didn't see it then, but it turned out that getting fired from Apple was the best thing that could have ever happened to me. The heaviness of being successful was replaced by the lightness of being a beginner again, less sure about everything. It freed me to enter one of the most creative periods of my life.\n我当时没有觉察,但是事后证明,从苹果公司被炒是我这辈子发生的最棒的事情。因为,作为一个成功者的极乐感觉被作为一个创业者的轻松感觉所重新代替:对任何事情都不那么特别看重。这让我觉得如此自由,进入了我生命中最有创造力的一个阶段。"), 0.46826816), (Document(metadata={'source': '../../resource/knowledge.txt'}, page_content="I'm pretty sure none of this would have happened if I hadn't been fired from Apple. It was awful tasting medicine, but I guess the patient needed it. Sometimes life hits you in the head with a brick. Don't lose faith. I'm convinced that the only thing that kept me going was that I loved what I did. You've got to find what you love. And that is as true for your work as it is for your lovers. Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven't found it yet, keep looking. Don't settle. As with all matters of the heart, you'll know when you find it. And, like any great relationship, it just gets better and better as the years roll on. So keep looking until you find it. Don't settle.\n我可以非常肯定,如果我不被苹果公司开除的话,这其中一件事情也不会发生的。这个良药的味道实在是太苦了,但是我想病人需要这个药。有些时候,生活会拿起一块砖头向你的脑袋上猛拍一下。不要失去信心,我很清楚唯一使我一直走下去的,就是我做的事情令我无比钟爱。你需要去找到你所爱的东西,对于工作是如此,对于你的爱人也是如此。你的工作将会占据生活中很大的一部分。你只有相信自己所做的是伟大的工作,你才能怡然自得。如果你现在还没有找到,那么继续找、不要停下来、全心全意的去找,当你找到的时候你就会知道的。就像任何真诚的关系,随着岁月的流逝只会越来越紧密。所以继续找,直到你找到它,不要停下来。\n\nMy third story is about death.\n我的第三个故事是关于死亡的。"), 0.4740282)]

保存和加载

您还可以保存和加载 FAISS 索引。这样做很有用,因为您不必每次使用时都重新创建它。

#示例:faiss_save.py
#保存索引
db.save_local("faiss_index")
#读取索引
new_db = FAISS.load_local("faiss_index", embeddings,allow_dangerous_deserialization=True)
docs = new_db.similarity_search(query)
docs[0]

输出结果

page_content='During the next five years, I started a company named NeXT, another company named Pixar, and fell in love with an amazing woman who would become my wife. Pixar went on to create the worlds first computer animated feature film, Toy Story, and is now the most successful animation studio in the world. In a remarkable turn of events, Apple bought NeXT, I retuned to Apple, and the technology we developed at NeXT is at the heart of Apple's current renaissance. And Laurene and I have a wonderful family together.
在接下来的五年里, 我创立了一个名叫 NeXT 的公司,还有一个叫Pixar的公司,然后和一个后来成为我妻子的优雅女人相识。Pixar 制作了世界上第一个用电脑制作的动画电影——“”玩具总动员”,Pixar 现在也是世界上最成功的电脑制作工作室。在后来的一系列运转中,Apple 收购了NeXT,然后我又回到了苹果公司。我们在NeXT 发展的技术在 Apple 的复兴之中发挥了关键的作用。我还和 Laurence 一起建立了一个幸福的家庭。' metadata={'source': '../../resource/knowledge.txt'}


http://www.ppmy.cn/devtools/164655.html

相关文章

Python 绘制迷宫游戏,自带最优解路线

1、需要安装pygame 2、上下左右移动,空格实现物体所在位置到终点的路线,会有虚线绘制。 import pygame import random import math# 迷宫单元格类 class Cell:def __init__(self, x, y):self.x xself.y yself.walls {top: True, right: True, botto…

PyCharm接入本地部署DeepSeek 实现AI编程!【支持windows与linux】

今天尝试在pycharm上接入了本地部署的deepseek,实现了AI编程,体验还是很棒的。下面详细叙述整个安装过程。 本次搭建的框架组合是 DeepSeek-r1:1.5b/7b Pycharm专业版或者社区版 Proxy AI(CodeGPT) 首先了解不同版本的deepsee…

【计算机网络入门】初学计算机网络(九)

目录 1.令牌传递协议 2. 局域网&IEEE802 2.1 局域网基本概念和体系结构 3. 以太网&IEEE802.3 3.1 MAC层标准 3.1.1 以太网V2标准 ​编辑 3.2 单播广播 3.3 冲突域广播域 4. 虚拟局域网VLAN 1.令牌传递协议 先回顾一下令牌环网技术,多个主机形成…

MyBatis的关联映射

前言 在实际开发中,对数据库的操作通常会涉及多张表,MyBatis提供了关联映射,这些关联映射可以很好地处理表与表,对象与对象之间的的关联关系。 一对一查询 步骤: 先确定表的一对一关系确定好实体类,添加关…

vue3+nuxt中监听sessionStorage.setItem,数据发生变化动态获取响应

在写项目时,遇见一个未读消息的功能,当消息已读时,减少一条未读消息提示,需要实现当sessionStorage.setItem存储的数据发生改变时,响应式的获取并显示,但是由于sessionStorage 的 setItem 操作本身并不会触…

深度学习-137-LangGraph之应用实例(六)构建RAG问答系统带条件边分支

文章目录 1 大语言模型2 带分支的RAG系统2.1 处理文本构建Document2.2 向量存储2.3 自定义工具2.4 创建图2.4.1 预构建的MessagesState2.4.2 编排图2.4.3 可视化图2.5 测试调用3 参考附录使用langgraph框架构建一个带有条件分支的智能问答系统。创建一个能够从网页文档中提取信…

mac 安装node提示 nvm install v14.21.3 failed可能存在问题

如果你在 macOS 上使用 nvm(Node Version Manager)安装 Node.js 版本 v14.21.3 时遇到安装失败的问题,可以按照以下步骤进行排查和解决: 1. 确认 nvm 安装是否正确 首先,确认你的 nvm 是否正确安装,并且能…

【FL0093】基于SSM和微信小程序的微信点餐系统小程序

🧑‍💻博主介绍🧑‍💻 全网粉丝10W,CSDN全栈领域优质创作者,博客之星、掘金/知乎/b站/华为云/阿里云等平台优质作者、专注于Java、小程序/APP、python、大数据等技术领域和毕业项目实战,以及程序定制化开发…