TypeError: expected string or buffer - Langchain, OpenAI Embeddings

news/2024/9/23 16:58:43/

题意:类型错误:期望字符串或缓冲区 - Langchain,OpenAI Embeddings

问题背景:

I am trying to create RAG using the product manuals in pdf which are splitted, indexed and stored in Chroma persisted on a disk. When I try the function that classifies the reviews using the documents context, below is the error I get:

我正在尝试使用 PDF 格式的产品手册创建 RAG,这些手册被拆分、索引并存储在硬盘上的 Chroma 中。当我尝试使用文档上下文对评论进行分类的函数时,出现了以下错误:

from langchain import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.embeddings import AzureOpenAIEmbeddings
from langchain.chat_models import AzureChatOpenAI
from langchain.vectorstores import Chromallm = AzureChatOpenAI(azure_deployment="ChatGPT-16K",openai_api_version="2023-05-15",azure_endpoint=endpoint,api_key=result["access_token"],temperature=0,seed = 100)embedding_model = AzureOpenAIEmbeddings(api_version="2023-05-15",azure_endpoint=endpoint,api_key=result["access_token"],azure_deployment="ada002",
)vectordb = Chroma(persist_directory=vector_db_path,embedding_function=embedding_model,collection_name="product_manuals",
)def format_docs(docs):return "\n\n".join(doc.page_content for doc in docs)def classify (review_title, review_text, product_num):template = """You are a customer service AI Assistant that handles responses to negative product reviews. Use the context below and categorize {review_title} and {review_text} into defect, misuse or poor quality categories based only on provided context. If you don't know, say that you do not know, don't try to make up an answer. Respond back with an answer in the following format:poor qualitymisusedefect{context}Category: """rag_prompt = PromptTemplate.from_template(template)retriever = vectordb.as_retriever(search_type="similarity", search_kwargs={'filter': {'product_num': product_num}})retrieval_chain = ({"context": retriever | format_docs, "review_title: RunnablePassthrough(), "review_text": RunnablePassthrough()}| rag_prompt| llm| StrOutputParser())return retrieval_chain.invoke({"review_title": review_title, "review_text": review_text})classify(review_title="Terrible", review_text ="This baking sheet is terrible. It stains so easily and i've tried everything to get it clean", product_num ="8888999")

Error stack:        错误信息:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File <command-3066972537097411>, line 1
----> 1 issue_recommendation(2     review_title="Terrible",3     review_text="This baking sheet is terrible. It stains so easily and i've tried everything to get it clean. I've maybe used it 5 times and it looks like it's 20 years old. The side of the pan also hold water, so when you pick it up off the drying rack, water runs out. I would never purchase these again.",4     product_num="8888999"5    6 )File <command-3066972537097410>, line 44, in issue_recommendation(review_title, review_text, product_num)36 retriever = vectordb.as_retriever(search_type="similarity", search_kwargs={'filter': {'product_num': product_num}})38 retrieval_chain = (39         {"context": retriever | format_docs, "review_text": RunnablePassthrough()}40         | rag_prompt41         | llm42         | StrOutputParser()43 )
---> 44 return retrieval_chain.invoke({"review_title":review_title, "review_text": review_text})File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/runnables/base.py:1762, in RunnableSequence.invoke(self, input, config)1760 try:1761     for i, step in enumerate(self.steps):
-> 1762         input = step.invoke(1763             input,1764             # mark each step as a child run1765             patch_config(1766                 config, callbacks=run_manager.get_child(f"seq:step:{i+1}")1767             ),1768         )1769 # finish the root run1770 except BaseException as e:File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/runnables/base.py:2327, in RunnableParallel.invoke(self, input, config)2314     with get_executor_for_config(config) as executor:2315         futures = [2316             executor.submit(2317                 step.invoke,(...)2325             for key, step in steps.items()2326         ]
-> 2327         output = {key: future.result() for key, future in zip(steps, futures)}2328 # finish the root run2329 except BaseException as e:File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/runnables/base.py:2327, in <dictcomp>(.0)2314     with get_executor_for_config(config) as executor:2315         futures = [2316             executor.submit(2317                 step.invoke,(...)2325             for key, step in steps.items()2326         ]
-> 2327         output = {key: future.result() for key, future in zip(steps, futures)}2328 # finish the root run2329 except BaseException as e:File /usr/lib/python3.10/concurrent/futures/_base.py:451, in Future.result(self, timeout)449     raise CancelledError()450 elif self._state == FINISHED:
--> 451     return self.__get_result()453 self._condition.wait(timeout)455 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:File /usr/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self)401 if self._exception:402     try:
--> 403         raise self._exception404     finally:405         # Break a reference cycle with the exception in self._exception406         self = NoneFile /usr/lib/python3.10/concurrent/futures/thread.py:58, in _WorkItem.run(self)55     return57 try:
---> 58     result = self.fn(*self.args, **self.kwargs)59 except BaseException as exc:60     self.future.set_exception(exc)File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/runnables/base.py:1762, in RunnableSequence.invoke(self, input, config)1760 try:1761     for i, step in enumerate(self.steps):
-> 1762         input = step.invoke(1763             input,1764             # mark each step as a child run1765             patch_config(1766                 config, callbacks=run_manager.get_child(f"seq:step:{i+1}")1767             ),1768         )1769 # finish the root run1770 except BaseException as e:File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/retrievers.py:121, in BaseRetriever.invoke(self, input, config)117 def invoke(118     self, input: str, config: Optional[RunnableConfig] = None119 ) -> List[Document]:120     config = ensure_config(config)
--> 121     return self.get_relevant_documents(122         input,123         callbacks=config.get("callbacks"),124         tags=config.get("tags"),125         metadata=config.get("metadata"),126         run_name=config.get("run_name"),127     )File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/retrievers.py:223, in BaseRetriever.get_relevant_documents(self, query, callbacks, tags, metadata, run_name, **kwargs)221 except Exception as e:222     run_manager.on_retriever_error(e)
--> 223     raise e224 else:225     run_manager.on_retriever_end(226         result,227         **kwargs,228     )File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/retrievers.py:216, in BaseRetriever.get_relevant_documents(self, query, callbacks, tags, metadata, run_name, **kwargs)214 _kwargs = kwargs if self._expects_other_args else {}215 if self._new_arg_supported:
--> 216     result = self._get_relevant_documents(217         query, run_manager=run_manager, **_kwargs218     )219 else:220     result = self._get_relevant_documents(query, **_kwargs)File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/vectorstores.py:654, in VectorStoreRetriever._get_relevant_documents(self, query, run_manager)650 def _get_relevant_documents(651     self, query: str, *, run_manager: CallbackManagerForRetrieverRun652 ) -> List[Document]:653     if self.search_type == "similarity":
--> 654         docs = self.vectorstore.similarity_search(query, **self.search_kwargs)655     elif self.search_type == "similarity_score_threshold":656         docs_and_similarities = (657             self.vectorstore.similarity_search_with_relevance_scores(658                 query, **self.search_kwargs659             )660         )File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_community/vectorstores/chroma.py:348, in Chroma.similarity_search(self, query, k, filter, **kwargs)331 def similarity_search(332     self,333     query: str,(...)336     **kwargs: Any,337 ) -> List[Document]:338     """Run similarity search with Chroma.339 340     Args:(...)346         List[Document]: List of documents most similar to the query text.347     """
--> 348     docs_and_scores = self.similarity_search_with_score(349         query, k, filter=filter, **kwargs350     )351     return [doc for doc, _ in docs_and_scores]File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_community/vectorstores/chroma.py:437, in Chroma.similarity_search_with_score(self, query, k, filter, where_document, **kwargs)429     results = self.__query_collection(430         query_texts=[query],431         n_results=k,(...)434         **kwargs,435     )436 else:
--> 437     query_embedding = self._embedding_function.embed_query(query)438     results = self.__query_collection(439         query_embeddings=[query_embedding],440         n_results=k,(...)443         **kwargs,444     )446 return _results_to_docs_and_scores(results)File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_community/embeddings/openai.py:691, in OpenAIEmbeddings.embed_query(self, text)682 def embed_query(self, text: str) -> List[float]:683     """Call out to OpenAI's embedding endpoint for embedding query text.684 685     Args:(...)689         Embedding for the text.690     """
--> 691     return self.embed_documents([text])[0]File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_community/embeddings/openai.py:662, in OpenAIEmbeddings.embed_documents(self, texts, chunk_size)659 # NOTE: to keep things simple, we assume the list may contain texts longer660 #       than the maximum context and use length-safe embedding function.661 engine = cast(str, self.deployment)
--> 662 return self._get_len_safe_embeddings(texts, engine=engine)File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_community/embeddings/openai.py:465, in OpenAIEmbeddings._get_len_safe_embeddings(self, texts, engine, chunk_size)459 if self.model.endswith("001"):460     # See: https://github.com/openai/openai-python/461     #      issues/418#issuecomment-1525939500462     # replace newlines, which can negatively affect performance.463     text = text.replace("\n", " ")
--> 465 token = encoding.encode(466     text=text,467     allowed_special=self.allowed_special,468     disallowed_special=self.disallowed_special,469 )471 # Split tokens into chunks respecting the embedding_ctx_length472 for j in range(0, len(token), self.embedding_ctx_length):File /databricks/python/lib/python3.10/site-packages/tiktoken/core.py:116, in Encoding.encode(self, text, allowed_special, disallowed_special)114     if not isinstance(disallowed_special, frozenset):115         disallowed_special = frozenset(disallowed_special)
--> 116     if match := _special_token_regex(disallowed_special).search(text):117         raise_disallowed_special_token(match.group())119 try:TypeError: expected string or buffer

Embeddings seems to work fine when I test. It also works fine when I remove the context and retriever from the chain. It seems to be related to embeddings. Examples on Langchain website instantiates retriver from Chroma.from_documents() whereas I load Chroma vector store from a persisted path. I also tried invoking with review_text only (instead of review title and review text) but the error persists. Not sure why this is happening. These are the package versions I work:

当我测试时,Embeddings 似乎工作正常。当我从链中移除上下文和检索器时,它也能正常工作。问题似乎与 Embeddings 有关。Langchain 网站上的示例是通过 Chroma.from_documents() 实例化检索器,而我是从已保存的路径加载 Chroma 向量存储。我也尝试仅使用 review_text(而不是 review titlereview text),但错误仍然存在。不确定为什么会这样。这是我使用的包版本:

Name: openai Version: 1.6.1

Name: langchain Version: 0.0.354

问题解决:

I've come across the same issue, and turned out that langchain pass a key-value pair as an input to the encoding.code() while it requires str type. A work around is by using itemgetter() to get the direct string input. It might be something like this

我也遇到了同样的问题,发现是由于 langchain 将一个键值对作为输入传递给 encoding.code(),而它需要的是 str 类型。一个解决方法是使用 itemgetter() 来获取直接的字符串输入。可能是这样的:

        retrieval_chain = ({"document": itemgetter("question") | self.retriever,"question": itemgetter("question"),}| prompt| model| StrOutputParser())

You can find the reference here

你可以在这里找到参考资料。


http://www.ppmy.cn/news/1529414.html

相关文章

IDEA Cody 插件实现原理

近年来&#xff0c;智能编程助手 在开发者日常工作中变得越来越重要。IDEA Cody 插件是 JetBrains 生态中一个重要的插件&#xff0c;它可以帮助开发者 快速生成代码、自动补全、并提供智能提示&#xff0c;从而大大提升开发效率。今天我们将深入探讨 Cody 插件的实现原理&…

《黑神话悟空》开发框架与战斗系统解析

本文主要围绕《黑神话悟空》的开发框架与战斗系统解析展开 主要内容 《黑神话悟空》采用的技术栈 《黑神话悟空》战斗系统的实现方式 四种攻击模式 连招系统的创建 如何实现高扩展性的战斗系统 包括角色属性系统、技能配置文件和逻辑节点的抽象等关键技术点 版权声明 本…

Redis技术解析(基础篇)

1.初识Redis Redis是一种键值型的NoSql数据库&#xff0c;这里有两个关键字&#xff1a; 键值型 Redis-server NoSql 其中键值型&#xff0c;是指Redis中存储的数据都是以key、value对的形式存储&#xff0c;而value的形式多种多样&#xff0c;可以是字符串、数值、甚至jso…

Pandas -----------------------基础知识(二)

dataframe读写数据操作 import pandas as pd# 准备数据(字典) data [[1, 张三, 1999-3-10, 18],[2, 李四, 2002-3-10, 15],[3, 王五, 1990-3-10, 33],[4, 隔壁老王, 1983-3-10, 40] ]df pd.DataFrame(data, columns[id, name, birthday, age]) df写到csv文件中 &#xff0c;…

智慧火灾应急救援航拍检测数据集(无人机视角)

智慧火灾应急救援。 无人机&#xff0c;直升机等航拍视角下火灾应急救援检测数据集&#xff0c;数据分别标注了火&#xff0c;人&#xff0c;车辆这三个要素内容&#xff0c;29810张高清航拍影像&#xff0c;共31GB&#xff0c;适合森林防火&#xff0c;应急救援等方向的学术研…

FEAD:fNIRS-EEG情感数据库(视频刺激)

摘要 本文提出了一种可用于训练情绪识别模型的fNIRS-EEG情感数据库——FEAD。研究共记录了37名被试的脑电活动和脑血流动力学反应&#xff0c;以及被试对24种情绪视听刺激的分类和维度评分。探讨了神经生理信号与主观评分之间的关系&#xff0c;并在前额叶皮层区域发现了显著的…

2024 vue3入门教程:02 我的第一个vue页面

1.打开src下的App.vue&#xff0c;删除所有的默认代码 2.更换为自己写的代码&#xff0c; 变量msg&#xff1a;可以自定义为其他&#xff08;建议不要使用vue的关键字&#xff09; 我的的第一个vue&#xff1a;可以更换为其他自定义文字 3.运行命令两步走 下载依赖 cnpm i…

【PyTorch】深入浅出PyTorch

为什么要学习PyTorch Why learn PyTorch PyTorch日益增长的发展速度与深度学习时代的迫切需求 PyTorch实验模型训练 数据 模型 损失函数 优化器 迭代训练 模型应用 如何学习和掌握PyTorch 勤动手 成体系 构建知识体系 熟悉知识分布 对应查缺补漏 多总结