LangChain 入门7 格式化输出

概述：

LangChain 提供的格式化输出功能具有多个优势，这些优势在处理和分析由 AI 生成的内容时尤其有用：

结构化数据：格式化输出允许 AI 的回应以结构化的方式呈现，如 JSON 对象，这使得数据更易于解析和处理。
清晰的信息层次：通过格式化输出，可以清晰地区分不同的信息部分，例如回答、理由、来源引用等，从而提高信息的可读性和可用性。
自动化处理：结构化的输出可以被自动化工具和流程直接使用，无需额外的解析步骤，这提高了效率并减少了人为错误。
易于集成：格式化的输出更容易与其他系统和应用程序集成，因为它们可以被设计为符合特定的 API 响应格式。
增强的分析能力：结构化数据支持更复杂的数据分析和机器学习任务，因为它可以被直接输入到数据分析工具中。
更好的用户体验：对于最终用户而言，格式化的输出可以提供更整洁和直观的展示方式，改善他们的整体体验。
错误处理：当出现错误或异常时，格式化的输出可以帮助快速识别问题所在，因为错误信息和状态码也可以被结构化地展示。
可扩展性：随着业务需求的发展，结构化的输出模式可以更容易地扩展和修改，以适应新的数据字段或响应格式。
多用途：格式化的输出不仅可以用于内部系统，还可以用于生成最终用户可以直接消费的输出，如网页内容或移动应用界面。
一致性：确保所有输出遵循相同的格式可以提高整个系统的一致性，使开发和维护更加容易。

对提取质量的解析方法

将模型温度设置为0。

model = Ollama(model="llama2", temperature=0)

优化提示，切中要害。
记录模式并向LLM提供更多信息，可以使用文档字符串或注释
案例：

from typing import Literal, Optional
from pydantic import BaseModel, Field
"""
属性：
transcript_search（str）：搜索视频记录内容的查询。
title_search（str）：视频标题的搜索查询。
view_count（可选[Tuple[int，int]]）：视频视图计数的范围过滤器。
指定为元组（min_views，max_views）。
publish_date（可选[Tuple[datetime.date，datetime.date]]）：视频发布日期的范围过滤器。
指定为元组（start_date，end_date）。
length（可选[Tuple[int，int]]）：以秒为单位的视频长度范围过滤器。
指定为元组（min_length，max_length）。
"""class VideoQuerySchema(BaseModel):"""属性：transcript_search（str）：搜索视频记录内容的查询。title_search（str）：视频标题的搜索查询。view_count（可选[Tuple[int，int]]）：视频视图计数的范围过滤器。指定为元组（min_views，max_views）。publish_date（可选[Tuple[datetime.date，datetime.date]]）：视频发布日期的范围过滤器。指定为元组（start_date，end_date）。length（可选[Tuple[int，int]]）：以秒为单位的视频长度范围过滤器。指定为元组（min_length，max_length）。"""transcript_search: str = Field(..., description="Search query for video transcript content")title_search: str = Field(..., description="Search query for video title")view_count: Optional[Tuple[int, int]] = Field(None, description="Range filter for video view count")publish_date: Optional[Tuple[datetime.date, datetime.date]] = Field(None, description="Range filter for video publish date")length: Optional[Tuple[int, int]] = Field(None, description="Range filter for video length in seconds")

提供参考示例!不同的示例会有所帮助，包括不需要提取任何内容的示例。
如果有很多示例，可以使用检索器检索最相关的示例

retriever = vectorstore.as_retriever(k=5)#创建检索器：从矢量存储中创建一个检索器实例，指定要检索的相关文档数（k）。

使用可用的最佳LLM/Chat模型（例如，gpt-4、claude-3等）进行基准测试
如果架构非常大，试将其分解为多个较小的架构，运行单独的提取并合并结果。
确保模式允许模型拒绝提取信息。模型可能会自己编造信息。
案例：

class ExtractionResultSchema(BaseModel):extracted_info: Optional[str] = Field(None, description="The extracted information, if any.")reject_extraction: bool = Field(False, description="Whether the model decided to reject the extraction.")rejection_reason: Optional[str] = Field(None, description="The reason for rejecting the extraction, if applicable.")# Example usage
result = ExtractionResultSchema(extracted_info=None,reject_extraction=True,rejection_reason="The input text does not contain any relevant information to extract."
)#reject_extraction：一个布尔字段，指示模型是否决定拒绝提取。
#rejection_reason：一个可选的字符串字段，用于提供拒绝提取的原因。

增加验证/校正步骤（要求LLM校正或验证提取结果）。

verification_result = llm(verification_prompt.format(text=text, extracted_info=extracted_info))
print(verification_result)

代码实现-使用内置的PydanticOutputParser来解析聊天模型的输出

加载模型

from langchain_community.llms import Ollamamodel = Ollama(model="llama2", temperature=0)

加载其他依赖

from typing import List, Optionalfrom langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator

创建提示模版

class Person(BaseModel):"""关于个人的信息."""#这里就是记录摸索#它允许您为Pydantic模型中的字段指定元数据和验证规则。#设置默认值：可以使用默认参数为字段设置默认值。例如：x:int=Field（默认值为0）。   name: str = Field(..., description="The name of the person")height_in_meters: float = Field(..., description="The height of the person expressed in meters.")class People(BaseModel):"""识别文本中所有的信息."""people: List[Person]# 设置解析器
parser = PydanticOutputParser(pydantic_object=People)
#PydanticOutputParser允许将Pydantic模型指定为语言模型所需的输出格式。它为语言模型提供了如何根据指定的Pydantic模型模式构建其输出的说明。parser.get_format_instructions()

返回：
在这里插入图片描述

这里不要执行只是做输出对比

#这里不要执行，只是提供一个对比的示例
class Book(BaseModel):title: str = Field(description="The title of the book")author: str = Field(description="The author of the book")genres: List[str] = Field(description="A list of genres the book belongs to")# Create an instance of the PydanticOutputParser
parser = PydanticOutputParser(pydantic_object=Book)
parser.get_format_instructions()

返回：
在这里插入图片描述
从这里看 get_format_instructions
提供相同的提示词模版

'The output should be formatted as a JSON instance that conforms to
the JSON schema below.\n\nAs an example, for the schema {“properties”:
{“foo”: {“title”: “Foo”, “description”: “a list of strings”, “type”:
“array”, “items”: {“type”: “string”}}}, “required”: [“foo”]}\nthe
object {“foo”: [“bar”, “baz”]} is a well-formatted instance of the
schema. The object {“properties”: {“foo”: [“bar”, “baz”]}} is not
well-formatted.\n\nHere is the output schema:\n```\n

翻译：

'输出应格式化为符合以下JSON架构的JSON实例。\例如，对于架构｛“properties”：｛“foo”：{“title”：“foo”，“description”：“字符串列表”、“type”：“array”、“items”：｛。对象｛“properties”：｛“foo”：[“bar”，“baz”]}｝的格式不正确。\n\n这是输出架构：\n```\n

这里告诉大模型需要输出的具体结构正确与错误的案例。后面内容就是给到提示模版的自定义内容

Prompt 从信息中构建提示词模版

# Prompt 从信息中构建提示词模版
prompt = ChatPromptTemplate.from_messages([("system","Answer the user query. Wrap the output in `json` tags\n{format_instructions}",#以json 的形式返回数据),("human", "{query}"),]
).partial(format_instructions=parser.get_format_instructions()) #partial方法用于创建一个新的PromptTemplate，其中一些输入变量已经填充或“partially”。

prompt 的详细内容
在这里插入图片描述

输出

query = "Anna is 23 years old and she is 6 feet tall"
print(prompt.format_prompt(query=query).to_string())#以str 形式查看提示词的模拟

返回：
在这里插入图片描述

#构建LCEL
chain = prompt | model | parser
chain.invoke({"query": query})

People(people=[Person(name=‘Anna’, height_in_meters=1.82)])
#这里看到返回的内容很简洁

其他样式

提示词模版

# Prompt
prompt = ChatPromptTemplate.from_messages([("system","Answer the user query. Output your answer as JSON that  ""matches the given schema: ```json\n{schema}\n```. ""Make sure to wrap the answer in ```json and ```tags",),("human", "{query}"),]
).partial(schema=People.schema()) #LangChain中的.schema（）方法用于获取Pydantic模型的模式或结构。它返回模型模式的字典表示，这对于内省、文档或生成JSON模式表示非常有用。

自定义解析器

# Custom parser
def extract_json(message: AIMessage) -> List[dict]:"""从字符串中提取JSON内容，JSON嵌入在```JSON和```标记之间。参数：text（str）：包含JSON内容的文本。退货：list：提取的JSON字符串的列表。"""text = message#.content# 定义正则表达式模式以匹配JSON块spattern = r"```json(.*?)```"# 在字符串中查找模式的所有非重叠匹配matches = re.findall(pattern, text, re.DOTALL)# 返回匹配的JSON字符串列表，去掉任何开头或结尾的空格try:return [json.loads(match.strip()) for match in matches]except Exception:raise ValueError(f"Failed to parse: {message}")

输出

query = "Anna is 23 years old and she is 6 feet tall"
print(prompt.format_prompt(query=query).to_string())
chain = prompt | model | extract_json
chain.invoke({"query": query})