zerox - 使用视觉模型将 PDF 转换为 Markdown

news/2025/1/15 9:38:18/

7900 Stars 478 Forks 39 Issues 17 贡献者 MIT License Python 语言

代码: https://github.com/getomni-ai/zerox

主页: OmniAI. Automate document workflows

更多AI开源软件:AI开源 - 小众AI

zerox基于视觉模型 API 服务,提供了将 PDF 文档转化为 Markdown 的功能。其原理是先将原文件(如 pdf、docx)转换为图片,然后把图片发给视觉模型处理,最后汇总所有结果生成完整的 Markdown 文件。

主要功能

一种非常简单的 OCR 文档以进行 AI 摄取的方法。毕竟,文档应该是一种视觉表示。带有奇怪的布局、表格、图表等。视觉模型很有意义!

  • 传入文件(pdf、docx、image 等)
  • 将该文件转换为一系列图像
  • 将每张图片传递给 GPT 并很好地请求 Markdown
  • 聚合响应并返回 Markdown

Node Zerox安装和使用

npm install zerox

Zerox 使用 和 用于 pdf => 图像处理步骤。这些应该会自动拉取,但您可能需要手动安装。graphicsmagickghostscript​

在 linux 上使用:

sudo apt-get update
sudo apt-get install -y graphicsmagick
Node 用法

**使用文件 URL**

import { zerox } from "zerox";const result = await zerox({filePath: "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf",openaiAPIKey: process.env.OPENAI_API_KEY,
});

**从本地路径**

import path from "path";
import { zerox } from "zerox";const result = await zerox({filePath: path.resolve(__dirname, "./cs101.pdf"),openaiAPIKey: process.env.OPENAI_API_KEY,
});
选项
const result = await zerox({// RequiredfilePath: "path/to/file",openaiAPIKey: process.env.OPENAI_API_KEY,// Optionalcleanup: true, // Clear images from tmp after run.concurrency: 10, // Number of pages to run at a time.correctOrientation: true, // True by default, attempts to identify and correct page orientation.errorMode: ErrorMode.IGNORE, // ErrorMode.THROW or ErrorMode.IGNORE, defaults to ErrorMode.IGNORE.maintainFormat: false, // Slower but helps maintain consistent formatting.maxRetries: 1, // Number of retries to attempt on a failed page, defaults to 1.maxTesseractWorkers: -1, // Maximum number of tesseract workers. Zerox will start with a lower number and only reach maxTesseractWorkers if needed.model: "gpt-4o-mini", // Model to use (gpt-4o-mini or gpt-4o).onPostProcess: async ({ page, progressSummary }) => Promise<void>, // Callback function to run after each page is processed.onPreProcess: async ({ imagePath, pageNumber }) => Promise<void>, // Callback function to run before each page is processed.outputDir: undefined, // Save combined result.md to a file.pagesToConvertAsImages: -1, // Page numbers to convert to image as array (e.g. `[1, 2, 3]`) or a number (e.g. `1`). Set to -1 to convert all pages.tempDir: "/os/tmp", // Directory to use for temporary files (default: system temp directory).trimEdges: true, // True by default, trims pixels from all edges that contain values similar to the given background colour, which defaults to that of the top-left pixel.
});

该选项尝试通过将前一页的输出作为下一页的附加上下文传入,以一致的格式返回 markdown。这需要请求同步运行,因此速度要慢得多。但是,如果您的文档包含大量表格数据,或者经常包含跨页的表格,则此属性很有价值。maintainFormat​

Request #1 => page_1_image
Request #2 => page_1_markdown + page_2_image
Request #3 => page_2_markdown + page_3_image
示例输出
{completionTime: 10038,fileName: 'invoice_36258',inputTokens: 25543,outputTokens: 210,pages: [{content: '# INVOICE # 36258\n' +'**Date:** Mar 06 2012  \n' +'**Ship Mode:** First Class  \n' +'**Balance Due:** $50.10  \n' +'## Bill To:\n' +'Aaron Bergman  \n' +'98103, Seattle,  \n' +'Washington, United States  \n' +'## Ship To:\n' +'Aaron Bergman  \n' +'98103, Seattle,  \n' +'Washington, United States  \n' +'\n' +'| Item                                       | Quantity | Rate   | Amount  |\n' +'|--------------------------------------------|----------|--------|---------|\n' +"| Global Push Button Manager's Chair, Indigo | 1        | $48.71 | $48.71  |\n" +'| Chairs, Furniture, FUR-CH-4421             |          |        |         |\n' +'\n' +'**Subtotal:** $48.71  \n' +'**Discount (20%):** $9.74  \n' +'**Shipping:** $11.13  \n' +'**Total:** $50.10  \n' +'---\n' +'**Notes:**  \n' +'Thanks for your business!  \n' +'**Terms:**  \n' +'Order ID : CA-2012-AB10015140-40974  ',page: 1,contentLength: 747,status: 'SUCCESS',}],summary: {failedPages: 0,successfulPages: 1,totalPages: 1,},
}

Python Zerox安装和使用

(Python SDK - 支持来自不同提供商的视觉模型,如 OpenAI、Azure OpenAI、Anthropic、AWS Bedrock 等)

安装
  • 在系统上安装 **poppler**,它应该在 path 变量中可用。请参阅 pdf2image 文档以获取平台说明。
  • 安装 py-zerox:
pip install py-zerox

该函数是一个异步 API,它使用视觉模型执行 OCR(光学字符识别)以降价。它处理 PDF 文件并将其转换为 markdown 格式。在使用此 API 之前,请确保为模型和模型提供程序设置环境变量。pyzerox.zerox​

请参阅 LiteLLM 文档 来设置环境并传递正确的模型名称。

用法
from pyzerox import zerox
import os
import json
import asyncio### Model Setup (Use only Vision Models) Refer: https://docs.litellm.ai/docs/providers ##### placeholder for additional model kwargs which might be required for some models
kwargs = {}## system prompt to use for the vision model
custom_system_prompt = None# to override
# custom_system_prompt = "For the below pdf page, do something..something..." ## example###################### Example for OpenAI ######################
model = "gpt-4o-mini" ## openai model
os.environ["OPENAI_API_KEY"] = "" ## your-api-key###################### Example for Azure OpenAI ######################
model = "azure/gpt-4o-mini" ## "azure/<your_deployment_name>" -> format <provider>/<model>
os.environ["AZURE_API_KEY"] = "" # "your-azure-api-key"
os.environ["AZURE_API_BASE"] = "" # "https://example-endpoint.openai.azure.com"
os.environ["AZURE_API_VERSION"] = "" # "2023-05-15"###################### Example for Gemini ######################
model = "gemini/gpt-4o-mini" ## "gemini/<gemini_model>" -> format <provider>/<model>
os.environ['GEMINI_API_KEY'] = "" # your-gemini-api-key###################### Example for Anthropic ######################
model="claude-3-opus-20240229"
os.environ["ANTHROPIC_API_KEY"] = "" # your-anthropic-api-key###################### Vertex ai ######################
model = "vertex_ai/gemini-1.5-flash-001" ## "vertex_ai/<model_name>" -> format <provider>/<model>
## GET CREDENTIALS
## RUN ##
# !gcloud auth application-default login - run this to add vertex credentials to your env
## OR ##
file_path = 'path/to/vertex_ai_service_account.json'# Load the JSON file
with open(file_path, 'r') as file:vertex_credentials = json.load(file)# Convert to JSON string
vertex_credentials_json = json.dumps(vertex_credentials)vertex_credentials=vertex_credentials_json## extra args
kwargs = {"vertex_credentials": vertex_credentials}###################### For other providers refer: https://docs.litellm.ai/docs/providers ####################### Define main async entrypoint
async def main():file_path = "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf" ## local filepath and file URL supported## process only some pages or allselect_pages = None ## None for all, but could be int or list(int) page numbers (1 indexed)output_dir = "./output_test" ## directory to save the consolidated markdown fileresult = await zerox(file_path=file_path, model=model, output_dir=output_dir,custom_system_prompt=custom_system_prompt,select_pages=select_pages, **kwargs)return result# run the main function:
result = asyncio.run(main())# print markdown result
print(result)
参数
async def zerox(cleanup: bool = True,concurrency: int = 10,file_path: Optional[str] = "",maintain_format: bool = False,model: str = "gpt-4o-mini",output_dir: Optional[str] = None,temp_dir: Optional[str] = None,custom_system_prompt: Optional[str] = None,select_pages: Optional[Union[int, Iterable[int]]] = None,**kwargs
) -> ZeroxOutput:...

参数

  • **cleanup** (bool, optional): 是否在处理后清理临时文件。默认为 True。
  • **concurrency** (int,可选): 要运行的并发进程数。默认值为 10。
  • **file_path** (可选[str], 可选): 要处理的 PDF 文件的路径。默认为空字符串。
  • **maintain_format** (bool,可选): 是否保留上一页的格式。默认为 False。
  • **model** (str,可选): 用于生成完成项的模型。默认为 “gpt-4o-mini”。 有关正确的模型名称,请参阅 LiteLLM Providers,因为它可能因提供商而异。
  • **output_dir** (Optional[str], optional): 用于保存 Markdown 输出的目录。默认为 None。
  • **temp_dir** (str,可选): 存储临时文件的目录,默认为系统临时目录中的某个命名文件夹。如果已经存在,则内容将在 zerox 使用之前被删除。
  • **custom_system_prompt** (str,可选): 用于模型的系统提示符,这将覆盖默认的系统提示符 zerox。通常,除非你想要一些特定的行为,否则它不是必需的。设置后,它将引发友好警告。默认为 None。
  • **select_pages** (optional[union[int, Iterable[int]]], 可选): 要处理的页面,可以是单个页码或页码的可迭代对象,默认为 None
  • **kwargs** (dict,可选): 要传递给 litellm.completion 方法的其他关键字参数。 有关详细信息,请参阅 LiteLLM 文档 和 完成输入 。

返回

  • 零x输出: 包含模型生成的 Markdown 内容以及一些元数据(请参阅下文)。
示例输出(“azure/gpt-4o-mini”的输出)

​Note: The output is mannually wrapped for this documentation for better readability.​

ZeroxOutput(completion_time=9432.975,file_name='cs101',input_tokens=36877,output_tokens=515,pages=[Page(content='| Type    | Description                          | Wrapper Class |\n' +'|---------|--------------------------------------|---------------|\n' +'| byte    | 8-bit signed 2s complement integer   | Byte          |\n' +'| short   | 16-bit signed 2s complement integer  | Short         |\n' +'| int     | 32-bit signed 2s complement integer  | Integer       |\n' +'| long    | 64-bit signed 2s complement integer  | Long          |\n' +'| float   | 32-bit IEEE 754 floating point number| Float         |\n' +'| double  | 64-bit floating point number         | Double        |\n' +'| boolean | may be set to true or false          | Boolean       |\n' +'| char    | 16-bit Unicode (UTF-16) character    | Character     |\n\n' +'Table 26.2.: Primitive types in Java\n\n' +'### 26.3.1. Declaration & Assignment\n\n' +'Java is a statically typed language meaning that all variables must be declared before you can use ' +'them or refer to them. In addition, when declaring a variable, you must specify both its type and ' +'its identifier. For example:\n\n' +'‍‍```java\n' +'int numUnits;\n' +'double costPerUnit;\n' +'char firstInitial;\n' +'boolean isStudent;\n' +'‍‍```\n\n' +'Each declaration specifies the variable’s type followed by the identifier and ending with a ' +'semicolon. The identifier rules are fairly standard: a name can consist of lowercase and ' +'uppercase alphabetic characters, numbers, and underscores but may not begin with a numeric ' +'character. We adopt the modern camelCasing naming convention for variables in our code. In ' +'general, variables must be assigned a value before you can use them in an expression. You do not ' +'have to immediately assign a value when you declare them (though it is good practice), but some ' +'value must be assigned before they can be used or the compiler will issue an error.\n\n' +'The assignment operator is a single equal sign, `=` and is a right-to-left assignment. That is, ' +'the variable that we wish to assign the value to appears on the left-hand-side while the value ' +'(literal, variable or expression) is on the right-hand-side. Using our variables from before, ' +'we can assign them values:\n\n' +'> 2 Instance variables, that is variables declared as part of an object do have default values. ' +'For objects, the default is `null`, for all numeric types, zero is the default value. For the ' +'boolean type, `false` is the default, and the default char value is `\\0`, the null-terminating ' +'character (zero in the ASCII table).',content_length=2333,page=1)]
)

支持的文件类型

我们使用 和 的组合来执行 document => 图像转换。对于非图像/非 pdf 文件,我们使用 libreoffice 将该文件转换为 pdf,然后再转换为图像。libreofficegraphicsmagick​

["pdf", // Portable Document Format"doc", // Microsoft Word 97-2003"docx", // Microsoft Word 2007-2019"odt", // OpenDocument Text"ott", // OpenDocument Text Template"rtf", // Rich Text Format"txt", // Plain Text"html", // HTML Document"htm", // HTML Document (alternative extension)"xml", // XML Document"wps", // Microsoft Works Word Processor"wpd", // WordPerfect Document"xls", // Microsoft Excel 97-2003"xlsx", // Microsoft Excel 2007-2019"ods", // OpenDocument Spreadsheet"ots", // OpenDocument Spreadsheet Template"csv", // Comma-Separated Values"tsv", // Tab-Separated Values"ppt", // Microsoft PowerPoint 97-2003"pptx", // Microsoft PowerPoint 2007-2019"odp", // OpenDocument Presentation"otp", // OpenDocument Presentation Template
];


http://www.ppmy.cn/news/1562987.html

相关文章

计算机网络之---HTTP协议

HTTP 协议详解 HTTP (Hypertext Transfer Protocol)&#xff0c;即超文本传输协议&#xff0c;是用于Web浏览器和Web服务器之间进行通信的基础协议。它是应用层协议&#xff0c;定义了客户端&#xff08;如浏览器&#xff09;与服务器之间如何交换数据&#xff0c;特别是Web页…

QT加载Ui文件信息方法(python)

在 PyQt 或 PySide 中&#xff0c;加载 Qt Designer 生成的 .ui 文件有两种常见方法&#xff1a; 使用 pyuic 将 .ui 文件转换为 Python 代码。动态加载 .ui 文件。 以下是两种方法的详细说明和示例代码。 方法 1&#xff1a;使用 pyuic 将 .ui 文件转换为 Python 代码 步骤…

利用AI大模型和Mermaid生成流程图

核心点1&#xff1a;利用大模型生成流程图的语句&#xff08;Code&#xff09; 确定业务流程&#xff1a; 用户需要明确要绘制的业务流程&#xff0c;包括主要步骤、决策点以及各步骤之间的关系。将确定的业务流程以文字形式描述出来。 生成Mermaid代码&#xff1a; 将描述好的…

IntelliJ IDEA 主题插件

在 IntelliJ IDEA 中&#xff0c;有很多优秀的主题插件可以帮助你改变 IDE 的外观和配色方案&#xff0c;使得开发过程更加愉悦和高效。以下是一些非常受欢迎和实用的 主题插件&#xff0c;以及如何安装和使用它们的步骤&#xff1a; &#x1f31f; 流行主题插件推荐 1️⃣ Ma…

Spring Boot启动流程详解与技术探讨

引言 Spring Boot作为一个快速构建Spring应用的框架&#xff0c;以其简洁的配置和强大的自动装配功能&#xff0c;受到了广大Java开发者的青睐。在面试中&#xff0c;Spring Boot的启动流程是一个高频问题&#xff0c;它不仅考察了对Spring Boot框架的理解深度&#xff0c;还涉…

HarmonyOS中实现TabBar(相当于Android中的TabLayout+ViewPager)

参考网址&#xff1a;自定义页签切换联动 1.自定义组件TabBarView Component export struct TabBarView{State currentIndex: number 0State selectedIndex: number 0private controller: TabsController new TabsController()//tab标签内容横线布局Builder tabBuilder(ind…

云原生周刊:Prometheus 3.0 正式发布

开源项目推荐 Achilles-SDK Achilles-SDK 是一个专为构建 Kubernetes 控制器而设计的开源开发工具包。它简化了控制器的开发流程&#xff0c;提供了强大的 API 和高效的抽象层&#xff0c;使开发者能够专注于业务逻辑的实现&#xff0c;而无需处理底层复杂性。Achilles-SDK 支…

基于Springboot + vue实现的厨艺交流平台

&#x1f942;(❁◡❁)您的点赞&#x1f44d;➕评论&#x1f4dd;➕收藏⭐是作者创作的最大动力&#x1f91e; &#x1f496;&#x1f4d5;&#x1f389;&#x1f525; 支持我&#xff1a;点赞&#x1f44d;收藏⭐️留言&#x1f4dd;欢迎留言讨论 &#x1f525;&#x1f525;&…