zerox - 使用视觉模型将 PDF 转换为 Markdown

devtools/2025/1/15 4:37:16/

7900 Stars 478 Forks 39 Issues 17 贡献者 MIT License Python 语言

代码: https://github.com/getomni-ai/zerox

主页: OmniAI. Automate document workflows

更多AI开源软件:AI开源 - 小众AI

zerox基于视觉模型 API 服务,提供了将 PDF 文档转化为 Markdown 的功能。其原理是先将原文件(如 pdf、docx)转换为图片,然后把图片发给视觉模型处理,最后汇总所有结果生成完整的 Markdown 文件。

主要功能

一种非常简单的 OCR 文档以进行 AI 摄取的方法。毕竟,文档应该是一种视觉表示。带有奇怪的布局、表格、图表等。视觉模型很有意义!

  • 传入文件(pdf、docx、image 等)
  • 将该文件转换为一系列图像
  • 将每张图片传递给 GPT 并很好地请求 Markdown
  • 聚合响应并返回 Markdown

Node Zerox安装和使用

npm install zerox

Zerox 使用 和 用于 pdf => 图像处理步骤。这些应该会自动拉取,但您可能需要手动安装。graphicsmagickghostscript​

在 linux 上使用:

sudo apt-get update
sudo apt-get install -y graphicsmagick
Node 用法

**使用文件 URL**

import { zerox } from "zerox";const result = await zerox({filePath: "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf",openaiAPIKey: process.env.OPENAI_API_KEY,
});

**从本地路径**

import path from "path";
import { zerox } from "zerox";const result = await zerox({filePath: path.resolve(__dirname, "./cs101.pdf"),openaiAPIKey: process.env.OPENAI_API_KEY,
});
选项
const result = await zerox({// RequiredfilePath: "path/to/file",openaiAPIKey: process.env.OPENAI_API_KEY,// Optionalcleanup: true, // Clear images from tmp after run.concurrency: 10, // Number of pages to run at a time.correctOrientation: true, // True by default, attempts to identify and correct page orientation.errorMode: ErrorMode.IGNORE, // ErrorMode.THROW or ErrorMode.IGNORE, defaults to ErrorMode.IGNORE.maintainFormat: false, // Slower but helps maintain consistent formatting.maxRetries: 1, // Number of retries to attempt on a failed page, defaults to 1.maxTesseractWorkers: -1, // Maximum number of tesseract workers. Zerox will start with a lower number and only reach maxTesseractWorkers if needed.model: "gpt-4o-mini", // Model to use (gpt-4o-mini or gpt-4o).onPostProcess: async ({ page, progressSummary }) => Promise<void>, // Callback function to run after each page is processed.onPreProcess: async ({ imagePath, pageNumber }) => Promise<void>, // Callback function to run before each page is processed.outputDir: undefined, // Save combined result.md to a file.pagesToConvertAsImages: -1, // Page numbers to convert to image as array (e.g. `[1, 2, 3]`) or a number (e.g. `1`). Set to -1 to convert all pages.tempDir: "/os/tmp", // Directory to use for temporary files (default: system temp directory).trimEdges: true, // True by default, trims pixels from all edges that contain values similar to the given background colour, which defaults to that of the top-left pixel.
});

该选项尝试通过将前一页的输出作为下一页的附加上下文传入,以一致的格式返回 markdown。这需要请求同步运行,因此速度要慢得多。但是,如果您的文档包含大量表格数据,或者经常包含跨页的表格,则此属性很有价值。maintainFormat​

Request #1 => page_1_image
Request #2 => page_1_markdown + page_2_image
Request #3 => page_2_markdown + page_3_image
示例输出
{completionTime: 10038,fileName: 'invoice_36258',inputTokens: 25543,outputTokens: 210,pages: [{content: '# INVOICE # 36258\n' +'**Date:** Mar 06 2012  \n' +'**Ship Mode:** First Class  \n' +'**Balance Due:** $50.10  \n' +'## Bill To:\n' +'Aaron Bergman  \n' +'98103, Seattle,  \n' +'Washington, United States  \n' +'## Ship To:\n' +'Aaron Bergman  \n' +'98103, Seattle,  \n' +'Washington, United States  \n' +'\n' +'| Item                                       | Quantity | Rate   | Amount  |\n' +'|--------------------------------------------|----------|--------|---------|\n' +"| Global Push Button Manager's Chair, Indigo | 1        | $48.71 | $48.71  |\n" +'| Chairs, Furniture, FUR-CH-4421             |          |        |         |\n' +'\n' +'**Subtotal:** $48.71  \n' +'**Discount (20%):** $9.74  \n' +'**Shipping:** $11.13  \n' +'**Total:** $50.10  \n' +'---\n' +'**Notes:**  \n' +'Thanks for your business!  \n' +'**Terms:**  \n' +'Order ID : CA-2012-AB10015140-40974  ',page: 1,contentLength: 747,status: 'SUCCESS',}],summary: {failedPages: 0,successfulPages: 1,totalPages: 1,},
}

Python Zerox安装和使用

(Python SDK - 支持来自不同提供商的视觉模型,如 OpenAI、Azure OpenAI、Anthropic、AWS Bedrock 等)

安装
  • 在系统上安装 **poppler**,它应该在 path 变量中可用。请参阅 pdf2image 文档以获取平台说明。
  • 安装 py-zerox:
pip install py-zerox

该函数是一个异步 API,它使用视觉模型执行 OCR(光学字符识别)以降价。它处理 PDF 文件并将其转换为 markdown 格式。在使用此 API 之前,请确保为模型和模型提供程序设置环境变量。pyzerox.zerox​

请参阅 LiteLLM 文档 来设置环境并传递正确的模型名称。

用法
from pyzerox import zerox
import os
import json
import asyncio### Model Setup (Use only Vision Models) Refer: https://docs.litellm.ai/docs/providers ##### placeholder for additional model kwargs which might be required for some models
kwargs = {}## system prompt to use for the vision model
custom_system_prompt = None# to override
# custom_system_prompt = "For the below pdf page, do something..something..." ## example###################### Example for OpenAI ######################
model = "gpt-4o-mini" ## openai model
os.environ["OPENAI_API_KEY"] = "" ## your-api-key###################### Example for Azure OpenAI ######################
model = "azure/gpt-4o-mini" ## "azure/<your_deployment_name>" -> format <provider>/<model>
os.environ["AZURE_API_KEY"] = "" # "your-azure-api-key"
os.environ["AZURE_API_BASE"] = "" # "https://example-endpoint.openai.azure.com"
os.environ["AZURE_API_VERSION"] = "" # "2023-05-15"###################### Example for Gemini ######################
model = "gemini/gpt-4o-mini" ## "gemini/<gemini_model>" -> format <provider>/<model>
os.environ['GEMINI_API_KEY'] = "" # your-gemini-api-key###################### Example for Anthropic ######################
model="claude-3-opus-20240229"
os.environ["ANTHROPIC_API_KEY"] = "" # your-anthropic-api-key###################### Vertex ai ######################
model = "vertex_ai/gemini-1.5-flash-001" ## "vertex_ai/<model_name>" -> format <provider>/<model>
## GET CREDENTIALS
## RUN ##
# !gcloud auth application-default login - run this to add vertex credentials to your env
## OR ##
file_path = 'path/to/vertex_ai_service_account.json'# Load the JSON file
with open(file_path, 'r') as file:vertex_credentials = json.load(file)# Convert to JSON string
vertex_credentials_json = json.dumps(vertex_credentials)vertex_credentials=vertex_credentials_json## extra args
kwargs = {"vertex_credentials": vertex_credentials}###################### For other providers refer: https://docs.litellm.ai/docs/providers ####################### Define main async entrypoint
async def main():file_path = "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf" ## local filepath and file URL supported## process only some pages or allselect_pages = None ## None for all, but could be int or list(int) page numbers (1 indexed)output_dir = "./output_test" ## directory to save the consolidated markdown fileresult = await zerox(file_path=file_path, model=model, output_dir=output_dir,custom_system_prompt=custom_system_prompt,select_pages=select_pages, **kwargs)return result# run the main function:
result = asyncio.run(main())# print markdown result
print(result)
参数
async def zerox(cleanup: bool = True,concurrency: int = 10,file_path: Optional[str] = "",maintain_format: bool = False,model: str = "gpt-4o-mini",output_dir: Optional[str] = None,temp_dir: Optional[str] = None,custom_system_prompt: Optional[str] = None,select_pages: Optional[Union[int, Iterable[int]]] = None,**kwargs
) -> ZeroxOutput:...

参数

  • **cleanup** (bool, optional): 是否在处理后清理临时文件。默认为 True。
  • **concurrency** (int,可选): 要运行的并发进程数。默认值为 10。
  • **file_path** (可选[str], 可选): 要处理的 PDF 文件的路径。默认为空字符串。
  • **maintain_format** (bool,可选): 是否保留上一页的格式。默认为 False。
  • **model** (str,可选): 用于生成完成项的模型。默认为 “gpt-4o-mini”。 有关正确的模型名称,请参阅 LiteLLM Providers,因为它可能因提供商而异。
  • **output_dir** (Optional[str], optional): 用于保存 Markdown 输出的目录。默认为 None。
  • **temp_dir** (str,可选): 存储临时文件的目录,默认为系统临时目录中的某个命名文件夹。如果已经存在,则内容将在 zerox 使用之前被删除。
  • **custom_system_prompt** (str,可选): 用于模型的系统提示符,这将覆盖默认的系统提示符 zerox。通常,除非你想要一些特定的行为,否则它不是必需的。设置后,它将引发友好警告。默认为 None。
  • **select_pages** (optional[union[int, Iterable[int]]], 可选): 要处理的页面,可以是单个页码或页码的可迭代对象,默认为 None
  • **kwargs** (dict,可选): 要传递给 litellm.completion 方法的其他关键字参数。 有关详细信息,请参阅 LiteLLM 文档 和 完成输入 。

返回

  • 零x输出: 包含模型生成的 Markdown 内容以及一些元数据(请参阅下文)。
示例输出(“azure/gpt-4o-mini”的输出)

​Note: The output is mannually wrapped for this documentation for better readability.​

ZeroxOutput(completion_time=9432.975,file_name='cs101',input_tokens=36877,output_tokens=515,pages=[Page(content='| Type    | Description                          | Wrapper Class |\n' +'|---------|--------------------------------------|---------------|\n' +'| byte    | 8-bit signed 2s complement integer   | Byte          |\n' +'| short   | 16-bit signed 2s complement integer  | Short         |\n' +'| int     | 32-bit signed 2s complement integer  | Integer       |\n' +'| long    | 64-bit signed 2s complement integer  | Long          |\n' +'| float   | 32-bit IEEE 754 floating point number| Float         |\n' +'| double  | 64-bit floating point number         | Double        |\n' +'| boolean | may be set to true or false          | Boolean       |\n' +'| char    | 16-bit Unicode (UTF-16) character    | Character     |\n\n' +'Table 26.2.: Primitive types in Java\n\n' +'### 26.3.1. Declaration & Assignment\n\n' +'Java is a statically typed language meaning that all variables must be declared before you can use ' +'them or refer to them. In addition, when declaring a variable, you must specify both its type and ' +'its identifier. For example:\n\n' +'‍‍```java\n' +'int numUnits;\n' +'double costPerUnit;\n' +'char firstInitial;\n' +'boolean isStudent;\n' +'‍‍```\n\n' +'Each declaration specifies the variable’s type followed by the identifier and ending with a ' +'semicolon. The identifier rules are fairly standard: a name can consist of lowercase and ' +'uppercase alphabetic characters, numbers, and underscores but may not begin with a numeric ' +'character. We adopt the modern camelCasing naming convention for variables in our code. In ' +'general, variables must be assigned a value before you can use them in an expression. You do not ' +'have to immediately assign a value when you declare them (though it is good practice), but some ' +'value must be assigned before they can be used or the compiler will issue an error.\n\n' +'The assignment operator is a single equal sign, `=` and is a right-to-left assignment. That is, ' +'the variable that we wish to assign the value to appears on the left-hand-side while the value ' +'(literal, variable or expression) is on the right-hand-side. Using our variables from before, ' +'we can assign them values:\n\n' +'> 2 Instance variables, that is variables declared as part of an object do have default values. ' +'For objects, the default is `null`, for all numeric types, zero is the default value. For the ' +'boolean type, `false` is the default, and the default char value is `\\0`, the null-terminating ' +'character (zero in the ASCII table).',content_length=2333,page=1)]
)

支持的文件类型

我们使用 和 的组合来执行 document => 图像转换。对于非图像/非 pdf 文件,我们使用 libreoffice 将该文件转换为 pdf,然后再转换为图像。libreofficegraphicsmagick​

["pdf", // Portable Document Format"doc", // Microsoft Word 97-2003"docx", // Microsoft Word 2007-2019"odt", // OpenDocument Text"ott", // OpenDocument Text Template"rtf", // Rich Text Format"txt", // Plain Text"html", // HTML Document"htm", // HTML Document (alternative extension)"xml", // XML Document"wps", // Microsoft Works Word Processor"wpd", // WordPerfect Document"xls", // Microsoft Excel 97-2003"xlsx", // Microsoft Excel 2007-2019"ods", // OpenDocument Spreadsheet"ots", // OpenDocument Spreadsheet Template"csv", // Comma-Separated Values"tsv", // Tab-Separated Values"ppt", // Microsoft PowerPoint 97-2003"pptx", // Microsoft PowerPoint 2007-2019"odp", // OpenDocument Presentation"otp", // OpenDocument Presentation Template
];


http://www.ppmy.cn/devtools/150579.html

相关文章

蓝桥杯嵌入式速通(1)

1.工程准备 创建一文件夹存放自己的代码&#xff0c;并在mdk中include上文件夹地址 把所有自身代码的头文件都放在headfile头文件中&#xff0c;之后只需要在新的文件中引用headfile即可 headfile中先提前可加入 #include "stdio.h" #include "string.h"…

Redis 优化秒杀(异步秒杀)

目录 为什么需要异步秒杀 异步优化的核心逻辑是什么&#xff1f; 阻塞队列的特点是什么&#xff1f; Lua脚本在这里的作用是什么&#xff1f; 异步调用创建订单的具体逻辑是什么&#xff1f; 为什么要用代理对象proxy调用createVoucherOrder方法&#xff1f; 对于代码的详细…

IP层之分片包的整合处理

前言 在上一章节中&#xff0c;笔者就IP层的接收代码逻辑做了简单介绍&#xff0c;并对实现代码进行了逻辑梳理以及仿真测试&#xff0c;并且在上一章节中&#xff0c;就IP层的分片包问题&#xff0c;如何确定分片包是否存在已经进行了简单介绍&#xff0c;并在接收模块中&…

信息时代的消费者行为变迁与应对策略:基于链动2+1模式、AI智能名片及S2B2C商城小程序的分析

摘要&#xff1a;随着信息技术的飞速发展与互联网的全面普及&#xff0c;消费者行为模式正在经历深刻变革。在信息大爆炸的背景下&#xff0c;消费者拥有了前所未有的信息获取能力&#xff0c;他们开始独立思考&#xff0c;追求个性化消费体验&#xff0c;并展现出理性消费的趋…

Github 2025-01-09 Go开源项目日报 Top10

根据Github Trendings的统计,今日(2025-01-09统计)共有10个项目上榜。根据开发语言中项目的数量,汇总情况如下: 开发语言项目数量Go项目10TypeScript项目1Prometheus监控系统和时间序列数据库 创建周期:4149 天开发语言:Go协议类型:Apache License 2.0Star数量:52463 个…

[Android]service命令的使用

在前面的讨论中,我们说到,如果在客户端懒得使用aidl文件生成的接口类进行binder,可以使用IBinder的transcat方法 Parcel dataParcel = Parcel.obtain(); Parcel resultParcel = Parcel.obtain();dataParcel.writeInterfaceToken(DESCRIPTOR);//发起请求 aProxyBinder.trans…

蠕虫病毒会给服务器造成哪些危害?

蠕虫病毒是一种独立的恶意计算机程序&#xff0c;可以进行自我复制来传播到其他的计算机系统当中&#xff0c;蠕虫病毒和传统病毒之间是有着区别的&#xff0c;蠕虫病毒不需要宿主程序就能够自行传播&#xff0c;主要是利用各种操作系统漏洞进行攻击的。 接下来小编就介绍一下蠕…

rabbitmq的三个交换机及简单使用

提前说一下&#xff0c;创建队列&#xff0c;交换机&#xff0c;绑定交换机和队列都是在生产者。消费者只负责监听就行了&#xff0c;不用配其他的。 完成这个场景需要两个服务哦。 1直连交换机-生产者的代码。 在配置类中创建队列&#xff0c;交换机&#xff0c;绑定交换机…