GRAPHARG——学习

20250106
项目git地址：https://github.com/microsoft/graphrag.git
版本：1.2.0

python">### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.encoding_model: cl100k_base # this needs to be matched to your model!
`hiuuh`
llm:api_key: `填你自己的` # set this in the generated .env filetype: openai_chat # or azure_openai_chatmodel: deepseek-chatmodel_supports_json: true # recommended if this is available for your model.# audience: "https://cognitiveservices.azure.com/.default"api_base: https://api.deepseek.com # https://<instance>.openai.azure.comapi_version: V3# organization: <organization_id>deployment_name: maweijunparallelization:stagger: 0.3# num_threads: 50async_mode: threaded # or asyncioembeddings:async_mode: threaded # or asynciovector_store: type: lancedbdb_uri: 'output/lancedb'container_name: defaultoverwrite: truellm:api_key: `填你自己的`type: openai_embedding # or azure_openai_embeddingmodel: embedding-2api_base: https://open.bigmodel.cn/api/paas/v4# api_base: https://<instance>.openai.azure.com# api_version: 2024-02-15-preview# audience: "https://cognitiveservices.azure.com/.default"# organization: <organization_id># deployment_name: <azure_model_deployment_name>### Input settings ###input:type: file # or blobfile_type: text # or csvbase_dir: "input"file_encoding: utf-8file_pattern: ".*\\.txt$"chunks:size: 1200overlap: 100group_by_columns: [id]### Storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be providedcache:type: file # one of [blob, cosmosdb, file]base_dir: "cache"reporting:type: file # or console, blobbase_dir: "output/${timestamp}/logs"storage:type: file # one of [blob, cosmosdb, file]base_dir: "output/${timestamp}/artifacts"## only turn this on if running `graphrag index` with custom settings
## we normally use `graphrag update` with the defaults
update_index_storage:# type: file # or blob# base_dir: "update_output"### Workflow settings ###skip_workflows: []entity_extraction:prompt: "prompts/entity_extraction.txt"entity_types: [organization,person,geo,event]max_gleanings: 1summarize_descriptions:prompt: "prompts/summarize_descriptions.txt"max_length: 500claim_extraction:enabled: falseprompt: "prompts/claim_extraction.txt"description: "Any claims or facts that could be relevant to information discovery."max_gleanings: 1community_reports:prompt: "prompts/community_report.txt"max_length: 2000max_input_length: 8000cluster_graph:max_cluster_size: 10embed_graph:enabled: false # if true, will generate node2vec embeddings for nodesumap:enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)snapshots:graphml: trueembeddings: falsetransient: false### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#querylocal_search:prompt: "prompts/local_search_system_prompt.txt"global_search:map_prompt: "prompts/global_search_map_system_prompt.txt"reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"drift_search:prompt: "prompts/drift_search_system_prompt.txt"reduce_prompt: "prompts/drift_search_reduce_prompt.txt"basic_search:prompt: "prompts/basic_search_system_prompt.txt"

这段代码是一个配置文件，用于设置和调整 GraphARG 框架的行为。GraphARG 是一个基于图结构的推理框架，通常用于处理图数据和自然语言数据的联合任务。以下是对配置文件的详细解释：

1. LLM 设置 (LLM Settings)

这部分配置与 大语言模型 (LLM) 相关，用于定义如何调用和处理 LLM 的 API。

encoding_model: cl100k_base
指定编码模型，需要与使用的 LLM 模型匹配。cl100k_base 是 OpenAI 模型常用的编码器。
llm
配置 LLM 的 API 调用参数：
- api_key: LLM 的 API 密钥，通常存储在 .env 文件中。
- type: LLM 的类型，例如 openai_chat 或 azure_openai_chat。
- model: 使用的 LLM 模型名称，例如 deepseek-chat。
- model_supports_json: 是否支持 JSON 格式的输入输出。
- api_base: LLM API 的基础 URL。
- api_version: API 的版本号。
- deployment_name: 部署名称（适用于 Azure OpenAI）。
parallelization
配置并行化参数：
- stagger: 调用 API 时的延迟时间（秒），用于避免速率限制。
- num_threads: 并行线程数（未启用）。
async_mode
指定异步模式，可以是 threaded（多线程）或 asyncio（异步 I/O）。

2. 嵌入模型设置 (Embeddings Settings)

这部分配置与 嵌入模型 相关，用于生成文本或节点的向量表示。

async_mode
指定嵌入模型的异步模式。
vector_store
配置向量存储：
- type: 向量存储类型，例如 lancedb。
- db_uri: 数据库的 URI。
- container_name: 容器名称。
- overwrite: 是否覆盖现有数据。
llm
配置嵌入模型的 API 调用参数：
- api_key: 嵌入模型的 API 密钥。
- type: 嵌入模型类型，例如 openai_embedding 或 azure_openai_embedding。
- model: 嵌入模型名称，例如 embedding-2。
- api_base: 嵌入模型 API 的基础 URL。

3. 输入设置 (Input Settings)

这部分配置与输入数据的处理相关。

input
配置输入数据的来源和格式：
- type: 输入类型，例如 file（文件）或 blob（Blob 存储）。
- file_type: 文件类型，例如 text（文本）或 csv。
- base_dir: 输入文件的根目录。
- file_encoding: 文件编码格式，例如 utf-8。
- file_pattern: 文件名的正则表达式模式，用于匹配文件。
chunks
配置文本分块参数：
- size: 每个块的大小（字符数）。
- overlap: 块之间的重叠字符数。
- group_by_columns: 按列分组（适用于结构化数据）。

4. 存储设置 (Storage Settings)

这部分配置与缓存、报告和存储相关。

cache
配置缓存存储：
- type: 缓存类型，例如 file（文件）或 blob（Blob 存储）。
- base_dir: 缓存文件的根目录。
reporting
配置报告输出：
- type: 报告类型，例如 file（文件）或 console（控制台）。
- base_dir: 报告文件的根目录。
storage
配置存储：
- type: 存储类型，例如 file（文件）或 blob（Blob 存储）。
- base_dir: 存储文件的根目录。
update_index_storage
配置索引更新存储（通常不需要手动启用）。

5. 工作流设置 (Workflow Settings)

这部分配置与框架的工作流相关。

skip_workflows
指定要跳过的工作流。
entity_extraction
配置实体抽取任务：
- prompt: 实体抽取的提示模板文件路径。
- entity_types: 要抽取的实体类型，例如 organization、person 等。
- max_gleanings: 最大抽取数量。
summarize_descriptions
配置文本摘要任务：
- prompt: 摘要生成的提示模板文件路径。
- max_length: 摘要的最大长度。
claim_extraction
配置声明抽取任务（默认禁用）：
- prompt: 声明抽取的提示模板文件路径。
- description: 任务的描述。
- max_gleanings: 最大抽取数量。
community_reports
配置社区报告生成任务：
- prompt: 报告生成的提示模板文件路径。
- max_length: 报告的最大长度。
- max_input_length: 输入的最大长度。
cluster_graph
配置图聚类任务：
- max_cluster_size: 最大聚类大小。
embed_graph
配置图嵌入任务（默认禁用）。
umap
配置 UMAP 降维任务（默认禁用）。
snapshots
配置快照输出：
- graphml: 是否生成 GraphML 格式的快照。
- embeddings: 是否生成嵌入快照。
- transient: 是否生成临时快照。

6. 查询设置 (Query Settings)

这部分配置与查询任务相关。

local_search
配置本地搜索任务：
- prompt: 本地搜索的提示模板文件路径。
global_search
配置全局搜索任务：
- map_prompt: 全局搜索的 Map 阶段提示模板。
- reduce_prompt: 全局搜索的 Reduce 阶段提示模板。
- knowledge_prompt: 全局搜索的知识提示模板。
drift_search
配置漂移搜索任务：
- prompt: 漂移搜索的提示模板文件路径。
- reduce_prompt: 漂移搜索的 Reduce 阶段提示模板。
basic_search
配置基础搜索任务：
- prompt: 基础搜索的提示模板文件路径。

总结

这个配置文件定义了 GraphARG 框架的核心行为，包括：

LLM 和嵌入模型的 API 调用。
输入数据的处理方式。
缓存、报告和存储的设置。
工作流的任务配置。
查询任务的提示模板和参数。

通过调整这些配置，可以灵活地适应不同的应用场景和需求。如果需要更详细的配置选项，可以参考官方文档：GraphARG 配置文档。

GRAPHARG——学习

1. LLM 设置 (LLM Settings)

2. 嵌入模型设置 (Embeddings Settings)

3. 输入设置 (Input Settings)

4. 存储设置 (Storage Settings)

5. 工作流设置 (Workflow Settings)

6. 查询设置 (Query Settings)

总结

相关文章

LabVIEW如何有效地进行数据采集？

nosql mysql的区别

Spring MVC学习——发送请求（@RequestMapping注解及请求参数绑定）

CAPL与外部接口

SpringBoot 整合 SpringMVC：配置嵌入式服务器

【方法论】ChatGPT与DeepSeek的联合应用，提升工作效率的新解决方案

100.1 AI量化面试题：解释夏普比率(Sharpe Ratio)的计算方法及其在投资组合管理中的应用，并说明其局限性

Dubbo view