第二节：GLM-4v-9B数据加载源码解读

文章目录

前言
一、GLM-4v-9B模型数据格式
二、GLM-4v-9B数据处理源码
- 1、训练数据加工源码(process_batch)
- 2、评估与预测数据加工源码
三、GLM-4v-9B训练数据加工源码解读（process_batch）
- 1、batched_conv原始数据获取源码解读
- 2、数据batch循环初始化源码解读
- 3、数据单轮循环源码解读（tokenizer.apply_chat_template）
- - 1、handle_single_conversation(conversation)方法源码解读
  - - 1、数据结构
    - 2、图像处理
    - 3、获得文本信息
    - 4、文本token转换
    - 5、self.build_single_message
  - 2、元素添加
- 4、数据batch循环结尾处理
- 5、返回值
四、GLM-4v-9B训练数据加工内容呈现
- - 1、样本内容
  - 2、样本实现内容
五、模拟GLM-4v-9B数据处理Demo

前言

清华智普的GLM-4v-9b模型，作为优化的多模态大模型，特别适用于国内应用场景，解决了国外模型本地化不足的问题。本专栏提供环境安装、数据处理、视觉与语言模型源码理解，并基于Hugging Face重构GLM模型搭建教程，帮助理解、修改和应用GLM墨西哥，指导搭建多模态大模型，帮助读者自由搭建与修改大模型。本节给出GLM-4-9B数据处理源码解读，特别理解模型数据处理后输出内容。

第一节：GLM-4v-9B大模型安装、推理与训练详细教程
第二节：GLM-4v-9B数据加载源码解读
第三节：GLM-4v-9B数据加载之huggingface数据加载方法教程(通用大模型数据加载实列)
第四节：GLM-4v-9b模型的tokenizer源码解读
第五节：GLM-4v-9b模型model加载源码解读(模型相关参数方法解读)
第六节：GLM-4v-9b模型加载源码解读(模型加载方法解读)
第七节：GLM-4v-9b模型的视觉模型源码解读
第八节：GLM-4v-9b模型的大语言模型源码解读(ChatGLMForConditionalGeneration)
第九节：通过Debug解析ChatGLMForConditionalGeneration的数据流，理解GLM-4v-9b模型架构
第十节：通过Debug解析ChatGLMModel的数据流，理解视觉与语言模型结合架构
第十一节：利用huggingface重构GLM-4v-9B模型数据处理代码Demo
第十二节：利用huggingface重构GLM-4v-9B训练模型代码Demo
第十一、十二节是在理解GLM-4v-9B模型后，使用huggignface重新构建/搭建GLM-4v-9B模型，使读者能自由构建多模态大模型！

本节给出GLM数据处理输出input_ids、attention_mask 、position_ids 、loss_masks 、images 内容，这些内容再GLM源码中如何实现。我也给出模拟GLM模型数据处理Demo代码便于读者学习。然而，本节GLM数据处理多是应用huggingface数据处理方法，我也会再下一篇给出huggingface相关数据处理方法。

一、GLM-4v-9B模型数据格式

模型数据格式就是官网提供，可直接参考官网：https://github.com/THUDM/GLM-4/blob/main/finetune_demo/README.md
如下图显示：
在这里插入图片描述
而我们使用简单数据列子作为教程讲解，如下：

[{"messages": [{"role": "user","content": "图片中内容是什么？","image": "/GLM-4V-9B/GLM-4-main/THUDM/data/example/images/000000001.jpg"},{"role": "assistant","content": "图片中有拿着篮球的人。"},{"role": "user","content": "图片中的男人在做什么？"},{"role": "assistant","content": "这个男人正在玩篮球。"}]
},
{"messages": [{"role": "user","content": "图片中内容是什么？","image": "/GLM-4V-9B/GLM-4-main/THUDM/data/example/images/000000001.jpg"},{"role": "assistant","content": "图片中有拿着篮球的人。"},{"role": "user","content": "图片中的男人在做什么？"},{"role": "assistant","content": "这个男人正在玩篮球。"}
]
}]

需要注意，图像路径是绝对路径。

二、GLM-4v-9B数据处理源码

在这里，我先给出GLM的源码，这也是数据处理核心，后面再进行解读。

1、训练数据加工源码(process_batch)

来源：finetune_demo/finetune_vision.py-->process_batch函数
这个代码process_batch就是数据处理代码，也是数据处理核心，我先给出源码，后面再进行解读，其代码如下：

def process_batch(batch: Mapping[str, Sequence],tokenizer: PreTrainedTokenizer,max_input_length: int,max_output_length: int,combine: bool,
) -> dict[str, list]:batched_conv = batch['messages']batched_input_ids = []batched_attention_mask = []batched_position_ids = []batched_labels = []batched_images = []max_length = max_input_length + max_output_lengthfor conv in batched_conv:input_ids = [151331, 151333]attention_mask = [1, 1]position_ids = list(range(len(input_ids)))loss_masks = [False, False]images = []if conv[0].get('image'):conv[0]['image'] = Image.open(conv[0]['image']).convert('RGB')else:conv[0]['image'] = imgfor message in conv:loss_mask_val = False if message['role'] in ('system', 'user', 'observation') else Truenew_input_ids_all = tokenizer.apply_chat_template([message],tokenize=True,return_dict=True,padding=True)new_input_ids = new_input_ids_all['input_ids'][0][2:]new_attention_mask = new_input_ids_all['attention_mask'][0][2:]new_position_ids = list(range(position_ids[-1] + 1, position_ids[-1] + 1 + len(new_input_ids)))if message.get('image'):  # Only One Imageimages.append(new_input_ids_all['images'])new_loss_masks = [loss_mask_val] * len(new_input_ids)input_ids += new_input_idsattention_mask += new_attention_maskposition_ids += new_position_idsloss_masks += new_loss_masksinput_ids.append(151336)  # EOSattention_mask.append(1)position_ids.append(len(position_ids))loss_masks.append(False)labels = []for input_id, mask in zip(input_ids, loss_masks):if mask:labels.append(input_id)else:labels.append(-100)batched_input_ids.append(input_ids[:max_length])batched_attention_mask.append(attention_mask[:max_length])batched_position_ids.append(position_ids[:max_length])batched_labels.append(labels[:max_length])batched_images.append(images[0][0])del batched_conv, conv, input_ids, attention_mask, position_ids, loss_masks, message, new_input_ids, new_loss_masks, labels, input_id, masktorch.cuda.empty_cache()return {'input_ids': batched_input_ids,'attention_mask': batched_attention_mask,'position_ids': batched_position_ids,'labels': batched_labels,'images': batched_images}

2、评估与预测数据加工源码

来源：finetune_demo/finetune_vision.py-->process_batch_eval函数
其代码如下：

def process_batch_eval(batch: Mapping[str, Sequence],tokenizer: PreTrainedTokenizer,max_input_length: int,max_output_length: int,combine: bool,
) -> dict[str, list]:batched_conv = batch['messages']batched_input_ids = []batched_attention_mask = []batched_position_ids = []batched_output_ids = []batched_images = []for conv in batched_conv:if conv[0].get('image'):image = Image.open(conv[0]['image']).convert('RGB')else:image = img   conv[0]['image'] = imagenew_input_ids_all = tokenizer.apply_chat_template(conv,tokenize=True,return_dict=True,padding=True)input_ids = new_input_ids_all['input_ids'][0]attention_mask = new_input_ids_all['attention_mask'][0]position_ids = list(range(len(input_ids)))dialogue_parts = [0]for idx, token_id in enumerate(input_ids):if token_id == 151337:dialogue_parts.append(idx + 1)if not dialogue_parts or dialogue_parts[-1] != len(input_ids):dialogue_parts.append(len(input_ids))# Split the conversation into multiple dialogue segmentsfor end_idx in range(1, len(dialogue_parts)):input_segment = input_ids[:dialogue_parts[end_idx]]attention_segment = attention_mask[:dialogue_parts[end_idx]]position_segment = position_ids[:dialogue_parts[end_idx]]output_segment = input_ids[dialogue_parts[end_idx - 1]:dialogue_parts[end_idx]]output_segment.append(151336)  # Add EOS tokenbatched_input_ids.append(input_segment[:max_input_length])batched_attention_mask.append(attention_segment[:max_input_length])batched_position_ids.append(position_segment[:max_input_length])batched_output_ids.append(output_segment[:max_output_length])batched_images.append(new_input_ids_all['images'][0])del batched_conv, input_ids, attention_mask, position_ids, new_input_ids_all, output_segmenttorch.cuda.empty_cache()return {'input_ids': batched_input_ids,'attention_mask': batched_attention_mask,'position_ids': batched_position_ids,'output_ids': batched_output_ids,'images': batched_images}

三、GLM-4v-9B训练数据加工源码解读（process_batch）

来源：finetune_demo/finetune_vision.py-->process_batch函数

1、batched_conv原始数据获取源码解读

主要是将jsonl所有文件内容给到变量batched_conv，并给相应模型使用变量给出定义，其源码如下：

batched_conv = batch['messages']
batched_input_ids = []
batched_attention_mask = []
batched_position_ids = []
batched_labels = []
batched_images = []
max_length = max_input_length + max_output_length

batched_conv得到的是原始数据，也就是jsonl文件对应数据格式，就是messages中内容而已，每个列表仍是字典元素，本身就是一个列表，会在后面进行处理，如下格式：

"messages": [{"role": "user","content": "图片中内容是什么？","image": "/extend_disk/disk3/tj/GLM-4V-9B/GLM-4-main/THUDM/data/example/images/000000001.jpg"},{"role": "assistant","content": "图片中有拿着篮球的人。"},{"role": "user","content": "图片中的男人在做什么？"},{"role": "assistant","content": "这个男人正在玩篮球。"}]

如图所示：

在这里插入图片描述

2、数据batch循环初始化源码解读

这个就是对每个对话数据进行初始化，特别是input_ids、attention_mask 、position_ids 、loss_masks 、images ，其值如下代码所示。而第一个对话图像路径变成了图像。

for conv in batched_conv:input_ids = [151331, 151333]attention_mask = [1, 1]position_ids = list(range(len(input_ids)))  # [0,1]loss_masks = [False, False]images = []if conv[0].get('image'):conv[0]['image'] = Image.open(conv[0]['image']).convert('RGB')else:conv[0]['image'] = img

3、数据单轮循环源码解读（tokenizer.apply_chat_template）

批量数据的单个数据进入此循环，而tokenizer.apply_chat_template此函数非常重要，该函数是被智普修改过的函数。

for message in conv:loss_mask_val = False if message['role'] in ('system', 'user', 'observation') else True # 为了给定labels是否计算new_input_ids_all = tokenizer.apply_chat_template([message],tokenize=True,return_dict=True,padding=True)

1、handle_single_conversation(conversation)方法源码解读

这个也是智普进行了更改，其源码如下：

def handle_single_conversation(conversation):input_ids = self.get_prefix_tokens() if add_special_tokens else []input_message = "[gMASK]<sop>" if add_special_tokens else ""input_image = Nonetransform = transforms.Compose([transforms.Resize((self.image_size, self.image_size), interpolation=transforms.InterpolationMode.BICUBIC),transforms.ToTensor(),transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),])for item in conversation:if item.get("tools"):tools = item["tools"]content = "你是一个名为 GLM-4 的人工智能助手。你是基于智谱AI训练的语言模型 GLM-4 模型开发的，你的任务是针对用户的问题和要求提供适当的答复和支持。"for tool in tools:if tool["type"] == "function":function = tool["function"]content += f"\n\n## {function['name']}\n\n{json.dumps(function, ensure_ascii=False, indent=4)}"content += "\n在调用上述函数时，请使用 Json 格式表示调用的参数。"elif tool["type"] == "python":content += "\n\n## python\n\n当你向 `python` 发送包含 Python 代码的消息时，该代码将会在一个有状态的 Jupyter notebook 环境中执行。\n`python` 返回代码执行的输出，或在执行 60 秒后返回超时。\n`/mnt/data` 将会持久化存储你的文件。在此会话中，`python` 无法访问互联网。不要使用 `python` 进行任何网络请求或者在线 API 调用，这些在线内容的访问将不会成功。"elif tool["type"] == "simple_browser":content += "\n\n## simple_browser\n\n你可以使用 `simple_browser` 工具。该工具支持以下函数：\n`search(query: str, recency_days: int)`：使用搜索引擎进行查询并显示结果，可以使用 `recency_days` 参数控制搜索内容的时效性。\n`mclick(ids: list[int])`：获取一系列指定 id 的页面内容。每次调用时，须选择3-10个页面。选择多个角度的页面，同时尽可能选择可信任的信息来源。考虑到部分页面是无法加载的，你也可以多打开一些可能有用的页面而不用担心内容过多。\n`open_url(url: str)`：打开指定的 URL。\n\n使用 `【{引用 id}†{引用文本}】` 来引用内容。\n\n操作步骤：1. 使用 `search` 来获得信息列表; 2. 使用 `mclick` 来获取指定 ID 页面的内容; 3. 根据获得的内容进行回复。在回复中应当引用信息来源。\n 如果用户提供了 URL，也可以用 `open_url` 直接打开页面。\n如果初次搜索结果没有找到合适的信息，也可以再次使用 `search` 进行搜索。"elif tool["type"] == "cogview":content += "\n\n## cogview\n\n如果用户的请求中包含了对图像的描述，你可以使用 `cogview` 来生成图像并展示给用户。你需要向 `cogview` 发送图像描述，规则：\n- 发送给 `cogview` 的消息必须使用英语。用户的中文描述必须完全翻译为英语。\n- 应当尽可能详细地描述图像生成的需求，需求描述约 100 英文单词。\n- 保持用户原始描述的意图。不要虚构内容或者没见过的人物。\n- 如无特殊说明，所在地为中国，持有中国立场并遵循中国社会主义价值观。"else:raise NotImplementedError(f"Unknown tool type {tool['type']}")input = self.build_single_message("system", "", content, tokenize=tokenize)if tokenize:input_ids.extend(input)else:input_message += inputmessage = ""message_prefix = Noneif item.get("image"):assert input_image is None, "Multiple images are not supported"input_image = transform(item["image"])message_prefix = self.convert_tokens_to_ids(["<|begin_of_image|>", "<|endoftext|>", "<|end_of_image|>"])if item.get("content"):message += item["content"]if message or message_prefix:input = self.build_single_message(item["role"],item.get("metadata", ""),message,tokenize=tokenize,message_prefix=message_prefix)if tokenize:input_ids.extend(input)else:input_message += inputif add_generation_prompt:if tokenize:input_ids.extend([self.convert_tokens_to_ids("<|assistant|>")])else:input_message += "<|assistant|>"return {"input": input_ids if tokenize else input_message, "image": input_image}

如下图显示：
在这里插入图片描述

1、数据结构

进入该函数内容如下,已转换成列表了，如下显示。

[{"role": "user","content": "图片中内容是什么？","image": PIL读取的图像数据}]

2、图像处理

若对话内容存在图像，则会进行图像处理，图像转换代码如下：

transform = transforms.Compose([transforms.Resize((self.image_size, self.image_size), interpolation=transforms.InterpolationMode.BICUBIC),transforms.ToTensor(),transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),]

而图像处理源码如下：

 message = ""message_prefix = Noneif item.get("image"):assert input_image is None, "Multiple images are not supported"input_image = transform(item["image"])message_prefix = self.convert_tokens_to_ids(["<|begin_of_image|>", "<|endoftext|>", "<|end_of_image|>"])

3、获得文本信息

继续通过下面代码得到文本，其源码如下：

if item.get("content"):message += item["content"]

message之前是""，现在相加得到 图片中内容是什么？。

4、文本token转换

将内容转换token id主要是self.build_single_message方法，得到input,然后再使用input_ids.extend(input)得到最终input_ids，这里input_ids本身batch循环是有2个变量值[151331, 151333],在使用input_ids.extend(input)实现input_ids，其值为[151331, 151333, 151336, 198, 151339, 151329, 151340, 100736, 98322, 99098, 101052, 11314]

if message or message_prefix:input = self.build_single_message(item["role"],item.get("metadata", ""),message,tokenize=tokenize,message_prefix=message_prefix)if tokenize:input_ids.extend(input)else:input_message += input

如图显示：
在这里插入图片描述
input_message= "[gMASK]<sop>"的token没有用到input_ids中，暂时先不急，我后面再解读！

5、self.build_single_message

该函数就是获取token id ，其源码如下：

def build_single_message(self, role, metadata, message, tokenize=True, message_prefix=None):assert role in ["system", "user", "assistant", "observation"], roleif tokenize:role_tokens = [self.convert_tokens_to_ids(f"<|{role}|>")] + self.tokenizer.encode(f"{metadata}\n",disallowed_special=())message_tokens = self.tokenizer.encode(message, disallowed_special=())if message_prefix is not None:message_tokens = message_prefix + message_tokenstokens = role_tokens + message_tokensreturn tokenselse:return str(f"<|{role}|>{metadata}\n{message}")

首先，role_tokens就是获得 f"<|{role}|>" 与 f" \n"的token id，其值为[151336,198]。

其次，message_tokens是内容message='图片中内容是什么？'的token id，其值为[100736, 98322, 99098, 101052, 11314]。

然后，message_prefix不是None时候，message_tokens = message_prefix + message_tokens = [151339, 151329, 151340]+[100736, 98322, 99098, 101052, 11314]。

最后，tokens = role_tokens + message_tokens =[151336,198] +[151339, 151329, 151340，100736, 98322, 99098, 101052, 11314]。

最终，将获得的tokens结果输出！

2、元素添加

这一段就是处理模型输入后面添加的结果，说白了就是多轮对话进行内容拼接的方法，我不再解释，我会在后面给出示列。

new_input_ids = new_input_ids_all['input_ids'][0][2:]
new_attention_mask = new_input_ids_all['attention_mask'][0][2:]
new_position_ids = list(range(position_ids[-1] + 1, position_ids[-1] + 1 + len(new_input_ids)))
if message.get('image'):  # Only One Imageimages.append(new_input_ids_all['images'])new_loss_masks = [loss_mask_val] * len(new_input_ids)
input_ids += new_input_ids
attention_mask += new_attention_mask
position_ids += new_position_ids
loss_masks += new_loss_masks

4、数据batch循环结尾处理

这段代码就是每个样本数据处理，同时也添加了151336的id，这些我都不在解释了，我后面会给出解释。

    input_ids.append(151336)  # EOSattention_mask.append(1)position_ids.append(len(position_ids))loss_masks.append(False)labels = []for input_id, mask in zip(input_ids, loss_masks):if mask:labels.append(input_id)else:labels.append(-100)batched_input_ids.append(input_ids[:max_length])batched_attention_mask.append(attention_mask[:max_length])batched_position_ids.append(position_ids[:max_length])batched_labels.append(labels[:max_length])batched_images.append(images[0][0])

5、返回值

这个就是数据返回值，用于输入模型的值。

 return {'input_ids': batched_input_ids,'attention_mask': batched_attention_mask,'position_ids': batched_position_ids,'labels': batched_labels,'images': batched_images}

四、GLM-4v-9B训练数据加工内容呈现

这部分，我直接给出process_batch返回值内容，这样我可以不在解读上面数据处理内容了。

1、样本内容

假如一个样本内容：

  "messages": [{"role": "user","content": "图片中内容是什么？","image": "/extend_disk/disk3/tj/GLM-4V-9B/GLM-4-main/THUDM/data/example/images/000000001.jpg"},{"role": "assistant","content": "图片中有拿着篮球的人。"},{"role": "user","content": "图片中的男人在做什么？"},{"role": "assistant","content": "这个男人正在玩篮球。"}]

2、样本实现内容

我们通过process_batch方法，对某个样本获取结果如下表：

在这里插入图片描述

第一个token是图像位置，将对话拼接起来成了上面内容，而最后一个token是添加的。这个是GLM代码给出的，这样就更好理解代码内容了。如此，我们只需按照对话内容给出，GLM模型就没啥问题，能构套入图像id嵌入。

在这里插入图片描述

五、模拟GLM-4v-9B数据处理Demo

然而，为了更好理解与知晓GLM数据加载代码内容，我将其提取出来做了变化，这样可以方便大家查看，那么我直接给出源码如下：

# -*- coding: utf-8 -*-
import functools
from collections.abc import Callable, Mapping, Sequence
from typing import  Any
import torch
from datasets import Dataset, Split
from nltk.translate.bleu_score import sentence_bleu
from torch import nn
from transformers import (AutoTokenizer,PreTrainedTokenizer,
)
from datasets import load_dataset, DatasetDict, NamedSplit
from typing import Optional
from PIL import Image
img = Image.new('L', (224, 224), 0).convert('RGB')
def _load_datasets(data_dir: str,data_format: str,data_files: dict[NamedSplit, str],num_proc: Optional[int],
) -> DatasetDict:if data_format == '.json':dataset_dct = load_dataset(data_dir,data_files=data_files,split=None,num_proc=num_proc,)else:raise NotImplementedError(f"Cannot load dataset in the '{data_format}' format.")return dataset_dctclass DataManager(object):def __init__(self, data_dir: str):self._num_proc = 1data_files={NamedSplit('train'): 'train.json', NamedSplit('validation'): 'train.json', NamedSplit('test'): 'train.json'}self._dataset_dct = _load_datasets(data_dir,'.json',data_files,self._num_proc,)def _get_dataset(self, split: NamedSplit) -> Optional[Dataset]:return self._dataset_dct.get(split, None)def get_dataset(self,split: NamedSplit,process_fn: Callable[[dict[str, Any]], dict[str, Any]],batched: bool = True,remove_orig_columns: bool = True,) -> Optional[Dataset]:orig_dataset = self._get_dataset(split)if orig_dataset is None:returnif remove_orig_columns:remove_columns = orig_dataset.column_nameselse:remove_columns = Nonereturn orig_dataset.map(process_fn,batched=batched,remove_columns=remove_columns,num_proc=self._num_proc,# This is default params of  orig_dataset.map, and you can change it smaller# https://github.com/THUDM/GLM-4/issues/277writer_batch_size=1000,batch_size=1000,)
def process_batch_check(batch: Mapping[str, Sequence],tokenizer: PreTrainedTokenizer,max_input_length: int,max_output_length: int,combine: bool,
) -> dict[str, list]:batched_conv = batch['messages']max_length = max_input_length + max_output_lengthreturn {'input_ids': batched_conv }
def data_check(results,  tokenizer: PreTrainedTokenizer,max_input_length: int,max_output_length: int,):batched_input_ids = []batched_attention_mask = []batched_position_ids = []batched_labels = []batched_images = []  batched_conv=results['input_ids']max_length = max_input_length + max_output_lengthfor conv in batched_conv:input_ids = [151331, 151333]attention_mask = [1, 1]position_ids = list(range(len(input_ids)))loss_masks = [False, False]images = []if conv[0].get('image'):conv[0]['image'] = Image.open(conv[0]['image']).convert('RGB')else:conv[0]['image'] = imgfor message in conv:loss_mask_val = False if message['role'] in ('system', 'user', 'observation') else Truenew_input_ids_all = tokenizer.apply_chat_template([message],tokenize=True,return_dict=True,padding=True)new_input_ids = new_input_ids_all['input_ids'][0][2:]new_attention_mask = new_input_ids_all['attention_mask'][0][2:]new_position_ids = list(range(position_ids[-1] + 1, position_ids[-1] + 1 + len(new_input_ids)))if message.get('image'):  # Only One Imageimages.append(new_input_ids_all['images'])new_loss_masks = [loss_mask_val] * len(new_input_ids)input_ids += new_input_idsattention_mask += new_attention_maskposition_ids += new_position_idsloss_masks += new_loss_masksinput_ids.append(151336)  # EOSattention_mask.append(1)position_ids.append(len(position_ids))loss_masks.append(False)labels = []for input_id, mask in zip(input_ids, loss_masks):if mask:labels.append(input_id)else:labels.append(-100)batched_input_ids.append(input_ids[:max_length])batched_attention_mask.append(attention_mask[:max_length])batched_position_ids.append(position_ids[:max_length])batched_labels.append(labels[:max_length])batched_images.append(images[0][0])del batched_conv, conv, input_ids, attention_mask, position_ids, loss_masks, message, new_input_ids, new_loss_masks, labels, input_id, masktorch.cuda.empty_cache()return {'input_ids': batched_input_ids,'attention_mask': batched_attention_mask,'position_ids': batched_position_ids,'labels': batched_labels,'images': batched_images}
def main(       
):config_file = '/GLM-4V-9B/GLM-4-main/finetune_demo/configs/lora.yaml'model_dir = '/GLM-4V-9B/GLM-4-main/THUDM/glm-4v-9b'tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) data_dir = '/extend_disk/disk3/tj/GLM-4V-9B/GLM-4-main/THUDM/data/example'data_manager = DataManager(data_dir)train_dataset_try = data_manager.get_dataset(Split.TRAIN,functools.partial(process_batch_check,combine=True, # Not use nowtokenizer=tokenizer,max_input_length=512,max_output_length=512,),batched=True,)r=data_check(train_dataset_try,tokenizer=tokenizer,max_input_length=512,max_output_length=512,)        
if __name__ == '__main__':main()