微调大模型实践

devtools/2025/1/15 15:14:12/

微调大模型试验
其中,重点关注:数据清洗,数据合成方法,sft 的 task 种类、sft 的数据量级

数据加载

国内用户建议到 https://modelscope.cn/datasets 下载数据,但是下载后发现并不能和huggingface datasets无缝衔接,而是报了个错

  • AttributeError: ‘MsDataset’ object has no attribute ‘column_names’

因此,可以继续采用魔搭下载数据,但是转换到dataset适应的形式,顺便也对整个数据过程更加了解一下。

但最简单的修改方法是:

dataset = MsDataset.load()
train_dataset = dataset.to_hf_dataset()  # 魔搭社区下载

然后是:

  • https://github.com/modelscope/modelscope/blob/a903ec7a898f5dfb44349e2ce15971ec5f08e528/examples/pytorch/llm/utils/dataset.py#L34
  • https://github.com/hiyouga/LLaMA-Factory/blob/6c94305e4746c9a735ff62a6428e295d1a67da52/src/llmtuner/data/loader.py#L83

几种方法

train_dataset = load_from_disk(args.dataset_name, split="train[:1024]")def preprocess_function(examples):queries = examples["sentence"]queries = get_detailed_instruct(task, queries)batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')result = {f"sentence_{k}": v for k, v in batch_dict.items()}queries = examples["positive"]batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')for k, v in batch_dict.items():result[f"positive_{k}"] = vqueries = examples["negative"]batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')for k, v in batch_dict.items():result[f"negative_{k}"] = vresult["labels"] = [0] * len(examples["sentence"]) return resultprocessed_datasets = dataset.map(preprocess_function,batched=True,remove_columns=dataset["train"].column_names,desc="Running tokenizer on dataset",)

数据构造

  • 百川例子

Deepspeed zero0

LoRA
1 80GB A100

{"train_batch_size": "auto","train_micro_batch_size_per_gpu": "auto","gradient_accumulation_steps": "auto","gradient_clipping": "auto","zero_allow_untested_optimizer": true,"fp16": {"enabled": "auto","loss_scale": 0,"loss_scale_window": 1000,"initial_scale_power": 16,"hysteresis": 2,"min_loss_scale": 1},"bf16": {"enabled": "auto"},"zero_optimization": {"stage": 0,"allgather_partitions": true,"allgather_bucket_size": 5e8,"overlap_comm": true,"reduce_scatter": true,"reduce_bucket_size": 5e8,"contiguous_gradients": true,"round_robin_gradients": true}}
{'loss': 1.3997, 'grad_norm': 1.9448336362838745, 'learning_rate': 1.589403973509934e-05, 'epoch': 0.96}                                           
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [10:07<00:00,  3.71s/it]/root/.conda/envs/demo/lib/python3.10/site-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /root/share/model_repos/internlm2-chat-7b - will assume that the vocabulary was not modified.warnings.warn(
{'train_runtime': 620.0575, 'train_samples_per_second': 16.128, 'train_steps_per_second': 0.252, 'train_loss': 1.4265562815543933, 'epoch': 1.0}   
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [10:20<00:00,  3.97s/it]
Training seconds: 638.3791897296906 seconds.
Training minutes: 10.64 minutes.
Peak reserved memory = 60.09 GB.
Peak reserved memory for training = 60.09 GB.
Peak reserved memory % of max memory = 75.918 %.
Peak reserved memory for training % of max memory = 75.918 %.

deepspeed zero2 no offload

LoRA

{"fp16": {"enabled": "auto","loss_scale": 0,"loss_scale_window": 100,"initial_scale_power": 16,"hysteresis": 2,"min_loss_scale": 1e-10},"zero_optimization": {"stage": 2,"allgather_partitions": true,"allgather_bucket_size": 1e8,"overlap_comm": true,"reduce_scatter": true,"reduce_bucket_size": 1e8,"contiguous_gradients": true},"gradient_accumulation_steps": "auto","gradient_clipping": "auto","steps_per_print": 2000,"train_batch_size": "auto","train_micro_batch_size_per_gpu": "auto","wall_clock_breakdown": false
}
{'loss': 1.366, 'grad_norm': 2.294084072113037, 'learning_rate': 1.589403973509934e-05, 'epoch': 0.96}                                             
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [10:14<00:00,  3.73s/it]/root/.conda/envs/demo/lib/python3.10/site-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /root/share/model_repos/internlm2-chat-7b - will assume that the vocabulary was not modified.warnings.warn(
{'train_runtime': 622.2199, 'train_samples_per_second': 16.071, 'train_steps_per_second': 0.251, 'train_loss': 1.4371743569007287, 'epoch': 1.0}   
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [10:22<00:00,  3.99s/it]
Training seconds: 631.9961657524109 seconds.
Training minutes: 10.53 minutes.
Peak reserved memory = 59.59 GB.
Peak reserved memory for training = 59.59 GB.
Peak reserved memory % of max memory = 75.286 %.
Peak reserved memory for training % of max memory = 75.286 %.

deepspeed zero2 offload

{"train_batch_size": "auto","train_micro_batch_size_per_gpu": "auto","gradient_accumulation_steps": "auto","gradient_clipping": "auto","zero_allow_untested_optimizer": true,"fp16": {"enabled": "auto","loss_scale": 0,"loss_scale_window": 1000,"initial_scale_power": 16,"hysteresis": 2,"min_loss_scale": 1},"bf16": {"enabled": "auto"},"zero_optimization": {"stage": 2,"offload_optimizer": {"device": "cpu","pin_memory": true},"allgather_partitions": true,"allgather_bucket_size": 5e8,"overlap_comm": true,"reduce_scatter": true,"reduce_bucket_size": 5e8,"contiguous_gradients": true,"round_robin_gradients": true}}

RuntimeError: Error building extension 'cpu_adam' Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f087b065750>

  • rm -rf /tmp/torch_extentions/*
  • https://github.com/microsoft/DeepSpeed/issues/889#issuecomment-808357696
  • torch和deepspeed版本的问题

deepspeed zero3 offload

{"train_batch_size": "auto","train_micro_batch_size_per_gpu": "auto","gradient_accumulation_steps": "auto","gradient_clipping": "auto","zero_allow_untested_optimizer": true,"fp16": {"enabled": "auto","loss_scale": 0,"loss_scale_window": 1000,"initial_scale_power": 16,"hysteresis": 2,"min_loss_scale": 1},"bf16": {"enabled": "auto"},"zero_optimization": {"stage": 3,"overlap_comm": true,"contiguous_gradients": true,"sub_group_size": 1e9,"reduce_bucket_size": "auto","stage3_prefetch_bucket_size": "auto","stage3_param_persistence_threshold": "auto","stage3_max_live_parameters": 1e9,"stage3_max_reuse_distance": 1e9,"stage3_gather_16bit_weights_on_model_save": true}}
{'loss': 1.4062, 'grad_norm': 2.122793574276295, 'learning_rate': 1.589403973509934e-05, 'epoch': 0.96}                                            
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [20:17<00:00,  7.65s/it]/root/.conda/envs/demo/lib/python3.10/site-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /root/share/model_repos/internlm2-chat-7b - will assume that the vocabulary was not modified.warnings.warn(
{'train_runtime': 1225.8007, 'train_samples_per_second': 8.158, 'train_steps_per_second': 0.127, 'train_loss': 1.4307525463593311, 'epoch': 1.0}   
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [20:25<00:00,  7.86s/it]
Training seconds: 1227.789188861847 seconds.
Training minutes: 20.46 minutes.
Peak reserved memory = 65.516 GB.
Peak reserved memory for training = 48.928 GB.
Peak reserved memory % of max memory = 82.773 %.
Peak reserved memory for training % of max memory = 61.816 %.

http://www.ppmy.cn/devtools/111532.html

相关文章

设计之道:ORM、DAO、Service与三层架构的规范探索

引言&#xff1a; 实际开发中&#xff0c;遵守一定的开发规范&#xff0c;不仅可以提高开发效率&#xff0c;还可以提高项目的后续维护性以及项目的扩展性&#xff1b;了解一下本博客的项目设计规范&#xff0c;对项目开发很有意义 一、ORM思想 ORM&#xff08;Object-Relation…

【机器学习-监督学习】决策树

【作者主页】Francek Chen 【专栏介绍】 ⌈ ⌈ ⌈Python机器学习 ⌋ ⌋ ⌋ 机器学习是一门人工智能的分支学科&#xff0c;通过算法和模型让计算机从数据中学习&#xff0c;进行模型训练和优化&#xff0c;做出预测、分类和决策支持。Python成为机器学习的首选语言&#xff0c;…

【网络通信基础与实践第二讲】包括互联网概述、互联网发展的三个阶段、互联网的组成、计算机网络的体系结构

一、互联网概述 计算机网络是由若干节点&#xff08;node&#xff09;和连接这些节点的链路&#xff08;link&#xff09;组成。 网络之间还可以通过路由器互联起来&#xff0c;这就构成了一个覆盖范围更大的计算机网络。这样的网络称为互联网。 网络把许多计算机连接在一起…

Java高级Day40-QQ项目全代码

114.多用户通信系统(QQ)项目 QQServer包 //ManageClientThread// public class ManageClientThread {//返回public static HashMap<String, ServerConnectClientThread> getHm(){return hm;}private static HashMap<String,ServerConnectClientThread> hm new H…

用Python设置PDF中图片的透明度

在PDF文档的设计与内容创作过程中&#xff0c;图像的透明度设置是一个重要的操作。尤其是在处理图文密集型PDF文档时&#xff0c;设置适当的图片透明度能够极大地提升视觉表达的层次感与专业性。设置PDF图像的透明度能够让图像更好地融入背景&#xff0c;实现平滑过渡的效果&am…

git svn 日记

1. git log -p -1 --name-only 该命令用于查看最新的一次提交记录的详细信息&#xff0c;包括文件更改情况。 git log&#xff1a;显示 Git 仓库的提交历史。-p&#xff1a;显示每次提交的差异 (diff)&#xff0c;也就是文件内容的修改部分。-1&#xff1a;表示只显示最近的一…

【HTML】元素的分类(块元素、行内元素、行内块元素)

元素的分类 块元素行内元素行内块元素转换 块元素 独占一行&#xff0c;宽度默认为容器的100%&#xff0c;可以设置宽、高、行高、内外边距&#xff1b;布局时&#xff0c;块元素可以包含块元素和行内元素 <div>div</div><p>p</p><h3>h1-h6</…

java(1)数据类型,运算符,逻辑控制语句以及基本应用

目录 ​编辑 1.前言 2.正文 2.1数据类型与变量 2.1.1字面常量 2.1.2数据类型 2.1.3变量 2.1.3.1整型 2.1.3.2浮点型 2.1.3.3字符型 2.1.3.4布尔型 2.1.4类型转换与类型提升 2.1.4.1字符串的拼接 2.1.4.2整型转字符串 2.1.4.3字符串转整数 2.2运算符 2.2.1算术运…