微调大模型试验
其中,重点关注:数据清洗,数据合成方法,sft 的 task 种类、sft 的数据量级
数据加载
国内用户建议到 https://modelscope.cn/datasets 下载数据,但是下载后发现并不能和huggingface datasets无缝衔接,而是报了个错
- AttributeError: ‘MsDataset’ object has no attribute ‘column_names’
因此,可以继续采用魔搭下载数据,但是转换到dataset适应的形式,顺便也对整个数据过程更加了解一下。
但最简单的修改方法是:
dataset = MsDataset.load()
train_dataset = dataset.to_hf_dataset() # 魔搭社区下载
然后是:
- https://github.com/modelscope/modelscope/blob/a903ec7a898f5dfb44349e2ce15971ec5f08e528/examples/pytorch/llm/utils/dataset.py#L34
- https://github.com/hiyouga/LLaMA-Factory/blob/6c94305e4746c9a735ff62a6428e295d1a67da52/src/llmtuner/data/loader.py#L83
几种方法
train_dataset = load_from_disk(args.dataset_name, split="train[:1024]")def preprocess_function(examples):queries = examples["sentence"]queries = get_detailed_instruct(task, queries)batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')result = {f"sentence_{k}": v for k, v in batch_dict.items()}queries = examples["positive"]batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')for k, v in batch_dict.items():result[f"positive_{k}"] = vqueries = examples["negative"]batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')for k, v in batch_dict.items():result[f"negative_{k}"] = vresult["labels"] = [0] * len(examples["sentence"]) return resultprocessed_datasets = dataset.map(preprocess_function,batched=True,remove_columns=dataset["train"].column_names,desc="Running tokenizer on dataset",)
数据构造
- 百川例子
Deepspeed zero0
LoRA
1 80GB A100
{"train_batch_size": "auto","train_micro_batch_size_per_gpu": "auto","gradient_accumulation_steps": "auto","gradient_clipping": "auto","zero_allow_untested_optimizer": true,"fp16": {"enabled": "auto","loss_scale": 0,"loss_scale_window": 1000,"initial_scale_power": 16,"hysteresis": 2,"min_loss_scale": 1},"bf16": {"enabled": "auto"},"zero_optimization": {"stage": 0,"allgather_partitions": true,"allgather_bucket_size": 5e8,"overlap_comm": true,"reduce_scatter": true,"reduce_bucket_size": 5e8,"contiguous_gradients": true,"round_robin_gradients": true}}
{'loss': 1.3997, 'grad_norm': 1.9448336362838745, 'learning_rate': 1.589403973509934e-05, 'epoch': 0.96}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [10:07<00:00, 3.71s/it]/root/.conda/envs/demo/lib/python3.10/site-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /root/share/model_repos/internlm2-chat-7b - will assume that the vocabulary was not modified.warnings.warn(
{'train_runtime': 620.0575, 'train_samples_per_second': 16.128, 'train_steps_per_second': 0.252, 'train_loss': 1.4265562815543933, 'epoch': 1.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [10:20<00:00, 3.97s/it]
Training seconds: 638.3791897296906 seconds.
Training minutes: 10.64 minutes.
Peak reserved memory = 60.09 GB.
Peak reserved memory for training = 60.09 GB.
Peak reserved memory % of max memory = 75.918 %.
Peak reserved memory for training % of max memory = 75.918 %.
deepspeed zero2 no offload
LoRA
{"fp16": {"enabled": "auto","loss_scale": 0,"loss_scale_window": 100,"initial_scale_power": 16,"hysteresis": 2,"min_loss_scale": 1e-10},"zero_optimization": {"stage": 2,"allgather_partitions": true,"allgather_bucket_size": 1e8,"overlap_comm": true,"reduce_scatter": true,"reduce_bucket_size": 1e8,"contiguous_gradients": true},"gradient_accumulation_steps": "auto","gradient_clipping": "auto","steps_per_print": 2000,"train_batch_size": "auto","train_micro_batch_size_per_gpu": "auto","wall_clock_breakdown": false
}
{'loss': 1.366, 'grad_norm': 2.294084072113037, 'learning_rate': 1.589403973509934e-05, 'epoch': 0.96}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [10:14<00:00, 3.73s/it]/root/.conda/envs/demo/lib/python3.10/site-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /root/share/model_repos/internlm2-chat-7b - will assume that the vocabulary was not modified.warnings.warn(
{'train_runtime': 622.2199, 'train_samples_per_second': 16.071, 'train_steps_per_second': 0.251, 'train_loss': 1.4371743569007287, 'epoch': 1.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [10:22<00:00, 3.99s/it]
Training seconds: 631.9961657524109 seconds.
Training minutes: 10.53 minutes.
Peak reserved memory = 59.59 GB.
Peak reserved memory for training = 59.59 GB.
Peak reserved memory % of max memory = 75.286 %.
Peak reserved memory for training % of max memory = 75.286 %.
deepspeed zero2 offload
{"train_batch_size": "auto","train_micro_batch_size_per_gpu": "auto","gradient_accumulation_steps": "auto","gradient_clipping": "auto","zero_allow_untested_optimizer": true,"fp16": {"enabled": "auto","loss_scale": 0,"loss_scale_window": 1000,"initial_scale_power": 16,"hysteresis": 2,"min_loss_scale": 1},"bf16": {"enabled": "auto"},"zero_optimization": {"stage": 2,"offload_optimizer": {"device": "cpu","pin_memory": true},"allgather_partitions": true,"allgather_bucket_size": 5e8,"overlap_comm": true,"reduce_scatter": true,"reduce_bucket_size": 5e8,"contiguous_gradients": true,"round_robin_gradients": true}}
RuntimeError: Error building extension 'cpu_adam' Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f087b065750>
rm -rf /tmp/torch_extentions/*
- https://github.com/microsoft/DeepSpeed/issues/889#issuecomment-808357696
- torch和deepspeed版本的问题
deepspeed zero3 offload
{"train_batch_size": "auto","train_micro_batch_size_per_gpu": "auto","gradient_accumulation_steps": "auto","gradient_clipping": "auto","zero_allow_untested_optimizer": true,"fp16": {"enabled": "auto","loss_scale": 0,"loss_scale_window": 1000,"initial_scale_power": 16,"hysteresis": 2,"min_loss_scale": 1},"bf16": {"enabled": "auto"},"zero_optimization": {"stage": 3,"overlap_comm": true,"contiguous_gradients": true,"sub_group_size": 1e9,"reduce_bucket_size": "auto","stage3_prefetch_bucket_size": "auto","stage3_param_persistence_threshold": "auto","stage3_max_live_parameters": 1e9,"stage3_max_reuse_distance": 1e9,"stage3_gather_16bit_weights_on_model_save": true}}
{'loss': 1.4062, 'grad_norm': 2.122793574276295, 'learning_rate': 1.589403973509934e-05, 'epoch': 0.96}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [20:17<00:00, 7.65s/it]/root/.conda/envs/demo/lib/python3.10/site-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /root/share/model_repos/internlm2-chat-7b - will assume that the vocabulary was not modified.warnings.warn(
{'train_runtime': 1225.8007, 'train_samples_per_second': 8.158, 'train_steps_per_second': 0.127, 'train_loss': 1.4307525463593311, 'epoch': 1.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [20:25<00:00, 7.86s/it]
Training seconds: 1227.789188861847 seconds.
Training minutes: 20.46 minutes.
Peak reserved memory = 65.516 GB.
Peak reserved memory for training = 48.928 GB.
Peak reserved memory % of max memory = 82.773 %.
Peak reserved memory for training % of max memory = 61.816 %.