微调大模型(Finetuning Large Language Models)—Data_preparation(四)

news/2024/12/22 2:37:00/

1. 数据的质量

在这里插入图片描述
数据准备的步骤:
在这里插入图片描述
什么是tokenizing?
在这里插入图片描述
其实就是将文本数据转换为代表文本的数字,一般是基于字符出现的频率。需要注意的,编码和解码的tokenizer需保持一致,一般训练的模型有自己专属匹配的tokenizer。

2. 准备数据

2.1 准备环境

python">import pandas as pd
import datasetsfrom pprint import pprint
from transformers import AutoTokenizer

2.2 对文本进行标记化处理

python"># 调用预训练模型的分词器
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
text = "Hi, how are you?"
encoded_text = tokenizer(text)["input_ids"]
print(encoded_text)

输出如下:

[12764, 13, 849, 403, 368, 32]

尝试对编码的数据进行解码

python">decoded_text = tokenizer.decode(encoded_text)
print("Decoded tokens back into text: ", decoded_text)

输出如下:

Decoded tokens back into text:  Hi, how are you?

可以看到数据又被还原了。在尝试对多个数据进行编码解码。

python">list_texts = ["Hi, how are you?", "I'm good", "Yes"]
encoded_texts = tokenizer(list_texts)
print("Encoded several texts: ", encoded_texts["input_ids"])

输出如下:

Encoded several texts:  [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175], [4374]]

2.3 填充或者截断

实际上,对于模型来说,非常重要的一点是批处理中的所有
内容长度相同,因为你正在使用固定大小的张量进行操作
所以文本需要是相同的。因此采用填充策略来处理变长编码文本内容。对于我们的填充标记,您必须指定要用于填充的数字,具体来说,我们使用的是零,它实际上也是句子结束标记。因此,
当我们通过分词器运行填充时,您可以看到是”字符串右侧填充了许多零。

python">tokenizer.pad_token = tokenizer.eos_token 
encoded_texts_longest = tokenizer(list_texts, padding=True)
print("Using padding: ", encoded_texts_longest["input_ids"])

输出如下:

Using padding:  [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175, 0, 0, 0], [4374, 0, 0, 0, 0, 0]]

当然也可以使用截断来保持定长

python">encoded_texts_truncation = tokenizer(list_texts, max_length=3, truncation=True)
print("Using truncation: ", encoded_texts_truncation["input_ids"])

输出如下:

python">Using truncation:  [[12764, 13, 849], [42, 1353, 1175], [4374]]

我们可以看到上述字符是从右侧进行截断的。当然我们也可以从左侧进行截断处理。

python">tokenizer.truncation_side = "left"
encoded_texts_truncation_left = tokenizer(list_texts, max_length=3, truncation=True)
print("Using left-side truncation: ", encoded_texts_truncation_left["input_ids"])

输出如下:

Using left-side truncation:  [[403, 368, 32], [42, 1353, 1175], [4374]]

采用截断和填充的方式保持一致的策略

python">encoded_texts_both = tokenizer(list_texts, max_length=3, truncation=True, padding=True)
print("Using both padding and truncation: ", encoded_texts_both["input_ids"])

输出如下:

Using both padding and truncation:  [[403, 368, 32], [42, 1353, 1175], [4374, 0, 0]]

2.4 准备指令数据集

python">import pandas as pdfilename = "lamini_docs.jsonl"
instruction_dataset_df = pd.read_json(filename, lines=True)
examples = instruction_dataset_df.to_dict()if "question" in examples and "answer" in examples:text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:text = examples["input"][0] + examples["output"][0]
else:text = examples["text"][0]prompt_template = """### Question:
{question}### Answer:"""num_examples = len(examples["question"])
finetuning_dataset = []
for i in range(num_examples):question = examples["question"][i]answer = examples["answer"][i]text_with_prompt_template = prompt_template.format(question=question)finetuning_dataset.append({"question": text_with_prompt_template, "answer": answer})from pprint import pprint
print("One datapoint in the finetuning dataset:")
pprint(finetuning_dataset[0])

输出如下:

One datapoint in the finetuning dataset:
{'answer': 'Lamini has documentation on Getting Started, Authentication, ''Question Answer Model, Python Library, Batching, Error Handling, ''Advanced topics, and class documentation on LLM Engine available ''at https://lamini-ai.github.io/.','question': '### Question:\n''What are the different types of documents available in the ''repository (e.g., installation guide, API documentation, '"developer's guide)?\n"'\n''### Answer:'}

2.5 对单个示例进行标记化

python">text = finetuning_dataset[0]["question"] + finetuning_dataset[0]["answer"]
tokenized_inputs = tokenizer(text,return_tensors="np",padding=True
)
print(tokenized_inputs["input_ids"])

输出如下:

python">[[ 4118 19782    27   187  1276   403   253  1027  3510   273  7177  2130275   253 18491   313    70    15    72   904 12692  7102    13  899010097    13 13722   434  7102  6177   187   187  4118 37741    27    454988    74   556 10097   327 27669 11075   264    13  5271 23058    1319782 37741 10031    13 13814 11397    13   378 16464    13 11759 105351981    13 21798 12989    13   285   966 10097   327 21708    46 107972130   387  5987  1358    77  4988    74    14  2284    15  7280    15900 14206]]

对数据保持一致长度。

python">max_length = 2048
max_length = min(tokenized_inputs["input_ids"].shape[1],max_length,
)tokenized_inputs = tokenizer(text,return_tensors="np",truncation=True,max_length=max_length
)tokenized_inputs["input_ids"]

输出如下:

array([[ 4118, 19782,    27,   187,  1276,   403,   253,  1027,  3510,273,  7177,  2130,   275,   253, 18491,   313,    70,    15,72,   904, 12692,  7102,    13,  8990, 10097,    13, 13722,434,  7102,  6177,   187,   187,  4118, 37741,    27,    45,4988,    74,   556, 10097,   327, 27669, 11075,   264,    13,5271, 23058,    13, 19782, 37741, 10031,    13, 13814, 11397,13,   378, 16464,    13, 11759, 10535,  1981,    13, 21798,12989,    13,   285,   966, 10097,   327, 21708,    46, 10797,2130,   387,  5987,  1358,    77,  4988,    74,    14,  2284,15,  7280,    15,   900, 14206]])

2.6 对指令数据集进行标记化

python">def tokenize_function(examples):if "question" in examples and "answer" in examples:text = examples["question"][0] + examples["answer"][0]elif "input" in examples and "output" in examples:text = examples["input"][0] + examples["output"][0]else:text = examples["text"][0]tokenizer.pad_token = tokenizer.eos_tokentokenized_inputs = tokenizer(text,return_tensors="np",padding=True,)max_length = min(tokenized_inputs["input_ids"].shape[1],2048)tokenizer.truncation_side = "left"tokenized_inputs = tokenizer(text,return_tensors="np",truncation=True,max_length=max_length)return tokenized_inputs

对数据进行标记

python">finetuning_dataset_loaded = datasets.load_dataset("json", data_files=filename, split="train")tokenized_dataset = finetuning_dataset_loaded.map(tokenize_function,batched=True,batch_size=1,drop_last_batch=True
)print(tokenized_dataset)

输出如下:

Dataset({features: ['question', 'answer', 'input_ids', 'attention_mask'],num_rows: 1400
})

对数据增加列标签

python">tokenized_dataset = tokenized_dataset.add_column("labels", tokenized_dataset["input_ids"])

2.7 对数据集进行划分

python">split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)
print(split_dataset)

输出如下:

DatasetDict({train: Dataset({features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],num_rows: 1260})test: Dataset({features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],num_rows: 140})
})

2.8 测试自己的数据集加载

python">finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = datasets.load_dataset(finetuning_dataset_path)
print(finetuning_dataset)

输出如下:

python">DatasetDict({train: Dataset({features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],num_rows: 1260})test: Dataset({features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],num_rows: 140})
})

尝试taylor_swift数据集

python">taylor_swift_dataset = "lamini/taylor_swift"
bts_dataset = "lamini/bts"
open_llms = "lamini/open_llms"dataset_swiftie = datasets.load_dataset(taylor_swift_dataset)
print(dataset_swiftie["train"][1])

输出如下:

{'question': 'What is the most popular Taylor Swift song among millennials? How does this song relate to the millennial generation? What is the significance of this song in the millennial culture?', 'answer': 'Taylor Swift\'s "Shake It Off" is the most popular song among millennials. This song relates to the millennial generation as it is an anthem of self-acceptance and embracing one\'s individuality. The song\'s message of not letting others bring you down and to just dance it off resonates with the millennial culture, which is often characterized by a strong sense of individuality and a rejection of societal norms. Additionally, the song\'s upbeat and catchy melody makes it a perfect fit for the millennial generation, which is known for its love of pop music.', 'input_ids': [1276, 310, 253, 954, 4633, 11276, 24619, 4498, 2190, 24933, 8075, 32, 1359, 1057, 436, 4498, 14588, 281, 253, 24933, 451, 5978, 32, 1737, 310, 253, 8453, 273, 436, 4498, 275, 253, 24933, 451, 4466, 32, 37979, 24619, 434, 346, 2809, 640, 733, 5566, 3, 310, 253, 954, 4633, 4498, 2190, 24933, 8075, 15, 831, 4498, 7033, 281, 253, 24933, 451, 5978, 347, 352, 310, 271, 49689, 273, 1881, 14, 14764, 593, 285, 41859, 581, 434, 2060, 414, 15, 380, 4498, 434, 3935, 273, 417, 13872, 2571, 3324, 368, 1066, 285, 281, 816, 11012, 352, 745, 8146, 684, 342, 253, 24933, 451, 4466, 13, 534, 310, 2223, 7943, 407, 247, 2266, 3282, 273, 2060, 414, 285, 247, 18235, 273, 38058, 22429, 15, 9157, 13, 253, 4498, 434, 598, 19505, 285, 5834, 90, 40641, 2789, 352, 247, 3962, 4944, 323, 253, 24933, 451, 5978, 13, 534, 310, 1929, 323, 697, 2389, 273, 1684, 3440, 15], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [1276, 310, 253, 954, 4633, 11276, 24619, 4498, 2190, 24933, 8075, 32, 1359, 1057, 436, 4498, 14588, 281, 253, 24933, 451, 5978, 32, 1737, 310, 253, 8453, 273, 436, 4498, 275, 253, 24933, 451, 4466, 32, 37979, 24619, 434, 346, 2809, 640, 733, 5566, 3, 310, 253, 954, 4633, 4498, 2190, 24933, 8075, 15, 831, 4498, 7033, 281, 253, 24933, 451, 5978, 347, 352, 310, 271, 49689, 273, 1881, 14, 14764, 593, 285, 41859, 581, 434, 2060, 414, 15, 380, 4498, 434, 3935, 273, 417, 13872, 2571, 3324, 368, 1066, 285, 281, 816, 11012, 352, 745, 8146, 684, 342, 253, 24933, 451, 4466, 13, 534, 310, 2223, 7943, 407, 247, 2266, 3282, 273, 2060, 414, 285, 247, 18235, 273, 38058, 22429, 15, 9157, 13, 253, 4498, 434, 598, 19505, 285, 5834, 90, 40641, 2789, 352, 247, 3962, 4944, 323, 253, 24933, 451, 5978, 13, 534, 310, 1929, 323, 697, 2389, 273, 1684, 3440, 15]}

3. 总结

本节讲述了大模型微调前的数据准备工作,最重要的是模型的tokenizer以及截断策略和数据的划分,自己的数据集在制作过程中,仅需遵照上述流程即可,剩下的数据质量自行把握。


http://www.ppmy.cn/news/1533201.html

相关文章

求10 个整数中最大值

我们需要10个整数之中求出10个整数之中的最大值所以我们先要将10个整数先放置到一个容器之中,我们初期就使用数组的形式存放10个数组即设置数组arr[10],我们要将10个数组之中的数字输出出来,我们这里使用的是遍历循环输出数组。我们这里是使用…

Linux高级IO之poll与epoll

文章目录 poll使用 epoll系统调用 epoll的工作原理红黑树队列 epoll的工作模式水平触发边缘触发 Reactor设计模式工作原理epoll Reactor设计模式的简单示例 poll和epoll都是多路转接的调用,但是epoll实在过于优秀了,一般也都是用epoll的,除此…

You are not allowed to push code to this project

原因1 用户权限不够。 具体查看用户权限路径: 原因2 vscode之前都能提交代码,但是突然就提交不上了。 表现为:前端代码能拉取,但是不能提交。使用idea进行前端代码的提交,完全没问题。 解决方案:修改TortoiseG…

ChatGPT与R语言融合技术在生态环境数据统计分析、绘图(回归和混合效应模型、多元统计分析)

自2022年GPT(Generative Pre-trained Transformer)大语言模型的发布以来,它以其卓越的自然语言处理能力和广泛的应用潜力,在学术界和工业界掀起了一场革命。在短短一年多的时间里,GPT已经在多个领域展现出其独特的价值…

docker(一)之cgroup详解

概述 摘要: 容器的三大基石包括: namespaces, docker,rootfs。其中Namespace实现进程隔离。Cgroups用于限制进程资源,如CPU、内存、磁盘空间、网络流量等。rootfs则是容器的基础文件系统。通过这些技术,Docker实现了轻量级的容器隔离。本文将…

嵌入式单片机底层原理详解

前言 此笔记面向有C语言基础、学习过数字电路、对单片机有一定了解且尚在学习阶段的群体编写,笔记中会介绍单片机的结构、工作原理,以及一些C语言编程技巧,对于还停留在复制模板、copy代码阶段的读者会有比较大的帮助,待学习完成后可以独立完成几乎所有单片机的驱动开发。 …

解决VRM格式模型在Unity中运行出现头发乱飞等问题

1、问题 通过VRoidStudio制作导出的vrm格式的模型,放在unity中使用时,一运行就会出现头发乱飞,没有自然下垂的问题 2、解决方法 将模型下的secondary中的所有VRM Spring Bone脚本中的Drag Force改为1,Hit Radius改为0 修改后…

SQL第11课——使用子查询

11.1 子查询 查询(query): 任何SQL语句都是查询,但此术语一般指select语句。 select语句是SQL的查询,迄今为止所有的select语句都是简单查询,即从单个数据库表中检索数据的单条语句。 SQL还允许创建子查询(subquer…