Transformer中的数据输入构造

文章目录

- 1. 文本内容
- 2. 字典构造
- - 2.1 定义一个类用于字典构造
  - 2.2 拆分文本
  - 2.3 构造结果
- 3. 完整代码

1. 文本内容

假如我们有如下一段文本内容：

Optics

It is the branch of physics that studies the behaviour and properties of light .

Optical Science

这段文本有5行，第一行内容为 ’Optics‘，第二行为空行，第三行内容为 ’It is the branch of physics that studies the behaviour and properties of light .‘，第四行内容为空行，第五行内容为’Optical Science‘
根据这段文本，可以构造一个字典。在这个字典中，每一个单词有一个编号（ $\mathrm{index}$ ），根据这个编号，我们就能知道这个编号对应哪个单词。
将这段文本以 $\mathrm{.txt}$ 文件的形式放在 $\mathrm{data}$ 文件夹下。这里， $\mathrm{.txt}$ 文件和 $\mathrm{data}$ 文件夹都可以自己创建，如下图所示

2. 字典构造

2.1 定义一个类用于字典构造

import os
from io import open
import torchclass Dictionary(object):def __init__(self):self.word2idx = {}self.idx2word = []def add_word(self, word):if word not in self.word2idx:self.idx2word.append(word)self.word2idx[word] = len(self.idx2word) - 1return self.word2idx[word]def __len__(self):return len(self.idx2word)

self.word2idx = {} 是建立一个空字典来存放每一个单词对应的 $\mathrm{index}$ 。self.idx2word = [] 是建立一个空列表来存放 $\mathrm{index}$ 对应的单词；
第二个函数 add_word 用来接收输入的文本数据，然后用 self.idx2word.append(word) 一个一个的放进 self.idx2word = [] 这个空列表里。self.word2idx[word] = len(self.idx2word) - 1 是为每一个加进来的单词分配一个 $\mathrm{index}$ ，然后 $\mathrm{word:index}$ 作为键值对放进self.word2idx = {} 建立的空字典里。
第三个函数返回的是在这个字典中总共有多少个单词（包括标点符号，例如上面文本中的句号 $\cdot$ ）。

2.2 拆分文本

$\mathrm{Dictionary}$ 这个类需要输入数据来产生词典，所以接下来要制作数据，这个数据来源就是 $1$ 中的文本内容。这里，可以定义如下的一个 $\mathrm{Data}$ 类：

import os
from io import open
import torchclass Data(object):def __init__(self, path):self.dictionary = Dictionary()self.demo = self.tokenize(os.path.join(path, 'demo_text.txt'))def tokenize(self, path):"""Tokenizes a text file."""assert os.path.exists(path)# Add words to the dictionarywith open(path, 'r', encoding="utf8") as f:for line in f:words = line.split() + ['<eos>']for word in words:self.dictionary.add_word(word)# Tokenize file contentwith open(path, 'r', encoding="utf8") as f:idss = []for line in f:words = line.split() + ['<eos>']ids = []for word in words:ids.append(self.dictionary.word2idx[word])idss.append(torch.tensor(ids).type(torch.int64))ids = torch.cat(idss)return ids

self.dictionary = Dictionary() 就是将 $2.1$ 中构造的字典类实例化，以方便调用。self.demo = self.tokenize(os.path.join(path, 'demo_text.txt')) 是将 $\mathrm{demo\_text.txt}$ 中的内容转化为一个个的 $\mathrm{index}$ 。
tokenize(self, path) 这个函数就是用来实现将 $\mathrm{demo\_text.txt}$ 中的内容转化为一个个的 $\mathrm{index}$ 。
在tokenize(self, path) 这个函数中，第一个 with open(path, 'r', encoding="utf8") as f: ： $\mathrm{open}$ 函数打开文本内容后，用 $\mathrm{for}$ 循环，逐行拆分文本为一个个单词（包括标点符号），然后用 self.dictionary.add_word(word) 这个函数将每一个单词放进字典里。注意 words = line.split() + ['<eos>'] ,这里给每一行的末尾加了一个字符 $\mathrm{'<eos>'}$ 用于提示一行结束。
在tokenize(self, path) 这个函数中，第二个 with open(path, 'r', encoding="utf8") as f: ： $\mathrm{open}$ 函数打开文本内容后，用 $\mathrm{for}$ 循环，逐行拆分文本为一个个单词（包括标点符号），然后用 ids.append(self.dictionary.word2idx[word]) 这个函数将每一个单词对应的 $\mathrm{index}$ 放进列表里。
idss.append(torch.tensor(ids).type(torch.int64)) 是将每一循环得到的 $\mathrm{ids}$ 存起来。
因为每一循环得到 $\mathrm{ids}$ 是一个 $\mathrm{tensor}$ ，所以 $\mathrm{idss}$ 里有很多个 $\mathrm{tensor}$ ，最后用 ids = torch.cat(idss) 把所有数据整合成一个 $\mathrm{tensor}$ 。

2.3 构造结果

输出字典代码如下：

data = Data('./data') # 给定数据文件夹
data_dict = data.dictionary.word2idx
print(f'由给定文本构造的词典为：\n{data_dict}')

输出结果如下：

由给定文本构造的词典为：
{'Optics': 0, '<eos>': 1, 'It': 2, 'is': 3, 'the': 4, 'branch': 5, 'of': 6, 'physics': 7, 'that': 8, 'studies': 9,
'behaviour': 10, 'and': 11, 'properties': 12, 'light': 13, '.': 14, 'Optical': 15, 'Science': 16}

对比原文本，可以发现，每一个单词有一个对应的编号，其中 '<eos>' 是我们主动添加的代表一行结束的字符。

由给定的文本产生的 $\mathrm{index}$ 编码输出为：

data_demo = data.demo
print(f"给定文本所产生的index编码输出为：\n{data_demo}")
# 给定文本所产生的index编码输出为：
# tensor([ 0,  1,  1,  2,  3,  4,  5,  6,  7,  8,  9,  4, 10, 11, 12,  6, 13, 14,
#          1,  1, 15, 16,  1])

第一个数字0代表 $\mathrm{Optics}$ , 第二个数字1代表 $\mathrm{Optics}$ 后的行结束符 '<eos>' 。
第三个数字1代表空行里的结束符 '<eos>'。
第四个数字2代表第三行的第一个单词 $\mathrm{It}$ 。可以类比文本和 $\mathrm{index}$ 的编码输出，都可以通过字典一一对应。
这里的 $\mathrm{index}$ 的编码输出就是用于 $\mathrm{transformer}$ 的训练数据。

3. 完整代码

# %%
import os
from io import open
import torch# %% Dictionary
class Dictionary(object):def __init__(self):self.word2idx = {}self.idx2word = []def add_word(self, word):if word not in self.word2idx:self.idx2word.append(word)self.word2idx[word] = len(self.idx2word) - 1return self.word2idx[word]def __len__(self):return len(self.idx2word)# %% Data
class Data(object):def __init__(self, path):self.dictionary = Dictionary()self.demo = self.tokenize(os.path.join(path, 'demo_text.txt'))def tokenize(self, path):"""Tokenizes a text file."""assert os.path.exists(path)# Add words to the dictionarywith open(path, 'r', encoding="utf8") as f:for line in f:words = line.split() + ['<eos>']for word in words:self.dictionary.add_word(word)# Tokenize file contentwith open(path, 'r', encoding="utf8") as f:idss = []for line in f:words = line.split() + ['<eos>']ids = []for word in words:ids.append(self.dictionary.word2idx[word])idss.append(torch.tensor(ids).type(torch.int64))ids = torch.cat(idss)return ids# %%
data = Data('./data')  # 给定数据文件夹
data_dict = data.dictionary.word2idx
print(f'由给定文本构造的词典为：\n{data_dict}')
# 由给定文本构造的词典为：
# {'Optics': 0, '<eos>': 1, 'It': 2, 'is': 3, 'the': 4, 'branch': 5, 'of': 6, 'physics': 7, 'that': 8, 'studies': 9,
# 'behaviour': 10, 'and': 11, 'properties': 12, 'light': 13, '.': 14, 'Optical': 15, 'Science': 16}
data_demo = data.demo
print(f"给定文本所产生的index编码输出为：\n{data_demo}")
# 给定文本所产生的index编码输出为：
# tensor([ 0,  1,  1,  2,  3,  4,  5,  6,  7,  8,  9,  4, 10, 11, 12,  6, 13, 14,
#          1,  1, 15, 16,  1])