基于循环神经网络的语言模型:RNNLM、GRULM

news/2024/10/17 18:27:49/

基于循环神经网络的语言模型:RNNLM

RNNLM首次提出是在《Recurrent neural network based language model》这篇非常重要的神经网络语言模型论文种,发表于2010年。这篇论文的主要贡献是:

  1. 首次提出并实现了一种基于循环神经网络(Recurrent Neural Network)的语言模型,简称RNN语言模型。
  2. 通过在隐藏层引入循环连接来捕捉词汇序列的长程依赖关系,使模型具有更强的序列建模能力。
  3. 克服了NNLM中限制只能输入固定长度上下文,RNN支持可变长度的上下文输入。
  4. 开启了使用更加强大和复杂的神经网络来进行语言建模的研究潮流,如后续提出的LSTM语言模型等。
  5. 基于RNN的语言模型后来也在实践中得到广泛的应用,产生了重大影响。

RNNLM模型结构

xt
st
st-1
yt
  • x x x : 输入层

  • s s s : 隐藏/上下文/状态层

  • y y y : 输出层

  • x t x_t xt : t t t 时刻的输入

  • y t y_t yt : t t t 时刻的输出,下一个词的概率分布。

  • s t s_t st : t t t 时刻隐藏层的状态

模型的输入向量 x t x_t xt 由当前时刻的词向量 w t w_t wt和上一时刻的状态向量 s t − 1 s_{t-1} st1组成:

x t = w t + s t − 1 + : c o n c a t e n a t e x_t = w_t + s_{t-1} \quad + : concatenate xt=wt+st1+:concatenate

模型的正向计算过程如下:

s t j = f ( ∑ i x t i u j i ) (1) s_t^j = f(\sum_i x_t^i u_{ji}) \quad \text{(1)} stj=f(ixtiuji)(1)

y t k = g ( ∑ j s t j v k j ) (2) y_t^k = g(\sum_j s_t^j v_{kj}) \quad \text{(2)} ytk=g(jstjvkj)(2)

f ( z ) f(z) f(z) :为激活函数:

f ( z ) = 1 1 + e − z (3) f(z) = \frac{1}{1 + e^{-z}} \quad \text{(3)} f(z)=1+ez1(3)

g(z), softmax函数:

f ( z m ) = e z m ∑ k e z k (4) f(z_m) = \frac{e^{z_m}}{\sum_k e^{z_k}} \quad \text{(4)} f(zm)=kezkezm(4)

训练细节

  • s 0 s_0 s0 : 初始状态的初始化,采用较小的值例如0.1,当语料足够大时,这个不重要。
  • w t w_t wt : 词的向量表示,采用one-hot编码,实践中长度在:30000 ∼ \sim 200000。
  • 状态层的大小:30 ∼ \sim 500,实验证明语料越大,隐藏层越大。
  • 初始学习率 α = 0.1 \alpha = 0.1 α=0.1,损失没有显著下降减半。

误差函数

E r r o r t = d e s i r e d t − y t (5) Error_t = desired_t - y_t \quad \text{(5)} Errort=desiredtyt(5)

  • d e s i r e d t desired_t desiredt : t t t 时刻真实的下一个词的one-hot编码向量。
  • y t y_t yt : 模型的预测输出。

优化

训练语料的预处理:将所有出现频率低于阈值的单词(在训练文本中)合并为一个特殊的标记。

P ( w t + 1 i ∣ w t , s t − 1 ) = { y t r a r e C r a r e i f w t + 1 i i s r a r e y t i o t h e r w i s e P(w^i_{t+1}|w_t, s_{t-1}) = \begin{equation} \left\{ \begin{aligned} & \frac{y_t^{rare}}{C_{rare}} \quad if \quad w^i_{t+1} \quad is \quad rare \\ & y_t^i \quad otherwise \\ \end{aligned} \right. \end{equation} P(wt+1iwt,st1)= Crareytrareifwt+1iisrareytiotherwise

模型实现:Pytorch

  • 模型没有100%的还原RNNLM,例如词向量的表示,采用了当前比较流行的Embedding
  • 循环连接部分分别实现RNNcell和GRUcell
  • GRUcell:
    1. GRU有两个门结构:重置门和更新门。重置门可以决定遗忘先前的隐状态信息,更新门可以决定保留先前的隐状态信息。
    2. GRU的隐状态只包含一个隐层向量,而普通RNN每一步都会生成一个隐状态向量。
    3. GRU在结构上更加简单,只涉及一个隐状态向量和两个门控制向量,计算量更小。
    4. 实验结果显示,与相同配置的普通RNN相比,GRU能取得更好的性能,特别是在长序列的任务上。
import os
import time
import pandas as pd
from dataclasses import dataclassimport torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
from torch.utils.tensorboard import SummaryWriter
# 模型参数
@dataclass
class ModelConfig:vocab_size: int = None  n_embed : int = Nonen_hidden: int = None

RNNcell

timestep:t
timestep:t-1
ht-1
ht
A
xt
ht-1
A
xt-1
class RNNCell(nn.Module):"""the job of a 'Cell' is to:take input at current time step x_{t} and the hidden state at theprevious time step h_{t-1} and return the resulting hidden stateh_{t} at the current timestep"""def __init__(self, config):super().__init__()self.xh_to_h = nn.Linear(config.n_embed + config.n_hidden, config.n_hidden)def forward(self, xt, hprev):xh = torch.cat([xt, hprev], dim=1)ht = F.tanh(self.xh_to_h(xh))return ht

GRUcell

reset
update
Wz
Wr
Wh
Sigmoid
rt
Sigmoid
xh
zt
ht-1
concat
xt
hrt-1
*
h't
concat
ht
*
+
1-zt
*
class GRUCell(nn.Module):"""same job as RNN cell, but a bit more complicated recurrence formulathat makes the GRU more expressive and easier to optimize."""def __init__(self, config):super().__init__()# input, forget, output, gateself.xh_to_z = nn.Linear(config.n_embed + config.n_hidden, config.n_hidden)self.xh_to_r = nn.Linear(config.n_embed + config.n_hidden, config.n_hidden)self.xh_to_hbar = nn.Linear(config.n_embed + config.n_hidden, config.n_hidden)def forward(self, xt, hprev):# first use the reset gate to wipe some channels of the hidden state to zeroxh = torch.cat([xt, hprev], dim=1)r = F.sigmoid(self.xh_to_r(xh))hprev_reset = r * hprev# calculate the candidate new hidden state hbarxhr = torch.cat([xt, hprev_reset], dim=1)hbar = F.tanh(self.xh_to_hbar(xhr))# calculate the switch gate that determines if each channel should be updated at allz = F.sigmoid(self.xh_to_z(xh))# blend the previous hidden state and the new candidate hidden stateht = (1 - z) * hprev + z * hbarreturn ht
class RNN(nn.Module):def __init__(self, config, cell_type):super().__init__()self.vocab_size = config.vocab_sizeself.start = nn.Parameter(torch.zeros(1, config.n_hidden)) # the starting hidden stateself.wte = nn.Embedding(config.vocab_size, config.n_embed) # token embeddings tableif cell_type == 'rnn':self.cell = RNNCell(config)elif cell_type == 'gru':self.cell = GRUCell(config)self.lm_head = nn.Linear(config.n_hidden, self.vocab_size)def forward(self, idx, targets=None):device = idx.deviceb, t = idx.size()# embed all the integers up front and all at once for efficiencyemb = self.wte(idx) # (b, t, n_embed)# sequentially iterate over the inputs and update the RNN state each tickhprev = self.start.expand((b, -1)) # expand out the batch dimensionhiddens = []for i in range(t):xt = emb[:, i, :] # (b, n_hidden)ht = self.cell(xt, hprev) # (b, n_hidden)hprev = hthiddens.append(ht)# decode the outputshidden = torch.stack(hiddens, 1) # (b, t, n_hidden)logits = self.lm_head(hidden)# if we are given some desired targets also calculate the lossloss = Noneif targets is not None:loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)return logits, loss

测试数据

数据集来10k+中文外卖评价数据集:

data = pd.read_csv('./dataset/waimai_10k.csv')
data.dropna(subset='review',inplace=True)
data['review_length'] = data.review.apply(lambda x:len(x))
data.sample(5)
labelreviewreview_length
20621价格实惠,值得购买。10
23721很好吃,奶茶还好,天很冷,还没有凉17
63990这么好的店、宫保鸡丁竟然拿土豆充当鸡肉!20多的菜,至于吗?30
31471挺好的,豆浆很好喝~~~12
12481好吃,真的是大肘子肉10

语料统计信息:

data = data[data.review_length <=50] # 滤掉长度超过300的评论
words = data.review.tolist()
chars = sorted(list(set(''.join(words))))    
max_word_length = max(len(w) for w in words)print(f"number of examples: {len(words)}")
print(f"max word length: {max_word_length}")
print(f"size of vocabulary: {len(chars)}")
number of examples: 10796
max word length: 50
size of vocabulary: 2272

划分训练/测试数据

test_set_size = min(1000, int(len(words) * 0.1)) 
rp = torch.randperm(len(words)).tolist()
train_words = [words[i] for i in rp[:-test_set_size]]
test_words = [words[i] for i in rp[-test_set_size:]]
print(f"split up the dataset into {len(train_words)} training examples and {len(test_words)} test examples")
split up the dataset into 9796 training examples and 1000 test examples

构造字符数据集[tensor]

  • < BLANK> : 0
  • token seqs : [1, 2, 3, 4, 5, 6]
  • x : [0, 1, 2, 3, 4, 5, 6]
  • y : [1, 2, 3, 4, 5, 6, 0]
class CharDataset(Dataset):def __init__(self, words, chars, max_word_length):self.words = wordsself.chars = charsself.max_word_length = max_word_length# char-->index-->charself.char2i = {ch:i+1 for i,ch in enumerate(chars)}self.i2char = {i:s for s,i in self.char2i.items()}    def __len__(self):return len(self.words)def contains(self, word):return word in self.wordsdef get_vocab_size(self):return len(self.chars) + 1      def get_output_length(self):return self.max_word_length + 1def encode(self, word):# char sequece ---> index sequenceix = torch.tensor([self.char2i[w] for w in word], dtype=torch.long)return ixdef decode(self, ix):# index sequence ---> char sequenceword = ''.join(self.i2char[i] for i in ix)return worddef __getitem__(self, idx):word = self.words[idx]ix = self.encode(word)x = torch.zeros(self.max_word_length + 1, dtype=torch.long)y = torch.zeros(self.max_word_length + 1, dtype=torch.long)x[1:1+len(ix)] = ixy[:len(ix)] = ixy[len(ix)+1:] = -1 # index -1 will mask the lossreturn x, y

数据加载器[DataLoader]

class InfiniteDataLoader:def __init__(self, dataset, **kwargs):train_sampler = torch.utils.data.RandomSampler(dataset, replacement=True, num_samples=int(1e10))self.train_loader = DataLoader(dataset, sampler=train_sampler, **kwargs)self.data_iter = iter(self.train_loader)def next(self):try:batch = next(self.data_iter)except StopIteration: # this will technically only happen after 1e10 samples... (i.e. basically never)self.data_iter = iter(self.train_loader)batch = next(self.data_iter)return batch

训练模型

# 模型评估
@torch.inference_mode()
def evaluate(model, dataset, batch_size=10, max_batches=None):model.eval()loader = DataLoader(dataset, shuffle=True, batch_size=batch_size, num_workers=0)losses = []for i, batch in enumerate(loader):batch = [t.to('cuda') for t in batch]X, Y = batchlogits, loss = model(X, Y)losses.append(loss.item())if max_batches is not None and i >= max_batches:breakmean_loss = torch.tensor(losses).mean().item()model.train() # reset model back to training modereturn mean_loss

环境初始化:

torch.manual_seed(seed=12345)
torch.cuda.manual_seed_all(seed=12345)work_dir = "./Rnn_log"
os.makedirs(work_dir, exist_ok=True)
writer = SummaryWriter(log_dir=work_dir)

模型初始化:

config = ModelConfig(vocab_size=len(chars)+1,n_embed=64,n_hidden=128)#model = RNN(config,cell_type='rnn')
model = RNN(config,cell_type='gru')model.to('cuda')
RNN((wte): Embedding(2273, 64)(cell): GRUCell((xh_to_z): Linear(in_features=192, out_features=128, bias=True)(xh_to_r): Linear(in_features=192, out_features=128, bias=True)(xh_to_hbar): Linear(in_features=192, out_features=128, bias=True))(lm_head): Linear(in_features=128, out_features=2273, bias=True)
)

初始化数据:

train_dataset = CharDataset(train_words, chars, max_word_length)
test_dataset = CharDataset(test_words, chars, max_word_length)train_dataset[0][0].shape, train_dataset[0][1].shape
(torch.Size([51]), torch.Size([51]))

Training:

# init optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.01, betas=(0.9, 0.99), eps=1e-8)
# init dataloader
batch_loader = InfiniteDataLoader(train_dataset, batch_size=128, pin_memory=True, num_workers=4)# training loop
best_loss = None
step = 0
train_losses, test_losses = [],[]
while True:t0 = time.time()# get the next batch, ship to device, and unpack it to input and targetbatch = batch_loader.next()batch = [t.to('cuda') for t in batch]X, Y = batch# feed into the modellogits, loss = model(X, Y)# calculate the gradient, update the weightsmodel.zero_grad(set_to_none=True)loss.backward()optimizer.step()# wait for all CUDA work on the GPU to finish then calculate iteration time takentorch.cuda.synchronize()t1 = time.time()# loggingif step % 1000 == 0:print(f"step {step} | loss {loss.item():.4f} | step time {(t1-t0)*1000:.2f}ms")# evaluate the modelif step > 0 and step % 100 == 0:train_loss = evaluate(model, train_dataset, batch_size=100, max_batches=10)test_loss  = evaluate(model, test_dataset,  batch_size=100, max_batches=10)train_losses.append(train_loss)test_losses.append(test_loss)# save the model to disk if it has improvedif best_loss is None or test_loss < best_loss:out_path = os.path.join(work_dir, "model.pt")print(f"test loss {test_loss} is the best so far, saving model to {out_path}")torch.save(model.state_dict(), out_path)best_loss = test_lossstep += 1# termination conditionsif step > 10100:break
step 0 | loss 7.7387 | step time 84.71ms
test loss 5.455846786499023 is the best so far, saving model to ./Rnn_log/model.pt
test loss 5.085928916931152 is the best so far, saving model to ./Rnn_log/model.pt
test loss 4.722366809844971 is the best so far, saving model to ./Rnn_log/model.pt
test loss 4.451460361480713 is the best so far, saving model to ./Rnn_log/model.pt
test loss 4.261294364929199 is the best so far, saving model to ./Rnn_log/model.pt
test loss 4.121057987213135 is the best so far, saving model to ./Rnn_log/model.pt
test loss 4.0212507247924805 is the best so far, saving model to ./Rnn_log/model.pt
test loss 3.935884475708008 is the best so far, saving model to ./Rnn_log/model.pt
test loss 3.87166166305542 is the best so far, saving model to ./Rnn_log/model.pt
step 1000 | loss 3.7037 | step time 66.99ms
.......
test loss 3.476886749267578 is the best so far, saving model to ./Rnn_log/model.pt
step 4000 | loss 2.9470 | step time 57.79ms
step 5000 | loss 2.8236 | step time 60.15ms
step 6000 | loss 2.7413 | step time 60.07ms
step 7000 | loss 2.6398 | step time 58.10ms
step 8000 | loss 2.5385 | step time 58.41ms
step 9000 | loss 2.3928 | step time 58.49ms
step 10000 | loss 2.2889 | step time 57.82ms

RNNLM vs GRULM

在这里插入图片描述


测试:评论生成器

@torch.no_grad()
def generate(model, idx, max_new_tokens, temperature=1.0, do_sample=False, top_k=None):for _ in range(max_new_tokens):# forward the model to get the logits for the index in the sequencelogits, _ = model(idx)# pluck the logits at the final step and scale by desired temperaturelogits = logits[:,-1,:] / temperature# optionally crop the logits to only the top k optionsif top_k is not None:v, _ = torch.topk(logits, top_k)logits[logits < v[:, [-1]]] = -float('Inf')# apply softmax to convert logits to (normalized) probabilitiesprobs = F.softmax(logits, dim=-1)# either sample from the distribution or take the most likely elementif do_sample:idx_next = torch.multinomial(probs, num_samples=1)else:_, idx_next = torch.topk(probs, k=1, dim=-1)# append sampled index to the running sequence and continueidx = torch.cat((idx, idx_next), dim=-1)return idx
def print_samples(num=13):# inital 0 tokensX_init = torch.zeros((num, 1), dtype=torch.long).to('cuda')steps = train_dataset.get_output_length() - 1 # -1 because we already start with <START> token (index 0)X_samp = generate(model, X_init, steps, top_k=None, do_sample=True).to('cuda')new_samples = []for i in range(X_samp.size(0)):# get the i'th row of sampled integers, as python listrow = X_samp[i, 1:].tolist() # note: we need to crop out the first <START> token# token 0 is the <END> token, so we crop the output sequence at that pointcrop_index = row.index(0) if 0 in row else len(row)row = row[:crop_index]word_samp = train_dataset.decode(row)new_samples.append(word_samp)return new_samples
print_samples(num=10)
['不好吃,肥肉煎饼!不松心了!','山药有两次,不过小蛋鱼还不错,肉里的不筋道少','草面不值的煎饼','菜给的不错,服务好,速度快,来了!绝对的是辣','速度很快,不贴心吧','好吃,就是量少,味道不怎么样啊','菜品很喜欢,百度骑士特别棒!','金针菇汉堡我喜欢,面好大馅鲜,很好吃,毕竟糊所有菜品不如以前的菜饼。','蛮生的。送过小哥快被味道真差。','巨好吃~!!']

http://www.ppmy.cn/news/133708.html

相关文章

投影仪显示计算机自动无信号,解决投影仪显示无信号的问题

1.VGA线链接正确 首先请确保VGA线连接在投影机的带有“Computer”或“PC”字样的接口上(注意&#xff1a;不要连接在带有“Monitor Out”字样的接口上)。另外投影机与计算机最好是直连的&#xff0c;中间不要串接分频器、延长线、转接线等其它设备。 2.查看投影画面内容 其次请…

手机连接投影机的步骤_手机怎么连接投影仪?这几招实用

智能手机不论功能多强大&#xff0c;都只局限在几寸小小的屏幕上&#xff0c;一旦连上投影&#xff0c;小屏变大屏&#xff0c;不仅可以享受大屏视觉盛宴&#xff0c;还可以和家人朋友分享快乐。投影仪连接手机使用正成为一种趋势。但是很多消费者买了投影仪却不知如何连接手机…

学计算机投影仪定义,主编教您电脑如何连接投影仪

电脑如何连接投影仪呢&#xff1f;随着科技的发展&#xff0c;生活水平的提高。一些家庭在家看电影的时候都会使用投影仪观看。不过有些新手朋友不知道怎么将电脑与投影仪相连&#xff0c;接下来我们就看看电脑连接投影仪的具体步骤。 投影仪是一种利用光学元件将工件的轮廓放大…

投影机检测不到计算机信号,投影仪搜索不到信号源怎么办?这几种操作方法可进行修复...

投影仪搜索不到信号源怎么办?在我们使用的win7旗舰版操作系统中想要将屏幕内容投放到投影仪或者其他显示设备中播放&#xff0c;但是投影仪中却识别不到信号源&#xff0c;没有任何反应的现象&#xff0c;该怎么办呢?针对屏幕投影没有信号源的问题大家可以通过下面 介绍的操作…

计算机无法外接投影,投影仪不能连接电脑了如何解决?教程攻略

原标题&#xff1a;投影仪不能连接电脑了如何解决&#xff1f;教程攻略 投影仪近几年一直是我们关注的焦点&#xff0c;随着科技的发展投影仪的功能也越来越多元化&#xff0c;投影仪的投影功能也已经成为了日常生活和商务办公必不可少的&#xff0c;平常我们看个资源比较好的电…

docker架构速看(2)-镜像

docker架构细看(2)-镜像 ​ 上一章讲了Docker服务端的启动&#xff0c;这一章我们来看Docker中的镜像,需要对容器镜像分层存储&#xff0c;容器存储驱动有一定了解&#xff0c;参考 容器技术原理(一)&#xff1a;从根本上认识容器镜像 ​ Docker篇之镜像存储-OverlayFS和联合…

投影仪与计算机连接方式,电脑怎么接投影仪教程 简单三步教你搞定

在现如今的生活中,随着科技的不断进步人们在日常生活中几乎是离不开智能设备的陪伴。而近年智能投影更是火爆了朋友圈,以往在大众的印象中投影仪是露天影院和课堂上才会出现的设备。但如今却摇身一变走进了寻常人的家中,凭借百吋大屏和4K的超清画质成为了年轻人中的宠儿。 虽…

计算机无法投影,电脑无法识别投影仪-电脑为什么检测不到投影仪,应该怎么安装...

电脑无法识别投影仪-电脑为什么检测不到投影仪,应该怎么安装 投影仪销量排行