机器翻译笔记

神经网络机器翻译系统（中英、越南语-英语）

下载https://github.com/tensorflow/nmt项目或直接使用D:\Deep-Learning-21-Examples-master\chapter_16\nmt（2017年8月25的版本，与tf1.8不兼容）
从https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi/下载越南与英语平行语料库，保存至chapter_16\nmt_data，训练集大概13w对句子，验证集1500多对，测试集1200多对；
进入chapter_16，创建chapter_16\nmt_model，执行python -m nmt.nmt
–src=vi --tgt=en --vocab_prefix=nmt_data/vocab --train_prefix=nmt_data/train
–dev_prefix= nmt_data/tst2012 --test_prefix=nmt_data/tst2013 --out_dir=nmt_model
–num_train_steps=12000 --steps_per_stats=100 --num_layers=2 --num_units=128
–dropout=0.2 --metrics=bleu，window会报错UnicodeEncodeError: ‘gbk’ codec can’t encode character ，文件头部添加sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding=‘utf8’) 即可，
直接注释print(out_s, end="", file=sys.stdout)也可以，但会错过很多信息；
（https://blog.csdn.net/u013066244/article/details/53057411）
考虑到训练时cpu使用率太高及内存问题，下面使用google colab的免费gpu训练；
翻墙、使用gmail登录、将nmt打包上传至云端硬盘（如mount目录）、挂载共享目录（https://blog.csdn.net/cocoaqin/article/details/79184540）
注意：colab的tf版本为1.10，所在目录/usr/local/lib/python3.6/dist-packages
执行!unzip nmt.zip解压，进入mount/nmt目录，再执行以上训练命令，发现step-time 为0.2s左右，比自己的cpu快5倍，wps为30k，也快5倍；
在执行到5400步，报错挂掉， DataLossError (see above for traceback): file is too short to be an sstable，[[Node: save/RestoreV2 = RestoreV2。。。。tensorflow.python.framework.errors_impl.DataLossError: file is too short to be an sstable。。。好在可以接着训练！！！
最终验证集和测试集的bleu、ppl分别是5.9、32.41，5.1、36.37；训练集是step 12000 lr 1 step-time 0.20s wps 30.35K ppl 34.68 gN 3.31 bleu 5.87 ；其中best_bleu自动保存了bleu较好的几个检查点，如11900等；
执行!tensorboard --logdir nmt_model打开tensorboard（会自动给出url如 http://fac4c1842001:6006），但访问不了。。；hparams会保存训练使用的超参数；可以将best_bleu目录下载下来，在自己电脑上看，只能看到dev_bleu、dev_ppl、test_bleu、test_ppl，且只有最后时刻的一个点；
注意：词汇表有3个特殊的单词：< unk>、< s>、< /s>分别表示不常见、句子开头和结尾（自动添加到句子）；
制作my_infer_file.vi，写入越南语（可直接从tst2013.vi复制），上传至云端nmt_testdata目录，执行!python -m nmt.nmt --out_dir=nmt_model
-inference_input_file=nmt_testdata/my_infer_file.vi
–inference_output_file=nmt_model/output_infer，
速度很快不到一分钟，使用了nmt_model/translate.ckpt-12000
下载后打开output_infer查看；翻译质量不够好，有3000多个< unk>；
越南语-英语是官网github的例子，下面进行中英翻译；NiuTrans （小牛翻译）提供的数据集（训练集10w，测试以及验证集1000），中文已分好词；上传数据至云端nmt_data_ch_en，同时新建云端nmt_model_ch_en目录；
执行python -m nmt.nmt --src=en --tgt=zh --attention=scaled_luong
–vocab_prefix=nmt_data_ch_en/vocab --train_prefix=nmt_data_ch_en/train
–dev_prefix=nmt_data_ch_en/dev --test_prefix=nmt_data_ch_en/test
–out_dir=nmt_model_ch_en
–num_train_steps=200000 --steps_per_stats=100 --num_layers=2
–num_units=128 --dropout=0.2 --metrics=bleu ，训练到18000挂掉，直接测试；
执行!python -m nmt.nmt --out_dir=nmt_model_ch_en
–inference_input_file=nmt_testdata/my_infer_file.en
–inference_output_file=nmt_model_ch_en/output_infer，
报错OutOfRangeError (see above for traceback): Read less bytes than requested，下载checkpoint发现是个临时文件，只有最后一个检查点，可能跟训练中断有关！
nmt源码研究：
tf.contrib.seq2seq.AttentionWrapper 、tf.contrib.seq2seq.LuongAttention、tf.contrib.seq2seq.BahdanauAttention、tf.contrib.seq2seq.BasicDecoder、tf.contrib.seq2seq.dynamic_decode、tf.nn.dynamic_rnn、tf.nn.bidirectional_dynamic_rnn、tf.nn.embedding_lookup、tf.contrib.seq2seq.TrainingHelper等；
疑点：wiki说BLEU’s output is always a number between 0 and 1，但nmt项目对最后值乘了100。bleu是对n-gram的改进，1-grams代表充分性、n-grams代表流畅性；惩罚短句子；

关于Colaboratory和云端硬盘的要点：
9. 验证字符串需要输入2次，每次不同，根据提示输入，最后都要回车
10. Colaboratory和云端硬盘的关联很简单，两个应用可以互相访问、跳转（貌似本来就是关联的）
11. 执行!mkdir -p drive是在colab当前目录创建共享目录，!google-drive-ocamlfuse drive -o nonempty是把drive目录挂载到云端硬盘的root目录；这样就能在drive看到云端硬盘的所有文件和目录（ls drive/ ），也可以通过rm -f直接删除云端硬盘的文件
https://redstonewill.com/2014/
https://zhuanlan.zhihu.com/p/57759598
https://blog.csdn.net/angus_monroe/article/details/79542843

机器翻译seq2seq

seq2seq=2个RNN（或LSTM）或N：1和1：N的结合，因为一个RNN无法解决输入和输出长度不等的问题（比如N：N只能是长度相等，而N：1或1：N限制是1，我们需要N：M），所以自然想到两个RNN联合，用第一个RNN将序列编码为一个固定长度的向量（利用最后一个隐藏层），再用第二个RNN将该向量映射到第二种序列；在机器翻译中，多个词向量输入，中间向量表示整个句子（有点类似doc2vec），再映射到另一种语言的多个词向量，这里的三种向量维度可以不一样？输入LSTM和输出LSTM不一样，参数不共享，通过微不足道的计算量增加换取更好的训练；输出LSTM可看成是生成模型；
http://karpathy.github.io/2015/05/21/rnn-effectiveness/ （rnn应用）
https://www.youtube.com/watch?v=WCUNPb-5EYI （很好）
https://www.youtube.com/watch?v=QciIcRxJvsM （很好）
https://www.youtube.com/watch?v=8HyCNIVRbSU （非常好）
http://colah.github.io/posts/2015-08-Understanding-LSTMs/ （LSTM网站）
https://www.jianshu.com/p/9dc9f41f0b29 （LSTM网站翻译）
seq2seq训练目标是最大化每对样本条件概率logP(T|S)的均值，参见原论文3.2节和另一篇公式4
使用多层LSTM和源端句子逆序可以提高BLEU和降低perplexity（其他trick如，让一个minibatch句子长度相等；并行训练，利用8个GPU，将每层LSTM放到不同GPU，另外4个处理softmax）
长句子翻译可以使用attention mechanism 解决，源端句子逆序也有帮助；Decoder的各个上下文向量c不一样（加权相加得到）；需要学习权重矩阵aij；
seq2seq也可叫Encoder–Decoder（先编码为固定向量，再解码）；Encoder–Decoder提出并使用了GRU；
Encoder–Decoder也能训练词向量或短语向量phrase representation？原论文4.4节和图4
attention mechanism 其实是alignment model
LuongAttention的Global Attention和Local Attention啥意思？在tf中有体现吗？
（Global Attention和Local Attention区别是，whether the “attention” is placed on all source positions or on only a few source positions，即权重是否共享；）
（soft and hard attention分别对应Global and Local Attention；soft attention可以同时关注不同的区域，hard attention一次只能关注一个区域）
（Local Attention比Global Attention更好，计算代价小；）
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.
越接近专业人工翻译越好；BLEU也是采用了N-gram的匹配规则，通过它能够算出比较译文和参考译文之间n组词的相似的一个占比；
https://www.youtube.com/watch?v=DejHQYAGb7Q&list=PLkDaE6sCZn6F6wUI9tvS_Gw1vaFAx6rd6 （吴恩达 bleu）

nmt源码再研究：

model.py定义了BaseModel和Model类，其中后者是前者的子类；BaseModel代表了一个Sequence-to-sequence模型，它的mode有TRAIN | EVAL | INFER三种；iterator必须是iterator_utils中的BatchedInput类；变量的初始化initializer由model_helper.py的get_initializer方法提供；BaseModel的主要方法有init_embeddings、train、eval、build_graph、_build_encoder_cell、_build_decoder、get_max_time、_build_decoder_cell、_build_encoder、_compute_loss、infer、decode等；init_embeddings利用model_helper的create_emb_for_encoder_and_decoder方法实现，可以指定encode、decoder是否共享embedding；build_graph调用_build_encoder、_build_decoder、_compute_loss实现；_build_decoder基于_build_encoder的输出，并调用_build_decoder_cell，还调用tf.nn.embedding_lookup、tf.contrib.seq2seq.TrainingHelper、tf.contrib.seq2seq.BasicDecoder、tf.contrib.seq2seq.dynamic_decode、tf.contrib.seq2seq.BeamSearchDecoder、tf.contrib.seq2seq.GreedyEmbeddingHelper；_compute_loss调用tf.nn.sparse_softmax_cross_entropy_with_logits；
对于TRAIN模式，利用tf.train.GradientDescentOptimizer或tf.train.AdamOptimizer优化器训练，调用tf.gradients、model_helper.gradient_clip、opt.apply_gradients；
_build_encoder_cell利用model_helper.create_rnn_cell实现（后者调用tf.contrib.rnn.MultiRNNCell）；

Model类重写了_build_encoder、_build_decoder_cell方法；_build_encoder方法中encoder_type分为uni和bi两种，即单向rnn和双向rnn，后者依赖_build_bidirectional_rnn方法（分别通过tf.nn.dynamic_rnn、tf.nn.bidirectional_dynamic_rnn实现）；
_build_decoder_cell方法依赖model_helper.create_rnn_cell以及tf.contrib.seq2seq.tile_batch实现；
主要文件层级结构大致为: tf和np等接口==》scripts各py==》utils各py==》model_helper.py==》model.py==》attention_model.py==》gnmt_model.py==》inference.py==》train.py==》 nmt.py
inference.py定义了InferModel类以及create_infer_model、load_data、inference、single_worker_inference、multi_worker_inference方法；其中对外是inference方法；inference方法使用了nmt_model.Model、attention_model.AttentionModel、gnmt_model.GNMTModel三种model；
attention_model.py定义了AttentionModel类（Model的子类），实现了attention-based decoder，出自《Effective Approaches to Attention-based Neural Machine Translation》；
重写了_build_decoder_cell方法，在调用的create_attention_mechanism方法中使用tf.contrib.seq2seq.LuongAttention、tf.contrib.seq2seq.BahdanauAttention这两种Attention；
然后用tf.contrib.seq2seq.AttentionWrapper包裹cell和attention_mechanism，再用tf.contrib.rnn.DeviceWrapper包裹；
gnmt_model.py定义了GNMTModel类（AttentionModel的子类）、GNMTAttentionMultiCell类（tf.nn.rnn_cell.MultiRNNCell的子类）；对外是GNMTModel类；
重写了_build_decoder_cell方法；和父类最大区别是，利用GNMTAttentionMultiCell作为cell，而不是直接用MultiRNNCell；
train.py定义了TrainModel类、EvalModel类、train方法，通过create_train_model返回TrainModel、create_eval_model返回EvalModel（这两方法和inference.create_infer_model对等）；
（train方法是对外方法，同时包含了train_model、eval_model、infer_model三个功能；可以选择Model、AttentionModel、GNMTModel；）
（eval指标为perplexity、BLEU and ROUGE scores；分别利用model_helper.compute_perplexity、nmt_utils.decode_and_evaluate计算得到；）
（nmt_utils.decode_and_evaluate调用evaluation_utils.evaluate，又再调用_bleu、_rouge、_accuracy；最后调用scripts\bleu.py的compute_bleu以及scripts\rouge.py的rouge；）
nmt.py定义了main函数，再调用run_main，把inference.inference和train.train作为参数，根据flags.inference_input_file是否有值，来判断执行train还是inference；

Time_major=True啥意思？
Time_major决定了inputs Tensor前两个dim表示的含义；
time_major=False时[batch_size, sequence_length, embedding_size]
time_major=True时[sequence_length, batch_size, embedding_size]

训练RNN需要gradient clipping？使用LSTM不需要，因为tanh防止了梯度消失和爆炸；

越南语和英语互译可通过修改–src=en --tgt=vi参数来实现；

Decoding methods include greedy, sampling, and beam-search decoding；
（beam search和greedy可以同时使用？不能，且beam search比greedy search效果更好）
（beam search见论文Neural Machine Translation and Sequence-to-sequence Models: A Tutorial）
（Beam search is similar to greedy search, but instead of considering only the one best hypothesis, we consider b best hypotheses at each time step, where b is the “width” of the beam.）
（beam search对应的api是tf.contrib.seq2seq.tile_batch、tf.contrib.seq2seq.BeamSearchDecoder；）

inference阶段的decoder和training阶段的decoder执行过程不同，前者是只给出A语言的句子，然后得到encoder_state给decoder作为输入，然后给decoder一个 <s>的向量得到相应的输出向量（对应一个单词），然后再把这个预测得来的单词向量作为第二个位置的输入，得到第二个位置的输出，以此类推，直到得到</s>的向量，终止迭代，得到的就是该句子的B语言的翻译；后者是每次都给完全正确的句子对给模型；

tf的Attention主要有四种，LuongAttention、BahdanauAttention、LuongMonotonicAttention、BahdanauMonotonicAttention；（它们的父类是_BaseMonotonicAttentionMechanism、_BaseAttentionMechanism，后者父类是AttentionMechanism）
LuongAttention来自论文《Effective Approaches to Attention-based Neural Machine Translation》，是一种乘法（multiplicative），同时支持scale=True得到scaled form attention；
BahdanauAttention来自论文《Neural Machine Translation by Jointly Learning to Align and Translate》，是一种加法（additive），以及normalize=True控制weight normalization（论文Weight normalization: A simple reparameterization to accelerate training of deep neural networks）；
LuongMonotonicAttention和BahdanauMonotonicAttention基于论文《Online and Linear-Time Attention by Enforcing Monotonic Alignments》分别对以上两种改进，是一种增量式的Attention（符合monotonic constraint）；

Attention、Transformer、BERT、GPT、ELMo的关系：
attention先后由两篇论文提出（分别对应BahdanauAttention和LuongAttention），attention mechanism的副产品是一个alignments between source and target sentences的可视化矩阵；
attention产生的动机是，对于长句子, the single fixed-size hidden state becomes an information bottleneck；
attention允许decoder挑选一些encoder端（source RNN）计算出的有用信息，而不是全部摈弃，因此提高了长句子的翻译质量；
attention mechanism核心思想是，establish direct short-cut connections between the target and the source by paying “attention” to relevant source content as we translate.
attention的一个可选函数是scoring function，可以是multiplicative或additive；
Attention分为encoder-decoder Attention、encoder Self-Attention、decoder Self-Attention三种；
https://www.youtube.com/watch?v=QuvRWevJMZ4 （非常好）
https://www.youtube.com/watch?v=W2rWgXJBZhU （图像中的Attention）
https://www.youtube.com/watch?v=SysgYptB198 （吴恩达）
https://www.youtube.com/watch?v=quoGRI-1l0A （吴恩达）
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/ （Jay Alammar大神）

Transformer由《Attention Is All You Need》提出，不需要CNN和RNN，且训练时间更少，得到的BLEU是state-of-the-art的，能扩展到其他nlp task；
（Scaled Dot-Product Attention即multiplicative，和additive attention对等；而multi-head self-attention是多个Scaled Dot-Product Attention并行叠加；）
（self-attention思想是，利用输入序列中单词的相互关系，来计算单词和序列的表示；思考句子：The animal didnot cross the street because it was too tired/wide）
（self-attention接收固定长度的向量，并输出等长的向量；它的公式为3.2.1节的公式一）
（self-attention的Q、K、V代表query、key、value；通常输入word embeddings长度大于Q、K、V向量的长度；Q、K、V就是要训练的参数！！！）
（feed forward就是普通的前馈神经网络？是的）
（Transformer的Decoder和Encoder类似，不过它的Q、K来自Encoder的输出；另外，Encoder有2层，而Decoder有3层；）
（Encoder的2层分别是self-attention和feed forward； Decoder的3层分别是self-attention、encoder-decoder attention和feed forward；）
（Transformer中每个word embeddings加了一个向量叫positional encoding，用于编码单词的相对位置；使用了sin和cos函数；）
（Transformer使用了layer normalization和residual connection；）
（attention比cnn、rnn的计算量小、更快、更易并行化；）
（https://github.com/tensorflow/tensor2tensor 实现了多个task的model，其中包括基于Transformer的Translation；）
https://jalammar.github.io/illustrated-transformer/ （Jay Alammar大神）
https://blog.csdn.net/qq_41664845/article/details/84969266 （夏目-图解Transformer）
https://blog.csdn.net/yujianmin1990/article/details/85221271 （于建民）
https://blog.csdn.net/pipisorry/article/details/84946653 （csdn大神，很好）
https://www.youtube.com/watch?v=rBCqOTEfxvg （Transformer作者）
https://www.youtube.com/watch?v=iDulhoQ2pro （还可以）
https://www.youtube.com/watch?v=S0KakHcj_rs （不错）
https://www.youtube.com/watch?v=KMY2Knr4iAs （Code Review-Pytorch版）
https://www.youtube.com/watch?v=z1xs9jdZnuY （精简）
http://nlp.seas.harvard.edu/2018/04/03/attention.html （基于OpenNMT代码解析）
https://edu.csdn.net/course/play/8959/185470 （阿里巴巴-于恒）
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

BERT全称Bidirectional Encoder Representations from Transformers，创新点在于支持利用pre-train和fine-tuned，以及无需task-specific的架构来实现多种nlp task；同时取得了11项nlp task的最好成绩；
（BERT提出了新的目标函数，叫做masked language model（MLM），即遮蔽语言模型）
（MLM 随机遮蔽模型输入中的一些token，目标在于仅基于遮蔽词的语境来预测其原始词汇id）
（除了MLM，作者还引入一个“下一句预测”（next sentence prediction）任务，可以和MLM共同预训练文本对的表示；）
（ps：作者认为很多NLP任务比如QA和NLI都需要对两个句子之间关系的理解，而语言模型不能很好的直接产生这种理解）
（BERT有BERT base和BERT large两种，参数个数不同；层数（即Transformer blocks）表示为L，隐藏大小表示为H，self-attention heads的数量表示为A；）
（BERT的输入是token embeddings、segmentation embeddings和position embeddings的总和；）
（BERT和GPT都使用gelu，而不是relu；gelu全称Gaussian Error Linear Unit；）
（BERT和GPT区别：BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer.）
（BERT也是语言模型作为训练任务；（双向））
https://github.com/google-research/bert
http://ruder.io/nlp-imagenet/ （非常好）
https://jalammar.github.io/illustrated-bert/ （Jay Alammar大神）
https://blog.csdn.net/qq_41664845/article/details/84787969 （夏目-图解BERT）
https://gluebenchmark.com/leaderboard （GLUE是一个自然语言任务集合）
https://www.jiqizhixin.com/articles/2018-12-03 （很好）
https://blog.csdn.net/xmxoxo/article/details/89315370 （可西哥）
https://blog.csdn.net/macanv/article/details/85684284 （Macanv的BERT-BiLSTM-CRF-NER）
https://github.com/macanv/BERT-BiLSTM-CRF-NER （Macanv的BERT-BiLSTM-CRF-NER）
https://github.com/hanxiao/bert-as-service （Tencent AI Lab的hanxiao大神）
https://www.youtube.com/watch?v=BhlOGGzC0Q0 （还可以）
https://www.youtube.com/watch?v=-9evrZnBorM （不错）
https://www.youtube.com/watch?v=0EtD5ybnh_s （Language Learning with BERT - TensorFlow and Deep Learning Singapore）
https://www.youtube.com/watch?v=ycXWAtm22-w （Language Model Overview: From word2vec to BERT）
https://www.cnblogs.com/d0main/p/10165671.html （很好）
https://cloud.tencent.com/developer/article/1389555 （很好）
https://blog.csdn.net/triplemeng/article/details/83053419 （不错）
https://www.cnblogs.com/rucwxb/p/10277217.html （还行）
https://blog.csdn.net/yangfengling1023/article/details/84025313 （一般）

GPT来自论文《Improving Language Understanding with Unsupervised Learning》，后改名为《Improving language understanding by generative pre-training》；
GPT提出了半监督方法，即unsupervised pre-training和supervised fine-tuning结合；（也是基于Transformer，但比BERT更早）
GPT的unsupervised pre-training是 sentence-level，而之前的word embeddings是word-level；（另外也有Phrase-level）
GPT也是语言模型作为训练任务；（单向）
https://openai.com/blog/language-unsupervised/
https://github.com/openai/finetune-transformer-lm

ELMo来自论文《Deep contextualized word representations》，全称Embeddings from Language Models representations；
ELMo是基于上下文的Embedding，比Word Embeddings和GloVe更好（可以解决多义词问题），利用语言模型作为训练任务；
ELMo使用经过独立训练的从左到右和从右到左LSTM的串联来生成下游任务的特征；
ELMo和GPT、BERT对比的缺点：1. LSTM特征抽取能力远弱于Transformer 2. 拼接方式的特征融合能力偏弱
https://allennlp.org/elmo

Attention is all you need，so Attention在nlp中可以取代rnn，那Attention在cv界可以取代cnn吗？不能，images still like cnn

打赏一下作者：
在这里插入图片描述

机器翻译笔记

相关文章

pytube——下载YouTube视频的python库

YoutuBe 是如何利用深度学习解决搜索推荐问题的？ (一) - 论文翻译

翻译：最令人印象深刻的YouTube频道，可让您学习AI，机器学习和数据科学

Chrome的YouTube双语字幕插件

YouTube技术架构

YouTube广告 || 一次性让你了解个够

Youtube CC字幕是什么，Caption和Subtitle的解释

YoutubeDL接口参数翻译