最开始知道题目中的论文,是看到BERT做数据增强的论文,Conditional BERT Contextual Augmentation,看到有同学分析的文章(见https://zhuanlan.zhihu.com/p/53141568),文章中提到了Kobayashi的这篇论文,刚好有开源的代码,作为英专毕业数年的不合格程序媛,自己写代码是不现实的,决定用开源的代码复现一下实验结果,记录以下作为参考。
原代码在Github上的地址:<https://github.com/pfnet-research/contextual_augmentation>
按照readme中的顺序,从上到下依次执行命令:
1. Prepare a label-conditional bi-directional language model
(1) # download wikitext
sh prepare_rawwikitext.sh
(2) # install chainer and spacy
命令:pip install cupy--->> pip install cupy-cuda90
提示:
Collecting cupy-cuda90
Downloading https://files.pythonhosted.org/packages/30/a5/89d64c99a8b17c1ed64fcc0c9207ff6bc70efe90a9c567d616eb910aee34/cupy_cuda90-6.2.0-cp36-cp36m-manylinux1_x86_64.whl (270.4MB)
|████████████████████████████████| 270.4MB 17kB/s
Collecting fastrlock>=0.3 (from cupy-cuda90)
Downloading https://files.pythonhosted.org/packages/b5/93/a7efbd39eac46c137500b37570c31dedc2d31a8ff4949fcb90bda5bc5f16/fastrlock-0.4-cp36-cp36m-manylinux1_x86_64.whl
Requirement already satisfied: numpy>=1.9.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from cupy-cuda90) (1.16.4)
Requirement already satisfied: six>=1.9.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from cupy-cuda90) (1.12.0)
Installing collected packages: fastrlock, cupy-cuda90
Successfully installed cupy-cuda90-6.2.0 fastrlock-0.4
命令:pip install chainer
提示:
Collecting chainer
Downloading https://files.pythonhosted.org/packages/2c/5a/86c50a0119a560a39d782c4cdd9b72927c090cc2e3f70336e01b19a5f97a/chainer-6.2.0.tar.gz (873kB)
|████████████████████████████████| 880kB 174kB/s
Requirement already satisfied: setuptools in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from chainer) (28.8.0)
Collecting typing<=3.6.6 (from chainer)
Downloading https://files.pythonhosted.org/packages/4a/bd/eee1157fc2d8514970b345d69cb9975dcd1e42cd7e61146ed841f6e68309/typing-3.6.6-py3-none-any.whl
Collecting typing_extensions<=3.6.6 (from chainer)
Downloading https://files.pythonhosted.org/packages/62/4f/392a1fa2873e646f5990eb6f956e662d8a235ab474450c72487745f67276/typing_extensions-3.6.6-py3-none-any.whl
Collecting filelock (from chainer)
Downloading https://files.pythonhosted.org/packages/93/83/71a2ee6158bb9f39a90c0dea1637f81d5eef866e188e1971a1b1ab01a35a/filelock-3.0.12-py3-none-any.whl
Requirement already satisfied: numpy>=1.9.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from chainer) (1.16.4)
Collecting protobuf<3.8.0rc1,>=3.0.0 (from chainer)
Downloading https://files.pythonhosted.org/packages/5a/aa/a858df367b464f5e9452e1c538aa47754d467023850c00b000287750fa77/protobuf-3.7.1-cp36-cp36m-manylinux1_x86_64.whl (1.2MB)
|████████████████████████████████| 1.2MB 153kB/s
Requirement already satisfied: six>=1.9.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from chainer) (1.12.0)
Building wheels for collected packages: chainer
Building wheel for chainer (setup.py) ... done
Stored in directory: /sunj/wanglina/.cache/pip/wheels/2e/be/c5/6ee506abcaa4a53106f7d7671bbee8b4e5243bc562a9d32ad1
Successfully built chainer
Installing collected packages: typing, typing-extensions, filelock, protobuf, chainer
Found existing installation: protobuf 3.9.0
Uninstalling protobuf-3.9.0:
Successfully uninstalled protobuf-3.9.0
Successfully installed chainer-6.2.0 filelock-3.0.12 protobuf-3.7.1 typing-3.6.6 typing-extensions-3.6.6
命令:pip install spacy
提示:
Collecting spacy
Downloading https://files.pythonhosted.org/packages/4e/f4/3d79c0eeec5d45046d0b1f00b3b78de00f8ce389f56d8b53fbbdd198d90e/spacy-2.1.6-cp36-cp36m-manylinux1_x86_64.whl (30.8MB)
|████████████████████████████████| 30.8MB 76kB/s
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
Downloading https://files.pythonhosted.org/packages/a6/e6/63f160a4fdf0e875d16b28f972083606d8d54f56cd30cb8929f9a1ee700e/murmurhash-1.0.2-cp36-cp36m-manylinux1_x86_64.whl
Collecting srsly<1.1.0,>=0.0.6 (from spacy)
Downloading https://files.pythonhosted.org/packages/aa/6c/2ef2d6f4c63a197981f4ac01bb17560c857c6721213c7c99998e48cdda2a/srsly-0.0.7-cp36-cp36m-manylinux1_x86_64.whl (180kB)
|████████████████████████████████| 184kB 1.1MB/s
Collecting thinc<7.1.0,>=7.0.8 (from spacy)
Downloading https://files.pythonhosted.org/packages/18/a5/9ace20422e7bb1bdcad31832ea85c52a09900cd4a7ce711246bfb92206ba/thinc-7.0.8-cp36-cp36m-manylinux1_x86_64.whl (2.1MB)
|████████████████████████████████| 2.1MB 1.2MB/s
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
Downloading https://files.pythonhosted.org/packages/3d/61/9b0520c28eb199a4b1ca667d96dd625bba003c14c75230195f9691975f85/cymem-2.0.2-cp36-cp36m-manylinux1_x86_64.whl
Requirement already satisfied: numpy>=1.15.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from spacy) (1.16.4)
Collecting requests<3.0.0,>=2.13.0 (from spacy)
Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)
|████████████████████████████████| 61kB 318kB/s
Collecting wasabi<1.1.0,>=0.2.0 (from spacy)
Downloading https://files.pythonhosted.org/packages/f4/c1/d76ccdd12c716be79162d934fe7de4ac8a318b9302864716dde940641a79/wasabi-0.2.2-py3-none-any.whl
Collecting preshed<2.1.0,>=2.0.1 (from spacy)
Downloading https://files.pythonhosted.org/packages/20/93/f222fb957764a283203525ef20e62008675fd0a14ffff8cc1b1490147c63/preshed-2.0.1-cp36-cp36m-manylinux1_x86_64.whl (83kB)
|████████████████████████████████| 92kB 453kB/s
Collecting blis<0.3.0,>=0.2.2 (from spacy)
Downloading https://files.pythonhosted.org/packages/34/46/b1d0bb71d308e820ed30316c5f0a017cb5ef5f4324bcbc7da3cf9d3b075c/blis-0.2.4-cp36-cp36m-manylinux1_x86_64.whl (3.2MB)
|████████████████████████████████| 3.2MB 1.0MB/s
Collecting plac<1.0.0,>=0.9.6 (from spacy)
Downloading https://files.pythonhosted.org/packages/9e/9b/62c60d2f5bc135d2aa1d8c8a86aaf84edb719a59c7f11a4316259e61a298/plac-0.9.6-py2.py3-none-any.whl
Collecting tqdm<5.0.0,>=4.10.0 (from thinc<7.1.0,>=7.0.8->spacy)
Downloading https://files.pythonhosted.org/packages/9f/3d/7a6b68b631d2ab54975f3a4863f3c4e9b26445353264ef01f465dc9b0208/tqdm-4.32.2-py2.py3-none-any.whl (50kB)
|████████████████████████████████| 51kB 271kB/s
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests<3.0.0,>=2.13.0->spacy)
Downloading https://files.pythonhosted.org/packages/e6/60/247f23a7121ae632d62811ba7f273d0e58972d75e58a94d329d51550a47d/urllib3-1.25.3-py2.py3-none-any.whl (150kB)
|████████████████████████████████| 153kB 1.2MB/s
Collecting certifi>=2017.4.17 (from requests<3.0.0,>=2.13.0->spacy)
Downloading https://files.pythonhosted.org/packages/69/1b/b853c7a9d4f6a6d00749e94eb6f3a041e342a885b87340b79c1ef73e3a78/certifi-2019.6.16-py2.py3-none-any.whl (157kB)
|████████████████████████████████| 163kB 1.2MB/s
Collecting chardet<3.1.0,>=3.0.2 (from requests<3.0.0,>=2.13.0->spacy)
Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133kB)
|████████████████████████████████| 143kB 1.2MB/s
Collecting idna<2.9,>=2.5 (from requests<3.0.0,>=2.13.0->spacy)
Downloading https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB)
|████████████████████████████████| 61kB 343kB/s
Installing collected packages: murmurhash, srsly, tqdm, cymem, plac, wasabi, preshed, blis, thinc, urllib3, certifi, chardet, idna, requests, spacy
Successfully installed blis-0.2.4 certifi-2019.6.16 chardet-3.0.4 cymem-2.0.2 idna-2.8 murmurhash-1.0.2 plac-0.9.6 preshed-2.0.1 requests-2.22.0 spacy-2.1.6 srsly-0.0.7 thinc-7.0.8 tqdm-4.32.2 urllib3-1.25.3 wasabi-0.2.2
命令:python -m spacy download en_core_web_sm
提示:
Collecting en_core_web_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm==2.1.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1MB)
|████████████████████████████████| 11.1MB 403kB/s
Building wheels for collected packages: en-core-web-sm
Building wheel for en-core-web-sm (setup.py) ... done
Stored in directory: /tmp/pip-ephem-wheel-cache-6kka_3pd/wheels/39/ea/3b/507f7df78be8631a7a3d7090962194cf55bc1158572c0be77f
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.1.0
WARNING: You are using pip version 19.1.1, however version 19.2.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')
(3) # segment text by sentence boundaries (very slowly)15:15开始到18:30的时候完成一半,整体估计需要六个小时。
命令:PYTHONIOENCODING=utf-8 python preprocess_spacy.py -d datasets/wikitext-103-raw/wiki.train.raw > datasets/wikitext-103-raw/spacy_wikitext-103-raw.train
提示:
0 lines end
100000 lines end
200000 lines end
300000 lines end
400000 lines end
500000 lines end
600000 lines end
700000 lines end
800000 lines end
900000 lines end
1000000 lines end
1100000 lines end
1200000 lines end
1300000 lines end
1400000 lines end
1500000 lines end
1600000 lines end
1700000 lines end
1800000 lines end
命令:PYTHONIOENCODING=utf-8 python preprocess_spacy.py -d datasets/wikitext-103-raw/wiki.valid.raw > datasets/wikitext-103-raw/spacy_wikitext-103-raw.valid
提示:
0 lines end
(4) # construct vocabulary on wikitext
命令:python construct_vocab.py --data datasets/wikitext-103-raw/spacy_wikitext-103-raw.train -t 50 --save datasets/wikitext-103-raw/spacy_wikitext-103-raw.train.vocab.t50
提示:
# of words: 49873
篇幅有限,下一篇接着记录。