复现数据增强实验(1)--Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

news/2024/10/17 21:28:21/

最开始知道题目中的论文,是看到BERT做数据增强的论文,Conditional BERT Contextual Augmentation,看到有同学分析的文章(见https://zhuanlan.zhihu.com/p/53141568),文章中提到了Kobayashi的这篇论文,刚好有开源的代码,作为英专毕业数年的不合格程序媛,自己写代码是不现实的,决定用开源的代码复现一下实验结果,记录以下作为参考。

原代码在Github上的地址:<https://github.com/pfnet-research/contextual_augmentation>

按照readme中的顺序,从上到下依次执行命令:

1. Prepare a label-conditional bi-directional language model

(1) # download wikitext

sh prepare_rawwikitext.sh

(2) # install chainer and spacy

命令:pip install cupy--->>  pip install cupy-cuda90

提示:

Collecting cupy-cuda90

  Downloading https://files.pythonhosted.org/packages/30/a5/89d64c99a8b17c1ed64fcc0c9207ff6bc70efe90a9c567d616eb910aee34/cupy_cuda90-6.2.0-cp36-cp36m-manylinux1_x86_64.whl (270.4MB)

     |████████████████████████████████| 270.4MB 17kB/s

Collecting fastrlock>=0.3 (from cupy-cuda90)

  Downloading https://files.pythonhosted.org/packages/b5/93/a7efbd39eac46c137500b37570c31dedc2d31a8ff4949fcb90bda5bc5f16/fastrlock-0.4-cp36-cp36m-manylinux1_x86_64.whl

Requirement already satisfied: numpy>=1.9.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from cupy-cuda90) (1.16.4)

Requirement already satisfied: six>=1.9.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from cupy-cuda90) (1.12.0)

Installing collected packages: fastrlock, cupy-cuda90

Successfully installed cupy-cuda90-6.2.0 fastrlock-0.4

命令:pip install chainer

提示:

Collecting chainer

  Downloading https://files.pythonhosted.org/packages/2c/5a/86c50a0119a560a39d782c4cdd9b72927c090cc2e3f70336e01b19a5f97a/chainer-6.2.0.tar.gz (873kB)

     |████████████████████████████████| 880kB 174kB/s

Requirement already satisfied: setuptools in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from chainer) (28.8.0)

Collecting typing<=3.6.6 (from chainer)

  Downloading https://files.pythonhosted.org/packages/4a/bd/eee1157fc2d8514970b345d69cb9975dcd1e42cd7e61146ed841f6e68309/typing-3.6.6-py3-none-any.whl

Collecting typing_extensions<=3.6.6 (from chainer)

  Downloading https://files.pythonhosted.org/packages/62/4f/392a1fa2873e646f5990eb6f956e662d8a235ab474450c72487745f67276/typing_extensions-3.6.6-py3-none-any.whl

Collecting filelock (from chainer)

  Downloading https://files.pythonhosted.org/packages/93/83/71a2ee6158bb9f39a90c0dea1637f81d5eef866e188e1971a1b1ab01a35a/filelock-3.0.12-py3-none-any.whl

Requirement already satisfied: numpy>=1.9.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from chainer) (1.16.4)

Collecting protobuf<3.8.0rc1,>=3.0.0 (from chainer)

  Downloading https://files.pythonhosted.org/packages/5a/aa/a858df367b464f5e9452e1c538aa47754d467023850c00b000287750fa77/protobuf-3.7.1-cp36-cp36m-manylinux1_x86_64.whl (1.2MB)

     |████████████████████████████████| 1.2MB 153kB/s

Requirement already satisfied: six>=1.9.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from chainer) (1.12.0)

Building wheels for collected packages: chainer

  Building wheel for chainer (setup.py) ... done

  Stored in directory: /sunj/wanglina/.cache/pip/wheels/2e/be/c5/6ee506abcaa4a53106f7d7671bbee8b4e5243bc562a9d32ad1

Successfully built chainer

Installing collected packages: typing, typing-extensions, filelock, protobuf, chainer

  Found existing installation: protobuf 3.9.0

    Uninstalling protobuf-3.9.0:

      Successfully uninstalled protobuf-3.9.0

Successfully installed chainer-6.2.0 filelock-3.0.12 protobuf-3.7.1 typing-3.6.6 typing-extensions-3.6.6

命令:pip install spacy

提示:

Collecting spacy

  Downloading https://files.pythonhosted.org/packages/4e/f4/3d79c0eeec5d45046d0b1f00b3b78de00f8ce389f56d8b53fbbdd198d90e/spacy-2.1.6-cp36-cp36m-manylinux1_x86_64.whl (30.8MB)

     |████████████████████████████████| 30.8MB 76kB/s

Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)

  Downloading https://files.pythonhosted.org/packages/a6/e6/63f160a4fdf0e875d16b28f972083606d8d54f56cd30cb8929f9a1ee700e/murmurhash-1.0.2-cp36-cp36m-manylinux1_x86_64.whl

Collecting srsly<1.1.0,>=0.0.6 (from spacy)

  Downloading https://files.pythonhosted.org/packages/aa/6c/2ef2d6f4c63a197981f4ac01bb17560c857c6721213c7c99998e48cdda2a/srsly-0.0.7-cp36-cp36m-manylinux1_x86_64.whl (180kB)

     |████████████████████████████████| 184kB 1.1MB/s

Collecting thinc<7.1.0,>=7.0.8 (from spacy)

  Downloading https://files.pythonhosted.org/packages/18/a5/9ace20422e7bb1bdcad31832ea85c52a09900cd4a7ce711246bfb92206ba/thinc-7.0.8-cp36-cp36m-manylinux1_x86_64.whl (2.1MB)

     |████████████████████████████████| 2.1MB 1.2MB/s

Collecting cymem<2.1.0,>=2.0.2 (from spacy)

  Downloading https://files.pythonhosted.org/packages/3d/61/9b0520c28eb199a4b1ca667d96dd625bba003c14c75230195f9691975f85/cymem-2.0.2-cp36-cp36m-manylinux1_x86_64.whl

Requirement already satisfied: numpy>=1.15.0 in /dnn4_added/wanglina/wln_install/python-3.6/lib/python3.6/site-packages (from spacy) (1.16.4)

Collecting requests<3.0.0,>=2.13.0 (from spacy)

  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)

     |████████████████████████████████| 61kB 318kB/s

Collecting wasabi<1.1.0,>=0.2.0 (from spacy)

  Downloading https://files.pythonhosted.org/packages/f4/c1/d76ccdd12c716be79162d934fe7de4ac8a318b9302864716dde940641a79/wasabi-0.2.2-py3-none-any.whl

Collecting preshed<2.1.0,>=2.0.1 (from spacy)

  Downloading https://files.pythonhosted.org/packages/20/93/f222fb957764a283203525ef20e62008675fd0a14ffff8cc1b1490147c63/preshed-2.0.1-cp36-cp36m-manylinux1_x86_64.whl (83kB)

     |████████████████████████████████| 92kB 453kB/s

Collecting blis<0.3.0,>=0.2.2 (from spacy)

  Downloading https://files.pythonhosted.org/packages/34/46/b1d0bb71d308e820ed30316c5f0a017cb5ef5f4324bcbc7da3cf9d3b075c/blis-0.2.4-cp36-cp36m-manylinux1_x86_64.whl (3.2MB)

     |████████████████████████████████| 3.2MB 1.0MB/s

Collecting plac<1.0.0,>=0.9.6 (from spacy)

  Downloading https://files.pythonhosted.org/packages/9e/9b/62c60d2f5bc135d2aa1d8c8a86aaf84edb719a59c7f11a4316259e61a298/plac-0.9.6-py2.py3-none-any.whl

Collecting tqdm<5.0.0,>=4.10.0 (from thinc<7.1.0,>=7.0.8->spacy)

  Downloading https://files.pythonhosted.org/packages/9f/3d/7a6b68b631d2ab54975f3a4863f3c4e9b26445353264ef01f465dc9b0208/tqdm-4.32.2-py2.py3-none-any.whl (50kB)

     |████████████████████████████████| 51kB 271kB/s

Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests<3.0.0,>=2.13.0->spacy)

  Downloading https://files.pythonhosted.org/packages/e6/60/247f23a7121ae632d62811ba7f273d0e58972d75e58a94d329d51550a47d/urllib3-1.25.3-py2.py3-none-any.whl (150kB)

     |████████████████████████████████| 153kB 1.2MB/s

Collecting certifi>=2017.4.17 (from requests<3.0.0,>=2.13.0->spacy)

  Downloading https://files.pythonhosted.org/packages/69/1b/b853c7a9d4f6a6d00749e94eb6f3a041e342a885b87340b79c1ef73e3a78/certifi-2019.6.16-py2.py3-none-any.whl (157kB)

     |████████████████████████████████| 163kB 1.2MB/s

Collecting chardet<3.1.0,>=3.0.2 (from requests<3.0.0,>=2.13.0->spacy)

  Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133kB)

     |████████████████████████████████| 143kB 1.2MB/s

Collecting idna<2.9,>=2.5 (from requests<3.0.0,>=2.13.0->spacy)

  Downloading https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB)

     |████████████████████████████████| 61kB 343kB/s

Installing collected packages: murmurhash, srsly, tqdm, cymem, plac, wasabi, preshed, blis, thinc, urllib3, certifi, chardet, idna, requests, spacy

Successfully installed blis-0.2.4 certifi-2019.6.16 chardet-3.0.4 cymem-2.0.2 idna-2.8 murmurhash-1.0.2 plac-0.9.6 preshed-2.0.1 requests-2.22.0 spacy-2.1.6 srsly-0.0.7 thinc-7.0.8 tqdm-4.32.2 urllib3-1.25.3 wasabi-0.2.2

命令:python -m spacy download en_core_web_sm

提示:

Collecting en_core_web_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm==2.1.0

  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1MB)

     |████████████████████████████████| 11.1MB 403kB/s

Building wheels for collected packages: en-core-web-sm

  Building wheel for en-core-web-sm (setup.py) ... done

  Stored in directory: /tmp/pip-ephem-wheel-cache-6kka_3pd/wheels/39/ea/3b/507f7df78be8631a7a3d7090962194cf55bc1158572c0be77f

Successfully built en-core-web-sm

Installing collected packages: en-core-web-sm

Successfully installed en-core-web-sm-2.1.0

WARNING: You are using pip version 19.1.1, however version 19.2.1 is available.

You should consider upgrading via the 'pip install --upgrade pip' command.

✔ Download and installation successful

You can now load the model via spacy.load('en_core_web_sm')

3 # segment text by sentence boundaries (very slowly)15:15开始到18:30的时候完成一半,整体估计需要六个小时。

命令:PYTHONIOENCODING=utf-8 python preprocess_spacy.py -d datasets/wikitext-103-raw/wiki.train.raw > datasets/wikitext-103-raw/spacy_wikitext-103-raw.train

提示:

0 lines end

100000 lines end

200000 lines end

300000 lines end

400000 lines end

500000 lines end

600000 lines end

700000 lines end

800000 lines end

900000 lines end

1000000 lines end

1100000 lines end

1200000 lines end

1300000 lines end

1400000 lines end

1500000 lines end

1600000 lines end

1700000 lines end

1800000 lines end

命令:PYTHONIOENCODING=utf-8 python preprocess_spacy.py -d datasets/wikitext-103-raw/wiki.valid.raw > datasets/wikitext-103-raw/spacy_wikitext-103-raw.valid

提示:

0 lines end

(4) # construct vocabulary on wikitext
命令:python construct_vocab.py --data datasets/wikitext-103-raw/spacy_wikitext-103-raw.train -t 50 --save datasets/wikitext-103-raw/spacy_wikitext-103-raw.train.vocab.t50
提示:

# of words: 49873

篇幅有限,下一篇接着记录。


http://www.ppmy.cn/news/123317.html

相关文章

物联网概论(IoT)_Chp5 物联网通信 Zigbee/蓝牙/UWB/WLAN/WiMax

Chp5 物联网通信 公用电信网可划分为三个部分&#xff0c;即长途网&#xff08;长途局以上的部分&#xff09;、中继网&#xff08;长途局与市话端局之间、市话端局与市话端局之间的部分&#xff09;和接入网&#xff08;端局与用户之间的部分&#xff09;。目前国际上倾向于将…

SpringCloud 微服务工具集v1.1

SpringCloud 微服务工具集v1.1 版本: Hoxton SR6 1.什么是微服务 官网: https://www.martinfowler.com/articles/microservices.html In short, the microservice architectural style is an approach to developing a single application as a suite of small services, e…

RibbitMQ 实战教程

RabbitMQ 实战教程 1.MQ引言&#xff08;视频网址&#xff1a;https://www.bilibili.com/video/BV1dE411K7MG?p2&#xff09; 1.1 什么是MQ MQ(Message Quene) : 翻译为 消息队列,通过典型的 生产者和消费者模型,生产者不断向消息队列中生产消息&#xff0c;消费者不断的从…

vulhub Brainpan:1

知识点 缓冲区溢出man 提权 主机发现 netdiscover端口扫描 nmap -p- --min-rate1000 192.168.80.6 >ports ports$(cat ports | awk -F " " {print $1} | awk -F "/" {print $1} | sort -n | tr \n , | sed s/[^0-9]*//) nmap -sC -sV -T4 -p $po…

实时Linux之PREEMPT_RT篇

实时Linux主要有两类方案&#xff1a; 单内核方案&#xff1a;对主线传统的Linux内核打入PREEMPT_RT补丁&#xff0c;使内核成为硬实时操作系统双内核方案&#xff1a;主线传统Linux内核实时内核的双内核方案&#xff0c;常见的主流方式有&#xff1a;RT-Linux&#xff0c;RTA…

NSIS 系统插件

原文&#xff1a;https://nsis.sourceforge.io/Docs/System/System.html 抄抄写写的翻译 NSIS 系统插件 目录 • 介绍 • 可用功能 o 内存相关功能 o 调用函数 o 64 位函数 • 常见问题 介绍 系统插件使开发人员能够调用任何 DLL 的任何…

Android 内核源码编译记录

注&#xff1a;此处内容总结自google官网&#xff1a;AOSP 编译内核。编译完成后刷机部分参考自其他大佬的文章。文中末尾提供了上传至CSDN的msm内核和Aarch64gcc工具 的下载链接&#xff0c;不想从官网下载的可以直接使用这个资源。 一.简介 1. 环境 手机&#xff1a;pixel…

Shell 编程

Shell 编程 1.什么是shell shell是一个命令解释器, 将人类输入高级语言, 通过 Shell程序 转换为 二进制、 shell分为两种使用方式: ​ 交互: 登录 执行命令 退出 ​ 非交互: 执行某个文件, 文件中都是一推命令, 整个文件从上往下依次执行 2.什么是shell 脚本 (1) 将系统命令堆积…