Hugging Face NLP课程学习记录 - 0. 安装transformers库 1. Transformer 模型

embedded/2024/9/23 8:47:56/

transformers__1_Transformer__0">Hugging Face NLP课程学习记录 - 0. 安装transformers库 & 1. Transformer 模型

说明:

  • 首次发表日期:2024-09-14
  • 官网: https://huggingface.co/learn/nlp-course/zh-CN/chapter1
  • 关于: 阅读并记录一下,只保留重点部分,大多从原文摘录,润色一下原文

transformers_8">0. 安装transformers库

创建conda环境并安装包:

conda create -n hfnlp python=3.12
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers==4.44.2# More
pip install seqeval
pip install sentencepiece

使用Hugging Face镜像(见 https://hf-mirror.com/ ):

export HF_ENDPOINT=https://hf-mirror.com

或者在python中设置Hugging Face镜像:

import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

1. Transformer 模型

Transformers 能做什么?

使用pipelines

Transformers 库中最基本的对象是 pipeline() 函数。它将模型与其必要的预处理和后处理步骤连接起来,使我们能够通过直接输入任何文本并获得最终的答案:

from transformers import pipelineclassifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

提示:

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b ....
[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

输出:

[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

输入多个句子:

classifier(["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)
[{'label': 'POSITIVE', 'score': 0.9598047137260437},{'label': 'NEGATIVE', 'score': 0.9994558691978455}]

将一些文本传递到pipeline时涉及三个主要步骤:

  1. 文本被预处理为模型可以理解的格式。
  2. 预处理的输入被传递给模型。
  3. 模型处理后输出最终人类可以理解的结果。

零样本分类

from transformers import pipelineclassifier = pipeline("zero-shot-classification")
classifier("This is a course about the Transformers library",candidate_labels=["education", "politics", "business"],
)

提示:

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 ([https://hf-mirror.com/facebook/bart-large-mnli](https://hf-mirror.com/facebook/bart-large-mnli)).
Using a pipeline without specifying a model name and revision in production is not recommended.

输出:

{'sequence': 'This is a course about the Transformers library', 'labels': ['education', 'business', 'politics'], 'scores': [0.8445952534675598, 0.11197696626186371, 0.043427806347608566]}

此pipeline称为zero-shot,因为您不需要对数据上的模型进行微调即可使用它

文本分类

现在让我们看看如何使用pipeline来生成一些文本。这里的主要使用方法是您提供一个提示,模型将通过生成剩余的文本来自动完成整段话。

from transformers import pipelinegenerator = pipeline("text-generation")
generator("In this course, we will teach you how to")

提示:

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://hf-mirror.com/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.

输出:

[{'generated_text': 'In this course, we will teach you how to create a simple Python script that uses the default Python scripts for the following tasks, such as adding a linker at the end of a file to a file, editing an array, etc.\n\n'}]

在pipeline中使用 Hub 中的其他模型

前面的示例使用了默认模型,但您也可以从 Hub 中选择特定模型以在特定任务的pipeline中使用 - 例如,文本生成。转到模型中心(hub)并单击左侧的相应标签将会只显示该任务支持的模型。例如这样。

让我们试试 distilgpt2 模型吧!以下是如何在与以前相同的pipeline中加载它:

from transformers import pipelinegenerator = pipeline("text-generation", model="distilgpt2")
generator("In this course, we will teach you how to",max_length=30,num_return_sequences=2
)
[{'generated_text': 'In this course, we will teach you how to make your world better. Our courses focus on how to make an improvement in your life or the things'},{'generated_text': 'In this course, we will teach you how to properly design your own design using what is currently in place and using what is best in place. By'}]

Mask filling

您将尝试的下一个pipeline是 fill-mask。此任务的想法是填充给定文本中的空白:

from transformers import pipelineunmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)
[{'score': 0.19198445975780487,'token': 30412,'token_str': ' mathematical','sequence': 'This course will teach you all about mathematical models.'},{'score': 0.04209190234541893,'token': 38163,'token_str': ' computational','sequence': 'This course will teach you all about computational models.'}]

top_k 参数控制要显示的结果有多少种。请注意,这里模型填充了特殊的< mask >词,它通常被称为掩码标记。其他掩码填充模型可能有不同的掩码标记,因此在探索其他模型时要验证正确的掩码字是什么。

命名实体识别

命名实体识别 (NER) 是一项任务,其中模型必须找到输入文本的哪些部分对应于诸如人员、位置或组织之类的实体。让我们看一个例子:

from transformers import pipelinener = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english ...
[{'entity_group': 'PER','score': 0.9981694,'word': 'Sylvain','start': 11,'end': 18},{'entity_group': 'ORG','score': 0.9796019,'word': 'Hugging Face','start': 33,'end': 45},{'entity_group': 'LOC','score': 0.9932106,'word': 'Brooklyn','start': 49,'end': 57}]

我们在pipeline创建函数中传递选项 grouped_entities=True 以告诉pipeline将对应于同一实体的句子部分重新组合在一起:这里模型正确地将“Hugging”和“Face”分组为一个组织,即使名称由多个词组成。

命名实体识别(中文)

运行来自 https://huggingface.co/shibing624/bert4ner-base-chinese README的代码

pip install seqeval
import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics.sequence_labeling import get_entitiesos.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-chinese")
model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-chinese")
label_list = ['I-ORG', 'B-LOC', 'O', 'B-ORG', 'I-LOC', 'I-PER', 'B-TIME', 'I-TIME', 'B-PER']sentence = "王宏伟来自北京,是个警察,喜欢去王府井游玩儿。"def get_entity(sentence):tokens = tokenizer.tokenize(sentence)inputs = tokenizer.encode(sentence, return_tensors="pt")with torch.no_grad():outputs = model(inputs).logitspredictions = torch.argmax(outputs, dim=2)char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]print(sentence)print(char_tags)pred_labels = [i[1] for i in char_tags]entities = []line_entities = get_entities(pred_labels)for i in line_entities:word = sentence[i[1]: i[2] + 1]entity_type = i[0]entities.append((word, entity_type))print("Sentence entity:")print(entities)get_entity(sentence)
王宏伟来自北京,是个警察,喜欢去王府井游玩儿。
[('宏', 'B-PER'), ('伟', 'I-PER'), ('来', 'I-PER'), ('自', 'O'), ('北', 'O'), ('京', 'B-LOC'), (',', 'I-LOC'), ('是', 'O'), ('个', 'O'), ('警', 'O'), ('察', 'O'), (',', 'O'), ('喜', 'O'), ('欢', 'O'), ('去', 'O'), ('王', 'O'), ('府', 'B-LOC'), ('井', 'I-LOC'), ('游', 'I-LOC'), ('玩', 'O'), ('儿', 'O')]
Sentence entity:
[('王宏伟', 'PER'), ('北京', 'LOC'), ('王府井', 'LOC')]

或者通过使用nerpy库来使用 shibing624/bert4ner-base-chinese 这个模型。

另外,可以使用的ltp来做中文命名实体识别,其Github仓库 https://github.com/HIT-SCIR/ltp 有4.9K的星

问答系统

from transformers import pipelinequestion_answerer = pipeline("question-answering")
question_answerer(question="Where do I work?",context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)
{'score': 0.6949753761291504, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

请注意,此pipeline通过从提供的上下文中提取信息来工作;它不会凭空生成答案。

文本摘要

文本摘要是将文本缩减为较短文本的任务,同时保留文本中的主要(重要)信息。下面是一个例子:

from transformers import pipelinesummarizer = pipeline("summarization", device=0)
summarizer("""America has changed dramatically during recent years. Not only has the number of graduates in traditional engineering disciplines such as mechanical, civil, electrical, chemical, and aeronautical engineering declined, but in most of the premier American universities engineering curricula now concentrate on and encourage largely the study of engineering science. As a result, there are declining offerings in engineering subjects dealing with infrastructure, the environment, and related issues, and greater concentration on high technology subjects, largely supporting increasingly complex scientific developments. While the latter is important, it should not be at the expense of more traditional engineering.Rapidly developing economies such as China and India, as well as other industrial countries in Europe and Asia, continue to encourage and advance the teaching of engineering. Both China and India, respectively, graduate six and eight times as many traditional engineers as does the United States. Other industrial countries at minimum maintain their output, while America suffers an increasingly serious decline in the number of engineering graduates and a lack of well-educated engineers.
"""
)

与文本生成一样,您指定结果的 max_lengthmin_length

翻译

对于翻译,如果您在任务名称中提供语言对(例如“translation_en_to_fr”),则可以使用默认模型,但最简单的方法是在模型中心(hub)选择要使用的模型。在这里,我们将尝试从法语翻译成英语:

pip install sentencepiece
from transformers import pipelinetranslator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en", device=0)
translator("Ce cours est produit par Hugging Face.")
[{'translation_text': 'This course is produced by Hugging Face.'}]

将英语翻译成中文:

from transformers import pipelinetranslator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-zh", device=0)
translator("America has changed dramatically during recent years.")
[{'translation_text': '近年来,美国发生了巨大变化。'}]

偏见和局限性

如果您打算在正式的项目中使用经过预训练或经过微调的模型。请注意:虽然这些模型是很强大,但它们也有局限性。其中最大的一个问题是,为了对大量数据进行预训练,研究人员通常会搜集所有他们能找到的内容,中间可能夹带一些意识形态或者价值观的刻板印象。

为了快速解释清楚这个问题,让我们回到一个使用BERT模型的pipeline的例子:

from transformers import pipelineunmasker = pipeline("fill-mask", model="bert-base-uncased", device=0)
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])
['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']

当要求模型填写这两句话中缺少的单词时,模型给出的答案中,只有一个与性别无关(服务员/女服务员)。


http://www.ppmy.cn/embedded/111346.html

相关文章

Linux基础---06压缩打包及解压rar压缩包

三种压缩解压工具的汇总表格如下&#xff0c;方便大家进行比较&#xff1a; 工具解释tar最常用&#xff0c; 有具体的格式要求&#xff0c;压缩后原文件不被覆盖 &#xff0c;文件以tar.gz结尾gzip使用频率较低&#xff0c;格式简单&#xff0c;压缩后原文件会被覆盖&#xff…

使用HTML

1.使用HTML的基本标签创建网页 2.使用相关的标签对文本信息进行排版 3.使用相关的图像标签&#xff0c;将图像和文本排版相结合 4.使用<a>标签创建超链接&#xff0c;锚链接以及功能性链接 使用vs code工具 HTML网络基本结构包括&#xff1a;网页头部 and 主体部分 …

如何正确使用布尔表达式

在Java编程语言中&#xff0c;布尔表达式&#xff08;Boolean Expressions&#xff09;是程序逻辑控制的核心部分。它们是用来表示“真”&#xff08;true&#xff09;或“假”&#xff08;false&#xff09;的逻辑语句&#xff0c;通常用于控制程序的执行流程&#xff0c;比如…

shader 案例学习笔记之将坐标系分成4个象限

代码&#xff1a; _st * 2.0;float index 0.0; index step(1., mod(_st.x,2.0)); index step(1., mod(_st.y,2.0))*2.0; 示意图&#xff1a; 计算左下角 计算右下角 计算左上角 计算右上角 最后结果示意&#xff1a; 坐标系被分成了4个单元格&#xff0c;每个单元格都有…

测试工程师学历路径:从功能测试到测试开发

现在软件从业者越来越多&#xff0c;测试工程师的职位也几近饱和&#xff0c;想要获得竞争力还是要保持持续学习。基本学习路径可以从功能测试-自动化测试-测试开发工程师的路子来走。 功能测试工程师&#xff1a; 1、软件测试基本概念&#xff1a; 学习软件测试的定义、目的…

Android SystemUI组件(05)状态栏-系统状态图标显示管理

该系列文章总纲链接&#xff1a;专题分纲目录 Android SystemUI组件 本章关键点总结 & 说明&#xff1a; 说明&#xff1a;本章节持续迭代之前章节的思维导图&#xff0c;主要关注下方 SystemBars分析中状态栏中的部分-系统状态图标显示&管理 即可。 1 系统状态图标显…

Machine Learning: A Probabilistic Perspective 机器学习:概率视角 PDF免费分享

下载链接在博客最底部&#xff01;&#xff01; 之前需要参考这本书&#xff0c;但是大多数博客都是收费才能下载本书。 在网上找了好久才找到免费的资源&#xff0c;浪费了不少时间&#xff0c;在此分享以节约大家的时间。 链接: https://pan.baidu.com/s/1erFsMcVR0A_xT4fx…

安卓13禁止声音调节对话框 删除音量调节对话框弹出 屏蔽音量对话框 android13

总纲 android13 rom 开发总纲说明 文章目录 1.前言2.问题分析3.代码分析3.1 方法13.2 方法24.代码修改4.1 代码修改方法14.2 代码修改方法25.编译6.彩蛋1.前言 客户需要,调整声音,不显示声音调节对话框了。我们在系统里面隐藏这个对话框。 2.问题分析 android在调整声音的…