Hugging Face NLP课程学习记录 - 0. 安装transformers库 1. Transformer 模型

server/2024/9/22 3:17:53/

transformers__1_Transformer__0">Hugging Face NLP课程学习记录 - 0. 安装transformers库 & 1. Transformer 模型

说明:

  • 首次发表日期:2024-09-14
  • 官网: https://huggingface.co/learn/nlp-course/zh-CN/chapter1
  • 关于: 阅读并记录一下,只保留重点部分,大多从原文摘录,润色一下原文

transformers_8">0. 安装transformers库

创建conda环境并安装包:

conda create -n hfnlp python=3.12
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers==4.44.2# More
pip install seqeval
pip install sentencepiece

使用Hugging Face镜像(见 https://hf-mirror.com/ ):

export HF_ENDPOINT=https://hf-mirror.com

或者在python中设置Hugging Face镜像:

import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

1. Transformer 模型

Transformers 能做什么?

使用pipelines

Transformers 库中最基本的对象是 pipeline() 函数。它将模型与其必要的预处理和后处理步骤连接起来,使我们能够通过直接输入任何文本并获得最终的答案:

from transformers import pipelineclassifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

提示:

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b ....
[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

输出:

[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

输入多个句子:

classifier(["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)
[{'label': 'POSITIVE', 'score': 0.9598047137260437},{'label': 'NEGATIVE', 'score': 0.9994558691978455}]

将一些文本传递到pipeline时涉及三个主要步骤:

  1. 文本被预处理为模型可以理解的格式。
  2. 预处理的输入被传递给模型。
  3. 模型处理后输出最终人类可以理解的结果。

零样本分类

from transformers import pipelineclassifier = pipeline("zero-shot-classification")
classifier("This is a course about the Transformers library",candidate_labels=["education", "politics", "business"],
)

提示:

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 ([https://hf-mirror.com/facebook/bart-large-mnli](https://hf-mirror.com/facebook/bart-large-mnli)).
Using a pipeline without specifying a model name and revision in production is not recommended.

输出:

{'sequence': 'This is a course about the Transformers library', 'labels': ['education', 'business', 'politics'], 'scores': [0.8445952534675598, 0.11197696626186371, 0.043427806347608566]}

此pipeline称为zero-shot,因为您不需要对数据上的模型进行微调即可使用它

文本分类

现在让我们看看如何使用pipeline来生成一些文本。这里的主要使用方法是您提供一个提示,模型将通过生成剩余的文本来自动完成整段话。

from transformers import pipelinegenerator = pipeline("text-generation")
generator("In this course, we will teach you how to")

提示:

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://hf-mirror.com/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.

输出:

[{'generated_text': 'In this course, we will teach you how to create a simple Python script that uses the default Python scripts for the following tasks, such as adding a linker at the end of a file to a file, editing an array, etc.\n\n'}]

在pipeline中使用 Hub 中的其他模型

前面的示例使用了默认模型,但您也可以从 Hub 中选择特定模型以在特定任务的pipeline中使用 - 例如,文本生成。转到模型中心(hub)并单击左侧的相应标签将会只显示该任务支持的模型。例如这样。

让我们试试 distilgpt2 模型吧!以下是如何在与以前相同的pipeline中加载它:

from transformers import pipelinegenerator = pipeline("text-generation", model="distilgpt2")
generator("In this course, we will teach you how to",max_length=30,num_return_sequences=2
)
[{'generated_text': 'In this course, we will teach you how to make your world better. Our courses focus on how to make an improvement in your life or the things'},{'generated_text': 'In this course, we will teach you how to properly design your own design using what is currently in place and using what is best in place. By'}]

Mask filling

您将尝试的下一个pipeline是 fill-mask。此任务的想法是填充给定文本中的空白:

from transformers import pipelineunmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)
[{'score': 0.19198445975780487,'token': 30412,'token_str': ' mathematical','sequence': 'This course will teach you all about mathematical models.'},{'score': 0.04209190234541893,'token': 38163,'token_str': ' computational','sequence': 'This course will teach you all about computational models.'}]

top_k 参数控制要显示的结果有多少种。请注意,这里模型填充了特殊的< mask >词,它通常被称为掩码标记。其他掩码填充模型可能有不同的掩码标记,因此在探索其他模型时要验证正确的掩码字是什么。

命名实体识别

命名实体识别 (NER) 是一项任务,其中模型必须找到输入文本的哪些部分对应于诸如人员、位置或组织之类的实体。让我们看一个例子:

from transformers import pipelinener = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english ...
[{'entity_group': 'PER','score': 0.9981694,'word': 'Sylvain','start': 11,'end': 18},{'entity_group': 'ORG','score': 0.9796019,'word': 'Hugging Face','start': 33,'end': 45},{'entity_group': 'LOC','score': 0.9932106,'word': 'Brooklyn','start': 49,'end': 57}]

我们在pipeline创建函数中传递选项 grouped_entities=True 以告诉pipeline将对应于同一实体的句子部分重新组合在一起:这里模型正确地将“Hugging”和“Face”分组为一个组织,即使名称由多个词组成。

命名实体识别(中文)

运行来自 https://huggingface.co/shibing624/bert4ner-base-chinese README的代码

pip install seqeval
import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics.sequence_labeling import get_entitiesos.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-chinese")
model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-chinese")
label_list = ['I-ORG', 'B-LOC', 'O', 'B-ORG', 'I-LOC', 'I-PER', 'B-TIME', 'I-TIME', 'B-PER']sentence = "王宏伟来自北京,是个警察,喜欢去王府井游玩儿。"def get_entity(sentence):tokens = tokenizer.tokenize(sentence)inputs = tokenizer.encode(sentence, return_tensors="pt")with torch.no_grad():outputs = model(inputs).logitspredictions = torch.argmax(outputs, dim=2)char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]print(sentence)print(char_tags)pred_labels = [i[1] for i in char_tags]entities = []line_entities = get_entities(pred_labels)for i in line_entities:word = sentence[i[1]: i[2] + 1]entity_type = i[0]entities.append((word, entity_type))print("Sentence entity:")print(entities)get_entity(sentence)
王宏伟来自北京,是个警察,喜欢去王府井游玩儿。
[('宏', 'B-PER'), ('伟', 'I-PER'), ('来', 'I-PER'), ('自', 'O'), ('北', 'O'), ('京', 'B-LOC'), (',', 'I-LOC'), ('是', 'O'), ('个', 'O'), ('警', 'O'), ('察', 'O'), (',', 'O'), ('喜', 'O'), ('欢', 'O'), ('去', 'O'), ('王', 'O'), ('府', 'B-LOC'), ('井', 'I-LOC'), ('游', 'I-LOC'), ('玩', 'O'), ('儿', 'O')]
Sentence entity:
[('王宏伟', 'PER'), ('北京', 'LOC'), ('王府井', 'LOC')]

或者通过使用nerpy库来使用 shibing624/bert4ner-base-chinese 这个模型。

另外,可以使用的ltp来做中文命名实体识别,其Github仓库 https://github.com/HIT-SCIR/ltp 有4.9K的星

问答系统

from transformers import pipelinequestion_answerer = pipeline("question-answering")
question_answerer(question="Where do I work?",context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)
{'score': 0.6949753761291504, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

请注意,此pipeline通过从提供的上下文中提取信息来工作;它不会凭空生成答案。

文本摘要

文本摘要是将文本缩减为较短文本的任务,同时保留文本中的主要(重要)信息。下面是一个例子:

from transformers import pipelinesummarizer = pipeline("summarization", device=0)
summarizer("""America has changed dramatically during recent years. Not only has the number of graduates in traditional engineering disciplines such as mechanical, civil, electrical, chemical, and aeronautical engineering declined, but in most of the premier American universities engineering curricula now concentrate on and encourage largely the study of engineering science. As a result, there are declining offerings in engineering subjects dealing with infrastructure, the environment, and related issues, and greater concentration on high technology subjects, largely supporting increasingly complex scientific developments. While the latter is important, it should not be at the expense of more traditional engineering.Rapidly developing economies such as China and India, as well as other industrial countries in Europe and Asia, continue to encourage and advance the teaching of engineering. Both China and India, respectively, graduate six and eight times as many traditional engineers as does the United States. Other industrial countries at minimum maintain their output, while America suffers an increasingly serious decline in the number of engineering graduates and a lack of well-educated engineers.
"""
)

与文本生成一样,您指定结果的 max_lengthmin_length

翻译

对于翻译,如果您在任务名称中提供语言对(例如“translation_en_to_fr”),则可以使用默认模型,但最简单的方法是在模型中心(hub)选择要使用的模型。在这里,我们将尝试从法语翻译成英语:

pip install sentencepiece
from transformers import pipelinetranslator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en", device=0)
translator("Ce cours est produit par Hugging Face.")
[{'translation_text': 'This course is produced by Hugging Face.'}]

将英语翻译成中文:

from transformers import pipelinetranslator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-zh", device=0)
translator("America has changed dramatically during recent years.")
[{'translation_text': '近年来,美国发生了巨大变化。'}]

偏见和局限性

如果您打算在正式的项目中使用经过预训练或经过微调的模型。请注意:虽然这些模型是很强大,但它们也有局限性。其中最大的一个问题是,为了对大量数据进行预训练,研究人员通常会搜集所有他们能找到的内容,中间可能夹带一些意识形态或者价值观的刻板印象。

为了快速解释清楚这个问题,让我们回到一个使用BERT模型的pipeline的例子:

from transformers import pipelineunmasker = pipeline("fill-mask", model="bert-base-uncased", device=0)
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])
['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']

当要求模型填写这两句话中缺少的单词时,模型给出的答案中,只有一个与性别无关(服务员/女服务员)。


http://www.ppmy.cn/server/120097.html

相关文章

java和kotlin版本对照表

Java 和 Kotlin 是两种广泛使用的编程语言&#xff0c;特别是在 Android 开发领域。虽然它们有不同的语法和特性&#xff0c;但它们在很多方面是可以互操作的&#xff0c;尤其是在同一个项目中使用时。了解 Java 和 Kotlin 的版本对应关系可以帮助开发者更好地进行跨语言开发和…

React【1】【ref常用法】

文章目录 前言用途1. 储存2. 储存dom句柄ref 前言 react组件每次调用setState的时候都会重新执行函数组件或者class组件 用途 1. 储存 每次调用setState时&#xff0c;组件函数都会重新执行。下面这种情况点击提交后&#xff0c;再点击取消&#xff0c;会发现定时器trimId1仍…

机器翻译之数据处理

目录 1.导包 2.读取本地数据 3.定义函数&#xff1a;数据预处理 4.定义函数&#xff1a;词元化 5.统计每句话的长度的分布情况 6. 获取词汇表 7. 截断或者填充文本序列 8.将机器翻译的文本序列转换成小批量tensor 9.加载数据 10.知识点个人理解 1.导包 #导包 import o…

linux 基础知识 什么是僵尸进程?有什么影响?如何解决?

linux 系统僵尸进程 在Linux系统中&#xff0c;僵尸进程&#xff08;Zombie Process&#xff09;是一种特殊的进程状态&#xff0c;它指的是一个已经完成执行的进程&#xff0c;其父进程尚未通过wait()或waitpid()系统调用来回收其资源和状态信息。 僵尸进程本身并不占用CPU和…

智能自行车码表:基于2605C语音芯片的创新开发方案

一、开发背景 随着科技的飞速发展和人们对健康生活的追求&#xff0c;自行车骑行已成为一种广受欢迎的绿色出行方式。智能自行车码表作为骑行者的得力助手&#xff0c;不仅记录骑行数据&#xff0c;还逐渐融入了更多智能化功能。然而&#xff0c;传统码表在语音提示、多语种支持…

使用python-pptx将PPT转换为图片:将每张幻灯片保存为单独的图片文件

哈喽,大家好,我是木头左! 本文将详细介绍如何使用python-pptx将PPT的每一张幻灯片保存为单独的图片文件。 安装python-pptx库 需要确保已经安装了python-pptx库。可以通过以下命令使用pip进行安装: pip install python-pptx导入所需库 接下来,需要导入一些必要的库,包…

深入解析 JVM 运行时数据区:实战与面试指南

Java 虚拟机 (JVM) 是 Java 开发者的核心工具之一&#xff0c;它不仅负责执行 Java 字节码&#xff0c;而且还管理着应用程序运行时的数据存储。在本文中&#xff0c;我们将继续深入探讨 JVM 的运行时数据区&#xff0c;并通过实际案例和常见面试问题来帮助读者更好地理解和应用…

面试突击-多线程和IO专题(至尊典藏版)

多线程和IO专题 一、多线程专题 1.介绍下进程和线程的关系 进程:一个独立的正在执行的程序 线程:一个进程的最基本的执行单位,执行路径 多进程:在操作系统中,同时运行多个程序 多进程的好处:可以充分利用CPU,提高CPU的使用率 多线程:在同一个进程(应用程序)中同时…