✅ NLP 研 0 选手的学习笔记

文章目录

一、需要的环境
二、pipeline简介
三、pipeline的使用
四、小结
五、补充说明

● 上一篇文章链接: NLP冻手之路(3)——评价及指标函数的使用(Metric，以 BLEU和GLUE 为例)

一、需要的环境

● python 需要 3.7+，pytorch 需要 1.10+

● 本文使用的库基于 Hugging Face Transformer，官网链接：https://huggingface.co/docs/transformers/index 【一个很不错的开源网站，针对于 transformer 框架做了很多大集成，目前 github 72.3k ⭐️】

● 安装 Hugging Face Transformer 的库只需要在终端输入 pip install transformers【这是 pip 安装方法】；如果你用的是 conda，则输入 conda install -c huggingface transformers

二、pipeline简介

● Hugging Face 提供了一个非常轻量化、简单的工具 pipeline，我们可以通过它来解决一些简单的 NLP 任务。pipeline 提供了专门用于多个任务的简单 API，包括命名实体识别、Mask 语言建模、情感分析、特征提取和问题回答等等。

● 通过学习和使用 pipeline，可以让我们更直观地、直接地体会到处理 NLP 任务的感觉。

三、pipeline的使用

3.1 情感分类

● 如果我们没有指定模型，那么它会自动下载模型 distilbert-base-uncased-finetuned-sst-2-english 到 ~/.cache/torch 文件夹当中。

● 如果下载速度过慢，可以先配置清华源再重新运行：pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

● 任务：对给定的文本进行情感二分类。

from transformers import pipeline

my_classifier = pipeline("sentiment-analysis")
result = my_classifier("This restaurant is good")
print(result)
result = my_classifier("我觉得这家餐馆不好吃")
print(result)

● 运行结果如下，首先他会下载情感分析模型，然后进行情感分析。

在这里插入图片描述

3.2 完形填空

● 如果我们没有指定模型，那么它会自动下载模型 distilroberta-base 到 ~/.cache/torch 文件夹当中。

● 任务：模型会对 <mask> 处进行填空，分数代表填这个词的概率。

from transformers import pipeline
from pprint import pprint
my_unmasker = pipeline("fill-mask")
sentence = 'HuggingFace is creating a <mask> that the community uses to solve NLP tasks.'
result = my_unmasker(sentence)
pprint(result )

输出：
[{
    
    'sequence': 'HuggingFace is creating a tool that the community uses to solve NLP tasks.',
  'score': 0.17927534878253937,
  'token': 3944,
  'token_str': ' tool'},
 {
    
    'sequence': 'HuggingFace is creating a framework that the community uses to solve NLP tasks.',
  'score': 0.11349416524171829,
  'token': 7208,
  'token_str': ' framework'},
 {
    
    'sequence': 'HuggingFace is creating a library that the community uses to solve NLP tasks.',
  'score': 0.05243571847677231,
  'token': 5560,
  'token_str': ' library'},
 {
    
    'sequence': 'HuggingFace is creating a database that the community uses to solve NLP tasks.',
  'score': 0.034935351461172104,
  'token': 8503,
  'token_str': ' database'},
 {
    
    'sequence': 'HuggingFace is creating a prototype that the community uses to solve NLP tasks.',
  'score': 0.028602460399270058,
  'token': 17715,
  'token_str': ' prototype'}]

3.3 文本生成

● 如果我们没有指定模型，那么它会自动下载模型 gpt2 到 ~/.cache/torch 文件夹当中。

● 任务：给定模型一段话/一句话，模型接着生成后续的文本，生成的长度由 max_length 决定。

from transformers import pipeline

text_generator = pipeline("text-generation")
result = text_generator("As far as I am concerned, I will",
               max_length=50,
               do_sample=False)
print(result)

输出：
[{
    
    'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]

3.4 命名实体识别

● 如果我们没有指定模型，那么它会自动下载模型 dbmdz/bert-large-cased-finetuned-conll03-english 到 ~/.cache/torch 文件夹当中。

● 任务：给定模型一段话，模型对其中的人名、地名、城市名、公司名等等。

from transformers import pipeline

ner_pipe = pipeline("ner")
sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""
for entity in ner_pipe(sequence):
    print(entity)

输出：
{
    
    'entity': 'I-ORG', 'score': 0.9995786, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{
    
    'entity': 'I-ORG', 'score': 0.9909764, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{
    
    'entity': 'I-ORG', 'score': 0.9982225, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{
    
    'entity': 'I-ORG', 'score': 0.999488, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
{
    
    'entity': 'I-LOC', 'score': 0.9994345, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
{
    
    'entity': 'I-LOC', 'score': 0.9993196, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
{
    
    'entity': 'I-LOC', 'score': 0.9993794, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
{
    
    'entity': 'I-LOC', 'score': 0.98625815, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
{
    
    'entity': 'I-LOC', 'score': 0.9514269, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
{
    
    'entity': 'I-LOC', 'score': 0.9336589, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
{
    
    'entity': 'I-LOC', 'score': 0.97616535, 'index': 28, 'word': 'Manhattan', 'start': 114, 'end': 123}
{
    
    'entity': 'I-LOC', 'score': 0.9914629, 'index': 29, 'word': 'Bridge', 'start': 124, 'end': 130}

3.5 摘要生成

● 如果我们没有指定模型，那么它会自动下载模型 sshleifer/distilbart-cnn-12-6 到 ~/.cache/torch 文件夹当中。

● 任务：略。

from transformers import pipeline

summarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

result = summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False)
print(result)

输出：
[{
    
    'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002 . At one time, she was married to eight men at once, prosecutors say .'}]

3.6 文本翻译

● 如果我们没有指定模型，那么它会自动下载模型 t5-base 到 ~/.cache/torch 文件夹当中。

● 任务：略。

from transformers import pipeline

#翻译
translator = pipeline("translation_en_to_de")
sentence = "Hugging Face is a technology company based in New York and Paris"
result = translator(sentence, max_length=40)
print(result)

输出：
[{
    
    'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]

3.7 阅读理解

● 如果我们没有指定模型，那么它会自动下载模型 t5-base 到 ~/.cache/torch 文件夹当中。

● 这段代码可能运行不成功，可能是原模型的 bug。

● 任务：给定一段文本，然后问文本一个问题，模型给出相应的答案。

from transformers import pipeline

question_answerer = pipeline("question-answering")
# 字符串前面加 r 是为了消除转义字符对字符串的影响. 加了 r 之后, 再打印字符串就会打印出完整的字符串
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a 
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune 
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""
result = question_answerer(question="What is extractive question answering?",
                           context=context)
print(result)
result = question_answerer(
    question="What is a good example of a question answering dataset?",
    context=context)
print(result)

输出：
{
    
    'score': 0.6177279353141785, 'start': 34, 'end': 95, 'answer': 'the task of extracting an answer from a text given a question'}
{
    
    'score': 0.5152313113212585, 'start': 148, 'end': 161, 'answer': 'SQuAD dataset'}

四、小结

● 本小节不是重点，但是可以让我们直观地体会到 NLP 唾手可得的魅力，后续还有深入探索，一起加油~！

五、补充说明