Transformers Quick Start

Author | huggingface compiled | VK source | Github

idea

Transformers is seeking to use / research / extension library Large Transformers model for researchers of NLP.

The design of the library has two strong goals:

  • As simple and quick to use:
    • We try to limit the number of object-oriented classes to learn the abstract, in fact, almost no abstraction, each model only need to use three standard categories: configuration, model and tokenizer,
    • All of these classes are available through the public from_pretrained()a simple and uniform way to initialize from the pre-training instantiate an instance of the method, the method will be responsible for downloading from the library, caching and pre-load related classes provide training model or save your own models.
    • Therefore, this library instead of building neural network module toolbox. If you want to expand / build this library, just use regular Python / PyTorch module, and inherit from this base class library to reuse models such as load / save functions.
  • Offering the most advanced performance model and the original model as closely as possible:
    • The results we provide at least one example for each architecture, which reproduces the above-mentioned example of official architecture provided by the author
    • Code is usually as close to the original code, which means that some of the code may not be so pytorch PyTorch of, because this is the result of the conversion TensorFlow code.

Several other goals:

  • Exposed internal model as consistently as possible:
    • We use an API to access all the hidden weight and attention,
    • Tokenizer basic model of the API and is standardized to facilitate switching between the models.
  • Combined with a subjective selection of a promising tool for fine-tuning / investigating these models:
    • Simple / consistent method of adding a new flag to the vocabulary entry to and embedded in fine-tuning,
    • The simplest way mask and trim the transformer head.

The main concept

The library is built on three types of classes for each model:

  • model class is currently offered in eight models in the library architecture PyTorch model (torch.nn.Modules), for example BertModel
  • type configuration, which stores all the parameters needed to build the model, e.g. BertConfig. You do not always have their own instance of these configurations, especially if you are using a pre-trained without any modification of the model, the model will automatically create responsible for instantiating configuration (which is part of the model)
  • tokenizer class, it stores each model vocabulary, and the vocabulary to be delivered to the model embedded in the index list is provided a method for encoding / decoding string, e.g. BertTokenizer

All of these classes can be instantiated from a pre-training model and two methods stored locally:

  • from_pretraining()From a pre-training allows you to instantiate a model version / configuration / tokenizer, this pre-release training can be provided by the library itself (currently 27 models listed here), can also be stored locally by the user (or server),
  • save_pretraining()It allows you to save local model / configuration / tokenizer, so that you can use from_pretraining()to reload it.

We will complete this quick start journey through some simple quick start example, to see how to instantiate and use these classes. The rest of the document is divided into two parts:

  • The main class introduces the three main categories (configuration, model, tokenizer) of public functions / methods / properties, as well as some optimization class as training tools provided,
  • Package references sections describe in detail all variants of each class for each architecture models, particularly when they call them the desired input and output.

Quick-Start:

Here are two examples show some class and Bert and GPT2 pre-training model.

For an example of each model type, see the complete API reference.

BERT example

Let us first use BertTokenizerto prepare a tokenized input from a text string (BERT mark to be input to the embedded index list)

import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM

# 可选:如果您想了解发生的信息,请按以下步骤logger
import logging
logging.basicConfig(level=logging.INFO)

# 加载预训练的模型标记器(词汇表)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# 标记输入
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)

# 用“BertForMaskedLM”掩盖我们试图预测的标记`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']

# 将标记转换为词汇索引
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# 定义与第一句和第二句相关的句子A和B索引(见论文)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# 将输入转换为PyTorch张量
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

Let's see how to use BertModelthe input is encoded in a hidden state:

# 加载预训练模型(权重)
model = BertModel.from_pretrained('bert-base-uncased')

# 将模型设置为评估模式
# 在评估期间有可再现的结果这是很重要的!
model.eval()

# 如果你有GPU,把所有东西都放在cuda上
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

#预测每个层的隐藏状态特征
with torch.no_grad():
    # 有关输入的详细信息,请参见models文档字符串
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    # Transformer模型总是输出元组。
    # 有关所有输出的详细信息,请参见模型文档字符串。在我们的例子中,第一个元素是Bert模型最后一层的隐藏状态
    encoded_layers = outputs[0]
# 我们已将输入序列编码为形状(批量大小、序列长度、模型隐藏维度)的FloatTensor
assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)

And how to use BertForMaskedLMpredictive shield Tags:

# 加载预训练模型(权重)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# 如果你有GPU,把所有东西都放在cuda上
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# 预测所有标记
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]

# 确认我们能预测“henson”
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'henson'

OpenAI GPT-2

Here is an example of a quick start, using GPT2Tokenizer and GPT2LMHeadModel classes and OpenAI pretraining model to predict the next marker in the prompt text.

First, let's useGPT2Tokenizer

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# 可选:如果您想了解发生的信息,请按以下步骤logger
import logging
logging.basicConfig(level=logging.INFO)

# 加载预训练模型(权重)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# 编码输入
text = "Who was Jim Henson ? Jim Henson was a"
indexed_tokens = tokenizer.encode(text)

# 转换为PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])

Let's see how GPT2LMHeadModelthe next generation a token followed our text:

# 加载预训练模型(权重)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# 将模型设置为评估模式
# 在评估期间有可再现的结果这是很重要的!
model.eval()

# 如果你有GPU,把所有东西都放在cuda上
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

# 预测所有标记
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

# 得到预测的下一个子词(在我们的例子中,是“man”这个词)
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
assert predicted_text == 'Who was Jim Henson? Jim Henson was a man'

Examples of each model for each class of model architecture (Bert, GPT, GPT-2, Transformer XL, XLNet and XLM) can be found in the document.

Use past GPT-2

And other models (GPT, XLNet, Transfo XL, CTRL), the use pastor memsproperties that can be used to prevent the recalculation key while using the decoding order / value pairs. It is useful in the generation of sequence, because a large part of the mechanism of attention thanks to previous calculations.

The following is a use band pastof GPT2LMHeadModela complete working example and argmax decoding (only as an example, because a large number of repeat introduced argmax Decoding):

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2')

generated = tokenizer.encode("The Manhattan bridge")
context = torch.tensor([generated])
past = None

for i in range(100):
    print(i)
    output, past = model(context, past=past)
    token = torch.argmax(output[..., -1, :])

    generated += [token.tolist()]
    context = token.unsqueeze(0)

sequence = tokenizer.decode(generated)

print(sequence)

Because all previous mark key / value pair contained within past, so that the model only needs as input a mark.

Model2Model example

Coder - decoder architecture labeling requires two inputs: one for the encoder, the decoder for the other. Suppose we want to use the Model2Modelconduct generative questions and answers, questions and answers from the token of the input model began.

import torch
from transformers import BertTokenizer, Model2Model

# 可选:如果您想了解发生的信息,请按以下步骤logger
import logging
logging.basicConfig(level=logging.INFO)

# 加载预训练模型(权重)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# 编码输入(问题)
question = "Who was Jim Henson?"
encoded_question = tokenizer.encode(question)

# 编码输入(答案)
answer = "Jim Henson was a puppeteer"
encoded_answer = tokenizer.encode(answer)

# 将输入转换为PyTorch张量
question_tensor = torch.tensor([encoded_question])
answer_tensor = torch.tensor([encoded_answer])

Let's see how Model2Modelto obtain this (question, answer) loss of value associated with it:

#为了计算损失,我们需要向解码器提供语言模型标签(模型生成的标记id)。
lm_labels =  encoded_answer
labels_tensor = torch.tensor([lm_labels])

# 加载预训练模型(权重)
model = Model2Model.from_pretrained('bert-base-uncased')

# 将模型设置为评估模式
# 在评估期间有可再现的结果这是很重要的!
model.eval()

# 如果你有GPU,把所有东西都放在cuda上
question_tensor = question_tensor.to('cuda')
answer_tensor = answer_tensor.to('cuda')
labels_tensor = labels_tensor.to('cuda')
model.to('cuda')

# 预测每个层的隐藏状态特征
with torch.no_grad():
    # 有关输入的详细信息,请参见models文档字符串
    outputs = model(question_tensor, answer_tensor, decoder_lm_labels=labels_tensor)
    # Transformers模型总是输出元组。
    # 有关所有输出的详细信息,请参见models文档字符串
    # 在我们的例子中,第一个元素是LM损失的值
    lm_loss = outputs[0]

This loss can be used to fine-tune the task Model2Model questions and answers. Suppose we fine-tune the model, now let's look at how to generate answers:

# 让我们重复前面的问题
question = "Who was Jim Henson?"
encoded_question = tokenizer.encode(question)
question_tensor = torch.tensor([encoded_question])

# 这次我们试图生成答案,所以我们从一个空序列开始
answer = "[CLS]"
encoded_answer = tokenizer.encode(answer, add_special_tokens=False)
answer_tensor = torch.tensor([encoded_answer])

# 加载预训练模型(权重)
model = Model2Model.from_pretrained('fine-tuned-weights')
model.eval()

# 如果你有GPU,把所有东西都放在cuda上
question_tensor = question_tensor.to('cuda')
answer_tensor = answer_tensor.to('cuda')
model.to('cuda')

# 预测所有标记
with torch.no_grad():
    outputs = model(question_tensor, answer_tensor)
    predictions = outputs[0]

# 确认我们能预测“jim”
predicted_index = torch.argmax(predictions[0, -1]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'jim'

Welcome attention Pan Chong blog resources Summary station: http://docs.panchuang.net/

Welcome concern PyTorch official Chinese Tutorial station: http://pytorch.panchuang.net/

OpenCV Chinese official document: http://woshicver.com/

发布了357 篇原创文章 · 获赞 1041 · 访问量 63万+

Guess you like

Origin blog.csdn.net/fendouaini/article/details/104920315