Transformers Quick Start Quick tour

First briefly introduce the characteristics of the Transformers library. Transformers can be used to download pre-trained models for Natural Language Understanding tasks, such as sentiment analysis tasks; they can also be used to download pre-trained models for Natural Language Generation tasks, such as translation tasks.

Use pipline for a natural language processing task

Use pipline to quickly use some pre-trained models.

Transformers provide some classic natural language tasks:

  1. Sentiment Analysis: Analyze whether a text is positive or negative.
  2. Text Generation: Provide a sentence and the model will generate the next sentence of that sentence.
  3. Named Entity Recognition: In the input sentence, each word is tokenized to reveal the meaning of the word.
  4. Question and Answer: Enter a piece of text and a question to extract the answer to the question from the text.
  5. Fill in masked text: Enter a piece of text in which some words are replaced by [MASK] tokens, and the model will fill in the masked text.
  6. Summary Generation: Generate a summary of a long piece of text.
  7. Translate: Convert one language to another.
  8. Feature extraction: Get a tensor representation of a piece of text.

Sentiment Analysis Sample

Single statement input code:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
print(classifier("We are very happy to introduce pipeline to the transformers repository."))
print(classifier("I hate you!"))

When this code is executed for the first time, a default pre-training model (pre-training model) and a text tokenizer (tokenizer) will be downloaded from the Internet. The role of the tokenizer is to process the text, and then the model will predict the text after processing.

The code output is as follows:

[{
    
    'label': 'POSITIVE', 'score': 0.9996980428695679}]
[{
    
    'label': 'NEGATIVE', 'score': 0.9987472891807556}]

In addition to single sentence input, a list containing multiple sentences can also be fed into the model to obtain results.

Multiple statement input:

from transformers import pipeline

sentences = [
    "I am very happy!",
    "I am not happy."
]

classifier = pipeline("sentiment-analysis")
print(classifier(sentences))

The output is as follows:

[{
    
    'label': 'POSITIVE', 'score': 0.9998728632926941}, 
{
    
    'label': 'NEGATIVE', 'score': 0.9997913241386414}]

Download the desired model

If we piplinedo not specify a model name in , the default model corresponding to the task will be downloaded.

As in the above example, he will download a “distilbert-base-uncased-finetuned-sst-2-english”model called .

If we don't want to use this model, we can check out some models at https://huggingface.co/models , this website. And in the details page, you can try the model and view the model import code.

Please add a picture description

Use of Tokenizers

Text tokenizers are tools for preprocessing text.

When tokenizers are used, they split a sentence into individual ones word, these wordare called tokens.

After that, the tokenizer tokensconverts the values ​​into numbers, and after converting them to numbers, we can send them to the model.

In order to realize the function of converting tokens into numbers, the tokenizer has a vocabulary, which is downloaded when we instantiate and specify the model. The tokenizer uses this table and the vocabulary used by the model in pre-training same.

Example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
sentence = "We are very happy to show you the Transformers library"
inputs = tokenizer(sentence)
print(inputs)

Output result:

{
    
    'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 102], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

As you can see, the returned value is a dictionary with two key-value pairs.

The first key-value pair "input_ids" is the result of converting the input sentence into a number, and the length is equal to the number of words in the sentence.

The second key-value pair "attention_mask", the value is all 1, which means let the model pay attention to all the words in it.

If we want to put a batch of sentences at a time, we can input a list of sentences.

Example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
sentence = [
    "We are very happy to show you the Transformers library",
    "I am not happy."
]
inputs = tokenizer(sentence)
print(inputs)

Output result:

{
    
    'input_ids': 
[[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 102], 
[101, 1045, 2572, 2025, 3407, 1012, 102]], 
'attention_mask': 
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
[1, 1, 1, 1, 1, 1, 1]]}

padding

In some cases, we need to truncate or complete sentences to make the model input consistent. This is the need to add parameters when using the tokenizer padding、truncation、max_length.

  • padding: Boolean, indicating whether to pad.
  • truncation: Boolean, indicating whether to truncate.
  • max_length: Indicates the maximum length of the text. If the text exceeds this length, it will be truncated, and if the text is less than this length, it will be filled.

Example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
sentence = [
    "We are very happy to show you the Transformers library",
    "I am not happy."
]
inputs = tokenizer(
    sentence,
    padding=True,
    truncation=True,
    max_length=10,
    return_tensors="pt"
)
print(inputs)

operation result:

{
    
    'input_ids': 
[[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 102], 
[101, 1045, 2572, 2025, 3407, 1012, 102, 0, 0, 0]], 
'attention_mask': 
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]}

It can be seen that all input lengths have become 10, and max_lengthsentences with a length less than 0 will be filled in the back, and attention_maskthe filling position is also set to 0, indicating that these words do not need to be paid attention to.

101Note: As can be seen from the running results, a mark with an index of sum is filled before and after the two sentences 102. This is a special mark in the task, which tokenizermeans [CLS]sum in this [SEP].

Model uses

After your input text has been tokenizerpreprocessed, you can feed it into the model. Because it includes all the information that needs to be entered.

It should be noted that when you use pytorchthe model, you cannot directly inputinput s, but need to use **variable unpacking.

Example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
sentence = [
    "We are very happy to show you the Transformers library",
    "I am not happy."
]
inputs = tokenizer(
    sentence,
    padding=True,
    truncation=True,
    max_length=10,
    return_tensors="pt"
)

pt_output = model(**inputs)
print(pt_output)

operation result:

(tensor([[-4.2644,  4.6002],
         [ 4.7293, -3.7452]], grad_fn=<AddmmBackward0>),)
  • Notice:

    return_tensors="pt“, not less, or an error will be reported.

    In transformers, all outputs are tuples (may have multiple or only one element). In this example, the final output we get is a tuple with only one element.

    All transformers models (whether Pytorch or Tensorflow) return results that are produced before the final activation function (such as softmax) , because the final activation function is often fused with the loss function.

Use softmax to make the result more "pleasing to the eye"

The absolute value of the output results in the above example is greater than 1, which is not quite like the result we predicted before. At this time, we can apply it to softmax. The code is as follows:

import torch.nn.functional as F
pt_predictions = F.softmax(pt_output[0], dim=-1)
print(pt_predictions)

The result is as follows:

tensor([[1.4128e-04, 9.9986e-01],
        [9.9979e-01, 2.0870e-04]], grad_fn=<SoftmaxBackward0>)

The result is much more comfortable.

Add the labels parameter to get loss

In addition, if we know the label of the classification, we can also input it into the model, and we will get a tuple like (loss, outputs)this, an example is as follows:

import torch
# 假定消极为0,积极为1 
pt_outputs = model(**inputs, labels = torch.tensor([1, 0]))
print(pt_outputs)

The result is as follows:

(tensor(0.0002, grad_fn=<NllLossBackward0>),
 tensor([[-4.2644,  4.6002],
         [ 4.7293, -3.7452]], grad_fn=<AddmmBackward0>))

The downloaded pre-trained model can also be trained normally

In addition to being used for prediction, these pre-trained models can also be used for training, because the bottom layer of these models is based on torch.nn.Moduleor tf.keras.Modelimplemented. Of course, the Transformers library has also carefully prepared Tranierand TFTrainerclasses to help training, which can be used for tasks such as distributed training and mixed progress. These classes will be introduced later, but I will mention them here first.

Save and load models

When you're done fine-tuning your model, you can save your sum tokenizerwith model:

tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

When you want to load the model saved locally (note that it is not downloaded online), you can use from_pretrained()the function. It should be noted that the parameter that needs to be passed in at this time is the local path of the file instead of the model name .

In addition, transformers also has a very cooooooooooool feature, that is, you can easily switch between Pytorch and Tensorflow, because any model saved through transformers can be directly imported in Pytorch and Tensorflow. If you want to import a model under Tensorflow in PyTorch, you only need to do the following (just switch the corresponding class from PyTorch to Tensorflow):

# Tensorflow下导入Pytorch的模型
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = TFAutoModel.from_pretrained(save_directory, from_pt=True)

# Pytorch下导入Tensorflow的模型
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModel.from_pretrained(save_directory, from_tf=True)

# 如果是相同环境,则 from_xx 参数不需要加

Get hidden layer weights and attention weights

If you want to get all hidden layer weights and attention weights, you need to do the following:

pt_outputs = model(**inputs, output_hidden_states=True, output_attentions=True)
all_hidden_states, all_attentions = pt_outputs[-2:]

about the code

AutoModelThe and AutoTonkenizerclasses are our shortcuts for using pretrained models. Behind this, the transformers library provides a model class for each type of model, so you can easily find the code and modify it whenever you need it.

Taking the above example as an example, the name of this model is “distilbert-base-uncased-finetuned-sst-2-english”, and its structure is DustilBERT. The corresponding model class is AutoModelForSequenceClassification(if it is in the TensorFlow environment, it is TFAutoModelForSequenceClassification). You can learn more about this model in the documentation, or browse its source code.

If you don't want to use autoloading, you can also import the same model like this:

from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = DistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

custom model

If you want to change the structure of the model, you can customize the configuration class. Each model architecture has its own configuration, as before DistilBERT, its configuration class DistilBertConfigis You can customize any feature dimension, random inactivation ratio, etc. in it. It should be noted that if you modify the core of the model, such as modifying the size of the hidden layer, then you can no longer use the pre-trained model, you must train the model from scratch. But you can still instantiate the model via configuration, just without the trained weights.

In the following example, we use a predefined DistilBERTvocabulary, and use the configuration class to randomly initialize a custom model:

from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification(config)
  • Since the predefined one is used Tokenizer, the method is used from_pretrained.
  • Since the structure of the model is changed, the pre-trained model cannot be loaded, so configinstantiation is used.

For operations that do not modify the core of the model, such as modifying the number of labels, we can still use the pre-trained model. As follows, we use the pre-trained model to create a classifier for 10 categories. You can use labelsthe default configuration class except for the quantity; you can also write it directly from_pretrained, and it will modify the default configuration class by itself. code show as below:

from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

sentence = [
    "I am happly"
]

inputs = tokenizer(
    sentence,
    return_tensors="pt"
)

print(model(**inputs))

operation result:

(tensor([[-0.0141, -0.1514,  0.0542,  0.0729,  0.1489, -0.1813,  0.0564, -0.0404,
          -0.0721,  0.0494]], grad_fn=<AddmmBackward0>),)

It can be seen that it has become 10 categories.

Guess you like

Origin blog.csdn.net/qq_42464569/article/details/121071756