Principle analysis of AutoTokenizer and AutoModel in transformers (1)

Take the example of sentiment analysis to analyze AutoTokenizer and AutoModel in transformers:

通过ipython进行查看sentiment-analynise依赖的Hugging Face – The AI community building the future.We’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/

The library in:

 From the above output, the library used for pre-training in pipeline mode is obtained:

distilbert-base-uncased-finetuned-sst-2-english

 The principle explanation of AutoTokenizer:

 if

test_sentences = ("today is not that bad", "today is so bad" , "so good")

 If there are three elements in it, an error will be reported.

because

inputs_tensor = tokenizer.encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt")

The sentence in must be one or two elements in Tuple()! ! !

and

 Using tokenizer() will not report an error! ! !

 After analyzing the underlying code, it was found that:

When the input is a string, tokenizer() == tokenizer.encode_plus();

When the input is list or tuple, tokenizer() == tokenizer.batch_encode_plus()

 Here is an explanation of the difference between padding=True and padding="max_length":

padding=True:

test_sentences = ("today is not that bad", "today is so bad", "so good")

means when

When all the sentences in test_sentences have different lengths, padding will use the longest sentence length as max_length to fill [PAD], that is, zero padding.

padding="max_length":

test_sentences = ("today is not that bad", "today is so bad", "so good")

 padding="max_length" is generally used with the max_length=XXXX parameter, and zero-padding is performed with the max_length length.

Explain the return value of tokenizer.batch_encode_plus() or tokenizer.encode_plus():

 input_ids: Indicates the id corresponding to the vocab encoded by the tokenizer

attention_mask: 1 represents the word that is not padding, and 0 represents the word that is padding.

The underlying statement with tokenizer.batch_encode_plus() or tokenizer.encode_plus():

tokenizer.convert_tokens_to_ids(tokenizer.tokenize(test_sentences[0]))

 Check the thesaurus in the tokenizer:

 Principle explanation of AutoModel:

Here is used: AutoModelForSequenceClassification

 The return value logits is a tensor with a dimension of (3,2), and we get [1,0,1] after processing it.

What does [1,0,1] represent?

Let's look at its config:

 From the above, 1 represents positive and 0 represents negative.

Complete code:

# ---encoding:utf-8---
# @Time    : 2023/8/1 10:54
# @Author  : CBAiotAigc
# @Email   :[email protected]
# @Site    : 
# @File    : tokenizer_sentiment_analysis.py
# @Project : AI_Review
# @Software: PyCharm
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
import torch
import torch.nn as nn

model_name = "../model/distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

test_sentences = ["today is not that bad", "today is so bad", "so good"]

# inputs_tensor = tokenizer.encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt")
# print(inputs_tensor)
inputs_tensor = tokenizer(test_sentences, padding=True, truncation=True, return_tensors="pt")

print(inputs_tensor)

inputs_tensor = tokenizer.batch_encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt")
print(inputs_tensor)

outputs = model(**inputs_tensor)
print(outputs)

model.eval()
with torch.no_grad():
    labels = torch.argmax(outputs.logits, dim=-1)
    print(labels)

    print(model.config.id2label)
    print([model.config.id2label[id] for id in labels.tolist()])

Guess you like

Origin blog.csdn.net/wtl1992/article/details/132037310