Take the example of sentiment analysis to analyze AutoTokenizer and AutoModel in transformers:
通过ipython进行查看sentiment-analynise依赖的Hugging Face – The AI community building the future.We’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/
The library in:
From the above output, the library used for pre-training in pipeline mode is obtained:
distilbert-base-uncased-finetuned-sst-2-english
The principle explanation of AutoTokenizer:
if
test_sentences = ("today is not that bad", "today is so bad" , "so good")
If there are three elements in it, an error will be reported.
because
inputs_tensor = tokenizer.encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt")
The sentence in must be one or two elements in Tuple()! ! !
and
Using tokenizer() will not report an error! ! !
After analyzing the underlying code, it was found that:
When the input is a string, tokenizer() == tokenizer.encode_plus();
When the input is list or tuple, tokenizer() == tokenizer.batch_encode_plus()
Here is an explanation of the difference between padding=True and padding="max_length":
padding=True:
test_sentences = ("today is not that bad", "today is so bad", "so good")
means when
When all the sentences in test_sentences have different lengths, padding will use the longest sentence length as max_length to fill [PAD], that is, zero padding.
padding="max_length":
test_sentences = ("today is not that bad", "today is so bad", "so good")
padding="max_length" is generally used with the max_length=XXXX parameter, and zero-padding is performed with the max_length length.
Explain the return value of tokenizer.batch_encode_plus() or tokenizer.encode_plus():
input_ids: Indicates the id corresponding to the vocab encoded by the tokenizer
attention_mask: 1 represents the word that is not padding, and 0 represents the word that is padding.
The underlying statement with tokenizer.batch_encode_plus() or tokenizer.encode_plus():
tokenizer.convert_tokens_to_ids(tokenizer.tokenize(test_sentences[0]))
Check the thesaurus in the tokenizer:
Principle explanation of AutoModel:
Here is used: AutoModelForSequenceClassification
The return value logits is a tensor with a dimension of (3,2), and we get [1,0,1] after processing it.
What does [1,0,1] represent?
Let's look at its config:
From the above, 1 represents positive and 0 represents negative.
Complete code:
# ---encoding:utf-8---
# @Time : 2023/8/1 10:54
# @Author : CBAiotAigc
# @Email :[email protected]
# @Site :
# @File : tokenizer_sentiment_analysis.py
# @Project : AI_Review
# @Software: PyCharm
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
import torch
import torch.nn as nn
model_name = "../model/distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
test_sentences = ["today is not that bad", "today is so bad", "so good"]
# inputs_tensor = tokenizer.encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt")
# print(inputs_tensor)
inputs_tensor = tokenizer(test_sentences, padding=True, truncation=True, return_tensors="pt")
print(inputs_tensor)
inputs_tensor = tokenizer.batch_encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt")
print(inputs_tensor)
outputs = model(**inputs_tensor)
print(outputs)
model.eval()
with torch.no_grad():
labels = torch.argmax(outputs.logits, dim=-1)
print(labels)
print(model.config.id2label)
print([model.config.id2label[id] for id in labels.tolist()])