Use the transformers framework to import the t5-small model for text translation

foreword

In the previous blog, the training and prediction tutorial based on the Seq2Seq machine translation model of transformer described how to train a translated Seq2Seq model. This blog will talk about how to use the model trained in huggingface to complete the translation task.

Environment and Model Description

To use the pre-trained model in huggingface, you must first install the libraries of transformers, torch and SentencePiece, and use the following commands

pip install transformers
pip install torch
pip install SentencePiece

There are many nlp or large pre-trained language models in huggingface. This time I chose the smaller model t5-small.

T5 is an encoder-decoder model that is multi-task trained on a very large number of unsupervised and supervised tasks, and converts each task

for text-to-text format. T5 can handle various tasks well, by adding different prefixes to the input corresponding to each task, for example

Such as: translate English to German: ..., abstract: summarize: .... (Note: This means that the input text can be added by adding

The way to specify the prefix is ​​to specify the kind of text-to-text task. For example, if you want T5 to do translation tasks, then the input for him can be

is "translate English to German: What is your name?"). For more information on which prefix to use, appendix to the original paper

Record D gives all prefixes. For sequence-to-sequence generation, it is recommended to use the generate() function. This method is responsible for passing the cross-

The attention layer feeds the encoded input to the decoder and automatically regresses to generate the decoder output. T5 uses relative scalar embeddings. Encoder input fill

Charging can be done on the left and right.

code example

from transformers import AutoModelWithLMHead, AutoTokenizer

model = AutoModelWithLMHead.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")
inputs = tokenizer.encode(
    "translate English to German: Hugging Face is a technology company based in New York and Paris",
    return_tensors="pt")
print(inputs)
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)

print(outputs[0])
print(tokenizer.decode(outputs[0]))

Running the above code will automatically download t5-small from huggingface's model library. Similarly, if the automatic download fails, you can manually download the model and place it in the corresponding folder. For details, please refer to ReadTimeoutError: HTTPSConnectionPool( host ='cdn-lfs.huggingface.co', port=443) This blog contains detailed tutorials.

Running the above code will output the German translation of Hugging Face is a technology company based in New York and Paris
insert image description here

The above is a single piece of data prediction, which can be predicted using batch data:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# when generating, we will use the logits of right-most token to predict the next token
# so the padding should be on the left
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token # to avoid an error

task_prefix = 'translate English to German: '
sentences = ['The house is wonderful.', 'I like to work in NYC.'] # use different length sentences to test batching
inputs = tokenizer([task_prefix + sentence for sentence in sentences], return_tensors="pt", padding=True)

output_sequences = model.generate(
    input_ids=inputs['input_ids'],
    attention_mask=inputs['attention_mask'],
    do_sample=False, # disable sampling to test if batching affects output
)

print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))

The output of the model is:

insert image description here

In addition, if you want to do finetuning on the basis of this pre-trained model and train your own data set, of course you can also follow the sample code below:

This is unsupervised data training

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt').input_ids
labels = tokenizer('<extra_id_0> cute dog <extra_id_1> the <extra_id_2>', return_tensors='pt').input_ids
# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss

This is supervised data training

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids
labels = tokenizer('Das Haus ist wunderbar.', return_tensors='pt').input_ids
# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss

To sum up, if your business can be covered with these general-purpose models, or if you are just starting to research in the field of machine translation, you can play with these pre-trained models first and try to see how they work. Or, if you have a batch of small data sets, you can also do some transfer learning based on this pre-trained model to improve the accuracy of the algorithm. The above code can give you some reference

Guess you like

Origin blog.csdn.net/weixin_42280271/article/details/130708802