Transformers data preprocessing: Preprocessing data

Data preprocessing Preprocessing data

In transformers, the main tool for data processing is the text tokenizer tokenizer. We can use the text tokenizer type corresponding to the model, or we can use AutoTokenizerautomatic classification directly.

A text tokenizer first splits the text into words, punctuation marks, etc. These split elements are called token. Then it will tokenbe converted to a number so that it can be converted into a tensor for training tensor. CLSIn addition, some specific text tokenizers will also add some special tags required by the model, such as , in BERT SEP.

Note:
If you want to use a pre-trained model, you need to use the corresponding text tokenizer of the model. Because the corresponding text tokenizer transforms the text in the same way as its model was trained, the corresponding vocabulary is also the same. If the text tokenizer is wrong, it will have a huge impact on model prediction or fine-tuning. For example, the original index of the word "I" is 1, and the index of "I" in another text tokenizer is 100, which causes the data received by the model to be completely different from what you think.

To automatically download the text tokenizer used when training or fine-tuning the model, you can use from_pretrained()the method:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

basic usage

preprocessing

Text tokenizers in transformers have many methods, but there is only one method for preprocessing, ie __call__: you just feed the text directly into the text tokenizer object. as follows:

encoded_input = tokenizer("Hello, I'm a single sentence!")
print(encoded_input)

输出:
{
    
    'input_ids': [101, 8667, 117, 146, 112, 182, 170, 1423, 5650, 106, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

This method returns a dictionary. input_idsis tokenthe index of each in the input text. attention_maskThe use of and will be discussed later token_type_ids.

decoding

In addition to encoding text, text tokenizers can also decode indices:

print(tokenizer.decode(encoded_input["input_ids"]))

输出:
[CLS] Hello, I'm a single sentence! [SEP]

We can see that the text tagger has automatically added the special tags required by BERT during preprocessing.

Not all models require special markup, if we use gtp2-mediumnot bert-base-cased, we can get the same result as the original text when decoding.

When decoding, we can also add parameters to the method add_special_tokens=Falseto remove special tags (some versions are skip_special_tokens=True).

multiple data

If you want to process multiple texts at once, you can combine them into an array, in and out of the text tokenizer at once, as follows:

batch_sentences = ["Hello I'm a single sentence",
                   "And another sentence",
                   "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

输出:
{
    
    'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
               [101, 1262, 1330, 5650, 102],
               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1]]}

Pad, truncate, return specific types

When processing multiple statements at once, we may also have the following requirements:

  • pad each sentence to the maximum length of the batch
  • Truncate each sentence to the maximum length the model can accept
  • return tensortype data

You can achieve all requirements with the following operations:

batch = tokenizer(batch_sentences, max_length=7, padding=True, truncation=True, return_tensors="pt", )
print(batch)

结果:
{
    
    'input_ids': tensor([
				[ 101, 8667,  146,  112,  182,  170,  102],
        [ 101, 1262, 1330, 5650,  102,    0,    0],
        [ 101, 1262, 1103, 1304, 1304, 1314,  102]]), 
'token_type_ids': tensor([
				[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]]), 
'attention_mask': tensor([
				[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1]])}

pytorch.tensorThis time a dictionary of strings to types is returned . We can now see the use based on the returned results attention_mask: it will tell the model which ones tokenneed to be paid attention to and which ones should not be ignored, because some are meaningless for filling token.

Note that the above code will issue a warning when executed if a model is used that does not have a maximum length associated with it. This is fine and can be ignored, or parameters can be added verbose=Falseto prevent the text tokenizer from throwing these exceptions.

processing sentence pairs

Sometimes it may be necessary to feed a pair of sentences into the model. For example, we need to judge whether two sentences are similar; or we are using a question answering model and need to feed text and questions into the model. For BERTthe model, sentence pairs need to be transformed into the following form:[CLS] Sequence A [SEP] Sequence B [SEP]

When using Transformers to process sentence pairs, we need to pass the two sentences into the text tokenizer as different variables (note that it is not integrated into a list like before, but two separate variables). Then we will get a corresponding dictionary, as in the following example:

encoded_input = tokenizer("How old are you?", "I'm 6 years old")
print(encoded_input)
print(tokenizer.decode(encoded_input["input_ids"]))
for i in encoded_input["input_ids"]:
    print(tokenizer.decode(i))

结果:
{
    
    'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] How old are you? [SEP] I'm 6 years old [SEP]
[ C L S ]
H o w
o l d
a r e
y o u
?
[ S E P ]
I
'
m
6
y e a r s
o l d
[ S E P ]

From the results we can see token_type_idsthe effect: they tell the model which part of the input belongs to the first sentence and which part belongs to the second sentence. Note that not all models require it token_tyoe_ids. By default, text tokenizers will only return the desired input relative to the model. You can pass some parameters like return_token_type_idsor return_lengthto change the output of the text tokenizer.

encoded_input = tokenizer("How old are you?", "I'm 6 years old",
                        return_token_type_ids=False, 
                        return_length=True,
                        )
print(encoded_input)

输出:
{
    
    'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'length': 14}

Also, if you want to process multiple statements at once, you can pass in two text lists separately. as follows:

batch_sentences = ["Hello I'm a single sentence",
                   "And another sentence",
                   "And the very very last one"]
batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
                             "And I should be encoded with the second sentence",
                             "And I go with the very last one"]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
print(encoded_inputs)

结果:
{
    
    'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102],
               [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102],
               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

input_idsWe can check our input by looping through the decoded list, as follows:

for ids in encoded_inputs["input_ids"]:
    print(tokenizer.decode(ids))

结果:
[CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
[CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
[CLS] And the very very last one [SEP] And I go with the very last one [SEP]

And you can still enter some parameters to fill or intercept the text, or convert it to a specific type when encoding, as follows:

batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")

something about padding and truncation

The instructions applicable in most cases have been introduced above. But Transformers also provides more methods, which revolve around three parameters padding, truncation, max_lengthto expand.

  • paddingUsed to control padding. It can be of boolean type or string type, as follows:
    • Trueor ”longest”to pad all sentences to the maximum length in the sequence list, or do nothing if you provide only one sentence.
    • “max_length”It is used to fill the sequence to the length of the parameter . If no parameter ( ) max_lengthis provided , it will be filled to the maximum length that the model can accept. And also works when you provide only one sentence.max_lengthmax_length=None
    • Falseor ”do_not_pad”to set to no padding. And this is the default value of the parameter.
  • truncationUsed to control truncation. Can be boolean or string.
    • TrueOr “only_first”the sentence is truncated to the length of the parameter, if no parameter ( ) max_lengthis provided , the sentence is truncated to the maximum length acceptable to the model. If the provided data is a sentence pair or a batch of sentence pairs, only the first sentence will be truncated.max_lengthmax_length=None
    • “only_sceond”Truncate the sentence to max_lengththe length of the parameter, if no max_lengthparameter ( max_length=None) is provided, it will be truncated to the maximum length that the model can accept. When the input data is a sentence pair, only the second piece of data will be truncated.
    • FalseOr ”do_not_truncate”indicate that the sentence is not to be intercepted. And this is the default value of the parameter.
  • max_lengthUsed to control the length of padding or truncation. Can be 整数or None, the default is the maximum degree the model can accept. If the model does not have a specific maximum input length, it will be truncated or padded to max_length.

Some usage summary

If in the following example, the input is a pair of sentences, you can truncation=Truereplace ** with STRATEGY, the options are as follows:['only_first', 'only_second', 'longest_first'] **.

  • Do not truncate
    • Not filled:**tokenizer(batch_sentences)**
    • pad to the maximum length of the current batch: **tokenizer(batch_sentences, padding=True)**or**tokenizer(batch_sentences, padding=’longest’)**
    • Pad to the maximum acceptable length of the model:**tokenizer(batch_sentences, padding='max_length')**
    • Padding to a specific length:**tokenizer(batch_sentences, padding='max_length', max_length=42)**
  • Truncate to the maximum length of the model input
    • No padding: tokenizer(batch_sentences, truncation=True)or tokenizer(batch_sentences, padding=True, truncation=STRATEGY)****
    • pad to the maximum length of the current batch: tokenizer(batch_sentences, padding=True, truncation=True)ortokenizer(batch_sentences, padding=True, truncation=STRATEGY)
    • Padding to the maximum acceptable length of the model: tokenizer(batch_sentences, padding='max_length', truncation=True)ortokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)
    • Fill to a specific length: It can't be done, because adding max_lengthparameters can't fill and truncate to the maximum input.
  • truncate to a specific length
    • no padding: ortokenizer(batch_sentences, truncation=True, max_length=42)
      tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)
    • Pad to current batch max length: ortokenizer(batch_sentences, padding=True, truncation=True, max_length=42)
      tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)
    • Padding to the maximum acceptable length of the model: not possible
    • pad to a specific length: ortokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)
      tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)

pre-labeled input

Tokenizers can also accept pre-tokenized input. This is important in Named Entity Recognition and Part-Of-Speech Tagging tasks.

It should be noted that the pre-labeled input is not an indexed input, but just divides the words,

To use pre-tagged input, just set the parameter to is_split_into_words=True. Examples are as follows:

encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
print(encoded_input)

结果:
{
    
    'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Note: The pre-tagged input will also add model-related special tags, unless the attribute parameter is add_special_tokens=False.

Enter multiple sentences

Pre-token input multiple sentences is exactly the same as before, you can encode multiple sentences like this:

batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
                   ["And", "another", "sentence"],
                   ["And", "the", "very", "very", "last", "one"]]
encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)

input sentence pair

Sentence pairs can also be entered like this:

batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
                             ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
                             ["And", "I", "go", "with", "the", "very", "last", "one"]]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)

padding and truncation

It is also possible to pad and truncate as before:

batch = tokenizer(batch_sentences,
                  batch_of_second_sentences,
                  is_split_into_words=True,
                  padding=True,
                  truncation=True,
                  return_tensors="pt")

Guess you like

Origin blog.csdn.net/qq_42464569/article/details/123239558