Common special symbols in Transformer models

Common special symbols in Transformer models

Let’s understand the common special symbols in the Transformer model through code.

sample code,

special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}

This code defines a dictionary special_tokens, which contains the tokens of the following keywords:

unk_token: unknown word token, used to replace words that are not in the vocabulary.
sep_token: separator token, used to separate sentences.
pad_token: padding token, used to pad the sequence to the same length.
cls_token: classification token, used for classification tasks.
mask_token: Mask token, used to mask some words.
So the specific meanings and functions of these tokens are:

[UNK] means unregistered words, that is, words that are not in the model vocabulary, will be replaced by this token.
[SEP] is used to split sentences, such as separating two sentences.
[PAD] is a padding token, which will be used to pad sentences to the same length.
[CLS] is a classification token, used for classification tasks. It will be added to the beginning of the sentence and classified through the representation of this token.
[MASK] is a mask token, used to mask some words, and then let the model predict the masked words.
These are common special symbols in the Transformer model. These special tokens need to be added when doing NLP tasks to represent some specific semantics.

end!

Guess you like

Origin blog.csdn.net/engchina/article/details/132815033