Article Directory
Word-based tokenization
This is a very common word segmentation method that splits text into words based on separators (spaces or punctuation, etc.)
example:
“Is it weird I don’t like coffee?”
If we consider only spaces, we will get
[“Is”, “it”, “weird”, “I”, “don’t”, “like”, “coffee?”]
- We found
coffee?
that this punctuated word - If there is one now
coffee.
, it will lead to the same word with different punctuation and different expressions, which is not ideal
So we take the punctuation into account:
[“Is”, “it”, “wierd”, “I”, “don”, “’”, “t”, “like”, “coffee”, “?”]
- We found that
don't
it was divided into three tokens - A better expression would be
do
,n't
- In this way, when you see next
doesn't
time, you can divide intodoes
andn't
, becausen't
you have learned before, so you can directly apply the knowledge you have learned - This can be achieved by devising some rules
Spaces, punctuation, and rule-based tokenization are all examples of word-based tokenization
after,Each word is represented by an ID, and each ID contains a lot of information, because a word in a sentence has a lot of contextual and semantic information
This approach sounds good, but results in a large corpus and thus a large vocabulary
- The SOTA model,
Transformer XL
, uses spaces and punctuation, resulting in a vocabulary size of 267735 - A huge vocabulary will generate a huge embedding matrix for the output output, resulting in a large number of model parameters(accounting for resources)
To prevent huge vocabularies, we can limit the number of words that are added to the vocabulary
-
For example, only adding the most common 5000 words
-
The model will generate IDS for these words and mark the remaining words as OOV (Out Of Vocabulary)
-
Disadvantage 1: But this will cause a lot of information loss, because the model will not learn OOV words, he learns the same OOV representation for all unknown words
-
Con 2: Marking misspelled words as OOV
In order to solve the above shortcomings, Character-based tokenization was generated
Character-based tokenization
Split raw text into individual characters
- Because each language has many different words, but a fixed number of letters
- This produces a tiny dictionary
For example, in English, we use 256 different characters (letters, numbers, special characters), but there are nearly 170,000 words in the dictionary
advantage
- will produce a tiny dictionary
- There are very few OOV words, so it is possible to take advantage of the per-character representations to create representations for words not seen during training
- Misspelled words can be spelled correctly instead of marked as OOV
shortcoming:
- Characters usually don't carry any meaning/information like words (words)
- The tokenized sequence represented in this way is much longer than the original sequence
Note: Some languages have characters that carry a lot of information, so this approach is useful
Subword-based tokenization
It is between word-based and character-based tokenization, mainly to solve the problems of the above two methods
- word-based: the dictionary is too large, there are too many OOV tokens, similar words have different meanings
- character-based: the sequence is too long, and the independent token does not contain much meaning
Principles:
- Do not split common words into smaller subwords
- Split rare words into smaller meaningful subwords
For example: should not be split boy
, should be split boys
into boy
ands
- This will help the model learn that the word "boys" is formed from the word "boy", which have slightly different meanings but have the same root.
We divide tokenization into token and ization
-
token is a root, which will help the model learn that the meaning of words with the same root is similar, such as tokens
-
The ization is a subword, which is marked as additional information of the root. This will help the model learn that tokenization and modernization are composed of different roots, but have the same suffix ization and are used in the same syntactic context
Another case is to split surprisingly into surprising and ly, because these two separate subwords appear more frequently
This algorithm uses special symbols to mark which word is the beginning of the token, and that word is the completion of the beginning token
- tokenization -> token and ## ization
- Different models have different special symbols,
##
which are used by BERT - Special symbols can also be placed at the beginning of words
A few common subword-based tokenization algorithms are WordPiece used by BERT and DistilBERT, Unigram by XLNet and ALBERT, and Bye-Pair Encoding by GPT-2 and RoBERTa.
This way the model has a decent sized vocabulary and can also learn meaningful context-free representations. This also handles unseen words, as they can be decomposed into known subwords
Reference: https://towardsdatascience.com/word-subword-and-character-based-tokenization-know-the-difference-ea0976b64e17