Common Word Segmentation Methods

Word-based tokenization

This is a very common word segmentation method that splits text into words based on separators (spaces or punctuation, etc.)

example:

“Is it weird I don’t like coffee?”

If we consider only spaces, we will get

[“Is”, “it”, “weird”, “I”, “don’t”, “like”, “coffee?”]

  • We found coffee?that this punctuated word
  • If there is one now coffee., it will lead to the same word with different punctuation and different expressions, which is not ideal

So we take the punctuation into account:

[“Is”, “it”, “wierd”, “I”, “don”, “’”, “t”, “like”, “coffee”, “?”]

  • We found that don'tit was divided into three tokens
  • A better expression would be do,n't
  • In this way, when you see next doesn'ttime, you can divide into doesand n't, because n'tyou have learned before, so you can directly apply the knowledge you have learned
  • This can be achieved by devising some rules

Spaces, punctuation, and rule-based tokenization are all examples of word-based tokenization

after,Each word is represented by an ID, and each ID contains a lot of information, because a word in a sentence has a lot of contextual and semantic information

This approach sounds good, but results in a large corpus and thus a large vocabulary

  • The SOTA model, Transformer XL, uses spaces and punctuation, resulting in a vocabulary size of 267735
  • A huge vocabulary will generate a huge embedding matrix for the output output, resulting in a large number of model parameters(accounting for resources)

To prevent huge vocabularies, we can limit the number of words that are added to the vocabulary

  • For example, only adding the most common 5000 words

  • The model will generate IDS for these words and mark the remaining words as OOV (Out Of Vocabulary)

  • Disadvantage 1: But this will cause a lot of information loss, because the model will not learn OOV words, he learns the same OOV representation for all unknown words

  • Con 2: Marking misspelled words as OOV

In order to solve the above shortcomings, Character-based tokenization was generated

Character-based tokenization

Split raw text into individual characters

  • Because each language has many different words, but a fixed number of letters
  • This produces a tiny dictionary

For example, in English, we use 256 different characters (letters, numbers, special characters), but there are nearly 170,000 words in the dictionary

advantage

  • will produce a tiny dictionary
  • There are very few OOV words, so it is possible to take advantage of the per-character representations to create representations for words not seen during training
  • Misspelled words can be spelled correctly instead of marked as OOV

shortcoming:

  • Characters usually don't carry any meaning/information like words (words)
  • The tokenized sequence represented in this way is much longer than the original sequence

Note: Some languages ​​have characters that carry a lot of information, so this approach is useful

Subword-based tokenization

It is between word-based and character-based tokenization, mainly to solve the problems of the above two methods

  • word-based: the dictionary is too large, there are too many OOV tokens, similar words have different meanings
  • character-based: the sequence is too long, and the independent token does not contain much meaning

Principles:

  • Do not split common words into smaller subwords
  • Split rare words into smaller meaningful subwords

For example: should not be split boy, should be split boysinto boyands

  • This will help the model learn that the word "boys" is formed from the word "boy", which have slightly different meanings but have the same root.

We divide tokenization into token and ization

  • token is a root, which will help the model learn that the meaning of words with the same root is similar, such as tokens

  • The ization is a subword, which is marked as additional information of the root. This will help the model learn that tokenization and modernization are composed of different roots, but have the same suffix ization and are used in the same syntactic context

Another case is to split surprisingly into surprising and ly, because these two separate subwords appear more frequently

This algorithm uses special symbols to mark which word is the beginning of the token, and that word is the completion of the beginning token

  • tokenization -> token and ## ization
  • Different models have different special symbols, ##which are used by BERT
  • Special symbols can also be placed at the beginning of words

A few common subword-based tokenization algorithms are WordPiece used by BERT and DistilBERT, Unigram by XLNet and ALBERT, and Bye-Pair Encoding by GPT-2 and RoBERTa.

This way the model has a decent sized vocabulary and can also learn meaningful context-free representations. This also handles unseen words, as they can be decomposed into known subwords

Reference: https://towardsdatascience.com/word-subword-and-character-based-tokenization-know-the-difference-ea0976b64e17

Guess you like

Origin blog.csdn.net/qq_52852138/article/details/129116527