[NLP] Common tokenize (word segmentation) method - Byte Pair Encoding (BPE)

The difference between tokenize and embedding

In natural language processing, Tokenization and Embedding are two important concepts.

Tokenization is the process of dividing natural language text into different lexical units (tokens). This process usually involves segmenting words, punctuation marks, numbers, etc. in the text, and removing some noise or redundant information. The result of Tokenization is a sequence containing several lexical units, which can be used to construct a vector representation of the text or input into a deep learning model for processing.

Embedding is the process of mapping vocabulary units to vector space, usually using some pre-trained word vector models (such as Word2Vec, GloVe and FastText, etc.). These models map each lexical unit to a fixed-length vector such that the vector representation of each lexical unit captures its semantic and contextual information. Embedding is often used in tasks such as text classification, machine translation, and language generation. It can convert lexical units in text sequences into vector sequences, which facilitates model processing.

In general, Tokenization and Embedding are two basic operations in natural language processing. Tokenization divides text sequences into lexical units, and Embedding maps lexical units into vector spaces and provides vector representations for text sequences. These operations are essential in many natural language processing tasks.

insert image description here
The image above nicely shows the difference between the two.

Common tokenization methods

There are three main types: word-level (word granularity), byte-level (word granularity), and character-level (subword granularity).
The difference between word-level LMs (word-based language models) and byte-level LMs (byte-based language models) is that they use different basic units when processing text.

word granularity

In this section we discuss methods related to word granularity. The segmentation of word granularity is the same as the principle of human understanding of text. It can be done with some tools, such as NLTK and SpaCy in English, jieba and HanLP in Chinese, etc.

First of all, let's intuitively look at the method of Tokenization at the word granularity. Obviously, it is consistent with the natural segmentation when we humans read.
insert image description here

The advantage of this method is that it can well preserve the semantic and boundary information of words .

For details, please refer to: https://zhuanlan.zhihu.com/p/444774532

In word-level LMs, the basic unit is a word. The input to the model is a sequence of words, and the output is a probability distribution predicting the next word. For example, given the sentence "I am going to the", the task of the model is to predict what the next word might be, such as "store" or "park", etc.

Defects of word granularity segmentation :

  1. The method of word granularity needs to construct a large dictionary, which seriously affects the calculation efficiency and consumes memory.
  2. Even if using such a large dictionary does not affect efficiency, it will cause OOV problems. Because human language is constantly developing, the vocabulary is also constantly increasing during development. For example: Needle No Poking, Niubility, Sixology, etc.
  3. The low-frequency words/sparse words in the vocabulary cannot be fully trained during the model training process, and the model cannot fully understand the semantics of these words.
  4. A word will produce different words because of different forms, such as "looks" and "looking" derived from "look", but they have similar meanings, and it is unnecessary to train them all.

word granularity

Word granularity, also known as character granularity, is segmented according to the smallest symbol of a certain language . Character granularity should be proposed by Karpathy in 2015. Simply put, English (Latin) is based on letters.

Chinese, Japanese, Korean, etc. are segmented in units of characters. For example:
insert image description here
its advantage is that the vocabulary is greatly reduced, 26 English letters can basically cover almost all words, and more than 5,000 Chinese can basically be combined to cover vocabulary. But apart from this advantage, there are all disadvantages. The most important thing is that this method seriously loses the phonetic information and boundary information of the vocabulary, which is catastrophic for the model. Moreover, dividing words too finely will make the input too long and increase the input calculation pressure. The cost of reducing the vocabulary is that the input length is greatly increased, so that the input calculation becomes more time-consuming.

There is not much to say about this Tokenization method, because it is generally not used in reality.

Correspondingly, in byte-level LMs (word granularity word segmentation), the basic unit is a byte. The input to the model is a sequence of bytes, and the output is a probability distribution predicting the next byte. For example, given the byte sequence "01000101 01101110 01100111 01101100 01101001 01110011 01101000", the task of the model is to predict what the next byte will be, such as " " or "!" and so on.

It should be noted that byte-level LMs may encounter some problems when dealing with Chinese text, because Chinese characters are usually composed of multiple bytes, and the combinations of these bytes are very diverse. Therefore, in order to process Chinese text, it is usually necessary to use more complex character-level language models, the basic unit of which is a character.

subword granularity

First of all, let's talk about what kind of Tokenization is the ideal Tokenization. In a nutshell, the vocabulary should be as small as possible, and at the same time, it should cover most of the words and minimize the occurrence of OOV words. In addition, each token in this table has a meaning, that is, In other words, each of the subwords cut out of a word has a meaning, and it should not be too detailed.

However, both of the previous two methods have more or less shortcomings, and both cannot meet the above requirements at the same time. So is there a way to balance all parties? Yes, that is the method of subword segmentation. But let me first declare here that this method is only applicable to English. For Chinese, it is impossible to divide a character into radicals and radicals.

So how is subword Tokenization divided? Another example:
insert image description here
Then I have to ask another question, how to segment or how to construct a subwords dictionary? There are currently 4 main methods. As shown below.
insert image description here

Please refer to the specific four word segmentation methods: https://zhuanlan.zhihu.com/p/444774532

BPE

For BPE related information: https://zhuanlan.zhihu.com/p/424631681

References:
https://zhuanlan.zhihu.com/p/444774532

Guess you like

Origin blog.csdn.net/weixin_42468475/article/details/131264705