Summary of tokens in NLP

Token can be understood as the smallest unit in text. In English, a token can be a word or a punctuation mark. In Chinese, characters or words are usually used as tokens. ChatGPT splits the input text into tokens so that the model can process and understand them.
In natural language processing (NLP), "token" refers to a basic unit in the text, which can usually be a word, a phrase, A punctuation mark, a character, etc., depending on the needs and methods of text processing.

Dividing text into several tokens is the first step in text processing. This process is called "tokenization"
Insert image description here
https://blog.csdn.net/David_house/article/details/131065079 (For details, you can read this article)

Guess you like

Origin blog.csdn.net/qq_45560230/article/details/133295259