LLM: SentencePiece (a necessary tool for vocabulary expansion)

background

As ChatGPT quickly went out of the circle, large open-source models have blossomed everywhere in recent months. At present, there are three main categories of open source large language models: large models derived from ChatGLM (wenda, ChatSQL, etc.), large models derived from LLaMA (Alpaca, Vicuna, BELLE, Phoenix, Chimera, etc.), large models derived from Bloom (Bloomz, BELLE, Phoenix, etc.). Among them, ChatGLM-6B is mainly trained in Chinese and English, LLaMA is mainly trained in Latin with English as the main language, and Bloom uses 46 natural languages ​​and 13 programming languages ​​for training.

Model training data volume Model parameters training data range vocabulary size word segmentation algorithm Tokenizer backend
LLaMA 1T~1.4T tokens (7B/13B use 1T, 33B/65B use 1.4T) 7B~65B Latin languages ​​with English as the primary language 32000 BBPE Realized based on SentencePiece tool
ChatGLM-6B About 1T tokens 6b Bilingual in Chinese and English 130528 BBPE Realized based on SentencePiece tool
Bloom 1.6TB preprocessed text

Guess you like

Origin blog.csdn.net/u013250861/article/details/132248345