background
As ChatGPT quickly went out of the circle, large open-source models have blossomed everywhere in recent months. At present, there are three main categories of open source large language models: large models derived from ChatGLM (wenda, ChatSQL, etc.), large models derived from LLaMA (Alpaca, Vicuna, BELLE, Phoenix, Chimera, etc.), large models derived from Bloom (Bloomz, BELLE, Phoenix, etc.). Among them, ChatGLM-6B is mainly trained in Chinese and English, LLaMA is mainly trained in Latin with English as the main language, and Bloom uses 46 natural languages and 13 programming languages for training.
Model | training data volume | Model parameters | training data range | vocabulary size | word segmentation algorithm | Tokenizer backend |
---|---|---|---|---|---|---|
LLaMA | 1T~1.4T tokens (7B/13B use 1T, 33B/65B use 1.4T) | 7B~65B | Latin languages with English as the primary language | 32000 | BBPE | Realized based on SentencePiece tool |
ChatGLM-6B | About 1T tokens | 6b | Bilingual in Chinese and English | 130528 | BBPE | Realized based on SentencePiece tool |
Bloom | 1.6TB preprocessed text |