Tiger Chinese large model of local deployment chatgpt

introduce

TigerBot is a multilingual and multitask large-scale language model (LLM). According to the automatic evaluation of the OpenAI InstructGPT paper on the public NLP dataset, TigerBot-7B has reached 96% of the comprehensive performance of the OpenAI model of the same size, and this is only our MVP. Here we will open source the following exploration results:

Model: TigerBot-7B, TigerBot-7B-base, TigerBot-180B (research version),
code: basic training and inference code, including quantization and inference code for dual card inference 180B model,
data: pre-training 100G, filtered from 2TB The data has been denoised, deduplicated and cleaned; supervised and fine-tuned 1G or 1 million pieces of data, covering 10 major categories and 120 subcategories of common user instructions in proportion
. Code-free training and use of its own large models and data,
domain data: covering finance, law, encyclopedia, and large model application developers are invited to create world-class applications in China.
Based on BLOOM, we have optimized the model architecture and algorithm as follows:

The instruction completes the innovative algorithm of supervised fine-tuning to obtain better learnability, and
uses ensemble and probabilistic modeling methods to achieve more controllable factuality and generativeness.
In parallel training, we have made a breakthrough Several memory and communication problems in mainstream frameworks such as deep-speed have caused months of uninterrupted operation in the kilocalorie environment.
For the more irregular distribution of the Chinese language, a more suitable algorithm optimization has been made from the tokenizer to the training algorithm.

Guess you like

Origin blog.csdn.net/artistkeepmonkey/article/details/131111667