Analysis of baichuan-7B, an open-source large model of Baichuan Intelligent

From: Eat jelly without spitting out jelly skin

Enter the NLP group —> join the NLP exchange group

baichuan-7B is mainly improved with reference to LLaMA, and the model architecture is consistent with LLaMA. In the open source large model, LLaMA is undoubtedly the brightest star among them, but LLaMA has the following problems:

  • LLaMA natively only supports Latin or Cyrillic languages, and only uses a small amount of Chinese data sets for training. Therefore, it is not particularly ideal for Chinese support.

  • The vocabulary size of the original LLaMA model is 32K, there are only a few Chinese words, and the decoding efficiency for Chinese is low.

The improvements of baichuan-7B are as follows:

Effect improvement : used to improve the effect of the model and decoding efficiency.

  • Word segmentation improvement: the vocabulary size is 64K (use 20 million Chinese-English multilingual corpus to train the word segmentation model, which significantly improves the compression rate for Chinese), while the LLaMA vocabulary size is 32K.

  • Dataset improvement: About 1.2T Chinese and English tokens were used for training (data cleaning based on open source Chinese and English data, self-grabbed Chinese Internet data and some high-quality knowledge data), while LLaMA 7B used 1T English tokens for training train.

Technical Improvements : Used to improve training stability and throughput.

  • Operator optimization technology: use more efficient operators, such as Flash-attention, RMSNorm of NVIDIA apex, etc.

  • Operator Segmentation Technology: Segment part of computing operators to reduce memory peak value.

  • Mixed Precision Technology: Reduce speeds up the calculation process without losing model accuracy.

  • Disaster recovery technology for training: joint optimization of the training platform and training framework, IaaS + PaaS to achieve minute-level fault location and task recovery.

  • Communication optimization technology, specifically including:

    • Adopt topology-aware set communication algorithm to avoid network congestion and improve communication efficiency.

    • Adaptively set the bucket size according to the number of cards to improve bandwidth utilization.

    • According to the model and cluster environment, adjust the triggering timing of communication primitives, so as to overlap computation and communication.

In addition, the model is open source and commercially available, which is also an advantage.

It can be seen that the current large model seems to have little room for improvement from the algorithm level, and more improvements are made from the engineering and data levels to improve its performance.

Finally, I hope that the domestically produced large models will get better and better~~


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/131266726