The largest Llama open source community in China releases the first pre-trained Chinese version Llama2

f8f067a750858c2d685f0293d3de9a4b.jpeg

"
On July 31, the Llama Chinese community took the lead in completing thefirst real Chinese version of the Llama2-13B large model, which greatly optimized and improved the Chinese capabilities of Llama2 from the bottom of the model. Undoubtedly, once the Chinese version of Llama2 is released, it will Open a new era of domestic large-scale model!


| The strongest in the world, but the Chinese language is short

Llama2 is currently the most powerful open-source large model in the world, but its Chinese ability needs to be improved . Meta lived up to expectations. In the early morning of July 19, it opened an upgraded version of the first generation of LLaMA: Llama2, 7B, 13B and 70B models of three sizes Fully open and free for commercial use. As the most powerful open-source large model in the AI ​​field, Llama2 is pre-trained on 2 trillion token data and fine-tuned on 1 million human-labeled data to obtain a dialogue model. In many benchmark tests including reasoning, programming, dialogue and knowledge testing, the results are significantly better than open source large language models such as MPT, Falcon, and the first generation of LLaMA. It is also comparable to commercial GPT-3.5 for the first time. Among the open source models Unique. 04dd94a210f59d778581c5a2564e084f.jpeg 05b4c2cacc1699fd08d1e861aa40fc5f.jpeg Although the pre-training data of Llama2 has doubled compared to the first generation, the proportion of Chinese pre-training data is still very small, accounting for only 0.13 % , which also leads to the weak Chinese ability of the original Llama2 . We asked some Chinese questions and found that in most cases Llama2 could not answer in Chinese, or answered questions in a mixed form of Chinese and English. Therefore, it is necessary to optimize Llama2 based on large-scale Chinese data, so that Llama2 has better Chinese ability. 9bf8e61e6404e9b3f5fd1d7397302305.jpeg

For this reason, the large-scale model doctoral team of top universities in China founded the Llama Chinese community and started the training journey of Llama2 Chinese large-scale model.

| The leading Llama Chinese community

Llama Chinese community is the leading open source large model Chinese community in China. Github reached 2.4k stars within two weeks . Llama is led by doctoral teams from Tsinghua University, Jiaotong University and Zhejiang University. It has gathered 60+ senior engineers in the AI ​​field and 2000+ top talents in various industries .

7547d8c460e73556c1cf49ea96fc8ed1.jpeg

Community link :
https://github.com/FlagAlpha/Llama2-Chinese
0cfbad0f614c5e38c8977eb13a81f80b.jpeg

 Community History:


b86f4b3a571047bbcd8e400e7b2f391b.jpeg


| The first pre-trained Chinese version of the Llama2 model is released!

Not a spinner! It is based on 200B Chinese corpus pre- training !

On July 31, the Llama Chinese community took the lead in completing the first real Chinese version of the 13B Llama2 model in China: Llama2-Chinese-13B, which greatly optimized and improved the Chinese ability of Llama2 from the bottom of the model. The Chinese culture of Llama2 can adopt roughly two routes: 1. Based on the existing Chinese instruction data set, fine-tune the instructions of the pre-training model, so that the base model can align with the Chinese question-and-answer ability . The advantage of this route is that the cost is low , the amount of instruction fine-tuning data is small, and the computing power resources required are small , and it can quickly realize the prototype of a Chinese Llama. But the shortcomings are also obvious. Fine-tuning can only stimulate the existing Chinese ability of the base model, but because Llama2 has less Chinese training data itself, the ability to be stimulated is also limited . Still need to start with pre-training. 2. Pre-training based on large-scale Chinese corpus. The disadvantage of this route is the high cost ! Not only large-scale high-quality Chinese data is required, but also large-scale computing resources are required. But the advantages are also obvious, that is, it can optimize the Chinese ability from the bottom layer of the model, and truly achieve the effect of curing the root cause, injecting powerful Chinese ability into the large model from the core ! In order to realize a thorough Chinese large model from the kernel , we chose the second route! We have collected a batch of high-quality Chinese corpus data sets, and optimized the Llama2 large model starting from pre-training. Part of the pre-training data data is as follows: Type Description Network data Public network data on the Internet, selected high-quality Chinese data, involving high-quality long text data such as encyclopedias, books, blogs, news, announcements, novels, etc. 200G data ClueClue's open Chinese pre-training data, high-quality Chinese long text data competition dataset after cleaning, Chinese natural language processing multi-task competition dataset in recent years, some datasets cleaned from about 150 MNBVCMNBVC The first phase of Llama2- The pre-training data of the Chinese-13B model contains 200B tokens. In the future, we will continue to iteratively update Llama2-Chinese and gradually increase the pre-training data to 1T tokens. In addition, we will gradually open the Chinese pre-training version of the 70B model, so stay tuned!
8176553a50aba4aa31484afed41e5bb2.jpeg We questioned the big model from different aspects such as general knowledge, language understanding, creative ability, logical reasoning, code programming, work skills, etc., and got satisfactory results !  Some effects are shown below:
  • general knowledge
0192045de7ebd214c7b56e192fda9132.jpeg
  • language understanding
3a7827a0f1b4975cbd6862e395b1e172.jpeg
  • creative ability
05dd674577d4927c75bad0f1bc649ec7.jpeg
  • logical reasoning
f938c5a81f194835c198bd1535b6a96f.jpeg
  • code programming
0b32af881ad4cdaff69b93b3afe0287b.jpeg
  • working ability
fd0bacd5bb8b7f1ff033a1dc287011a2.jpeg



Guess you like

Origin blog.csdn.net/zhaomengsen/article/details/132068086