The bilingual asymmetric Encoder-Decoder model OpenBA trained from scratch by Suzhou University has been officially open source!
Key highlights include:
Highlight 1: This model contributes a representative encoder-decoder large language model to the Chinese open source community, and its training process (including data collection and cleaning, model construction and training) has been completely open source.
Highlight 2: In terms of data, the data used by OpenBA are publicly available , making the model's capabilities more transparent.
Highlight 3: For Chinese instruction capabilities, we constructed a large-scale Chinese Flan data set based on open source annotation data and fully opened its construction method.
Highlight 4: With only 380B token training volume, it has surpassed many models trained with the same parameter amount and larger data in a variety of Chinese and English downstream tasks.
Technical report and project address
Technical report:
OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch
https://arxiv.org/abs/2309.10706
Model:
https://huggingface.co/OpenBA
Project:
https://github.com/OpenNLG/OpenBA.git
Paper overview
The development of large language models is inseparable from the contributions of the open source community. In the field of Chinese open source, although there are excellent works such as GLM, Baichuan, Moss, and BatGPT, there are still the following gaps:
Mainstream open source large language models are mainly based on the decoder-only architecture or its variants, and the encoder-decoder architecture still needs to be studied.
Many Chinese open source instruction data sets are generated by ChatGPT or translated from English, and there are copyright and quality issues.
To fill these gaps, this work:
It adopts an asymmetric encoder-decoder architecture (shallow encoder, deep decoder) and integrates three stages of UL2 multi-task training, length adaptation training and bilingual Flan training.
A Chinese Flan data set containing 50 million instructions was constructed, covering 44 tasks, and the collection and construction methods were completely open.
Pre-training data composition
The data of OpenBA is composed of 190B tokens English data, 190B tokens Chinese data and 20B tokens code data. Among them, the English data and code data are sampled from The Pile data set, while the Chinese data set mainly comes from a subset of Common Crawl and FudanNLPLAB's CBook-150K data set. Its specific pre-training data composition is shown in the figure below:
Bilingual Flan data collection
We selected The Flan Collection as the English Flan data set, while the Chinese Flan data set selected 50 million instruction , and its construction method was completely open. The distribution of the entire bilingual Flan data set and the specific composition of the Chinese Flan data set are given below.
Asymmetric Encoder-Decoder model structure
In terms of model structure selection, OpenBA tried three settings: (1) deeper decoder, (2) deeper encoder, (3) encoder and decoder with the same number of layers.
The paper believes that existing large language models are mainly decoder-only structures, which are good at generation capabilities, and deeper decoder layers will help improve model generation capabilities.
In response to this point, this article conducted a verification experiment, using the training target of UL2 to train the models of the above three settings, and observing the effect of the model on the three denoising verification sets. The ability on the S-Denoising task can be regarded as a test of A measure of model generation capabilities.
The experimental conclusion shows that deeper decoder settings have better capabilities in S-Denoising tasks, which also confirms the effectiveness of deeper decoder models in generation tasks.
Three-stage pre-training integrated with UL2
As shown in the figure above, OpenBA has gone through three stages of pre-training, namely:
UL2 pre-training This stage mainly involves three tasks: R-Denosing with a small number of random masks, X-Denosing with a large number of random masks, and S-Denosing with sequential masks.
Length adaptation training: At this stage, OpenBA extends the maximum input and output length of 570/380 to 1024/1024 and only focuses on the continuation task. The purpose of this step is to enable the model to adapt to downstream tasks that require higher context length, and to further enhance its generation capabilities.
Bilingual Flan training phase: In this phase, OpenBA fine-tunes the bilingual Flan data set to give the model a stronger ability to follow instructions.
Experimental results
OpenBA was evaluated on multiple commonly used Chinese and English Benchmarks (MMLU, CMMLU, C-Eval, BBH, SuperGLUE, etc.) and under different settings (including Zero-shot, Few-shot, Held-in, Hold-out), covering Common sense reasoning, natural language generation and natural language understanding tasks.
OpenBA achieves competitive results under different tasks and settings. The following are some evaluation results of OpenBA in BELEBELE (natural language understanding task), ROC Story (natural language generation task), and CMMLU (logical reasoning task).
OpenBA's manual evaluation results on ROC Story (story generation):
OpenBA’s automatic indicator results on CMMLU (Chinese logical reasoning):
summary
Although OpenBA only uses 380B tokens, it achieves excellent performance on numerous benchmarks, even surpassing models that consume more data. Soochow University has open sourced various stages of OpenBA checkpoints and the construction method of the Chinese Flan data set to facilitate the use of researchers.
The next phase of OpenBA's work will further in-depth research on the general chat model, calling tool model, and debiasing and alignment (please refer to the technical report for details).
If you are interested in OpenBA, welcome to cooperate and contribute to the open source community together.
Enter the NLP group—> Join the NLP exchange group