OpenBA: Another member of the open source model family! 15B Chinese-English asymmetric Encoder-Decoder structure bilingual model trained from scratch...

e8ddda1de46684d9fffd05473602db6d.png

The bilingual asymmetric Encoder-Decoder model OpenBA trained from scratch by Suzhou University has been officially open source!

Key highlights include:

  • Highlight 1: This model contributes a representative encoder-decoder large language model to the Chinese open source community, and its training process (including data collection and cleaning, model construction and training) has been completely open source.

  • Highlight 2: In terms of data, the data used by OpenBA are publicly available , making the model's capabilities more transparent.

  • Highlight 3: For Chinese instruction capabilities, we constructed a large-scale Chinese Flan data set based on open source annotation data and fully opened its construction method.

  • Highlight 4: With only 380B token training volume, it has surpassed many models trained with the same parameter amount and larger data in a variety of Chinese and English downstream tasks.

Technical report and project address

Technical report:
OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch
https://arxiv.org/abs/2309.10706

Model:
https://huggingface.co/OpenBA

Project:
https://github.com/OpenNLG/OpenBA.git

Paper overview

The development of large language models is inseparable from the contributions of the open source community. In the field of Chinese open source, although there are excellent works such as GLM, Baichuan, Moss, and BatGPT, there are still the following gaps:

  1. Mainstream open source large language models are mainly based on the decoder-only architecture or its variants, and the encoder-decoder architecture still needs to be studied.

  2. Many Chinese open source instruction data sets are generated by ChatGPT or translated from English, and there are copyright and quality issues.

To fill these gaps, this work:

  1. It adopts an asymmetric encoder-decoder architecture (shallow encoder, deep decoder) and integrates three stages of UL2 multi-task training, length adaptation training and bilingual Flan training.

  2. A Chinese Flan data set containing 50 million instructions was constructed, covering 44 tasks, and the collection and construction methods were completely open.

Pre-training data composition

The data of OpenBA is composed of 190B tokens English data, 190B tokens Chinese data and 20B tokens code data. Among them, the English data and code data are sampled from The Pile data set, while the Chinese data set mainly comes from a subset of Common Crawl and FudanNLPLAB's CBook-150K data set. Its specific pre-training data composition is shown in the figure below:

b326572070b5fc2ec564ab9571b4244a.png

Bilingual Flan data collection

We selected The Flan Collection as the English Flan data set, while the Chinese Flan data set selected 50 million instruction , and its construction method was completely open. The distribution of the entire bilingual Flan data set and the specific composition of the Chinese Flan data set are given below.

afd63416dcade063c85ccf12f7781100.png1b17ab1bfb7834ea875f56d84c50e799.png

Asymmetric Encoder-Decoder model structure

In terms of model structure selection, OpenBA tried three settings: (1) deeper decoder, (2) deeper encoder, (3) encoder and decoder with the same number of layers.

The paper believes that existing large language models are mainly decoder-only structures, which are good at generation capabilities, and deeper decoder layers will help improve model generation capabilities.

In response to this point, this article conducted a verification experiment, using the training target of UL2 to train the models of the above three settings, and observing the effect of the model on the three denoising verification sets. The ability on the S-Denoising task can be regarded as a test of A measure of model generation capabilities.

2ebc0bfaf59118eb6ce6ca2613083be0.png

The experimental conclusion shows that deeper decoder settings have better capabilities in S-Denoising tasks, which also confirms the effectiveness of deeper decoder models in generation tasks.

Three-stage pre-training integrated with UL2

eea4e09cb86f837f0e56a836a4c54cad.png

As shown in the figure above, OpenBA has gone through three stages of pre-training, namely:

  • UL2 pre-training This stage mainly involves three tasks: R-Denosing with a small number of random masks, X-Denosing with a large number of random masks, and S-Denosing with sequential masks.

  • Length adaptation training: At this stage, OpenBA extends the maximum input and output length of 570/380 to 1024/1024 and only focuses on the continuation task. The purpose of this step is to enable the model to adapt to downstream tasks that require higher context length, and to further enhance its generation capabilities.

  • Bilingual Flan training phase: In this phase, OpenBA fine-tunes the bilingual Flan data set to give the model a stronger ability to follow instructions.

Experimental results

OpenBA was evaluated on multiple commonly used Chinese and English Benchmarks (MMLU, CMMLU, C-Eval, BBH, SuperGLUE, etc.) and under different settings (including Zero-shot, Few-shot, Held-in, Hold-out), covering Common sense reasoning, natural language generation and natural language understanding tasks.

OpenBA achieves competitive results under different tasks and settings. The following are some evaluation results of OpenBA in BELEBELE (natural language understanding task), ROC Story (natural language generation task), and CMMLU (logical reasoning task).

6631541b5f29eee867a87d775f34f348.png
OpenBA automatic indicator results on BELEBELE (reading comprehension)

OpenBA's manual evaluation results on ROC Story (story generation):

f1f8a7379aa818319eada7bfed4072f4.png
Continuity Assessment
47bd26012c3b9d99ad29a914188e1d68.png
Consistency evaluation

OpenBA’s automatic indicator results on CMMLU (Chinese logical reasoning):522286ec5fc7ecfc7fedc62066890c72.png

summary

Although OpenBA only uses 380B tokens, it achieves excellent performance on numerous benchmarks, even surpassing models that consume more data. Soochow University has open sourced various stages of OpenBA checkpoints and the construction method of the Chinese Flan data set to facilitate the use of researchers.

The next phase of OpenBA's work will further in-depth research on the general chat model, calling tool model, and debiasing and alignment (please refer to the technical report for details).

If you are interested in OpenBA, welcome to cooperate and contribute to the open source community together.


Enter the NLP group—> Join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/133109839