The most powerful large language model with 7 billion parameters so far: the open source and commercially available RedPajam 7B full version is released!

The RedPajama model is an open source and commercially available large model released by TOGETHER. On June 6, 2023, TOGETHER officially announced that the model has completed training. After testing, the model currently exceeds all 7B-scale large models, and is even better than LLaMA-7B and Falcon-7B!

TOGETHER is a start-up company composed of a luxurious management team, founded by former Apple executives and Stanford University professors. Its goal is to provide an infrastructure for large generative models.

RedPajama is an open source large-scale model project initiated by TOGETHER and several companies. Currently includes an open source dataset with 1.2 trillion tokens, collected strictly according to the method in the LLaMA model paper. In addition, it also includes 2 open source large models, one is RedPajama 3B, which announced the end of training on May 5 and can be used on RTX 2070. Another model is the RedPajama 7B mentioned in this article, which announced the completion of training yesterday, and its effect surpasses all current models of the same size.

  RedPajama 3B model information card:

https://www.datalearner.com/ai-models/pretrained-models/RedPajama-INCITE-3B

1. Introduction to RedPajama dataset

On April 17, TOGETHER released the RedPajama project, which is well known to the public. The project hopes to build an open source large language model, the first step of which is to reproduce the high-quality pre-training dataset mentioned in the LLaMA paper. They believe that high-quality large-scale pre-training data sets are a necessary condition for large-scale model training. The MetaAI open source LLaMA model can be understood as the most powerful open source model. However, they only open source the pre-training results and do not allow commercial use. RedPajama collected such a data set himself based on the MetaAI paper.

Therefore, they open-sourced the RedPajama dataset of 1.2 trillion tokens. This is a 5TB dataset collected as described in the LLaMA paper. It has been downloaded thousands of times and used to train more than 100 models.

2. Introduction to RedPajama 7B model

On April 23, one week after the release of the RedPajama dataset, TOGETHER announced that they were training a model called RedPajama-7B based on this 1.2 trillion tokens dataset, and completed 40% of the training, but the effect has surpassed that of Pythia-7B. This demonstrates the value of large-scale high-quality pre-training datasets.

On May 5th, the training process of the RedPajama-7B model reached 80%, but the effect was unexpected, so TOGETHER released the 0.1 version of RedPajama 7B, including 3 versions: basic large model, chat-based fine-tuning and instruction-based fine-tuning.

RedPajama-7B v0.1 version Pre-training download link
RedPajama-INCITE-Base-7B-v0.1 https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1
RedPajama-INCITE-Chat-7B-v0.1 https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-7B-v0.1
RedPajama-INCITE-Instruct-7B-v0.1 https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1

The Base model here is a basic large language model, which adopts the same architecture as the Pythia model, but is trained based on the RedPajama dataset, while Chat is the result of instruction fine-tuning based on Base model training (based on Dolly2 and OASST fine-tuning). The Chat version model can already be used in OpenChatKit. Instruct is the result of fine-tuning the few-shot prompts based on the Base model. Fine-tuning on many NLP tasks (from P3 and Natural Instruction).

Today, TOGETHER announced that RedPajama 7B completed all training. The full version of the 3 RedPajama models are all open source:

RedPajama-7B v1.0 version Pre-training download link
RedPajama-INCITE-7B-Base https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Base
RedPajama-INCITE-Chat-7B https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Chat
RedPajama-INCITE-Instruct-7B https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Instruct

The above models all use the Apache2.0 open source protocol, which is completely open source and commercially available!

3. Effect of RedPajama 7B model

TOGETHER is evaluated on the HELM task. RedPajama-INCITE-Instruct-7B is used. From the results, the average HELM score of RedPajama 7B Instruct is 0.492, which exceeds 0.472 of LLaMA 7B and 0.407 of the strongest open source model Falcon 7B some time ago.

Red Pajama second generation model coming soon

In addition to the above-mentioned open source of the RedPajama 7B model, the official also announced the news of the RedPajama V2 version. RedPajama 2 will be trained on a dataset of 2-3 trillion tokens. The main plans are as follows:

  1. Prepare to automatically learn a mix of different data based on techniques like DoReMi.

  2. Introduce data sets such as Pile v1 (from Eleuther.ai) and Pile v2 (CrperAI) to increase the diversity of training data

  3. Handle more CommonCrawl datasets

  4. Use a better data deduplication strategy

  5. Introduce a code dataset of at least 150 billion tokens.

According to the official description, RedPajama 2 will continue to be open source!

INCITE supercomputer funding behind RedPajama

The RedPajama project is inseparable from an INCITE project of the National Science Foundation of the United States, which lowers the threshold for everyone to use super-large-scale computing resources by subsidizing applicants' DOE supercomputer usage time. RedPajama used a total of 3072 V100 GPUs for training.

As can be seen, project funding for supercomputers is important to facilitate such large model training. At present, it is not known whether there are such resources available for everyone to apply for in China. If there is, this may also promote the development of domestic large models!

Guess you like

Origin blog.csdn.net/weixin_48827824/article/details/131289006