Overview of LLaMA and its offspring models


Please add image description

LLaMA

Differences from the original transformer:

Pre-normalization [GPT3] . In order to improve training stability, the input of each Transformer sub-layer is normalized instead of the output. The RMSNorm normalization function introduced by Zhang and Sennrich (2019) is used.

SwiGLU activation function [PaLM] . ReLU nonlinearity is replaced with the SwiGLU activation function introduced by Shazeer (2020) to improve performance. The paper uses
dimensions of , not 4d in PaLM.

Rotation embedding [GPTNeo] . Absolute position embeddings are removed and rotated position embeddings (RoPE) introduced by Su et al. (2021) are added at each layer of the network.
Paper: LLaMA: Open and Efficient Foundation Language Models
Paper interpretation: LLaMA: Open and Efficient Foundation Language Model (Meta AI-2023)

Related GitHub

Alpaca

Alpaca is a pre-trained model obtained by Stanford after fine-tuning LLaMA 7B with 52k instruction data. The author claims that in terms of the effect of single-round instruction execution, Alpaca's reply quality is equivalent to openai's text-davinci-003, but Alpaca has very few parameters (fine-tuning A 7B llama needs to be trained on 8 A100 80G for 3 hours, costing at least $100).

Official blog introduction: Alpaca: A Strong, Replicaable Instruction-Following Model
Interpretation: Stanford Alpaca (Alpaca): ChatGPT Academic Edition Open Source Implementation The
Please add image description
above figure shows examples of seed data and generation task sample data.

The whole process of training Alpaca:

  1. First, based on 175 human-written instruction-output pairs, as the seed set of self-instruct;
  2. Based on the seed set, prompt text-davinci-003 to generate more instructions;
  3. Optimize self-instruct: simplify the generation pipeline and greatly reduce the cost;
  4. Use openai api to generate 52k non-repetitive instructions and corresponding outputs, and the cost is less than $500;
  5. Use the huggingface framework to fine-tune the llama model. In the process, two technologies of fully sharded data parallel and mixed precision training are used;

Related GitHub

Vicuna

Vicuna-13B is a model based on LLaMa-13B using supervised data fine-tuning. The data set comes from user dialogue data generated by ShareGPT.com, with a total of 70K items. ShareGPT is a ChatGPT data sharing website, users will upload ChatGPT answers that they find interesting. Preliminary evaluations using GPT-4 as a judge show that Vicuna-13B achieves >90% of the quality of OpenAI ChatGPT and Google Bard, while outperforming other models such as LLaMA and Stanford Alpaca >90% of the time. The cost to train Vicuna-13B is approximately $300.

Official introduction: Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality
Interpretation: Large models are also involuted, Vicuna training and inference guide, the effect crushes Stanford Alpaca

Vicuna's code is based on Stanford Alpaca, with additional support for multiple rounds of dialogue . And used similar hyperparameters to Stanford Alpaca.

Training process:
First, the researchers collected about 70,000 conversations from http://ShareGPT.com (a website for users to share ChatGPT conversation content), and enhanced the training script provided by Alpaca to better handle multiple rounds Dialogue and long sequences. The training is full fine-tune with PyTOrch FSDP on 8-card A100 GPU in one day. In order to provide demonstration services, Vicuna researchers built a lightweight distributed service system, created 80 different questions in eight question categories (such as: role-playing, coding/mathematical tasks, etc.), using GPT-4 to judge Model output, which can be used to make a preliminary assessment of the quality of the model. To compare two different models, Vicuna researchers combined the output of each model into a single prompt for each question. The hints are then sent to GPT-4, which evaluates which model provided the better response.

Related Github

Vicuna limitations
The researchers pointed out that, similar to other large language models, Vicuna also has certain limitations.
For example, Vicuna performed poorly on tasks involving programming, reasoning, mathematics, and factual accuracy.
Furthermore, it is not sufficiently optimized for safety or to mitigate potential toxicity or bias.

Koala

It is estimated that the names of camelids are not enough, so they are named after other animals.
A picture explains:
Please add image description

Official blog introduction: Koala: A Dialogue Model for Academic Research

Interpretation: 13 billion parameters, 8 A100 training, UC Berkeley released the dialogue model Koala

Similar to Vicuna, Koala also fine-tunes the LLaMA model using conversational data collected from the web, with a focus on publicly available data from conversations with closed-source large models such as ChatGPT.

The Koala model was implemented in EasyLM using JAX/Flax, and the Koala model was trained on a single Nvidia DGX server equipped with 8 A100 GPUs, the research team said. It takes 6 hours to complete 2 epochs of training. Such training typically costs less than $100 on public cloud computing platforms.

The research team compared Koala with ChatGPT and Stanford University's Alpaca experimentally, and the results showed that: Koala-13B with 13 billion parameters can effectively respond to various user queries, and the generated responses are generally better than Alpaca, and in more than half of the cases The performance is comparable to ChatGPT.

Baize

Paper: Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data
Official Github: https://github.com/project-baize/baize-chatbot
Interpretation: Using ChatGPT to train alpacas: "Baize" Open source, easy to build exclusive models, available for online trial play

In new research, the author proposes an pipeline for automatically collecting ChatGPT conversations. By sampling "seeds" from specific data sets, ChatGPT can talk to itself and generate high-quality multi-turn conversation data sets in batches. Among them, if a domain-specific dataset is used, such as a medical question-answering dataset, high-quality vertical domain corpus can be generated.

The training method proposed by Bai Ze. By using the features of ChatGPT to automatically generate high-quality multi-round chat corpus, let ChatGPT talk to itself and simulate the response of users and AI.

To fine-tune large language models in a resource-poor environment, the authors employ efficient parameter tuning methods that make efficient use of computing resources. This strategy enables state-of-the-art language models to maintain high performance and adaptability. Bai Ze improved the open source large language model LLaMA by fine-tuning LLaMA using a newly generated chat corpus. The model runs on a single GPU, making it available to a wider range of researchers.

The process of self-chatting is the basis of training content. In order for ChatGPT to effectively generate data, the researchers applied a template to define the format and requirements, allowing ChatGPT's API to continue generating transcripts for both parties in the conversation until a natural stopping point is reached. Conversations center around a "seed," which can be a question or a key phrase that sets the topic of the chat.

Please add image description
In fact, it is LLaMA fine-tuned with LORA

Camel(Luotuo)

This technical document will introduce the "Camel" Chinese model that we trained yesterday. This model is based on Meta's open source LLaMA, with reference to the two projects Alpaca and Alpaca-LoRA, and has been trained on Chinese and has achieved preliminary results.

Official Github: https://github.com/LC1332/Luotuo-Chinese-LLM
Interpretation: [Open Source GPT] Three Chinese guys open source the Chinese language model "Camel". Training and deployment can be completed with a single card, and it costs hundreds to train yourself. Chinese chat model

Nothing to say, old routine, SFT based on LLaMA

BELLE

In order to promote the development of open source large language models, we have invested a lot of energy in developing low-cost models that can be similar to ChatGPT. First, in order to improve the performance and training/inference efficiency of the model in the Chinese domain, we further expanded the vocabulary of LLaMA and conducted secondary pre-training on 3.4 billion Chinese words.

In addition, you can currently see the instruction training data generated based on ChatGPT in the following ways: 1) Refer to Alpaca’s self-instruct data based on GPT3.5; 2) Refer to Alpaca’s self-instruct data based on GPT4; 3) Users use ChatGPT to share The data of ShareGPT. Here, we look at the impact of training data categories on model performance. Specifically, we examined factors such as the quantity, quality, and language distribution of the training data, as well as our own collection of Chinese multi-turn conversation data, as well as some publicly accessible high-quality guidance datasets.

In order to better evaluate the effect, we used an evaluation set containing a thousand samples and nine real-world scenarios to test various models, while providing valuable insights through quantitative analysis to better promote the development of open source chat models. develop.

Official Github: https://github.com/LianjiaTech/BELLE
Interpretation: Chinese dialogue model BELLE is fully open source! (Attachment: data + model + lightweight)

Guanaco

Guanaco is an instruction alignment language model based on the current mainstream LLaMA-7B model training. On the basis of the original 52K data, an additional 534K+ pieces of data are added, covering English, Japanese, German, Simplified Chinese, Traditional Chinese (Taiwan), Traditional Chinese (Hong Kong) and various language and grammar tasks. Rich data facilitates the improvement and optimization of the model, which demonstrates excellent performance and potential in a multilingual environment.

GitHub:https://github.com/Guanaco-Model/Guanaco-Model.github.io

Recently, the University of Washington proposed QLoRA, which uses 4-bit quantization to compress the pre-trained language model, then freezes the large model parameters, and adds a relatively small number of trainable parameters to the model in the form of Low-Rank Adapters. The model volume is greatly compressed. At the same time, it hardly affects its reasoning effect. This technology is used in fine-tuning LLaMA 65B, which usually requires 780GB of GPU memory, but this technology only needs 48GB, and the training cost is greatly reduced.

QLoRA Interpretation: The open source Guanaco and the QLoRA technology behind it reduce the video memory requirements of the fine-tuned 65B model from more than 780GB to less than 48GB. The effect is almost as good as GPT-4. Detailed technical explanation

There are so many LLaMA descendant models that I am too lazy to list them all. In the process of looking for information, I found a Github repository that covers most open source LLMs. It is really wonderful. Link: https://github.com /chenking2020/FindTheChatGPTer

Guess you like

Origin blog.csdn.net/qq_44193969/article/details/131316050