Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Return to thesis directory
Return to the data directory

Paper address

1.Introduction

Many people have realized that the secret of ChatGPT lies in fine-tuning the instructions and fine-tuning them to the extreme. This is another time after GPT-3Great efforts have made miracles. This article comes from work at Tsinghua University in May aimed at generating high-quality instruction fine-tuning data.

2. Abstract and Introduction

The effectiveness of instruction fine-tuning has been verified by multiple works, and ChatGPT is one of the representatives. This work aims toimprove the performance upper limit of open source models and provide a system-designed, diverse, information-rich, large-scale teaching dialogue data Set UltraChat. UltraChat contains 1.5 million high-quality multi-turn conversations and covers A wide range of topics and instructions. UltraChat's statistical analysis revealed its advantages in various key indicators such as scale, average length, diversity, and consistency, solidifying its position as a leading The status of open source datasets.

This article believes that the quality and diversity of data used in the training process will further improve the performance of the chat language model a> plays a vital role.

This work is no longer focusedQ&A or Summary etc. specific tasks to structure the data, but consists of 3 parts:1. Questions about the world, 2. Creation and generation, and 2. Assistance with existing materials.

This work usesmetainformation, in-context expansion and iteration prompts to expand the number of instructions.

Use 2 ChatGPT-turbo APIs, one represents user user: generate question queries; A representation assistant: generates responses.

Finally, the llama model was fine-tuned using the generated data and evaluated using ChatGPT (feeling flawed since ChatGPT was already used to generate the data). The results are as shown below, achieving the best performance of the open source model at that time.
Insert image description here

3.Related work

Instruction fine-tuning

This blog focuses on data generation, skip it.

Data generation

SelfInstruct (Wang et al., 2022), 
Alpaca (Taori et al., 2023b),
code-alpaca (Chaudhary, 2023),
alpaca-cot (Si et al., 2023), 
GPT4ALL (Anandet al., 2023), 
ShareGPT (Domeccleston, 2023),
Dolly-v2 (Conover et al., 2023), 
BELLE (Ji et al.,2023), 
Vicuna (Chiang et al., 2023), 
Koala (Genget al., 2023), 
Baize (Xu et al., 2023),
CAMEL (Li et al.,2023)

4.Method

In order to ensure the quality and diversity of data, this work considers two key points.

  1. The opening statement directly determines the topic of the conversation. The opening line should be highly diverse and contain any task a human user might ask the chat model to perform.
  2. The user determines the plot of the conversation and the output should be customizedcustomized according to the current situation with different language styles and requested topics.

4.1 Questions about the world

The author first asked ChatGPT to obtain 30 meta-topics, and then further generated 30-50 sub-topics for each question. For each subtopic, another 10*10 questions are generated.

3030100=90000
3050100=150000

At the same time, the author also obtained 10,000 entities (for example, organic chemistry) from Wikidata, and each entity generated 5*30 questions.

10000*150=1500000

In the end, 500,000 questions about the world were left.

4.2 Creation and generation

Use ChatGPT to generate writing instructions and data.

4.3 Assistance with existing materials

First, the following materials were collected: C4: Internet data (about 20 T).

Then, after filtering, 10,000 texts were obtained, and for each text, 5 unique instructions were generated with the help of ChatGPT.

In order to match the instruction with the text and become the opening statement of a new conversation, the author designed the template in the picture below. It should be noted that there are 7 lines corresponding to 7 opening templates. Ultimately, 500,000 templates were used to generate conversation starters.

Insert image description here

5.Analysis and evaluation

analyze

The analysis results are shown in the figure below, using:

  1. Dialogue turns
  2. Conversation length
  3. Single conversation length
  4. Textual diversity (MTLD, paper: Mtld, vocdd, and hd-d: A validation study of sophisticated approaches to lexical diversity assessment)
  5. Topic diversity (just sample and calculate how many topics and variance)
  6. Continuity (ChatGPT)

Insert image description here

Evaluate

An own evaluation set:Our Evaluation Set

Insert image description here

Insert image description here

世界知识评估集:Truthful QA: Principle-driven self-alignment of language models from scratch with
minimal human supervision.

Insert image description here

Guess you like

Origin blog.csdn.net/a1920993165/article/details/134904862