Liu Zhiyuan's team proposed: How to improve the performance and efficiency of the model by expanding the high-quality guided dialogue dataset...

8bbf6bc5bf9b98e0e0185ffbabf55fe6.png

Deep Learning Natural Language Processing Original
Author | Carina Lau

As the open source language large model (LLM) blossoms, the performance and efficiency of the model are related to the balance between the cost of the product and the service experience. So, is there any way to make the large language model more efficient and better?

In order to further increase the upper limit of the open source model, the research team of Tsinghua University gave an answer: by expanding the high-quality guided dialogue data, the performance and efficiency of the model were significantly improved. As shown in the figure below, UltraLLaMA topped the LLM list!

36820e63690a44dda914d65a073abd37.png

Commented by netizens: UltraChat, which contains 1.5 million high-quality, diverse multi-turn conversations, is better than Vicuna, an open source model of SotA. 50b95ea69f469298d28028044e0410dc.pngLet's read the paper carefully and see what enlightenment we can bring~

Paper : Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
Address : https://arxiv.org/pdf/2305.14233.pdf
Code : https://github.com/thunlp/UltraChat

Enter the NLP group —> join the NLP exchange group

1 Overview of the dissertation project

In order to further increase the upper limit of the open source model, the paper proposes a new chat language model - UltraLLaMA, which is obtained by fine-tuning the LLaMA model on UltraChat, which provides a diverse and high-quality command dialogue data set, and successfully improves the chat language model. performance.

1bf4f7642ddf863f5e49016c2f30b24f.png
Figure: Paper structure generated by GPT-4
fb08bb7ca32b0e67bbb913e7c4155275.png

2 How is the UltraChat multimodal dataset constructed?

Building the Design : The general idea of ​​UltraChat is to use a separate LLM to generate opening lines, simulate users, and respond to queries. The three plans of UltraChat: Questions about the world, writing and creation, and assistance to existing materials all have characteristic designs, as shown in the figure below:

d4431b4118a816b6becf75eb0e12c099.png
Figure: Construction process of UltraChat

2.1 Questions about the world

  • This part of the data focuses on concepts, objects, and entities that exist in the real world.

  • The approach to collecting this part of the data has two angles: one around themes and concepts, and one around real-world entities.

  • Request ChatGPT to generate 30 representative and diverse meta-themes covering all aspects of our daily life, as shown below:

cfe097475489656e6cb51d16061f7421.png
Table: 30 first-part meta-concepts for generating hyperchat data

Build process :

  • First, more than 1100 subtopics are generated based on these metatopics; at the same time, the most commonly used 10,000 real-world named entities, such as people, places, events, etc., are collected from Wikidata.

  • A maximum of 10 specific questions were designed for each subtopic; 5 basic questions, 10 specific questions and 20 extended questions were designed for each entity.

  • Then use the Turbo API to generate new related questions for each of the 10 questions. Wanted to use these questions to create conversations, so sifted and sampled some of the ~500,000 questions as conversation starters.

  • We use hand-crafted prompts to instruct the model to generate a variety of questions covering a variety of common concepts and objects, asking it to answer concisely, meaningfully, and taking into account the context of the conversation history.

  • Finally, 200k specific questions and 250k general questions and 50k meta-questions are sampled, and multiple rounds of dialogues are generated iteratively.

2.2 Writing and creating

  • The purpose of this part is to automatically generate different types of written text according to the user's instructions.

  • Use ChatGPT to generate 20 different types of writing texts, such as stories, poems, essays, etc., according to user instructions.

89a1e3153b89a33760a3029004aaaa5b.png
Table: 20 types of chat-generated text materials for Cases 2 and 3

Build process :

  • For each type of writing, 200 different prompts are generated to let the AI ​​assistant generate text materials, and 80% of the instructions are further expanded and refined.

  • The generated instructions are used as the initial input, and 2~4 rounds of dialogues are generated respectively.

2.3 Assistance with existing materials

  • The purpose of this part is to generate different types of tasks, such as rewriting, translating, summarizing, etc., based on existing textual materials.

  • The C4 corpus, which contains a large number of datasets of text fragments and source URLs, and 20 different material types such as stories, poems, and papers, was used.

Build process :

  • About 10w different materials were extracted from the C4 dataset.

  • Some keywords were devised for each type, and materials were obtained after categorizing text fragments according to keywords and URLs.

  • Use ChatGPT to generate up to 5 questions/notes per material.

  • Combine the material for each question/instruction with a set of manually designed templates as the user's initial input to start a conversation with the AI ​​assistant.

  • 500,000 dialogue openings were obtained, each dialogue opening contains a text fragment and a mission instruction. For each input, 2~4 rounds of dialogue are generated.

0e8bd5666622db64cf363a33d30ec1cb.png
Table: Manually designed templates for linking existing materials and generated instructions

2.4 Dataset Evaluation

The UltraChat dataset is a large-scale multimodal dialogue dataset, which contains more than 1 million dialogues, and each dialogue contains an average of 8 dialogue rounds. It contains not only text information, but also audio, video and screen sharing data.

Statistical analysis and comparison of UltraChat with several other command datasets, the results are shown in the table below.

09fbe57f7239da593a21c616374a0185.png
Table: Statistics of existing instruction datasets
  • UltraChat outperforms other datasets in terms of scale, average number of rounds, longest average length per instance, and vocabulary diversity, and is one of the largest open-source datasets.

  • The topic diversity of UltraChat is slightly lower than GPT4ALL, but still higher than other datasets. This may be because each conversation of UltraChat contains more tokens, while that of GPT4ALL has only one turn per conversation.

  • Assessing the coherence of the datasets, it was found that the data from UltraChat and Baize ranked highest in terms of coherence.

3 How powerful is the UltraLLaMA dialogue model?

Basic situation of the model :

  • An UltraLLaMA that improves on the LLaMA-13B model to better understand dialogue context.

  • To enable the model to leverage information from earlier parts of the conversation to generate more relevant and coherent responses, the researchers split the conversation into shorter sequences, with a maximum length of 2048 tokens, and optimized the loss function only for the model's responses.

  • The model is fine-tuned using cross-entropy loss and 128A100gpu with a total batch size of 512.

Create an evaluation dataset7583bdbd1df8a3b2660caa75375dc84a.png

  • An evaluation set was constructed consisting of 300 questions/instructions generated by the Vicuna benchmark and GPT-4 across multiple topics and difficulty levels, as shown in the table above.

  • Use the TruthfulQA benchmark to evaluate the world knowledge of models and baselines, detecting whether they can identify true statements and avoid generating or propagating false information.

  • The TruthfulQA benchmark is a challenging test with 38 categories and two assessment tasks: multiple choice and generation.

3.1 Model evaluation

baseline assessment

  • ChatGPT is used to evaluate the responses of UltraLLaMA and other baseline models on each question.

230e6f8679997407e45fd35c0340ff83.png
Prompt for comparison evaluation
  • Feed ChatGPT the question and the responses of the two models, and have it rate each response, from 1 to 10, with a reason.

  • Prompts are evaluated with correctness as the main criterion.

ae360bb04af96c3ecc6e7b75179a1360.png
Figure: Comparison of responses between UltraLLaMA and other baselines on the curated evaluation set, which was completed by ChatGPT
  • The Win/Tie/Lose times of UltraLLaMA and other baseline models on the evaluation set are compared, as shown in the figure above.

  • UltraLLaMA far outperforms other open source models on the evaluation set, with a winning rate of 85%.

  • UltraLLaMA has a 13% higher win rate than Vicuna.

independent assessment

71aba1f578f1447b7435938213b1bb56.png
independent assessment prompt

The responses of the UltraLLaMA model and the baseline model were scored independently using ChatGPT. Response-based quality scores from 1 to 10. Bold indicates the best score, underlined the second best.

bb0cf9cbe32856f60540579337ce33b1.png
Table: Overall and segmental scores for each model on the curated evaluation set

The above table shows the score comparison between UltraLLaMA and the baseline model. UltraLLaMA outperforms other open-source models on both the total score and most parts of the evaluation set, demonstrating its strong capabilities.

This breakdown also reflects the performance of each model on different types of problems and instructions. In general, all models performed better on simple common sense and world knowledge-related questions, but performed worse on more complex tasks involving reasoning and creative writing. Interestingly, LLaMA, despite having fewer parameters, is comparable to larger models on common sense and world knowledge related problems, but lags behind on more demanding tasks. Furthermore, we also notice that the Dolly and OpenAssistant Pythia-based models perform worse than the LLaMA-based models even though they are smaller. This illustrates the importance of the underlying language model.

Q&A accuracy

  • UltraLLaMA and other baseline models are tested on the real QA multiple-echo task. Let the model judge whether each candidate answer is true or false.

  • The following table shows the judgment accuracy of each model. Discovering true judgments remains a difficult task for existing models.

  • UltraLLaMA outperforms Vicuna on this task and outperforms other baselines as well.

6c70675dd089088ec5d05ecc6893f6f4.png
Table: Accuracy on the ground truth QA benchmark for different models

Effects of the system prompt

  • People often use system prompts to guide various roles and answering styles.

  • System hints were found to affect the quality of the output generated by the model. When the model is prompted to provide "useful and detailed" responses, it generates more relevant and informative responses.

  • Such hints, while not necessarily improving the accuracy of definitive questions, will improve the overall quality of responses because more additional information will be included.

An example can be seen in the table below, where both responses were correct, but the system produced a more detailed response through the model guided by the prompt.

a9eae59efee98a61c1c692c91269f61d.png
Table: Comparison of UltraLLaMA with or without system prompts

4 Summary

The research results of this paper are of great significance to the development of chat language models. First, the creation of the UltraChat dataset provides a rich resource for the training of chat language models. Second, by fine-tuning the LLaMA model, the researchers successfully created a dialogue model UltraLLaMA with superior performance, which provides a strong reference for further optimization of the chat language model.


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/131016407