ChatHome: Development and Evaluation of a Domain-Specific Language Model for Home Renovation

Summary

This article introduces the development and evaluation of ChatHome, a domain-specific language model designed for the complex home decoration domain. Considering the mature capabilities of large language models (LLMs) like GPT-4 and the escalating fascination with home decoration, this study reconciles these aspects by generating a specialized model that can generate results relevant to the home decoration domain. High fidelity, precise output. The novelty of ChatHome lies in its approach, which blends domain-adaptive pre-training and instruction tuning on a broad dataset. The dataset includes professional articles, standard documents, and web content related to home improvement.

This two-pronged strategy aims to ensure that our models can absorb comprehensive domain knowledge and effectively handle user queries. Through thorough experiments on different datasets, both general and domain-specific, including the newly introduced “EvalHome” domain dataset, we confirm that ChatHome not only amplifies domain-specific capabilities but also retains its versatility.

论文原文: ChatHome: Development and Evaluation of a Domain-Specific Language Model for Home Renovation

If you don’t want to read the full text, just read the conclusion.

in conclusion

  • Using baichuan-13b-base as the base model, pre-training and SFT stages, vertical domain data: general data = 1:5 has the best effect
  • Use baichuan-13b-chat as the base model, only perform SFT, vertical domain data: general data = 1:5 for the best effect
  • Using baichuan-13b-base as the base model, using MIP (Multi-Task Instruction Pre Training) pre-training method, integrating downstream instruction data, vertical domain data: general data = 1:0, the effect exceeds the above two solutions

introduce

Omit unimportant descriptions…

This study proposes ChatHome, a language model specially designed for home improvement.

Our method consists of two steps:

First, the generic model is continuously pretrained using a wide range of home improvement datasets, including professional articles, standard documents, and web content.
Second, instruction fine-tuning is achieved using a policy-generated dataset based on question-answer pairs generated from home improvement cues.

A Strategy for Question-Answer Pairs Based on Home Improvement Prompt Generation: Generating Data Question-Answer Pairs Dataset Using GPT4

Related work

Omit unimportant descriptions…

According to different training stages, professional training methods in the LLM field can be roughly divided into the following categories:

  • Pre-training directly from scratch based on domain data, this method usually relies on a large amount of domain data, and the training cost is high;
  • Fine-tune directly based on domain command data
  • Based on domain data, domain pre-training is performed on the basic LLM, and then instruction fine-tuning is performed.

Data Sources

Pre-training corpus

Previous research has shown that language models can benefit from knowledge gained through domain-specific corpora. We collect a domain-specific corpus to augment the model with knowledge about home decoration.

In addition, we also write a general corpus, which provides the model with a balance of general knowledge.

National Standards: We have collected several national decorative building standards, among which
books in the field of "Residential Building Design Code" and "Housing Decoration Construction Code" are worth mentioning: we have collected real estate , home decoration, decoration, and architectural works published in the past ten years Books in other fields
Domain website: We
crawl the home improvement website, there WuDaoCorporaare about 30,000 articles about home improvement advice, home equipment purchase skills, etc. Article as a sample.

Data preprocessing
The above data is processed through a unified pipeline, including text extraction, quality filtering, and deduplication. During the text extraction process, we discard irrelevant information such as pictures, tables, and URLs and only store relevant text. Additionally, we minimize the impact of duplicate data on model training by deduplicating data at the article and sentence levels. Finally, we obtained approximately 26.6M tokens from the domain corpus and 276.6M tokens from the general corpus .
insert image description here

SFT corpus

In order to alleviate the domain bias problem and improve the performance of the model in a specific domain, we constructed about 25k instruction data from high-quality home improvement books and home improvement website articles to help the model adapt to specific domain knowledge. Details about these tips are presented in Table 5 of the Appendix.
insert image description here
Single-round dialogue: In order to obtain more questions related to home decoration, initially, we use GPT4 to simulate the dual roles of interior designers and customers, generating several question-answer pairs based on the given knowledge. Subsequently, in order to obtain a more detailed response, we submitted the above questions directly to GPT-4. This two-step approach allows us to obtain more comprehensive and precise data.

Multi-turn dialogue: Similar to single-turn dialogue, GPT-4 simulates the roles of interior designers and customers, facilitating the generation of multi-turn dialogue in the field of home decoration. Furthermore, to mitigate hallucinations, we equip GPT-4 with relevant articles such that its dialogue revolves around these provided knowledge. Furthermore, we instruct GPT-4 to stay focused and process conversations organically.

base model

Baichuan-13B-Base: The parameter size is 13 billion, and the training corpus contains 1.4 trillion tokens.
Baichuan-13B-Chat: Built on the infrastructure of Baichuan-13B-Base, fine-tuned using specialized instructions. Therefore, it demonstrates improved dialogue generation and instruction comprehension capabilities.

We apply the previously mentioned home improvement domain dataset to fine-tune our two base models. In order to explore the advantages of Domain Adaptive Pre-training ( DAPT ) in domain adaptation, we will conduct the same instruction tuning experiment on the model improved with DAPT.

DAPT: Pre-train the resulting model on a base model using domain-specific datasets

Domain adaptation inevitably faces the problem of catastrophic forgetting, characterized by the loss of previously acquired knowledge when adapting to a new domain. A straightforward approach to alleviate this problem is a rehearsal-based strategy that involves revisiting and relearning previously acquired knowledge. Considering that large language models are pre-trained on a wide range of general-purpose data, it is necessary to achieve a balance between general-purpose data and domain-specific data during domain adaptation. For each experiment, we performed five sets of data ratio tests to determine the most efficient data ratio scheme.

The parameter configurations of the DAPT and SFT stages are shown in Table 1. The only difference in the training hyperparameters of the DADT and SFT stages is the maximum length, which is set to 1024 for DAPT and 1536 for SFT.
insert image description here

Metrics

Evaluation is critical to the success of vertical domain models. For ChatHome, we not only hope to inject domain-related knowledge into the model, but also pay attention to the general capabilities of the model after domainization, so our evaluation includes two parts: general capability assessment and domain capability assessment.

general field

To evaluate the general capabilities of the model, we adopted C-Eval and CMMLU, which are both benchmarks for evaluating the high-level knowledge and capabilities of the base model in the Chinese context.

vertical domain

To the best of our knowledge, there are no authoritative exams in the field of home improvement. We construct a domain assessment called EvalHome, which covers three difficulty levels: domain fundamentals, domain expertise and innovative design, from low to high difficulty respectively. Since multiple-choice questions are a simple but good proxy for assessing the potential for advanced capabilities of domain models, we structured all questions in multiple-choice format, 113 in total. Table 2 shows the statistics of EvalHome.

insert image description here
The amount of data in the test set is a bit small, only 113 in total!

Analysis of results

The experimental results of the DAPT model on the general evaluation set are shown in Table 3. We give the average scores of CEval and CMMLU.
insert image description here

Despite adding more general data at a ratio of 1:10, under the data ratio scheme of 1:5, the DAPT model has the smallest loss in general ability, and the average scores on the CEval and CMMLU evaluation sets are respectively lower than the base model 2.57 points and 3.08 points.

This model is recorded as: Baichuan-13BBase-DAPT(1:5) .

Table 4 presents the experimental results of domain adaptation models on EvalHome and general evaluation sets. A total of 4 test groups were set up in the experiment, among which Baichuan-13B-Base-DAPT(1:0) means that the data ratio in the DAPT stage is 1:0. We can see that in addition to the Baichuan-13B-Base experiment, the other three experiments, Baichuan-13B-Base-DAPT(1:0) , Baichuan-13B-Base-DAPT(1:5) and Baichuan-13B-Chat , all yield the best results on EvalHome under the 1:5 data ratio scheme.

insert image description here
Combining the experimental results of these two tables, we can initially conclude that based on our existing basic model and data in the home decoration field, the best performance is achieved when the data ratio is 1:5.

During the instruction tuning phase, we observed a noteworthy phenomenon: as more general instruction data is added, the model's score on the general capability evaluation set decreases. This may be attributed to the focus of the evaluation benchmarks C-Eval and CMMLU, which primarily measure model-specific knowledge that may not be covered by our general instruction data.

It can be seen from Table 4 that the optimal results of 59.29 and 55.75 were respectively obtained on EvalHome by using the model after DAPT for instruction tuning. Compared with the e Baichuan-13B-Base model without DAPT, it is slightly improved, and the highest score is 53.98. However, when the Baichuan-13B-Chat t model is used for instruction tuning, a higher score of 60.17 is obtained on EvalHome. Compared with the Baichuan-13B-Chat model without updated parameters, the model under different data ratios has a significant improvement. This shows that post-dapt instruction tuning does not significantly outperform direct domain adaptation in instruction alignment models in our current domain scenario. We speculate that this phenomenon may be due to the fact that the Base model already contains a large amount of decoration-related data during pre-training.

This means that using the base model to continue pre-training and then performing SFT fine-tuning is not as effective as directly using the chat model for SFT fine-tuning.

Furthermore, inspired by some research works demonstrating the advantages of integrating downstream supervision data sets during pre-training, we attempt to integrate downstream instruction data during the DAPT stage. This strategy is called MIP (Multi-Task Instruction Pre Training), and we also use this name in this article. Due to training resource and time constraints, we did not conduct a detailed analysis of the data ratios. Therefore, in the MIP stage, our training data only consists of domain pre-training data and domain instruction data, without adding a common corpus. Nonetheless, an unexpected score of 69.03 was obtained on EvalHome, as shown in the last row of Table 4. Even more surprising is that this model not only achieved the highest score on EvalHome, but also scored higher on two common capability evaluation benchmarks, thus outperforming all other models.

The findings indicate that incorporating downstream instruction data during the DAPT stage is beneficial given the conditions of the current domain dataset and underlying model. Our future plans include more in-depth data ratio experiments during the MIP phase.

Guess you like

Origin blog.csdn.net/qq_44193969/article/details/132110110