Pre-training for domain adaptation of LLMs domain adaptation

So far I've highlighted that when developing an application, you typically use an existing LLM. This saves you a lot of time and can get a working prototype much faster.

However, there is one case where you might find it necessary to pre-train your own model from scratch. If your target domain uses vocabulary and linguistic structures that are not commonly used in everyday language, you may need domain adaptation to achieve good model performance.

For example, imagine you're a developer building an application that helps attorneys and paralegals summarize legal summaries. Legal writing uses very specific terms like "mens rea" in the first example and "res judicata" in the second. These words are rarely used outside the legal world, which means that they are unlikely to appear widely in the training text of existing LLMs. Therefore, models may have difficulty understanding these terms or using them correctly.
insert image description here

Another problem is that legal language sometimes uses everyday words in different contexts, like "consideration" in the third example. This has nothing to do with friendliness, but refers to the main elements of a contract that make an agreement enforceable. For similar reasons, you may encounter challenges if you try to use an existing LLM in a medical application.

Medical language contains many words that are not commonly used to describe medical conditions and procedures. These may not often be present in training datasets consisting of web scraping and book texts. Some domains also use language in highly idiosyncratic ways.

This last example of medical language might just be a random string of characters, but it's actually shorthand that doctors use to write prescriptions. This text means a lot to a pharmacist, meaning one tablet is taken orally, four times a day, after meals and at bedtime.
insert image description here

Because models learn their vocabulary and language understanding through the original pre-training tasks, pre-training your models from scratch will result in better models for highly specialized fields such as law, medicine, finance, or science.

Now, let's go back to BloombergGPT, first announced in 2023 in a paper by Bloomberg's Shijie Wu, Steven Lu, and colleagues. BloombergGPT is an example of a large language model that has been pretrained for a specific domain, finance.

Researchers at Bloomberg chose to combine financial data and generic tax data to pre-train a model that achieves state-of-the-art results on financial benchmarks while maintaining competitive performance on generic LLM benchmarks. Therefore, the researchers selected data consisting of 51% financial data and 49% public data.
insert image description here

In their paper, the Bloomberg researchers describe the model's architecture in more detail. They also discuss how they started with Chinchilla's scaling laws for guidance, and where they had to make trade-offs.

These two graphs compare the scaling laws discussed by some LLMs including BloombergGPT with researchers.
insert image description here

On the left, the diagonal traces the optimal model size for a range of computational budgets, in billions of parameters.

On the right, the line traces the computation of the optimal training dataset size, in units of number of tokens.

The dotted pink line on each graph represents the computational budget the Bloomberg team used to train the new model.
insert image description here

The pink shaded area corresponds to the computationally optimal scaling loss identified in the Chinchilla paper.
insert image description here

In terms of model size, you can see that BloombergGPT roughly follows the Chinchilla approach given a computational budget of 1.3 million GPU hours, or about 230 million petaflops. The model is only slightly above the pink shaded area, indicating a near-optimal number of parameters.
insert image description here

However, the actual number of tokens used to pre-train BloombergGPT is 569 billion, which is lower than the recommended Chinchilla value for the available computing budget. The less than optimal training dataset is due to the limited availability of data in the financial domain.
insert image description here

Shows practical constraints that may force you to make trade-offs when pretraining your own models.

Congratulations on completing your first week, you've covered a lot, so let's take a minute to review what you've seen.

  1. Mike walks you through some common uses of LLMs, such as writing, summarizing conversations, and translating.
    insert image description here

  2. He then details the Transforms architecture that powers these models.

  3. It also discusses some of the parameters you can use at inference time to affect the output of the model.

  4. He summarizes a generative AI project lifecycle that you can use to plan and guide your application development efforts.
    insert image description here

  5. Next, you saw how the model was trained on a large amount of text data in an initial training phase called pretraining. This is where the model develops its language understanding.

  6. You explored the computational challenges of training these models, which are substantial.

  7. In practice, you will almost always use some form of quantization when training a model due to GPU memory limitations.

  8. You concluded the week with a discussion of scaling laws for LLMs and how they can be used to design computationally optimal models.

If you want to read in more detail, be sure to check out this week's reading exercise.

reference

https://www.coursera.org/learn/generative-ai-with-llms/lecture/BMxlN/pre-training-for-domain-adaptation

Guess you like

Origin blog.csdn.net/zgpeace/article/details/132464754
Recommended