Exclusive | When to fine-tune a large language model?

0942beb5355e2ad0ad361f6e2ea89945.png

作者:Skanda VIvek翻译:陈之炎
校对:zrx


本文约3100字,建议阅读7分钟
对开源的大语言模型进行微调的确令人兴奋不已,相比之下,又如何微调非开源的大语言模型呢?

Tags: large language models

I was asked in a fork of my LinkedIn account how to fine-tune open source models like LLaMA. Companies are looking for a business case for selling LLM hosted and deployed solutions, applying AI and LLM to specific products. When I asked them why they don't use non-open source models like ChatGPT, they didn't give the correct answer. So I decided to write this article to answer how to use LLM to solve daily business problems.

The Case for Non-Open Source APIs

Have you tried to implement a specific case with ChatGPT's API? If you want to implement text summarization or answer questions, or just find a chatbot on a website, usually ChatGPT will do a good job on these language tasks.

It is generally believed that non-open source models are too expensive, 1000 tokens cost $0.002, why not try it on 100 samples and evaluate whether a large language model is the best choice for a specific application. In fact, with at least thousands of API calls per day in this range, the ChatGPT API is much cheaper than the open source model mentioned in the blog.

One of the points is, suppose you need to answer questions in thousands of documents, at this point, it would be easier to train or fine-tune an open-source model on this data, and ask if it would be easier to fine-tune the model with this data? It turns out that this is not as simple as it sounds (for various reasons, discussed below in the section on fine-tuning the labeled data).

But ChatGPT has a simple solution that answers questions based on the context of thousands of documents. It stores all documents in the database in small chunks of text.

346f8d5068a1c3d1c132b5320f573251.png 

a9317f0c2a473a26be2d4d15740b63a2.png

Downloading Documents into a Database for Scaled LLM Queries | Skanda Vivek

Data is offloaded from the model schema into a database containing document blocks that provide the model with the information necessary to answer questions.

By calculating the similarity between the question and the document block, you can find relevant documents, by converting the document block and the question into a word embedding vector, and calculating the cosine similarity between the document block and the question, and finally selecting only the specific Document blocks above cosine similarity are taken as relevant context.

Finally, questions and contexts can be combined into prompts like the following, incorporated into large language model APIs like ChatGPT:

0f1777b50eb764debed2d33410f743f9.png 

21d45f0d2845110e72b67d2bb753406c.png

You might ask - why not just type the entire document and question into the prompt, but break it up into small document chunks? The reason for splitting documents into chunks instead of merging all documents into a single prompt is because large language models have certain limitations on the maximum input and output lengths. For ChatGPT, the maximum input and output length is 4096 tokens, which is about 6-7 pages of text.

Downloading the document into a database and querying it using the non-open source LLM API might work well where the answer is clearly present in the document. But in many cases, suboptimal results can be produced, such as wanting ChatGPT to provide an expert perspective on a topic-for example, speculating on the financial prospects of the US economy.

 Professional field LLM

Well, if you have tried ChatGPT or BARD and you don't like it. The answer is indeed a bit off. For example, if asking ChatGPT:

"How long is inflation likely to continue in the U.S. economy?"

 It returns the following answer:

As an AI language model, I cannot predict the future with certainty, especially with regard to complex economic systems. However, I can offer you some information and perspective to help you understand the current situation.

Inflation is a complex phenomenon that can be caused by various factors such as increased demand, supply constraints, etc….

Unfortunately, the above answers are not good enough. And of course, tip ChatGPT some news about Fed Chairman Jerome Powell's statement. But that doesn't give you a lot of experience in the field, and if you keep talking, you might conclude - well, Jerome Powell, who else! Or another specialist.

Consider how you can become an expert in a field, and while you can read books about that field, you can also interact with experts in that field and learn from experience. While ChatGPT has been trained on a plethora of financial books, it may not have been trained by top financial experts or other domain-specific experts. So, how to make LLM an "expert" in the financial field? This is where fine-tuning comes in.

Fine-tuning the LLM

Before discussing fine-tuning large language models, first talk about fine-tuning small language models like BERT, which was common before the advent of large language models. For models like BERT and RoBERTa, fine-tuning amounts to passing some context and labels. Well-defined tasks, such as extracting answers from context, or classifying emails as spam and not spam. I wrote a few blog posts about these, which might be useful if interested in fine-tuning language models:

4c7425ecfe3a0df9d3e60f6deb387c3c.png 

245d5fb5fd417fd47216ea89bdb3971f.png

The reason why large language models (LLMs) are so popular is that they can perform multiple tasks seamlessly by changing cues and have an experience similar to talking to a human on the other end. Now the LLM needs to be adapted to become an expert on a subject and participate in the conversation like a "human". This is quite different from fine-tuning a BERT model on a specific task.

One of the first open-source breakthroughs was a group of researchers at Stanford who fine-tuned the 7B LLaMa model (published by Meta earlier this year), which they achieved at 52K instructions for less than $600 and called for the Alpaca. Shortly after, the Vicuna team released a 13 billion parameter model that achieved 90% of ChatGPT quality.

Recently, the MPT-7B transformer was released, which can ingest 65k tokens, which is 16 times the input size of ChatGPT! Training from scratch for 9.5 days at a cost of 200k$. As an example of LLM in the professional field, Bloomberg released a gpt-like model, BloombergGPT, built for the financial field and also trained from scratch.

Recent advances in training and fine-tuning open-source models have seen small and mid-sized companies enrich their offerings with custom LLMs. So how do you decide when to tune or train an llm in a specialized field?

First, the limitations of the closed-source LLM API in the professional field need to be clarified and allow clients to chat with experts in the field at a small cost. For 100K or so instructions, fine-tuning the model is not very expensive - but it takes careful thought to get the instructions right. This is to be bold, and while I can’t think of a fine-tuned model that performs significantly better than ChatGPT in any specialized domain, I believe there will be an inflection point here, and any company that does this will be rewarded.

This got me thinking about the case of fully training an LLM from scratch, which could easily cost upwards of hundreds of thousands of dollars, but investors would be more than happy to throw in the towel if there was a compelling reason to do so. In a recent interview with IBM, Hugging Face CEO Clem DeLancourt commented that before long, custom large language models will be as common as proprietary codebases—and will be as important as making the industry more competitive. composition.

main points

LLMs applied to specific domains are very valuable in the industry and come in 3 tiers in terms of added cost and customizability:

1. Non-open source API + document embedding database: The first solution is probably the easiest to implement, given the high quality of the ChatGPT API - and may even provide good enough (if not the best) performance. And it's not expensive either!

2. Fine-tuning LLM: Recent progress from fine-tuning the LLaMA model shows that it takes around $500 to achieve ChatGPT-like baseline performance in some domains. It's also worth a try if you have a database of ~50-100k instructions or conversations to fine-tune your baseline model.

3. Train from scratch: as shown by LLaMA and the most recent MPT-7B model, it will cost around 100-200k and take a week or two.

Now that you understand the above, start building your custom proprietary domain LLM application!

Original title:

When Should You Fine-Tune LLMs?

Original link:

https://medium.com/towards-data-science/when-should-you-fine-tune-llms-2dddc09a404a?source=explore---------8-58--------------------bbc182a3_471b_4f78_ad66_68a6b5de2c39-------15

Editor: Yu Tengkai

Proofreading: Lin Yilin

Translator profile

3ca2fc6264fcc21809ef82dfa42c119e.jpeg

Chen Zhiyan, graduated from Beijing Jiaotong University majoring in communication and control engineering, and obtained a master's degree in engineering. He has worked as an engineer at Great Wall Computer Software and System Company and Datang Microelectronics Company. He is currently the technical supporter of Beijing Wuyi Chaoqun Technology Co., Ltd. Currently engaged in the operation and maintenance of the intelligent translation teaching system, and has accumulated certain experience in artificial intelligence deep learning and natural language processing (NLP). In his spare time, he likes translation and creation. His translated works mainly include: IEC-ISO 7816, Iraqi Petroleum Engineering Project, Declaration of New Fiscalism, etc. Among them, the Chinese-English work "Declaration of New Fiscalism" was officially published in GLOBAL TIMES. I can use my spare time to join the translation volunteer group of the THU Data Pie platform. I hope to communicate and share with you and make progress together.

Translation Team Recruitment Information

Job content: It needs a meticulous heart to translate the selected foreign language articles into fluent Chinese. If you are an international student of data science/statistics/computer, or are engaged in related work overseas, or friends who are confident in your foreign language proficiency, welcome to join the translation team.

You can get: regular translation training to improve the translation level of volunteers, improve the awareness of the frontier of data science, overseas friends can keep in touch with the development of domestic technology applications, and the background of THU's data-based industry-university-research research brings great benefits to volunteers Development Opportunities.

Other benefits: Data science workers from famous companies, students from Peking University, Tsinghua University and overseas famous schools will all become your partners in the translation team.

Click "Read the original text" at the end of the article to join the Datapai team~

Reprint Notice

If you need to reprint, please indicate the author and source in a prominent position at the beginning of the article (from: Datapi ID: DatapiTHU), and place an eye-catching QR code at the end of the article. If you have an original logo article, please send [article name - official account name and ID to be authorized] to the contact email, apply for whitelist authorization and edit as required.

After publishing, please send the link back to the contact email (see below). Unauthorized reprinting and adaptation, we will pursue their legal responsibilities according to law.

176e1968c103f45e2e6e9a792a06b274.png

Click "Read the original text" to embrace the organization

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131970910