LLMs Parameter efficient fine-tuning PEFT techniques 2: Soft prompts

In LoRA, the goal is to find an efficient way to update the model's weights without having to train every parameter again. There are also augmentation methods in PEFT that aim to improve model performance without changing the weights. In this video, you will explore a second method of efficient parameter tuning called prompt tuning.

Now, prompt tuning sounds a bit like prompt engineering, but they are very different from each other. Using prompt engineering, you can modify the language of your prompts to get the desired completion. This may be as simple as trying different words or phrases, or more complex as including one or several inferred examples. The goal is to help the model understand the nature of the task you ask it to perform and generate better completions. However, prompt engineering has some limitations, as writing and trying different prompts can require a lot of manual effort. You are also limited by the length of the context window, and in the end, you may still not achieve the performance required for your task.
Insert image description here

With prompt tuning, you add additional trainable tokens to your prompts and let the supervised learning process determine their optimal values. This set of trainable tokens is called a soft prompt and is attached to an embedding vector that represents your input text. The soft prompt vector has the same length as the embedding vector of the language token. And including about 20 to 100 virtual tokens may be enough to get good performance.
Insert image description here

Tokens representing natural language are hard because they each correspond to a fixed position in the embedding vector space.
Insert image description here

However, soft prompts are not fixed discrete words of natural language. Instead, you can think of them as virtual tokens that can take on any value within a continuous multidimensional embedding space. Through supervised learning, the model learns the values ​​of these virtual tokens in order to maximize performance for a given task.
Insert image description here

In full fine-tuning, the training data set consists of input prompts and output completions or labels. The weights of large language models are updated during supervised learning. Compared with prompt tuning, the weights of large language models are frozen and the base model will not be updated.
Insert image description here

Instead, the embedding vector of the soft prompt is updated over time,
Insert image description here

Complete with tips for optimizing your model.
Insert image description here

Prompt tuning is a very parameter-efficient strategy because only a few parameters are being trained.
Insert image description here

This is similar to what you see in LoRA compared to the millions to billions of parameters in full fine-tuning. You can train a set of soft prompts for each task and then easily interchange them during inference. You can train a set of soft prompts for a task,
Insert image description here

Train another group for another task. To use them for inference, you can use the learned tokens to preprocess your input prompts; to switch to another task, you just change the soft prompt. Soft prompts are very small on disk, so this kind of fine-tuning is very efficient and flexible. You'll notice that the same LLM is used for all tasks, and all you have to do is switch soft prompts during inference.
Insert image description here

So what is the performance of prompt tuning? In the original paper, this approach is explored by Brian Lester and collaborators at Google. The authors compare prompt tuning with other methods across multiple model sizes. In this graph from the paper, you can see the model size on the X-axis and the SuperGLUE score on the Y-axis. This is the evaluation benchmark you learned about earlier this week, which evaluates model performance on a number of different language tasks. The red line shows the score of the model created with full fine-tuning.

And the orange line shows the score of the model created using multi-task fine-tuning. The green line shows the performance for prompt tuning, and finally, the blue line shows the score for prompt engineering only. As you can see, for smaller LLMs, prompt tuning does not perform as well as full fine-tuning.
Insert image description here

However, as the model size increases, so does the performance of prompt tuning. Once the model has around 10 billion parameters, prompt tuning can be as effective as full fine-tuning and provides significant performance improvements over using prompt engineering alone.
Insert image description here

One potential issue to consider is the interpretability of the learned virtual tokens. Keep in mind that since soft prompt tokens can take any value within the continuous embedding vector space. The trained tokens do not correspond to any known token, word or phrase in the LLM vocabulary.
Insert image description here

However, analysis of the nearest neighbor tokens of soft prompt positions shows that they form tight semantic clusters. The closest words to the soft prompt token have similar meanings. Recognized words were often task-relevant, suggesting that the prompt was learning word-like representations.
Insert image description here

You explored two PEFT methods in this lesson: LoRA, which uses a rank decomposition matrix to update model parameters in an efficient manner; and Prompt Tuning, which adds trainable tokens to your prompts and model weights remain unchanged . Both methods enable you to fine-tune your model, potentially improving the performance of your task while using fewer computing resources than a fully fine-tuned approach.
Insert image description here

LoRA is widely used in practice due to its comparable performance to full fine-tuning on many tasks and datasets, and you will try it out in this week's lab.
Insert image description here

So congratulations on completing week two. Let's recap what you saw earlier this week, where Mike walked you through how to adapt the base model through a process called instruction fine-tuning.

Along the way, you saw some hint templates and datasets for training the FLAN-T5 model. You also learned how to use evaluation metrics and benchmarks such as ROUGE and HELM to measure success during model fine-tuning.

In practice, instruction fine-tuning has proven to be very effective and useful across a wide range of natural language use cases and tasks. It's amazing how well you can fine-tune a model to fit your specific task with just a few hundred examples.

Next, you learned how Parameter Effective Fine-Tuning (PEFT) reduces the amount of computation required to fine-tune a model. You learned about two methods that can be used for this purpose, LoRA and Prompt Tuning. By the way, you can also combine LoRA with the quantization techniques you learned about in week one to further reduce your memory footprint. This is called QLoRA in practice. In practice, PEFT is heavily used to minimize computational and memory resources. This ultimately reduces the cost of fine-tuning, allows you to make the most of your computing budget, and accelerates your development process.

reference

https://www.coursera.org/learn/generative-ai-with-llms/lecture/8dnaU/peft-techniques-2-soft-prompts

Guess you like

Origin blog.csdn.net/zgpeace/article/details/133003844