Why use top_p for text generation sampling?

The previous article introduced the role of Temperature, a controllable parameter when generating text from a large model: ViewTemperature parameter and softmax , today we continue to look at another parameter that can also control the text output effect.

Maybe you will encounter this parameter when using the OpenAI interface, which is top_p.

In many interviews about algorithms, we are often asked about sorting algorithms. There is a classic question in sorting algorithms, which is how to extract the largest K values ​​from a pile of data. This is called the top_k problem.

So when a large model generates text, what is top_p? What is the difference and connection between it and top_k?

1. What is Top-p?

Top_p sampling, also known as kernel sampling, is a text sampling strategy used in natural language generation.

What we are talking about here is a sampling strategy, that is, a strategy of selecting which one to use as the final output from a bunch of predicted text words.

Traditional Top-k sampling

In Top_k sampling, the language model only considers the k words with the highest probability when predicting the next word.

However, this method has a shortcoming, that is, it only selects a fixed k words with the highest probability as candidates, and discards all the remaining words.

As a result, the generated text will lack diversity, because sometimes even low-probability words may be the correct choice related to the context.

Top-p sampling

To overcome this limitation, Top_p sampling selects a probability threshold p and then selects a minimal set of words from the model's predicted distribution whose sum of probabilities is at least p.

In other words, you start filtering from all possible words and put them into a candidate set. The filtering steps are:

Start selecting the word with the highest probability, followed by the word with the second highest probability, and then the word with the third highest probability. During the selection process, the probabilities of these words are continuously accumulated until the cumulative probability reaches or exceeds the threshold p, and then these words are selected. Words are put into a set as alternatives.

(Comics generated by AIGC)

2. Give an example

Assume that the probability distribution of the next word predicted by a language model is as follows. Next, we will compare how to select words from these candidate words using the two sampling methods top_k and top_p respectively.

词        概率
the       0.20
of        0.18
and       0.15
whale     0.10
ocean     0.08
aquarium  0.07
to        0.06
in        0.05
a         0.04
an        0.03
...

Top-k sampling (assuming k=5)

In top_k sampling, the model will only select from the k words with the highest probability:

  • the
  • of
  • and
  • whale
  • ocean

This sampling has an obvious disadvantage, it selects a fixed number of words.

Assuming that the probability distribution of the 10 words is very uniform, you may miss some words that are very suitable in a specific context. For example, in an article introducing aquatic life, aquarium (aquarium) may be a more suitable word.

Top-p sampling (assuming p=0.75)

In top_p sampling, we select words based on probability accumulation, starting from the word with the highest probability and accumulating until the p value is reached:

  • the (0.20)
  • of (0.18)
  • and (0.15)
  • whale (0.10)
  • ocean (0.08)
  • aquarium (0.07)

The cumulative probability of these words is 0.78, which exceeds the p threshold of 0.75.

In top_p sampling, even if the probability of aquarium is not the highest, it will be included because it contributes to reaching the cumulative probability threshold.

3. The core difference between the two

  • top_k is fixed: regardless of the distribution, top_k always selects a fixed number of words, possibly including some words with low probability but not suitable in the current context, or excluding some suitable words.
  • top_p is flexible: top_p selects words based on the cumulative probability of the distribution, and it can select a variable number of words, which enables the inclusion of more appropriate words in the context.

Overall, top_p sampling can better adjust the randomness and determinism in the generation process by considering cumulative probabilities instead of a fixed number of words, thereby improving text coherence and diversity.

This is why in practical applications, top_p is often considered a better choice than top_k.

This article was first published on the public account: Dong Dongcan is a siege lion.

Guess you like

Origin blog.csdn.net/dongtuoc/article/details/135042289