Common decoding strategies for generative models | greedy search & beam search & sample-topk & sample-topp

1, greedy search (greedy search)

For each step, take the one with the highest probability directly. It is easy to fall into local optimum.

2, beam search (beam search)

In each step, only the beam size with the highest probability is selected (beam size=1, which is actually a greedy search). In this way, there is more room for exploration during generation, and it is less likely to fall into local optimum.

specific methods:

  • Select the beam size candidates with the largest current score and put them into the beam
  • Then start to look at the next prediction of each candidate, and take the maximum beam size candidates from all predictions and put them into the beam
  • When there is a terminator generated in the candidate, the candidate is taken out, and the beam size is reduced by 1
  • Continue to predict the next step... until the beam size is 0, end the search, and take the highest score as the result.
    注:因为路径的概率 = 所有步的log(prob)求和,log(prob)是负的,所以会使得模型倾向于生成短句,于是针对束搜索生成较长句就有好几种策略,如对长度惩罚

Disadvantages: Because beam search also uses the method of maximum probability to generate dialogues, compared with human dialogues, it will appear that the generated dialogues are too common and there is no diversity. For example, the generative dialogue system of beam search will always say, "Me too", "Okay" and other unnutritious words.

3, sample (sampling)

In order to increase the randomness and diversity of generation, the answer can be generated by sampling to solve the problem of beam search.

  • But syntax errors may occur. It can be alleviated by strengthening the probability of the top words, and then only sampling some of the most likely words, so that while increasing randomness, it is possible to ensure that no general errors occur.

  • To strengthen the top word probability, you can divide the logits output by the model by a temperature (Temperature, T) less than 1, and then pass softmax, so that the distribution can be made sharper after softmax, and the probability of words with high probability is higher.

  • The top words are first selected according to the obtained probability, and then sampled, which directly eliminates the possibility of low-probability words appearing.

采样主流有topk和topp采样。

3.1、ball

  • Select the k tokens with the highest probability, then re-calculate the probability through softmax, and then perform sampling according to the obtained probability, and then proceed to the next step of generation, repeating continuously.
  • There may be a problem with topk, that is, if there is a situation where the model is very sure about the current generation, for example, the probability of the token with the highest probability is 0.9, while the probability of the remaining tokens is very low. And if you simply use topk sampling at this time, it will cause the low probability of sampling that you want to avoid before to still happen.

3.2、topp

  • Make improvements for the shortcomings of topk. topp is to set a probability limit first, for example, p=0.9, and then start to take the token with the highest probability, and accumulate the probability at the same time, and stop when it is greater than or equal to p, which is 0.9.
  • For example, if the maximum token probability is already 0.9, then only the largest token is taken.

4. Summary

一般来说,在对话系统中,采样效果好一些。一般设置参数为 temperature 设 0.9,topk 和 topp 一起使用,k 取 40,p 取 0.9。

Guess you like

Origin blog.csdn.net/weixin_43646592/article/details/131796247