Based on the generate() method of transformers to realize diversified text generation: interpretation of parameter meaning and algorithm principle

I. Introduction

Recently, I am doing text generation, using the text generation function of the huggingface transformers library generate(), which is the implementation of GenerationMixinthe class ( class transformers.generation_utils.GenerationMixin), and is a master of the relevant parameters of the autoregressive text generation pre-training model. Therefore, this article explains the meaning of these parameters and the principles of commonly used Greedy Search , Beam Search , Sampling ( Temperature , Top-k , Top-p ) and other algorithms.

The method provided by this class is generate()that the following things can be done through parameter adjustment:

  • greedy decoding : When num_beams=1and do_sample=False, the method is called greedy_search(), and each step generates the word with the highest conditional probability, so a single text is generated.
  • multinomial sampling : When num_beams=1and do_sample=True, call sample()the method to sample the vocabulary instead of selecting the word with the highest conditional probability to increase diversity.
  • beam-search decoding : When num_beams>1and do_sample=False, call beam_search()the method to do a column search of num_beams, and greedily select top N columns each time.
  • beam-search multinomial sampling : When num_beams>1and do_sample=True, calling beam_sample()the method is equivalent to adding some samples instead of greedily selecting top N columns each time.
  • diverse beam-search decoding : When num_beams>1and num_beam_groups>1, call group_beam_search()the method .
  • constrained beam-search decoding : when constraints!=None or force_words_ids!=None , implements controlled text generation.

2. The meaning of each input parameter

Next, look at each input parameter ( source code ):

insert image description here
I think the most useful parameters for the quality of text generation are: max_length, min_length, do_sample, top_k, top_p, repetition_penalty. Next, optionally record the meaning of each parameter.

inputs (torch.Tensor of varying shape depending on the modality, optional) — The sequence used as a prompt for the generation or as model inputs to the encoder. If None the method initializes it with bos_token_id and a batch size of 1. For decoder-only models inputs should of in the format of input_ids. For encoder-decoder models inputs can represent any of input_ids, input_values, input_features, or pixel_values.

inputs: input prompt. If it is empty, it is bos_token_idinitialized . For decoder-only models (GPT series), the input needs to be input_ids; for encoder-decoder models (BART, T5, etc.), the input is more diverse.

max_length (int, optional, defaults to model.config.max_length) — The maximum length of the sequence to be generated.

max_length: The maximum length of the generated sequence.

min_length (int, optional, defaults to 10) — The minimum length of the sequence to be generated.

min_length: The shortest length of the generated sequence, the default is 10.

do_sample (bool, optional, defaults to False) — Whether or not to use sampling ; use greedy decoding otherwise.

do_sample: Whether to enable sampling, the default is Falseto greedily find the word with the maximum conditional probability.

early_stopping (bool, optional, defaults to False) — Whether to stop the beam search when at least num_beams sentences are finished per batch or not.

early_stopping: Whether to stop beam search after generating at num_beamsleast sentences, the default is False.

num_beams (int, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search.

num_beams: The default is 1, that is, no beam search is performed.

temperature (float, optional, defaults to 1.0) — The value used to module the next token probabilities.

The default is 1.0, the lower the temperature (less than 1), the greater the gap between rich and poor output by softmax; the higher the temperature, the smaller the gap between softmax.

top_k (int, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering.

top_k: How many words with the highest as candidates, the default is 50. See below for details.

top_p (float, optional, defaults to 1.0) — If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.

top_p: It is known that the total probability of generating each word is 1 (that is, the default is 1.0). If top_p is less than 1, it will accumulate from high to low until top_p, and take the first N words as candidates.

typical_p (float, optional, defaults to 1.0) — The amount of probability mass from the original distribution to be considered in typical decoding. If set to 1.0 it takes no effect. See this paper for more details.

typical_p: typical sampling (I don’t know if it can be translated like this), the default value is 1.0 This parameter is invalid, the main idea: not always select words from the high probability area of ​​​​the distribution, but from the information content close to the expected value typical_p (that is, close to the conditional entropy of the model ) samples from the word set.
Paper: Typical Decoding for Natural Language Generation

repetition_penalty (float, optional, defaults to 1.0) — The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.

repetition_penalty: the default is 1.0, repetition penalty.
Paper: CTRL: A CONDITIONAL TRANSFORMER LANGUAGE MODEL FOR CONTROLLABLE GENERATION

pad_token_id (int, optional) — The id of the padding token.
bos_token_id (int, optional) — The id of the beginning-of-sequence token.
eos_token_id (int, optional) — The id of the end-of-sequence token.

pad_token_id / bos_token_id / eos_token_id: the id of the filler word <PAD>, start appendix <s>, and end character </s>.

length_penalty (float, optional, defaults to 1.0) — Exponential penalty to the length. 1.0 means that the beam score is penalized by the sequence length. 0.0 means no penalty. Set to values < 0.0 in order to encourage the model to generate longer sequences, to a value > 0.0 in order to encourage the model to produce shorter sequences.

length_penalty: length penalty, the default is 1.0.

  • length_penalty=1.0: the beam search score will be penalized by the length of the generated sequence
  • length_penalty=0.0: no penalty
  • length_penalty<0.0: Encourage the model to generate long sentences
  • length_penalty>0.0: Encourage the model to generate short sentences

no_repeat_ngram_size (int, optional, defaults to 0) — If set to int > 0, all ngrams of that size can only occur once.

no_repeat_ngram_size: Used to control the generation of repeated words, the default is 0, if it is greater than 0, the corresponding N-gram will only appear once

encoder_no_repeat_ngram_size (int, optional, defaults to 0) — If set to int > 0, all ngrams of that size that occur in the encoder_input_ids cannot occur in the decoder_input_ids.

encoder_no_repeat_ngram_size: It is also used to control the generation of repeated words. The default is 0. If it is greater than 0, the N-gram of encoder_input_ids will not appear in decoder_input_ids.

bad_words_ids(List[List[int]], optional) — List of token ids that are not allowed to be generated. In order to get the token ids of the words that should not appear in the generated text, use tokenizer(bad_words, add_prefix_space=True, add_special_tokens=False).input_ids.

bad_words_ids: A list of word ids that are forbidden to be generated, and the ids can be obtained by tokenizer(bad_words, add_prefix_space=True, add_special_tokens=False).input_idsthe method .

force_words_ids(List[List[int]] or List[List[List[int]]], optional) — List of token ids that must be generated. If given a List[List[int]], this is treated as a simple list of words that must be included, the opposite to bad_words_ids. If given List[List[List[int]]], this triggers a disjunctive constraint, where one can allow different forms of each word.

force_words_ids: Contrary to the bad_words_ids above, this input must generate a list of token ids. If the ids format is [List[List[int]]], such as [[1,2],[3,4]], the disjunctive constraint (Disjunctive Positive Constraint Decoding) is triggered, which probably means that a different word can be generated Form, such as "lonely", "loneliness", etc.
Thesis: Guided Generation of Cause and Effect

num_return_sequences(int, optional, defaults to 1) — The number of independently computed returned sequences for each element in the batch.

num_return_sequences: How many output sequences each input produces, the default is 1.

max_time: After how many seconds to stop generating.

attention_mask: The default is the same as the shape of the input_ids, 0 means mask, 1 means no mask, and the token that is masked does not participate in the calculation of attention weight.

decoder_start_token_id: The model of encoder-decoder architecture may specify an int value when the decoding start token is different from the encoder (such as [CLS], <s>).

num_beam_groups (int, optional, defaults to 1) : In order to ensure the diversity between different beams during beam search, these beams can be divided into groups. See the paper Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models for details .

diversity_penalty (float, optional, defaults to 0.0): If a beam generates the same words as other beams in the same step, then subtract this value as a penalty. This value is only valid when num_beam_groups is enabled.

prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], List[int]], optional): If this function is provided, each step of the beam search will be limited to the allowed tokens, otherwise no constraints will be imposed. The function has two inputs, namely batch_idand the input of the current step input_ids, and returns a list containing the tokens allowed for the next step. Can be used for conditional constraint generation. See the paper Autoregressive Entity Retrieval for details .

output_attentions (bool, optional, defaults to False): Whether to return the attention matrix values ​​of all attention layers, the default is False.

output_hidden_states (bool, optional, defaults to False): Whether to return the hidden_states of each layer, the default is False.

output_scores (bool, optional, defaults to False): Whether to return prediction scores.

forced_bos_token_id (int, optional): The decoder specifies the generated token id after generating the decoder_start_token_idcorresponding token, which is used by multilingual models such as mBART, because this value is generally used to distinguish target languages.

forced_eos_token_id (int, optional): max_lengthWhen , it is forced to be the last generated token id.

remove_invalid_values ​​(bool, optional): Whether to remove model nan (not a number) and inf (positive infinity) to prevent crashes, but may slow down generation.

exponential_decay_length_penalty (tuple(int, float), optional): After generating a certain number of tokens, impose an exponentially increasing length penalty in the format ( ), the former start_index, decay_factorindicates the index from which the penalty is applied, and the latter indicates the exponential decay factor.

3. Function output meaning

Return the ModelOutput class object (class transformers.utils.ModelOutput) if return_dict_in_generate=Trueor , otherwise return .config.return_dict_in_generate=Truetorch.FloatTensor

4. Brief description of the principle of each decoding algorithm

This section mainly introduces several of the most commonly used decoding methods for autoregressive text generation, including Greedy search , Beam search , Top-K sampling and Top-p sampling . Autoregressive generation is based on the following formula, which assumes that the probability distribution of a word sequence is equal to the product of the conditional probabilities of each word.
P ( w 1 : T ∣ W 0 ) = ∏ t = 1 TP ( wt ∣ w 1 : t − 1 , W 0 ) ,with w 1 : 0 = ∅ , P(w_{1:T} | W_0 ) = \prod_{t=1}^TP(w_{t} | w_{1: t-1}, W_0) \text{ ,with } w_{1: 0} = \emptyset,P(w1:TW0)=t=1TP(wtw1:t1,W0) ,with w1:0=,

4.1 Greedy Search

Greedy search, each time step ttt chooses the word with the highest probability:
wt = argmaxw P ( w ∣ w 1 : t − 1 ) w_t = argmax_{w}P(w | w_{1:t-1})wt=argmaxwP(ww1:t1)
insert image description here
For example, in the figure, the final generated sequence is ("The", "nice", "woman"). The common disadvantage of this greedy algorithm and beam search is that it is easy to generate repeated words. Try it a little:
insert image description here
In addition, greedy search tends to ignore high-probability words after low-probability words. For example, in the picture at the beginning, "the dog has" has a probability of 0.4 *0.9=0.36, which is higher than 0.5*0.4=0.20 of "the nice woman", but because the probability of dog in the first round is lower than that of nice, it misses the better solution in the figure. beam search can solve this problem.

4.2 Beam Search

Beam search selects the most likely Top- num_beamswords at each time step, which solves the risk of greedy search passing by.
insert image description here
As shown in the example, num_beams=2, the first step selects the sequences the nice (0.5) and the dog (0.4) with the highest probability, and the second step selects the sequence the dog has (0.4✖️0.9=0.36) and the nice woman (0.5✖️0.4=0.20).

Note that although beam search can find a solution with a higher probability than greedy search, it is not guaranteed to be the global optimal solution.

Give it a try. Set num_beams > 1, early_stopping=True, to stop early when the specified number of beams generate terminators.
insert image description here
It is better than before, but there are still repetitions, you can add to no_repeat_ngram_size=2prohibit the model from generating repeated 2-grams. But it needs to be used with caution, because the word "like" cannot be generated after it is generated, which leads to the disappearance of "like Jay Chou".
insert image description here
In addition, the topN sequences with the highest return probability can also be specified by num_return_sequencesparameter .
insert image description here
It can be seen that the difference in the generated top5 sequences is not too large.

There are three sayings about beam search:

  • If the length of the generation is predictable in advance, such as abstracts and translations, beam search is better; but for open generation, such as dialogue and story generation, the output length varies greatly, so beam search is not suitable.
  • Beam search is easy to generate words repeatedly. Since it takes a lot of experiments to achieve the balance of "prohibiting the generation of repeated n-grams" and "allowing the periodic generation of repeated n-grams", it is not easy to use this penalty to control repetition on open generation tasks.
  • When human beings often speak, they don't always choose the word with high probability as the next word, but they are often caught off guard and unexpected, as shown in the figure for comparison. So beam search still has big problems.insert image description here

4.3 Sampling

The sampling algorithm no longer sticks to high-probability words, but randomly selects words according to the conditional probability distribution. As shown in the figure, low-probability words such as car may also be selected as the generated text.
insert image description here
In the generate function, set do_sample=Trueand top_k=0temporarily disable topk sampling to see the actual effect.
insert image description here
It can be seen that the model is a bit gibberish... At this time, temperaturethe parameters come in handy.

4.3.0 Temperature

temperatureThe parameter is equivalent to cooling down the softmax, so that the probability gap of each word is increased (compared with the random sample just now, the possibility of high probability words is increased, and the possibility of low probability words is reduced). The formula is as follows: Compare: the following
insert image description here
figure The temperature is added.
insert image description here
The picture below is without adding temperature (the default is 1.0).
insert image description here
can be seen:

  • The smaller T is, the closer it is to 0, the more the probability density is concentrated on high-probability words, the more greedy the search is, and it is easier to generate repeated words.
  • The larger T is, the closer it is to 1, the closer it is to the original softmax, and the greater the randomness.
  • The larger T, even greater than 1, the more random the sampling, the more uniform the probability distribution.
    insert image description here
    Try it out:
    When temperature = 0.7:
    insert image description here
    When temperature = 0.1:
    insert image description here

4.3.1 Top-k sampling

Hierarchical Neural Story Generation proposed the Top-K sampling method. The principle is to find the K most likely words first, and then calculate the probability distribution among these K words, as shown in the figure (blue is the TopK of each step). GPT2 uses this sampling method.
insert image description here
insert image description here
It can be seen that it is much better than before, but there are problems:

  • The problem with top-k sampling is that this K is dead and cannot be adjusted dynamically, which leads to the above example, the left figure t=1 step, the probability distribution is relatively flat, and the right figure t=2 steps, the probability distribution is quite different
  • In the above example, given the, the words selected at t=1 are reasonable, but when t=2, down and a are obviously not suitable but they are also selected into the candidate set

Therefore, limiting the candidate set to a fixed value of K may allow the model to generate gibberish in the disparate distribution on the right, and also limit some creativity in the flat distribution. So top-p sampling came into being.

4.3.2 Top-p sampling

Top-p (nucleus) sampling is an algorithm proposed by Ari Holtzman et al. (2019). He selects words from the smallest candidate set such that the cumulative probability exceeds p, and then calculates the probability distribution of these words. In this way, the size of the candidate word set is not the same as topK, and will dynamically increase and decrease with the probability distribution of the next word.
insert image description here
For example, set p = 0.92, given the, the combined probability of the first 9 words at t=1 is 0.94, which just exceeds 0.92, so the first 9 words become candidates; when t=2, the probability of the first 3 words increases It has reached 0.97.

That is to say, when the next word is less predictable, there are more candidates; if the next word model knows what it is at a glance, then there are fewer candidates.

top_pIt is a value between 0-1, the closer the value is to 1, the better the effect.
insert image description here
When p is set relatively large, there may be too many candidate words sampled by top-p, so it can be used in combination with top-k to avoid those words whose probability of being selected by top-p is very low, as shown in the figure setting top_p and top_k.
insert image description here

Official documentation: https://huggingface.co/docs/transformers/v4.20.1/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate

References:
https://zhuanlan.zhihu.com/p/115076102
https://zhuanlan.zhihu.com/p/453286395
https://huggingface.co/blog/how-to-generate

Guess you like

Origin blog.csdn.net/muyao987/article/details/125917234