Introduction to Deep Learning (66) Recurrent Neural Networks - Beam Search

foreword

The core content comes from blog link 1 blog link 2 I hope you can support the author a lot
This article is used for records to prevent forgetting

Recurrent Neural Networks - Beam Search

courseware

greedy search

In seq2seq we use greedy search to predict the sequence

  • Output the word with the highest prediction probability at the current moment

But being greedy is probably not optimal:
insert image description here

exhaustive search

Optimal algorithm: For all possible sequences, calculate its probability, and then choose the best one.
If the output dictionary size is n, and the longest sequence is T, then we need to examine nT sequences

  • n = 10000,T= 10: n T = 1 0 40 n^T = 10^{40} nT=1040
  • computationally infeasible

beam search

Save the best k candidates
At each moment, add a new item (n possibilities) to each candidate, and select the best k among the kn options
insert image description here
insert image description here

Summarize

beam search saves the k best candidates at each search

  • When k = 1, it is a greedy search
  • Exhaustive search when k = n

Textbook

In the previous section, we predicted the output sequence one by one until the specific sequence-ending token “ ” appeared in the predicted sequence <eos>. This section will first introduce 贪心搜索(greedy search)the strategy and explore its problems, and then compare other alternative strategies: 穷举搜索(exhaustive search)and 束搜索(beam search).

Before formally introducing greedy search, we define the search problem using the same mathematical notation as in the previous section. At any time step t ′ t't , the decoder outputsyt ′ y_{t'}ytThe probability of depends on the time step t ′ t't The previous output subsequencey 1 , … , yt ′ − 1 y_1, \ldots, y_{t'-1}y1,,yt1And the context variable c obtained by encoding the information of the input sequence \mathbf{c}c . To quantify the computational cost, useY \mathcal{Y}Y represents the output vocabulary, which contains "<eos>", so the cardinality of this vocabulary set∣ Y ∣ \left|\mathcal{Y}\right|Y is the size of the vocabulary. We also specify the maximum number of tokens for the output sequence asT′T'T' . Therefore, our goal is to start from allO ( ∣ Y ∣ T ′ ) \mathcal{O}(\left|\mathcal{Y}\right|^{T'})O(YT' )to find the ideal output among possible output sequences. Of course, for all output sequences,<eos>the part after " " (not this sentence) will be discarded in the actual output.

1 Greedy search

First, let's look at a simple strategy: greedy search, which was used for sequence prediction in the previous section. For each time step of the output sequence, we will find the token with the highest conditional probability based on greedy search, namely:
yt ′ = argmax ⁡ y ∈ YP ( y ∣ y 1 , … , yt ′ − 1 , c ) y_{t'} = \operatorname*{argmax}_{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y_{t'-1}, \mathbf{c})yt=yYargmaxP ( andy1,,yt1,c)

Once the output sequence contains " <eos>" or reaches its maximum length T ′ T'T , the output is complete.
insert image description here
As shown in the figure, suppose there are four tokens "A", "B", "C" and "<eos>" in the output. The four numbers under each time step represent the<eos>conditional probabilities of generating "A", "B", "C" and " " at that time step respectively. At each time step, greedy search selects the token with the highest conditional probability. Therefore, the output sequence "A" "B" "C" and " " will be predicted in the graph<eos>. The conditional probability of this output sequence is0.5 × 0.4 × 0.4 × 0.6 = 0.048 0.5\times0.4\times0.4\times0.6 = 0.0480.5×0.4×0.4×0.6=0.048 .
So what is the problem with greedy search? In reality,最优序列(optimal sequence)it should be maximized∏ t ′ = 1 T ′ P ( yt ′ ∣ y 1 , … , yt ′ − 1 , c ) \prod_{t'=1}^{T'} P(y_{t' } \mid y_1, \ldots, y_{t'-1}, \mathbf{c})t=1TP ( andty1,,yt1,c ) The output sequence of values, which is the conditional probability of generating the output sequence based on the input sequence. However, greedy search cannot guarantee the optimal sequence.
insert image description here
Another example in the image above illustrates the problem. Unlike the previous figure, at the time step, we choose the token "C" in Figure 2, which has the second highest conditional probability. Since the output subsequence at the time step and at which the time step is based has changed from "A" and "B" in Figure 1 to "A" and "C" in Figure 9.8.2, each time step at The conditional probabilities of tokens are also changed in Fig. 2. Assuming that we select the word "B" at the time step, then the current time step is based on the output subsequences "A", "C" and "B" of the first three time steps as conditions, which is the same as "A"" in Figure 1 B" and "C" are different. Therefore, the conditional probability of generating each token at the time step in Figure 2 is also different from that in Figure 1. As a result, the<eos>conditional probability of the output sequence "A", "C", "B" and " " in Figure 2 is0.5 × 0.3 × 0.6 × 0.6 = 0.054 0.5\times0.3 \times0.6\times0.6=0.0540.5×0.3×0.6×0.6=0.054 , which is greater than the conditional probability of greedy search in Figure 1. This example shows that the output sequence "A", "B", "C" and "<eos>" obtained by greedy search is not necessarily the best sequence.

2 exhaustive search

If the goal is to obtain the optimal sequence, we can consider using 穷举搜索(exhaustive search): Exhaustively enumerate all possible output sequences and their conditional probabilities, and then calculate the one with the highest output conditional probability.

Although we can use exhaustive search to obtain the optimal sequence, but the calculation amount O ( ∣ Y ∣ T ′ ) \mathcal{O}(\left|\mathcal{Y}\right|^{T'})O(YT )can be surprisingly high. For example, when∣ Y ∣ = 10000 |\mathcal{Y}|=10000Y=10000 andT' = 10 T'=10T=At 10 , we need to evaluate1000 0 10 = 1 0 40 10000^{10} = 10^{40}1000010=1040 sequence, this is a huge number, and it is almost impossible for existing computers to calculate it. However, the calculation amount of greedy searchO ( ∣ Y ∣ T ′ ) \mathcal{O}(\left|\mathcal{Y}\right|T')O(YT ), which is significantly smaller than the exhaustive search. For example, when∣ Y ∣ = 10000 |\mathcal{Y}|=10000Y=10000 andT' = 10 T'=10T=10 , we only need to evaluate10000 × 10 = 1 0 5 10000\times10=10^510000×10=105 sequences.

3 beam search

So which sequence search strategy should you choose? If precision is most important, then obviously exhaustive search. Obviously greedy search if computational cost matters most. The practical application of beam search lies between these two extremes.

束搜索(beam search)It is an improved version of greedy search. It has a hyperparameter called 束宽(beam size)k. At time step 1, we select the k tokens with the highest conditional probabilities. These k tokens will be the first tokens of the k candidate output sequences respectively. At each subsequent time step, based on the k candidate output sequences from the previous time step, we will proceed from k ∣ Y ∣ k\left|\mathcal{Y}\right|kY picks the k candidate output sequences with the highest conditional probabilities out of the possible choices.
insert image description here

The figure above demonstrates the process of beam search. Suppose the output vocabulary contains only five elements: Y = { A , B , C , D , E } \mathcal{Y} = \{A, B, C, D, E\}Y={ A,B,C,D,E } , one of which is "<eos>". Set the beam width to 2 and the maximum length of the output sequence to 3. At time step 1, the hypothesis has the highest conditional probabilityP ( y 1 ∣ c ) P(y_1 \mid \mathbf{c})P ( and1The lemmas of c ) are A and C. At time step 2, we compute ally 2 ∈ Y y_2 \in \mathcal{Y}y2Y1 :
P ( A , y 2 ∣ c ) = P ( A ∣ c ) P ( y 2 ∣ A , c ) , P ( C , y 2 ∣ c ) = P ( C ∣ c ) P ( y 2 ∣ C , c ) , \begin{split}\begin{align}P(A, y_2 \mid \mathbf{c}) = P(A \mid \mathbf{c})P(y_2 \mid A, \mathbf{ c}),\\P(C,y_2\mid\mathbf{c}) = P(C\mid\mathbf{c})P(y_2\midC,\mathbf{c}),\end{aligned} \end{split}P(A,y2c)=P(Ac)P(y2A,c),P(C,y2c)=P(Cc)P(y2C,c),

Choose the largest two from these ten values, such as P ( A , B ∣ c ) P(A, B \mid \mathbf{c})P(A,Bc )P ( C , E ∣ c ) P(C, E \mid \mathbf{c})P(C,Ec ) . Then at time step 3, we compute ally 3 ∈ Y y_3 \in \mathcal{Y}y3Y is:

P ( A , B , y 3 ∣ c ) = P ( A , B ∣ c ) P ( y 3 ∣ A , B , c ) , P ( C , E , y 3 ∣ c ) = P ( C , E ∣ c ) P ( y 3 ∣ C , E , c ) , \begin{split}\begin{align}P(A, B, y_3 \mid \mathbf{c}) = P(A, B \mid \mathbf{ c})P(y_3\mid A,B,\mathbf{c}),\\P(C,E,y_3\mid\mathbf{c}) = P(C,E\mid\mathbf{c}) P(y_3 \mid C, E, \mathbf{c}),\end{align}\end{split}P(A,B,y3c)=P(A,Bc)P(y3A,B,c),P(C,E,y3c)=P(C,Ec)P(y3C,E,c),

Choose the largest two from these ten values, namely P ( A , B , D ∣ c ) P(A, B, D \mid \mathbf{c})P(A,B,Dc )P ( C , E , D ∣ c ) P(C, E, D \mid \mathbf{c})P(C,E,Dc ) , we will get six candidate output sequences: (1) A; (2) C; (3) A, B; (4) C, E; (5) A, B, D; (6) C, E, D.

Finally, based on these six sequences (for example, discarding <eos>the part including " " and after), we obtain the final set of candidate output sequences. We then select the sequence in which the product of conditional probabilities is the highest as the output sequence:

1 L α log ⁡ P ( y 1 , … , y L ∣ c ) = 1 L α ∑ t ′ = 1 L log ⁡ P ( y t ′ ∣ y 1 , … , y t ′ − 1 , c ) , \frac{1}{L^\alpha} \log P(y_1, \ldots, y_{L}\mid \mathbf{c}) = \frac{1}{L^\alpha} \sum_{t'=1}^L \log P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}), La1logP ( and1,,yLc)=La1t=1LlogP ( andty1,,yt1,c),

where LLL is the length of the final candidate sequence,α \alphaα is usually set to0.75 0.750.75 . Because a longer sequence will have more logarithmic terms in the summation of the above formula, soL α L^\alphaLα is used to penalize long sequences.

The computational cost of beam search is O ( k ∣ Y ∣ T ′ ) \mathcal{O}(k\left|\mathcal{Y}\right|T')O ( kYT ), the result is between greedy search and exhaustive search. In fact, greedy search can be seen as a special type of beam search with a beam width of 1. By flexibly choosing the beam width, beam search can provide a trade-off between accuracy and computational cost.

4 Summary

  • Sequence search strategies include greedy search, exhaustive search and beam search.

  • The sequence selected by greedy search has the least amount of computation, but the accuracy is relatively low.

  • The sequence selected by exhaustive search has the highest precision, but the largest amount of calculation.

  • Beam search makes a trade-off between accuracy and computational cost by flexibly choosing the beam width.

Guess you like

Origin blog.csdn.net/qq_52358603/article/details/128485487