Artificial Intelligence-Generative Model-Seq2Seq: Seq2Seq model optimization program

1. Used in seq2seq `teacher forcing`[used in the decoder in the training phase]

In the previous seq2seq case, we introduced teacher frocingwhat it is. At that time, our input and output were very similar, so at that time ours teacher forcingwas implemented in each time step, so now our input and output are different. How to use it?

We can traverse the outer layer of the time step in each batch of the decoderteacher forcing

code show as below:

use_teacher_forcing = random.random() > 0.5
if use_teacher_forcing: #使用teacher forcing
    for t in range(config.max_len):
        decoder_output_t, decoder_hidden, decoder_attn_t = self.forward_step(decoder_input, decoder_hidden,
                                                                             encoder_outputs)
        decoder_outputs[:, t, :] = decoder_output_t
        #使用正确的输出作为下一步的输入
        decoder_input = target[:, t].unsqueeze(1)  # [batch_size,1]

else:#不适用teacher forcing，使用预测的输出作为下一步的输入
    for t in range(config.max_len):
        decoder_output_t ,decoder_hidden,decoder_attn_t = self.forward_step(decoder_input,decoder_hidden,encoder_outputs)
        decoder_outputs[:,t,:] = decoder_output_t
        value, index = torch.topk(decoder_output_t, 1) # index [batch_size,1]
        decoder_input = index

The complete code of the decoder:

import torch
import torch.nn as nn
import config
import random
import torch.nn.functional as F
from word_sequence import word_sequence

class Decoder(nn.Module):
    def __init__(self):
        super(Decoder,self).__init__()
        self.max_seq_len = config.max_len
        self.vocab_size = len(word_sequence)
        self.embedding_dim = config.embedding_dim
        self.dropout = config.dropout

        self.embedding = nn.Embedding(num_embeddings=self.vocab_size,embedding_dim=self.embedding_dim,padding_idx=word_sequence.PAD)
        self.gru = nn.GRU(input_size=self.embedding_dim,
                          hidden_size=config.hidden_size,
                          num_layers=1,
                          batch_first=True,
                          dropout=self.dropout)
        self.log_softmax = nn.LogSoftmax()

        self.fc = nn.Linear(config.hidden_size,self.vocab_size)

    def forward(self, encoder_hidden,target,target_length):
        # encoder_hidden [batch_size,hidden_size*2]
        # target [batch_size,seq-len]

        decoder_input = torch.LongTensor([[word_sequence.SOS]]*config.batch_size).to(config.device)	# 初始化解码器的input
        decoder_hidden = encoder_hidden # 初始化解码器的hidden_state, 形状为：[batch_size,hidden_size*2]【*2是因为编码器使用了bidirectional，所以编码器的输出维度为hidden_size*2】
        decoder_outputs = torch.zeros(config.batch_size,config.max_len,self.vocab_size).to(config.device) # 初始化解码器的输出，形状为: [batch_size,seq_len,14]
        if random.random() > 0.5:
            for t in range(config.max_len):
                decoder_output_t , decoder_hidden = self.forward_step(decoder_input,decoder_hidden)
                decoder_outputs[:,t,:] = decoder_output_t
                value, index = torch.topk(decoder_output_t, 1) # 获取当前时间步的预测值  index [batch_size,1]
                decoder_input = index	# 使用当前时间步的预测值作为下一个时间步的输入
        else:
            for t in range(config.chatbot_target_max_len):
                decoder_output_t , decoder_hidden = self.forward_step(decoder_input,decoder_hidden)
                decoder_outputs[:,t,:] = decoder_output_t
                decoder_input = target[:,t].unsqueeze(-1)   #把真实值作为下一步的输入
        return decoder_outputs,decoder_hidden

    def forward_step(self,decoder_input,decoder_hidden):
        """
        :param decoder_input:[batch_size,1]
        :param decoder_hidden: [1,batch_size,hidden_size*2]
        :return: out:[batch_size,vocab_size],decoder_hidden:[1,batch_size,hidden_size*2]
        """
        embeded = self.embedding(decoder_input)  #embeded: [batch_size,1 , embedding_dim]
        out,decoder_hidden = self.gru(embeded,decoder_hidden) #out [1, batch_size, hidden_size*2], decoder_hidden:[1,batch_size,hidden_size*2]
        out = out.squeeze(0)
        out = F.log_softmax(self.fc(out),dim=-1)#[batch_Size, vocab_size]
        out = out.squeeze(1)
        # print("out size:",out.size(),decoder_hidden.size())
        return out,decoder_hidden

2. Use gradient clipping

Earlier, we introduced to you 梯度消失(梯度过小，在多层计算后导致其值太小而无法计算)and 梯度爆炸（梯度过大，导致其值在多层的计算后太大而无法计算）.

In common deep neural networks, especially RNNs, we often use 梯度裁剪methods to suppress excessive gradients, which can effectively prevent gradient explosions.

The implementation of gradient clipping is very simple. It only needs to set a threshold, and set the threshold when the gradient is greater than the threshold.

Insert picture description here

Implementation code:

loss.backward()
#进行梯度裁剪
nn.utils.clip_grad_norm_(model.parameters(),[5,10,15])
optimizer.step()

3. Use Attention mechanism [used in Decoder]

4. BeamSearch algorithm prediction [alternative greedy algorithm prediction]

4.1 Introduction to Beam Search

In the process of model evaluation, each time we choose the token id with the highest probability as the output, is the probability of the entire output sentence the highest?

Insert picture description here
Beam searchAlso known as 束集搜索, is an algorithm used to optimize output results in seq2seq ( not used in the training process, but used in evaluation or prediction ).

For example: In the traditional process of obtaining the output of the decoder, only the result with the highest probability is selected each time as the output of the current time step. When the output is over, we will find that the entire sentence may not be smooth. Although the output at each time step is indeed the most probable, the overall probability is not necessarily the largest. We often call itgreedy search[贪心算法]

In order to solve the above problems, you can consider calculating the probability product of all the outputs and choosing the path with the greatest probability, which can be achieved全局最优解 . But in this case, it means that if the sentence is very long and there are many candidate words, then the data that needs to be saved will be very large, and the amount of data that needs to be calculated will be very large.

Then Beam Search is a method between the above two methods. Assuming that Beam width=2, it means the maximum probability of saving each time. Here, it saves two at a time. The next time step is the same, and it is also reserved. Two, so that the purpose of constraining the size of the search space can be achieved, thereby improving the efficiency of the algorithm . Beam Search也不是全局最优. [The Viterbi algorithm is globally optimal]

beam width is a hyperparameter.
- When beam width =1, it is a greedy algorithm.
- When beam width=candidate words, it is to calculate all the probabilities.

For example, in the figure below:

Use a tree diagram to represent the possible output of each time step, where the number represents the conditional probability

The yellow arrow indicates a greedy search, the probability is not the largest

If the beam width is set to 2, then the result of the green path can be found later, and this result is the largest

Insert picture description here

The figure below is an example of giving beam width=3

First input start token <s>, and then get four outputs (here suppose one is four outputs: X,Y,Z,</s>), choose three with the highest probability, X, Y, W
Then put X, Y, W into the next time step as input, and get three different outputs (a total of 12 output sequences: XX, XY, XZ, X</s>, YX, YY, YZ, Y</s>,WX, WY, WZ, W</s>), among these 12 outputs, find the three with the highest output probability (as shown in the figure below: XX,XY,WY) and save them in Beam.
Then put XX, XY, WY into the next time step as input, respectively, and get three different sets of output (a total of 12 output sequences: XXX, XXY, XXZ, XX</s>, XYX, XYY, XYZ, XY</s>, WYX, WYY, WYZ, WY</s>), among these 12 outputs, find the three with the highest output probability (as shown in the figure below: XXX, XYX, WYX) and save them in Beam.
Then put XXX, XYX, WYX into the next time step as input, respectively, and get three different outputs (a total of 12 output sequences: XXXX, XXXY, XXXZ, XXX</s>, XYXX, XYXY, XYXZ, XYX</s>,WYXX, WYXY, WYXZ, WYX</s>), among these 12 outputs, find the three with the highest output probability (as shown in the figure below: XYXW,XYXX,WYX</s>) and save them to Beam in.
Then put XYXW, XYXX, WYX</s> into the next time step as input, and get three different sets of outputs (a total of 12 output sequences: XYXWX, XYXWY, XYXWZ, XYXW</s>,XYXXX, XYXXY, XYXXZ, XYXX</s>,WYX</s>X, WYX</s>Y, WYX</s>Z, WYX</s></s>), three of these 12 outputs are found The three with the highest output probability (as shown in the figure below: XYXW</s>, XYXWY, XYXX</s>) are saved in Beam.
Continue to repeat the above steps until the sequence probability when the end character is obtained is the maximum or the maximum length of the sentence max_len is reached , and the loop is ended. At this time, choose the path with the largest probability product.
Concatenate all the results with the highest probability on the entire path, for example, here may be<s>,X,Y,X,W,</s>

Insert picture description here

4.2 Beam Search explanation

For the model trained by the MLE algorithm, beam search is only needed for prediction. Because you know the correct answer during training, you don't need to perform this search again.

When predicting, suppose the size of the vocabulary is 3, and the content is a, b, c. The beam size is 2, when the decoder decodes:

When generating the first word, select the 2 words with the highest probability, assuming a and c, then the current 2 sequences are a and c.
When generating the second word, we combine the current sequences a and c with all the words in the vocabulary respectively to obtain the new 6 sequences aa ab ac ca cb cc, calculate the score of each sequence and select the highest score 2 sequences, as the new current sequence, if it is aa, cb.
This process will be repeated later until the end character is encountered or the maximum length is reached. The 2 sequences with the highest scores are finally output.

4.3 Implementation of Beam serach

In the ideas described above, we need to pay attention to the following:

How to save the data, the maximum beam width results of each output, and how to save the previous results afterwards
How to compare the probabilities after saving, keep the three most probable
It is not only possible to save only the information with the highest current probability, but also the output result of the previous path among the three with the highest current probability

4.3.1 Data structure-heap-understanding

For the above, a limited number of data is reserved, and it needs to be reserved according to the size. A data structure with priority can be used to achieve this. Here we can use 堆this data structure

堆It is a priority queue, but it is not actually a queue

队列It's all first in first out or first in last out ,
堆Fetch data only according to the priority level .
栈It is a first-in-last-out data structure , with stacking and popping operations

Among the modules that come with python, there is heapqa module called a module that provides all the methods. Through the following code we will understand how to use heapq

my_heap = [] #使用列表保存数据

 #往列表中插入数据，优先级使用插入的内容来表示，就是一个比较大小的操作，越大优先级越高
heapq.heappush(my_heap,[29,True,"xiaohong"]) 
heapq.heappush(my_heap,[28,False,"xiaowang"])
heapq.heappush(my_heap,[29,False,"xiaogang"])

for i in range(3):
    ret= heapq.heappop(my_heap)  #pop操作，优先级最小的数据
    print(ret)
    
#输出如下：
[28, False, 'xiaowang']
[29, False, 'xiaogang']
[29, True, 'xiaohong']

It can be found that the order of output is not the order of data insertion, but according to its priority, pop from small to large (False<True).

4.3.2 Use heap to implement beam search

In order to save the data, we can save the data in the beam search in the heap, and at the same time add data to the heap while judging the number of data, only save the beam width data

class Beam:
    def __init__(self):
        self.heap = list() #保存数据的位置
        self.beam_width = config.beam_width #保存数据的总数

    def add(self,probility,complete,seq,decoder_input,decoder_hidden):
        """
        添加数据，同时判断总的数据个数，多则删除
        :param probility: 概率乘积
        :param complete: 最后一个是否为EOS
        :param seq: list，所有token的列表
        :param decoder_input: 下一次进行解码的输入，通过前一次获得
        :param decoder_hidden: 下一次进行解码的hidden，通过前一次获得
        :return:
        """
        heapq.heappush(self.heap,[probility,complete,seq,decoder_input,decoder_hidden])
        #判断数据的个数，如果大，则弹出。保证数据总个数小于等于3
        if len(self.heap)>self.beam_width:
            heapq.heappop(self.heap)

    def __iter__(self):#让该beam能够被迭代
        return iter(self.heap)

Implementation method, complete the beam search search in the eval process of the model

Ideas:

Construct the <SOS>first input information such as the start symbol and save it in the heap
Take out the data in the heap and perform the forward_step operation to obtain the output of the current time step, hidden
Select topk (k=beam width) outputs from output as the next input
Save the input and other data needed for the next time step in a new heap
Get the data with the highest priority (the highest probability) in the new heap, determine whether the data is the end of EOS or whether it reaches the maximum length, if it is, stop the iteration
If not, re-traverse the data in the new heap

code show as below

# decoder中的新方法
def evaluatoin_beamsearch_heapq(self,encoder_outputs,encoder_hidden):
    """使用 堆 来完成beam search，对是一种优先级的队列，按照优先级顺序存取数据"""

    batch_size = encoder_hidden.size(1)
    #1. 构造第一次需要的输入数据，保存在堆中
    decoder_input = torch.LongTensor([[word_sequence.SOS] * batch_size]).to(config.device)
    decoder_hidden = encoder_hidden #需要输入的hidden

    prev_beam = Beam()
    prev_beam.add(1,False,[decoder_input],decoder_input,decoder_hidden)
    while True:
        cur_beam = Beam()
        #2. 取出堆中的数据，进行forward_step的操作，获得当前时间步的output，hidden
        #这里使用下划线进行区分
        for _probility,_complete,_seq,_decoder_input,_decoder_hidden in prev_beam:
            #判断前一次的_complete是否为True，如果是，则不需要forward
            #有可能为True，但是概率并不是最大
            if _complete == True:
                cur_beam.add(_probility,_complete,_seq,_decoder_input,_decoder_hidden)
            else:
                decoder_output_t, decoder_hidden,_ = self.forward_step(_decoder_input, _decoder_hidden,encoder_outputs)
                value, index = torch.topk(decoder_output_t, config.beam_width)  # [batch_size=1,beam_widht=3]
             	#3. 从output中选择topk（k=beam width）个输出，作为下一次的input
            	for m, n in zip(value[0], index[0]):
                    decoder_input = torch.LongTensor([[n]]).to(config.device)
                    seq = _seq + [n]
                    probility = _probility * m
                    if n.item() == word_sequence.EOS:
                    	complete = True
                    else:
                        complete = False

                 	#4. 把下一个实践步骤需要的输入等数据保存在一个新的堆中
                	 cur_beam.add(probility,complete,seq,decoder_input,decoder_hidden)
          #5. 获取新的堆中的优先级最高（概率最大）的数据，判断数据是否是EOS结尾或者是否达到最大长度，如果是，停止迭代
          best_prob,best_complete,best_seq,_,_ = max(cur_beam)
         if best_complete == True or len(best_seq)-1 == config.max_len: #减去sos
            return self._prepar_seq(best_seq)
         else:
            #6. 则重新遍历新的堆中的数据
            prev_beam = cur_beam
                                    
      def _prepar_seq(self,seq):#对结果进行基础的处理，共后续转化为文字使用
        if seq[0].item() == word_sequence.SOS:
            seq=  seq[1:]
        if  seq[-1].item() == word_sequence.EOS:
            seq = seq[:-1]
        seq = [i.item() for i in seq]
        return seq

4.3.3 Modify the seq2seq model

Use evaluatetoin_beamsearch_heapq to view the effect in seq2seq, and you will find that the effect of using beam search is better than the effect of using attention alone

Using the small yellow chicken corpus (500,000 questions and answers), a single word is used as the token, and the training result after 5 epochs, the left is the question, the right is the answer

你在干什么 >>>>> 你想干啥？
你妹 >>>>> 不是我
你叫什么名字 >>>>> 你猜
你个垃圾 >>>>> 你才是，你
你是傻逼 >>>>> 是你是傻
笨蛋啊 >>>>> 我不是，你

5. Other optimization model methods

Initialization of parameters
Optimize existing data and corpus
- Data cleaning
  - Handling of punctuation, expressions, and foreign language
  - Replace nouns such as time, person's name, location, etc. with corresponding symbols
- Never understand the angle, prepare the corpus with different levels of complexity
  - Angle: weather, eating, gender...
  - Complexity: simple, general, complex
Engineering perspective optimization
- Use templates to match common questions and return preset answers
- Use the classification model to classify the question and return the preset answer
- Use the search model to return answers to similar questions from the existing corpus.
According to the specific problem, use the classification model for training, and then train the individual back to the subject as the model
- For example, to ask for the name, you can use fasttext to perform intent recognition first, and 询问名字then directly return the name after hitting the classification.
- Or manually construct a lot of questions related to the name for training, so that you can answer the results more personalized
Modify and clean the existing corpus directly, and replace more answers in the corpus, such as those asking for names, asking about the weather, etc., so that more standardized answers can be answered to a greater extent
Use search model, no longer use this kind of generative model