Seq2Seq Getting Attention mechanisms and details

 

1.Sequence Generation

1.1. Introduction

In the recurrent neural network (RNN) Getting detailed article, we simply introduced Seq2Seq, we are here to expand it

A sentence is composed of characters (words) or words (word) composed of Chinese word may be formed by several words.

If you want to write a sentence, then RNN training to character or word units can be

FIG above as an example, token is generated before a time point of input RNN (character or word)

Assuming that character on the machine to produce a point in time is the "I", the vector y our output is distributed on the character, and it has a chance to write 0.7 "I am", has a chance to write 0.3 "I am."

1.2 Example: poetry

When generating a character of the first sentence because there is nothing before, we need to give the machine a special character-- <BOS>

BOS:Begin of Sentence

A first output character $ y $ ^ {1} can be represented by the following conditional probabilities

We then output the most probable that character, then $ y ^ {1} $ as input, ......, repeat this behavior until we output <EOS>

 EOS:End of Sentence

We RNN training data set is similar like this above. As shown below, our input is poetry every word, the output word is next input to our model obtained by minimizing the cross-entropy

 

1.3 Example: Paint

Photo by pixel composition, we can put a picture of the wanted pixels into words , let RNN generate pixels, the truth is the same.

But the picture of each line rightmost pixel $ a_ {i, j} $ and the next line of the leftmost $ a_ {i + 1, j -2} $ of the pixels are far away, they might not matter, $ a_ {i + 1, j-2} $ it may follow immediately above the pixel $ a_ {i, j-2 } $ larger relationship.

For example, the following figure gray pixels and yellow pixels may be little to do, but with more blue pixels relationship.

So when we generated picture pixel, gray pixel is generated by blue pixels, rather than generated by the yellow pixels.

2.Conditional Generation

But we do not want to randomly generated sentences, we expect it to generate the appropriate sentence based on our scene. To such images, the description of the output image; bot input sentence, the sentence output response.

 

2.1.Image Caption Generation

For example, we want to train a model used to generate the picture captions.

We can make a picture by CNN, the output of a vector, then this vector RNN thrown in.

  • The vector can only enter at the first point in time , so this vector RNN save to memory, the time Dianbu zero behind.
  • Also you can have input at each time point this vector, because RNN behind us may have forgotten the input vector.

 

2.2.Machine translation / Chat-bot

If you want to be a translator or a bot, our input is a sentence, is a machine translation output or response.

This model can be divided into two parts, Encoder and Decoder

Encoder then the input sentence at the last time point taken out of the output

Can take the output, you can also take $ h_ {t} $, there $ c_ {t} $

And then output as the vector Encoder Decoder input each time point. Encoder and Decoder are trained together .

Above this situation, our input is Sequence, our output is also Sequence, it is called Sequence to Sequence Model

 

 3.Dynamic Conditional Generation

This model is also called the Attention Based Model. Described earlier Encoder- Decoder this architecture, it might not have the ability to put a very long input compressed into a vector, such a vector can not represent all of the information sentence, resulting in unsatisfactory performance model. Decoder in front of each time point are the same input vector. In the Dynamic Conditional Generation, we hope the information Decoder obtained at each time point is not the same .

We continue the above example, to train a translation model. Here a plurality of vector $ z ^ {0} $, $ z ^ {0} $ is also a need to train the model parameter vector (referred to as key)

We put into each output of the hidden layer in a Database, with $ z ^ {0} $ to searches Database. It will be hidden and output layers of each $ h ^ {i} $ do match, to give a degree of matching  $ \ alpha ^ {i} _  {0} $

That's Attention

Matching method can match their own design, such as:

  • match may be $ z $ and $ h $ cosine similarity (cosine similarity)
  • match network may be input is $ z $ and $ H $, the output value is a
  • match may also be designed such as: $ \ alpha = h ^ {T} Wz $

Parameters $ W $ parameters of the second and the third method is the network of their own learning machine

After a match obtained $ \ alpha ^ {i} _ {0} $ later be followed by a SoftMax (may not), to give $ \ hat {\ alpha} ^ {i} _ {0} $, let $ and h $ {alpha \} $ $ multiplying the corresponding \ Hat then obtained by adding $ c ^ {0} $

Then $ c ^ {0} $ as input Decoer according $ z ^ {0} $ and $ c ^ {0} $, Decoder will get a $ z ^ {1} $, and outputs a word.

We get $ z ^ {1} $ after, we then calculated once Attention. This operation is repeated until a <EOS>

Attention Let's do not care about the entire sentence, but attention somewhere

 

4.Tips

4.1.Attention too concentrated

$ ^ {\ Alpha ^ {i} _ {t}} $, the subscript represents the time, the subscript represents the number of component Attention weight

Such as watching videos, text output description of the content of the film.

If our Attention looks like the image above bar graph shown, which are concentrated inside the second screen. In this screen, a second time point and the fourth point of time Attention is high.

Then our output may be so weird sentence the following:

A woman and woman is doing a woman

We want our Attention is average, should not just look at one particular frame, but each frame of the film to be seen on average.

We can lower our Attention a Regularization Term , machine learning, we also used l1 and l2 regularization.

In the above formula, $ \ sum _t \ alpha ^ {i} _ {t} $ Attention all accumulated in the same frame, with a constant desired $ \ $ of tau closer the better.

In this way we get Attention weight will be evenly distributed on different frames, rather than focus on a particular frame

4.2.training and testing inconsistent

4.2.1 introduced

As shown above, we train, we desired output is A, B, B, our model obtained by minimizing the cross-entropy loss

During the test, if we first output time point be B, it appears as part of the second input time point. But when the test is not standard, we do not know the second time point input is wrong.

That is the test of time, enter part of the machine is self-generated, there may be something wrong . This condition is called Exposure BIAS .

exposure bias refers to the model in the training of some situation it is not going probed

We look at the training and testing of

Training, our model has only seen by comparing the three branches, the other situation when there is no training model is not seen in

When tested, we do not have the above restrictions. If we test the time, the first step made a mistake, but when the training of our model have not seen any other branch status, leading to a step wrong, wrong step .

 

4.2.2 Scheduled Sampling

We can solve this phenomenon Scheduled Sampling

Scheduled Sampling mean machine to see what they produced, it will be difficult to train, to see the correct answer machine, training and testing will be inconsistent. Then consider a compromise .

Like using a dice to determine if it is positive to see things their own machine-generated, back to see the right answer.

The probability of the dice is dynamically determined, at the beginning of the answer to the probability of occurrence of relatively high, with the progress of training, the machines generate their own thing probability is relatively high. Because testing time, the machine generates its own thing probability is 100, the probability of occurrence of the answer is zero.

4.2.3Beam Search

We can also use the Beam Search. When our machines output, the output is a probability, then select a high probability.

That there is no greater chance of options which may be behind the B in the first step? Is greater than 0.6 * 0.6 * 0.6? As shown below

Maybe after the first sacrifice time point we can choose to better results. We have no way exhaustive of all possible, Beam Search is to overcome this problem

Every time doing sequence generation, we will retain the top N highest score possible . The number of deposit is called the Beam Size. Under Beam Szie = 2 in FIG.

testing only with Beam Search, training irrelevant

Another example of Beam Search

 

Guess you like

Origin www.cnblogs.com/dogecheng/p/11596712.html