Column introduction

Recently, I am building a judicial model, and now I am in the stage of data processing. Later, when discussing with my seniors, I said that attention should be used. After all, the judicial sector still has relatively high requirements for interpretation. Although I know something about it before, I always feel that I have not been able to get the essence of attention, and I have somewhat forgotten it, so starting this column from the beginning is equivalent to making a note for myself.

RNN

appear background

RNN (Recurrent Neural Network) cyclic neural network is a type of deep learning model that can handle inputs of different lengths. The output pressure mainly lies in NLP tasks, such as translation tasks. As far as Chinese-English translation is concerned, the English we get may be of any length, as short as a person's name, as long as the translation of long and difficult sentences in reading. At that time, many people truncated this kind of data, but obviously this would sacrifice the input content, which is not an ideal solution, so RNN appeared with hope.

model architecture

insert image description here
The situation can be easily explained through this picture. First, the parameters of the entire RNN code are in the h block (left side of the arrow), x is the input, and o is the output. Specifically, the green block on the left of the arrow can be just two matrices W and U. Let's now talk about how the entire data passes through this model.
We might as well assume that our input x = Where have you been recently?is a sentence with five words. First of all, we will of course use some methods to vectorize these words. We will not go into details here. By default, the five words of this sentence have been converted into five Fixed-length vectors
$v_1, v_2, v_3, v_4, v_5$
Each of them is a vector of English words corresponding to the position, such as $v_1$ Correspondingly where, $v_3$ The corresponding is you. So now our input has changed from five easy-to-read English words to five fixed-length vectors that are easy for computers to process. Below we describe how the RNN handles these five vectors.

model calculation

first step

Randomly initialize a hidden layer vector $h_1$ , calculate the vector $W * h_1$ , the sum $U * v_1$ , get two vectors $s_1 and e_1$ , here for simplicity, with $s_1 + e_1$ as output $o_1$ , with $s_1$ as $h_2$ . So far, we have completed the first step, input: random vector and the first word vector; output: the first output vector and the second hidden layer vector.

second step

$h_2, v_2$ obtained above $h_{2}, v_{2}$ Repeat the operation of the previous step, specifically, calculate $s_2 = W * h_2, e_2 = U * v_2$ . Let the output $o_2 = s_2 + e_2$ , the new hidden layer vector $h_3 = s_2$ 。

third step

利用 $h_3, v_3$ , continue to repeat, calculate $s_3 = W * h_3, e_3=U*v_3$ 。令 $o_3 = s_3 + e_3, h_4 = s_3$ 。

the fourth step

Input: $h_4, v_4$ . Output: $o_4, h_5$

the fifth step

Input: $h_5, v_5$ . Output: $o_5, h_6$ . Finish

So far we have obtained h and o. Generally, we will use the hidden layer h to represent the meaning of the sentence.

model evaluation

Let me briefly say here, first of all, the most obvious thing is that we have solved the problem of input of different lengths. By making each word perform repeated operations, we have achieved the initial goal; secondly, we have noticed that we have calculated each step Two input vectors are involved, one is the hidden layer vector h, which synthesizes the information of all the previous words. To some extent, h in each step represents the semantics of the previous words put together, and the other is the current new word x , can be understood as the modification of the overall sentence meaning by the new word.
On the whole, RNN not only solved the data length problem that plagued researchers at the time, but also provided an intuitive model design method. However, this kind of beggar’s version of RNN also has its own problems. For example, if the sentence is very long, the semantics of the first few words may be attenuated with continuous calculation, thus affecting the representation of the entire sentence.

Model promotion

In order to further solve the shortcomings of the beggar’s version of RNN, researchers have also successively proposed models such as LSTM and GRU. The essence of which is RNN. The difference lies in the calculation of each step. Through their own design, the information in front of the sentence can be partially retained. , thereby alleviating this "inattention" problem. But despite this, when the translation task encounters extremely long input, it is still possible to ignore the previous information, and the situation of translation before and after is reversed, so we need to further study better methods.

Please follow the full column

[Attention Evolution History] The emergence, architecture, promotion, and problems of RNN (the first round)