Deep learning | 20 things you must know about GRU

1. What is GRU?
GRU is the abbreviation of Gated Recurrent Unit. It is a special recurrent neural network, which controls the flow of information through the gate mechanism and realizes the long-range dependence modeling on the sequence.

2. What is the difference between GRU and LSTM?
The main difference between GRU and LSTM is that GRU has no cell state and forget gate, it combines the input gate and forget gate of LSTM, and has no output gate, which makes the structure of GRU simpler. But GRUs can still model long sequences very well.

3. What are the gate mechanisms of GRU?
GRU includes two gate mechanisms: update gate and reset gate. The update gate determines how the current hidden state is combined with the previous hidden state. The reset gate determines how much the previously hidden state works.

4. How does a GRU work?
At each time step, a GRU combines the current input, the previous hidden state, and the outputs of the two gates to calculate the current hidden state. The current hidden state affects the outputs of the two gates at the same time, realizing a cyclic update process.

5. What is the expression of GRU?
The expression of GRU is:
reset gate: r = sigmoid(XU + H * U_r)
update gate: z = sigmoid(XW + H * U_z)
keep = tanH(XW + r * H * U_h)
candidate hidden state: h ~ = keep * (1 - z) + z * H 
current hidden state: H = h ~ where X represents the input, U and W are weight matrices, and H represents the previous hidden state.

6. Why can the GRU unit capture long dependencies?
GRU can selectively forget the previous hidden state through the gate mechanism, so that it can not only use long-term memory, but also discard information that may mislead the current output in time, which makes GRU Ability to better model long sequences.

7. What are the advantages of GRU over RNN?
Compared with ordinary RNN, GRU can better capture the long-range dependence of long sequences and alleviate the phenomenon of gradient disappearance; compared with LSTM, GRU has fewer parameters and more concise calculations, so Training is faster.

8. How does PyTorch implement GRU?
PyTorch abstracts RNNCell into RNNCell class, and GRUCell is a subclass inherited from RNNCell. We can implement GRU like this:

python
gru = nn.GRUCell(input_size, hidden_size) 
output = []
h = torch.zeros(1, hidden_size)
for x in input:
    h = gru(x, h)
    output.append(h)

9. How does TensorFlow implement GRU?
In TensorFlow, we can implement GRU like this:

python
gru_cell = tf.contrib.rnn.GRUCell(hidden_size)
outputs, states = tf.nn.dynamic_rnn(gru_cell, inputs, dtype=tf.float32)

10. Does OpenAI GPT also use GRU?
OpenAI GPT uses the Transformer model, and it does not use recurrent neural network structures such as GRU or LSTM. Transformer relies on self-attention mechanism to model sequence dependencies, which is different from GRU which captures sequence dependencies through hidden states.

11. What is Bidirectional GRU?
Bidirectional GRU includes a forward GRU and a reverse GRU. The forward GRU processes the input sequence from the beginning to the end, and the reverse GRU processes the input sequence from the end to the beginning. The final output is the output concatenation of two GRUs, which can exploit the bidirectional information of the input sequence.

12. What is Stacked GRU?
Stacked GRU superimposes multiple GRU layers, and the output of the previous layer is used as the input of the next layer. This can enhance the expressive power of the model and capture more complex sequence features.

13. What are the implementation details of GRU?
The main implementation details of GRU are:
1) sigmoid activation function for gating mechanism
2) tanh activation function for candidate hidden states
3) matrix multiplication for transitions between gates
4) Element-wise multiplication is used for gate function execution
5) The learnable parameters include reset gate matrix U_r, update gate matrix U_z, candidate hidden state matrix U_h, etc.

14. What is the training algorithm of GRU?
Like the general recurrent neural network, the training algorithms commonly used by GRU are BPTT (Back Propagation Through Time) and TRUNCATED BPTT.

15. What are the precautions for using GRU? The precautions
for using GRU are:
1) Choose an appropriate hidden state size, which generally depends on the complexity of the input sequence
2) Appropriate training order, you can start from a short sequence and gradually transition to a long sequence Sequence
3) Take measures to alleviate overfitting, such as adding dropout, input noise, etc.
4) Adjust hyperparameters between different data sets and tasks
5) Correctly set parameters such as batch_first

16. What is the difference between PyTorch and TensorFlow's GRU implementation?
PyTorch's GRU implementation is based on RNNCell, and we need to iteratively calculate the hidden state manually. TensorFlow's GRU implementation is based on tf.rnn.GRUCell and tf.nn.dynamic_rnn, which can automatically complete the iteration and return process. TensorFlow's implementation is more abstract, and PyTorch's is more flexible.

17. What application scenarios can GRU be used for?
GRU is often used in sequence modeling tasks, such as:
machine translation, text summarization, speech recognition, image subtitles, Protein structure prediction, etc.

18. What are the advantages and disadvantages of GRU?
The main advantages of GRU are: it can capture the dependence of long sequences, the structure is simple, and it is easy to implement and adjust parameters.
The main disadvantages of GRU are: there is still the problem of gradient disappearance, and the ability to model ultra-long sequences is limited; compared with LSTM, there are more gating mechanisms, and the expression ability is slightly weaker.

19. What are the new RNN variants?
New RNN variants include QRNN, IndRNN, SRU, etc. They have innovations in structure and implementation, which can alleviate the problems of recurrent neural networks to a certain extent and achieve better results.

20. How is the Attention mechanism applied in GRU?
In GRU, we can apply the Attention mechanism on the final hidden state to generate an attention weight vector. Then this weight vector and each hidden state in the encoder are weighted and summed as the output of the GRU. This allows the GRU to "focus" on relevant parts of the input sequence and produce better output results.

Guess you like

Origin blog.csdn.net/weixin_47964305/article/details/131258276