【中英】【吴恩达课后测验】Course 5 -序列模型 - 第一周测验 - 循环神经网络

上一篇：【课程4 - 第四周编程作业】※※※※※ 【回到目录】※※※※※下一篇：【待撰写-课程5 -第一周编程作业】

假设你的训练样本是句子(单词序列)，下面哪个选项指的是第 $i$ 个训练样本中的第 $j$ 个词?
- 【★】 $x^{(i)<j>}$
- 【】 $x^{<i>(j)}$
- 【】 $x^{(j)<i>}$
- 【】 $x^{<j>(i)}$
We index into the $i^{th}$ row first to get the $i^{th}$ training example (represented by parentheses), then the $j^{th}$ column to get the $j^{th}$ word (represented by the brackets).

首先获取第 $i$ 个训练样本(用括号表示)，然后到 $j$ 列获取单词(用括尖括号表示)。
看一下下面的这个循环神经网络：

在下面的条件中，满足上图中的网络结构的参数是：
- 【★】 $T_x = T_y$
- 【】 $T_x < T_y$
- 【】 $T_x > T_y$
- 【】 $T_x = 1$
It is appropriate when every input should be matched to an output.

上图中每一个输入都与输出相匹配。
这些任务中的哪一个会使用多对一的RNN体系结构？
- 【】语音识别（输入语音，输出文本）。
- 【★】情感分类（输入一段文字，输出0或1表示正面或者负面的情绪）。
- 【】图像分类（输入一张图片，输出对应的标签）。
- 【★】人声性别识别（输入语音，输出说话人的性别）。
假设你现在正在训练下面这个RNN的语言模型：

在 $t$ 时，这个RNN在做什么？
- 【】计算 $P(y^{<1>}, y^{<2>}, …, y^{<t-1>})$
- 【】计算 $P(y^{<t>})$
- 【★】计算 $P(y^{<t>} \mid y^{<1>}, y^{<2>}, …, y^{<t-1>})$
- 【】计算 $P(y^{<t>} \mid y^{<1>}, y^{<2>}, …, y^{<t>})$
  
  Yes,in a language model we try to predict the next step based on the knowledge of all prior steps.
  
  是的，这个语言模型正在试着根据前面所有的知识来预测下一步。
你已经完成了一个语言模型RNN的训练，并用它来对句子进行随机取样，如下图：

在每个时间步 $t$ 都在做什么？
- 【】 (1)使用RNN输出的概率，选择该时间步的最高概率单词作为 $\hat{y}^{<t>}$ ，(2)然后将训练集中的正确的单词传递到下一个时间步。
- 【】 (i)使用由RNN输出的概率将该时间步的所选单词进行随机采样作为 $\hat{y}^{<t>}$ ，(2)然后将训练集中的实际单词传递到下一个时间步。
- 【】 (1)使用由RNN输出的概率来选择该时间步的最高概率词作为 $\hat{y}^{<t>}$ ，(2)然后将该选择的词传递给下一个时间步。
- 【★】 (1)使用RNN该时间步输出的概率对单词随机抽样的结果作为 $\hat{y}^{<t>}$ ，(2)然后将此选定单词传递给下一个时间步。
你正在训练一个RNN网络，你发现你的权重与激活值都是“NaN”，下列选项中，哪一个是导致这个问题的最有可能的原因？
- 【】梯度消失。
- 【★】梯度爆炸。
- 【】 ReLU函数作为激活函数g(.)，在计算g(z)时，z的数值过大了。
- 【】 Sigmoid函数作为激活函数g(.)，在计算g(z)时，z的数值过大了。
假设你正在训练一个LSTM网络，你有一个10,000词的词汇表，并且使用一个激活值维度为100的LSTM块，在每一个时间步中， $\Gamma_{u}$ 的维度是多少？
- 【】 1
- 【★】 100
- 【】300
- 【】 10000
Correct, $\Gamma_u$ is a vector of dimension equal to the number of hidden units in the LSTM.

$\Gamma_u$ 的向量维度等于LSTM中隐藏单元的数量。
这里有一些GRU的更新方程：

爱丽丝建议通过移除 $\Gamma_u$ 来简化GRU，即设置 $\Gamma_u$ ＝1。贝蒂提出通过移除 $\Gamma_r$ 来简化GRU，即设置 $\Gamma_r$ ＝1。哪种模型更容易在梯度不消失问题的情况下训练，即使在很长的输入序列上也可以进行训练？
- 【】爱丽丝的模型（即移除 $\Gamma_u$ ），因为对于一个时间步而言，如果 $\Gamma_r \approx 0$ ，梯度可以通过时间步反向传播而不会衰减。
- 【】爱丽丝的模型（即移除 $\Gamma_u$ ），因为对于一个时间步而言，如果 $\Gamma_r \approx 1$ ，梯度可以通过时间步反向传播而不会衰减。
- 【★】贝蒂的模型（即移除 $\Gamma_r$ ），因为对于一个时间步而言，如果 $\Gamma_u \approx 0$ ，梯度可以通过时间步反向传播而不会衰减。
- 【】贝蒂的模型（即移除 $\Gamma_r$ ），因为对于一个时间步而言，如果 $\Gamma_u \approx 1$ ，梯度可以通过时间步反向传播而不会衰减。
For the signal to backpropagate without vanishing, we need $c^{<t>}$ to be highly dependant on $c^{<t-1>}$

要使信号反向传播而不消失，我们需要 $c^{<t>}$ 高度依赖于 $c^{<t-1>}$ 。
这里有一些GRU和LSTM的方程:

从这些我们可以看到，在LSTM中的更新门和遗忘门在GRU中扮演类似 $\underline{\quad \quad}$ 与 $\underline{\quad \quad}$ 的角色，空白处应该填什么？
- 【★】 $\Gamma_u$ 与 1− $\Gamma_u$
- 【】 $\Gamma_u$ 与 $\Gamma_r$
- 【】 1− $\Gamma_u$ 与 $\Gamma_u$
- 【】 $\Gamma_r$ 与 $\Gamma_u$
你有一只宠物狗，它的心情很大程度上取决于当前和过去几天的天气。你已经收集了过去365天的天气数据 $x^{<1>}, …, x^{<365>}$ ，这些数据是一个序列，你还收集了你的狗心情的数据 $y^{<1>}, …, y^{<365>}$ ，你想建立一个模型来从x到y进行映射，你应该使用单向RNN还是双向RNN来解决这个问题？
- 【】双向RNN，因为在 $t$ 日的情绪预测中可以考虑到更多的信息。
- 【】双向RNN，因为这允许反向传播计算中有更精确的梯度。
- 【★】单向RNN，因为 $y^{<t>}$ 的值仅依赖于 $x^{<1>}, …, x^{<t>}$ ，而不依赖于 $x^{<t+1>}, …, x^{<365>}$ 。
- 【】单向RNN，因为 $y^{<t>}$ 的值只取决于 $x^{<t>}$ ，而不是其他天的天气。

Recurrent Neural Networks

Suppose your training examples are sentences (sequences of words). Which of the following refers to the jth word in the ith training example?
- [x] $x^{(i)<j>}$
- [ ] $x^{<i>(j)}$
- [ ] $x^{(j)<i>}$
- [ ] $x^{<j>(i)}$
We index into the $i^{th}$ row first to get the $i^{th}$ training example (represented by parentheses), then the $j^{th}$ column to get the $j^{th}$ word (represented by the brackets).
Consider this RNN:

This specific type of architecture is appropriate when:
- [x] $T_x = T_y$
- [ ] $T_x < T_y$
- [ ] $T_x > T_y$
- [ ] $T_x = 1$
It is appropriate when every input should be matched to an output.
To which of these tasks would you apply a many-to-one RNN architecture? (Check all that apply).
- [ ] peech recognition (input an audio clip and output a transcript)
- [x] Sentiment classification (input a piece of text and output a 0/1 to denote positive or negative sentiment)
- [ ] Image classification (input an image and output a label)
- [x] Gender recognition from speech (input an audio clip and output a label indicating the speaker’s gender)
You are training this RNN language model.

At the $t^{th}$ time step, what is the RNN doing? Choose the best answer.
- [ ] Estimating $P(y^{<1>}, y^{<2>}, …, y^{<t-1>})$
- [ ] Estimating $P(y^{<t>})$
- [x] Estimating $P(y^{<t>} \mid y^{<1>}, y^{<2>}, …, y^{<t-1>})$
- [ ] Estimating $P(y^{<t>} \mid y^{<1>}, y^{<2>}, …, y^{<t>})$
Yes,in a language model we try to predict the next step based on the knowledge of all prior steps.
You have finished training a language model RNN and are using it to sample random sentences, as follows:

What are you doing at each time step t?
- [ ] (i) Use the probabilities output by the RNN to pick the highest probability word for that time-step as $\hat{y}^{<t>}$ . (ii) Then pass the ground-truth word from the training set to the next time-step.
- [ ] (i) Use the probabilities output by the RNN to randomly sample a chosen word for that time-step as $\hat{y}^{<t>}$ . (ii) Then pass the ground-truth word from the training set to the next time-step.
- [ ] (i) Use the probabilities output by the RNN to pick the highest probability word for that time-step as $\hat{y}^{<t>}$ . (ii) Then pass this selected word to the next time-step.
- [x] (i) Use the probabilities output by the RNN to randomly sample a chosen word for that time-step as $\hat{y}^{<t>}$ . (ii) Then pass this selected word to the next time-step.
You are training an RNN, and find that your weights and activations are all taking on the value of NaN (“Not a Number”). Which of these is the most likely cause of this problem?
- [ ] Vanishing gradient problem.
- [x] Exploding gradient problem.
- [ ] ReLU activation function g(.) used to compute g(z), where z is too large.
- [ ] Sigmoid activation function g(.) used to compute g(z), where z is too large.
Suppose you are training a LSTM. You have a 10000 word vocabulary, and are using an LSTM with 100-dimensional activations a. What is the dimension of Γu at each time step?
- [ ] 1
- [x] 100
- [ ] 300
- [ ] 10000
Correct, $\Gamma_u$ is a vector of dimension equal to the number of hidden units in the LSTM.
Here’re the update equations for the GRU.

Alice proposes to simplify the GRU by always removing the $\Gamma_u$ . I.e., setting $\Gamma_u$ = 1. Betty proposes to simplify the GRU by removing the $\Gamma_r$ . I. e., setting $\Gamma_r$ = 1 always. Which of these models is more likely to work without vanishing gradient problems even when trained on very long input sequences?
- [ ] Alice’s model (removing $\Gamma_u$ ), because if $\Gamma_r$ ≈0 for a timestep, the gradient can propagate back through that timestep without much decay.
- [ ] Alice’s model (removing $\Gamma_u$ ), because if $\Gamma_r$ ≈1 for a timestep, the gradient can propagate back through that timestep without much decay.
- [x] Betty’s model (removing $\Gamma_r$ ), because if $\Gamma_u$ ≈0 for a timestep, the gradient can propagate back through that timestep without much decay.
- [ ] Betty’s model (removing $\Gamma_r$ ), because if $\Gamma_u$ ≈1 for a timestep, the gradient can propagate back through that timestep without much decay.
Yes, For the signal to backpropagate without vanishing, we need $c^{<t>}$ to be highly dependant on $c^{<t-1>}$
Here are the equations for the GRU and the LSTM:

From these, we can see that the Update Gate and Forget Gate in the LSTM play a role similar to _ and __ in the GRU. What should go in the the blanks?
- [x] $\Gamma_u$ and 1− $\Gamma_u$
- [ ] $\Gamma_u$ and $\Gamma_r$
- [ ] 1− $\Gamma_u$ and $\Gamma_u$
- [ ] $\Gamma_r$ and $\Gamma_u$
You have a pet dog whose mood is heavily dependent on the current and past few days’ weather. You’ve collected data for the past 365 days on the weather, which you represent as a sequence as $x^{<1>}, …, x^{<365>}$ . You’ve also collected data on your dog’s mood, which you represent as $y^{<1>}, …, y^{<365>}$ . You’d like to build a model to map from x→y. Should you use a Unidirectional RNN or Bidirectional RNN for this problem?
- [ ] Bidirectional RNN, because this allows the prediction of mood on day t to take into account more information.
- [ ] Bidirectional RNN, because this allows backpropagation to compute more accurate gradients.
- [x] Unidirectional RNN, because the value of $y^{<t>}$ depends only on $x^{<1>}, …, x^{<t>}$ , but not on $x^{<t+1>}, …, x^{<365>}$
- [ ] Unidirectional RNN, because the value of $y^{<t>}$ depends only on $x^{<t>}$ , and not other days’ weather.

【中英】【吴恩达课后测验】Course 5 - 序列模型 - 第一周测验

【中英】【吴恩达课后测验】Course 5 -序列模型 - 第一周测验 - 循环神经网络

Recurrent Neural Networks

猜你喜欢