Lecture 4——Quiz

第 1 个问题

错误

0/1 分

1. 第 1 个问题

The cross-entropy cost function with an $n$ -way softmax unit (a softmax unit with $n$ different outputs) is equivalent to:

Clarification:

Let's say that a network with $n$ linear output units has some weights $w$ . $w$ is a matrix with $n$ columns, and $w_i$ indexes a particular column in this matrix and represents the weights from the inputs to the $i^\text{th}$ output unit.

Suppose the target for a particular example is $j$ (so that it belongs to class $j$ in other words).

The squared error cost function for $n$ linear units is given by:

$\frac{1}{2}\sum_{i=1}^n (t_i - w_i^Tx)^2$

where $t$ is a vector of zeros except for 1 in index $j$ .

The cross-entropy cost function for an $n$ -way softmax unit is given by:

- $\log \left( \frac{\exp \left( w_j^Tx\right)}{\sum_{i=1}^n \exp \left( w_i^Tx\right)} \right) = -w_j^Tx + \log \left( \sum_{i=1}^n \exp \left( w_i^Tx \right)\right)$

Finally, $n$ logistic units would compute an output of $\sigma (w_i^Tx) = \frac{1}{1+\exp (-w_i^Tx)}$ independently for each class $i$ . Combined with the squared error the cost would be:

$\frac{1}{2}\sum_{i=1}^n (t_i - \sigma (w_i^Tx))^2$

Where again, $t$ is a vector of zeros with a 1 at index $j$ (assuming the true class of the example is $j$ ).

Using this same definition for $t$ , the cross-entropy error for $n$ logistic units would be the sum of the individual cross-entropy errors:

$-\sum_{i=1}^n t_i \log(\sigma(w_i^Tx)) + (1-t_i)\log(1-\sigma(w_i^Tx))$

For any set of weights $w$ , the network with a softmax output unit over $n$ classes will have some cost due to the cross-entropy error (cost function). The question is now asking whether we can define a new network with a set of weights $w^*$ using some (possibly different) cost function such that:

a) w∗= $f (w)$ for some function $f$

b) For every input, the cost we get using $w$ in the softmax output network with cross-entropy error is the same as the cost we would get using w∗ in the new network with the possibly different cost function.

The squared error cost function with $n$ logistic units.

未选择的是正确的

The squared error cost function with $n$ linear units.

这个选项的答案不正确

The cross-entropy cost function with $n$ logistic units.

未选择的是正确的

None of the above.

这应该被选择

第 2 个问题

错误

0/1 分

2. 第 2 个问题

A logistic unit with the cross-entropy cost function is equivalent to:

Clarification:

In a network with a logistic output, we will have a single vector of weights $w$ . For a particular example with target $t$ (which is 0 or 1), the cross-entropy error is given by:

$\log \left( \sigma(w^Tx) \right) - (1-t) \log \left( 1 - \sigma(w^T x) \right)$ where $\sigma(w^Tx) = \frac{1}{1+\exp (-w^Tx)}$ .

The squared error if we use a single linear unit would be:

$\frac{1}{2} (t - w^T x)^2$

Now notice that another way we might define $t$ is by using a vector with 2 elements, [1,0] to indicate the first class, and [0,1] to indicate the second class. Using this definition, we can develop a new type of classification network using a softmax unit over these two classes instead. In this case, we would use a weight matrix $w$ with two columns, where $w_i$ is the column of the $i^\text{th}$ class and connects the inputs to the $i^\text{th}$ output unit.

Suppose an example belonged to class $j$ (where $j$ is 1 or 2 to indicate [1,0] or [0,1]). Then the cross-entropy cost for this network would be:

$-\log \left( \frac{\exp (w_j^Tx)}{\exp (w_1^Tx) + \exp (w_2^T x)} \right ) = -w_j^Tx + \log \left( \exp(w_1^Tx) + \exp(w_2^Tx) \right)$

For any set of weights $w$ , the network with a logistic output unit will have some error due to the cross-entropy cost function. The question is now asking whether we can define a new network with a set of weights $w^*$ using some (possibly different) cost function such that:

a) w∗= $f (w)$ for some function $f$

b) For every input, the cost we get using $w$ in the network with a logistic output and cross-entropy error is the same cost that we would get using w∗ in the new network with the possibly different cost function.

A 2-way softmax unit (a softmax unit with 2 elements) with the cross entropy cost function.

这应该被选择

A linear unit with the squared error cost function.

未选择的是正确的

A 2-way softmax unit (a softmax with 2 elements) with the squared error cost function.

这个选项的答案不正确

None of the above.

未选择的是正确的

第 3 个问题

错误

0/1 分

3. 第 3 个问题

The output of a neuro-probabilistic language model is a large softmax unit and this creates problems if the vocabulary size is large. Andy claims that the following method solves this problem:

At every iteration of training, train the network to predict the current learned feature vector of the target word (instead of using a softmax). Since the embedding dimensionality is typically much smaller than the vocabulary size, we don't have the problem of having many output weights any more. Which of the following are correct? Check all that apply.

In theory there's nothing wrong with Andy's idea. However, the number of learnable parameters will be so far reduced that the network no longer has sufficient learning capacity to do the task well.

未选择的是正确的

The serialized version of the model discussed in the slides is using the current word embedding for the output word, but it's optimizing something different than what Andy is suggesting.

正确

Andy is correct: this is equivalent to the serialized version of the model discussed in the lecture.

未选择的是正确的

If we add in extra derivatives that change the feature vector for the target word to be more like what is predicted, it may find a trivial solution in which all words have the same feature vector.

这应该被选择

第 4 个问题

正确

1/1 分

4. 第 4 个问题

We are given the following tree that we will use to classify a particular example $x$ :

In this tree, each $p$ value indicates the probability that $x$ will be classified as belonging to a class in the right subtree of the node at with that $p$ was computed. For example, the probability that $x$ belongs to Class 2 is $(1-p_1)\times p_2$ . Recall that at training time this is a very efficient representation because we only have to consider a single branch of the tree. However, at test-time we need to look over all branches in order to determine the probabilities of each outcome.

Suppose we are not interested in obtaining the exact probability of every outcome, but instead we just want to find the class with the maximum probability. A simple heuristic is to search the tree greedily by starting at the root and choosing the branch with maximum probability at each node on our way from the root to the leaves. That is, at the root of this tree we would choose to go right if $p_1 \geq 0.5$ and left otherwise.

If $p_1=0.45$ , $p_2=0.6$ , and $p_3=0.95$ , then which class will the following methods report for $x$ ?

a) Evaluate the probabilities of each of the four classes (the leaf nodes) and report the class of the leaf node with the highest probability. This is the standard approach but may take quite some time.

b) The proposed alternative approach: greedily traverse the tree by choosing the branch with the highest probability and report the class of the leaf node that this finds.

Method a) will report class 4

Method b) will report class 4

Method a) will report class 2

Method b) will report class 3

Method a) will report class 1

Method b) will report class 4

Method a) will report class 4

Method b) will report class 2

正确

Method a) will report class 3

Method b) will report class 2

Method a) will report class 4

Method b) will report class 1

第 5 个问题

正确

1/1 分

5. 第 5 个问题

Brian is trying to use a neural network to predict the next word given several previous words. He has the following idea to reduce the amount of computation needed to make this prediction.

Rather than having the output layer of the network be a 100,000-way softmax, Brian says that we can just encode each word as an integer: 1 will correspond to the first word, 2 to the second word and so on up to 100,000. For the output layer, we can then simply use a single linear neuron with a squared error cost function. Is this a good idea?

Partly. Brian's method should only be used at the input layer. The output layer must always report probabilities, and squared error loss is not appropriate for that.

Yes. With this method, there are fewer parameters, while the network can still learn equally well.

No. Brian is implicitly imposing the belief that words with similar integers are more related to each other than words with very different integers, and this is usually not true.

正确

第 6 个问题

正确

1/1 分

6. 第 6 个问题

In the Collobert and Weston model, the problem of learning a feature vector from a sequence of words is turned into a problem of:

Learning a binary classifier.

正确

The model is made to answer this binary question: "is the word in the middle correct, or is it just some random word?"

Learning to predict the middle word in the sequence given the words that came before and the words that came after.

Learning to predict the next word in an arbitrary length sequence.

Learning to reconstruct the input vector.

第 7 个问题

错误

0/1 分

7. 第 7 个问题

Andy is in trouble! He is in the middle of training a feed-forward neural network that is supposed to predict the fourth word in a sequence given the previous three words (shown below).

Brian was supposed to get the dataset ready, but accidentally swapped the first and third word in every sequence! Now the neural network sees sequences of the form word 3, word 2, word 1, word 4; so what is actually being trained is this network:

Andy has been training this network for a long time and doesn't want to start over, but he is worried that the network won't do very well because of Brian's mistake. Should Andy be worried?

Andy should not be worried. His feed-forward network is smart enough to automatically recognize that the input sequences were given in the incorrect order and will change the sequences back to the correct order as it's learning.

Andy should be worried. The network will not do a very good job because of Brian's mistake. We know this because changing the order will make things more difficult for a human, so for artificial neural networks it will be even more difficult.

Andy should not be worried. Even though the network knows that each word in the sequence is in a different place, it does not care about the ordering of the input words as long as input sequences are always shown to it in a consistent way.

Andy should be worried. Even if they show the test examples in a consistent way (by swapping the order of the words in the same way that the network was trained) the performance will suffer. However, they can correct the ordering and re-train the network for a small number of iterations to fix things.

这个选项的答案不正确，应该选第三个选项。

第 1 个问题

错误

0/1 分

1. 第 1 个问题

The cross-entropy cost function with an $n$ -way softmax unit (a softmax unit with $n$ different outputs) is equivalent to:

Clarification:

Suppose the target for a particular example is $j$ (so that it belongs to class $j$ in other words).

The squared error cost function for $n$ linear units is given by:

$\frac{1}{2}\sum_{i=1}^n (t_i - w_i^Tx)^2$

where $t$ is a vector of zeros except for 1 in index $j$ .

The cross-entropy cost function for an $n$ -way softmax unit is given by:

- $\log \left( \frac{\exp \left( w_j^Tx\right)}{\sum_{i=1}^n \exp \left( w_i^Tx\right)} \right) = -w_j^Tx + \log \left( \sum_{i=1}^n \exp \left( w_i^Tx \right)\right)$

Finally, $n$ logistic units would compute an output of $\sigma (w_i^Tx) = \frac{1}{1+\exp (-w_i^Tx)}$ independently for each class $i$ . Combined with the squared error the cost would be:

$\frac{1}{2}\sum_{i=1}^n (t_i - \sigma (w_i^Tx))^2$

Where again, $t$ is a vector of zeros with a 1 at index $j$ (assuming the true class of the example is $j$ ).

Using this same definition for $t$ , the cross-entropy error for $n$ logistic units would be the sum of the individual cross-entropy errors:

$-\sum_{i=1}^n t_i \log(\sigma(w_i^Tx)) + (1-t_i)\log(1-\sigma(w_i^Tx))$

a) w∗= $f (w)$ for some function $f$

The squared error cost function with $n$ logistic units.

未选择的是正确的

The squared error cost function with $n$ linear units.

这个选项的答案不正确

The cross-entropy cost function with $n$ logistic units.

未选择的是正确的

None of the above.

这应该被选择

第 2 个问题

错误

0/1 分

2. 第 2 个问题

A logistic unit with the cross-entropy cost function is equivalent to:

Clarification:

In a network with a logistic output, we will have a single vector of weights $w$ . For a particular example with target $t$ (which is 0 or 1), the cross-entropy error is given by:

$\log \left( \sigma(w^Tx) \right) - (1-t) \log \left( 1 - \sigma(w^T x) \right)$ where $\sigma(w^Tx) = \frac{1}{1+\exp (-w^Tx)}$ .

The squared error if we use a single linear unit would be:

$\frac{1}{2} (t - w^T x)^2$

Now notice that another way we might define $t$ is by using a vector with 2 elements, [1,0] to indicate the first class, and [0,1] to indicate the second class. Using this definition, we can develop a new type of classification network using a softmax unit over these two classes instead. In this case, we would use a weight matrix $w$ with two columns, where $w_i$ is the column of the $i^\text{th}$ class and connects the inputs to the $i^\text{th}$ output unit.

Suppose an example belonged to class $j$ (where $j$ is 1 or 2 to indicate [1,0] or [0,1]). Then the cross-entropy cost for this network would be:

$-\log \left( \frac{\exp (w_j^Tx)}{\exp (w_1^Tx) + \exp (w_2^T x)} \right ) = -w_j^Tx + \log \left( \exp(w_1^Tx) + \exp(w_2^Tx) \right)$

a) w∗= $f (w)$ for some function $f$

A 2-way softmax unit (a softmax unit with 2 elements) with the cross entropy cost function.

这应该被选择

A linear unit with the squared error cost function.

未选择的是正确的

A 2-way softmax unit (a softmax with 2 elements) with the squared error cost function.

这个选项的答案不正确

None of the above.

未选择的是正确的

第 3 个问题

错误

0/1 分

3. 第 3 个问题

The output of a neuro-probabilistic language model is a large softmax unit and this creates problems if the vocabulary size is large. Andy claims that the following method solves this problem:

In theory there's nothing wrong with Andy's idea. However, the number of learnable parameters will be so far reduced that the network no longer has sufficient learning capacity to do the task well.

未选择的是正确的

The serialized version of the model discussed in the slides is using the current word embedding for the output word, but it's optimizing something different than what Andy is suggesting.

正确

Andy is correct: this is equivalent to the serialized version of the model discussed in the lecture.

未选择的是正确的

If we add in extra derivatives that change the feature vector for the target word to be more like what is predicted, it may find a trivial solution in which all words have the same feature vector.

这应该被选择

第 4 个问题

正确

1/1 分

4. 第 4 个问题

We are given the following tree that we will use to classify a particular example $x$ :

If $p_1=0.45$ , $p_2=0.6$ , and $p_3=0.95$ , then which class will the following methods report for $x$ ?

b) The proposed alternative approach: greedily traverse the tree by choosing the branch with the highest probability and report the class of the leaf node that this finds.

Method a) will report class 4

Method b) will report class 4

Method a) will report class 2

Method b) will report class 3

Method a) will report class 1

Method b) will report class 4

Method a) will report class 4

Method b) will report class 2

正确

Method a) will report class 3

Method b) will report class 2

Method a) will report class 4

Method b) will report class 1

第 5 个问题

正确

1/1 分

5. 第 5 个问题

Brian is trying to use a neural network to predict the next word given several previous words. He has the following idea to reduce the amount of computation needed to make this prediction.

Partly. Brian's method should only be used at the input layer. The output layer must always report probabilities, and squared error loss is not appropriate for that.

Yes. With this method, there are fewer parameters, while the network can still learn equally well.

No. Brian is implicitly imposing the belief that words with similar integers are more related to each other than words with very different integers, and this is usually not true.

正确

第 6 个问题

正确

1/1 分

6. 第 6 个问题

In the Collobert and Weston model, the problem of learning a feature vector from a sequence of words is turned into a problem of:

Learning a binary classifier.

正确

The model is made to answer this binary question: "is the word in the middle correct, or is it just some random word?"

Learning to predict the middle word in the sequence given the words that came before and the words that came after.

Learning to predict the next word in an arbitrary length sequence.

Learning to reconstruct the input vector.

第 7 个问题

错误

0/1 分

7. 第 7 个问题

Andy is in trouble! He is in the middle of training a feed-forward neural network that is supposed to predict the fourth word in a sequence given the previous three words (shown below).

Brian was supposed to get the dataset ready, but accidentally swapped the first and third word in every sequence! Now the neural network sees sequences of the form word 3, word 2, word 1, word 4; so what is actually being trained is this network:

Andy has been training this network for a long time and doesn't want to start over, but he is worried that the network won't do very well because of Brian's mistake. Should Andy be worried?

这个选项的答案不正确

第 1 个问题

错1.

第 1 个

True or false: the neural network in the lectures that was used to predict relationships in family trees had "bottleneck" layers (layers with fewer dimensions than the input). The reason these were used was to prevent the network from memorizing the training data without learning any meaningful features for generalization.

True

False

这个选项的答案不正确

1. 第 1 个问题

2. 第 2 个问题

3. 第 3 个问题

4. 第 4 个问题

5. 第 5 个问题

6. 第 6 个问题

7. 第 7 个问题

1. 第 1 个问题

2. 第 2 个问题

3. 第 3 个问题

4. 第 4 个问题

5. 第 5 个问题

6. 第 6 个问题

7. 第 7 个问题

第 1 个

猜你喜欢