Lecture 4——Quiz



第 1 个问题

1. 第 1 个问题

The cross-entropy cost function with an nn-way softmax unit (a softmax unit with nndifferent outputs) is equivalent to:

Clarification:

Let's say that a network with nn linear output units has some weights wwww is a matrix with nn columns, and w_iwi indexes a particular column in this matrix and represents the weights from the inputs to the i^\text{th}ith output unit.

Suppose the target for a particular example is jj (so that it belongs to class jj in other words).

The squared error cost function for nn linear units is given by:

\frac{1}{2}\sum_{i=1}^n (t_i - w_i^Tx)^221i=1n(tiwiTx)2

where tt is a vector of zeros except for 1 in index jj.

The cross-entropy cost function for an nn-way softmax unit is given by:

-\log \left( \frac{\exp \left( w_j^Tx\right)}{\sum_{i=1}^n \exp \left( w_i^Tx\right)} \right) = -w_j^Tx + \log \left( \sum_{i=1}^n \exp \left( w_i^Tx \right)\right)log(i=1nexp(wiTx)exp(wjTx))=wjTx+log(i=1nexp(wiTx))

Finally, nn logistic units would compute an output of \sigma (w_i^Tx) = \frac{1}{1+\exp (-w_i^Tx)}σ(wiTx)=1+exp(wiTx)1independently for each class ii. Combined with the squared error the cost would be:

\frac{1}{2}\sum_{i=1}^n (t_i - \sigma (w_i^Tx))^221i=1n(tiσ(wiTx))2

Where again, tt is a vector of zeros with a 1 at index jj (assuming the true class of the example is jj).

Using this same definition for tt, the cross-entropy error for nn logistic units would be the sum of the individual cross-entropy errors:

-\sum_{i=1}^n t_i \log(\sigma(w_i^Tx)) + (1-t_i)\log(1-\sigma(w_i^Tx))i=1ntilog(σ(wiTx))+(1ti)log(1σ(wiTx))

For any set of weights ww, the network with a softmax output unit over nn classes will have some cost due to the cross-entropy error (cost function). The question is now asking whether we can define a new network with a set of weights w^*w using some (possibly different) cost function such that:

a) w=f(w)f(w) for some function ff

b) For every input, the cost we get using ww in the softmax output network with cross-entropy error is the same as the cost we would get using w in the new network with the possibly different cost function.

第 2 个问题

2. 第 2 个问题

A logistic unit with the cross-entropy cost function is equivalent to:

Clarification:

In a network with a logistic output, we will have a single vector of weights ww. For a particular example with target tt (which is 0 or 1), the cross-entropy error is given by:

-t \log \left( \sigma(w^Tx) \right) - (1-t) \log \left( 1 - \sigma(w^T x) \right)tlog(σ(wTx))(1t)log(1σ(wTx)) where \sigma(w^Tx) = \frac{1}{1+\exp (-w^Tx)}σ(wTx)=1+exp(wTx)1.

The squared error if we use a single linear unit would be:

\frac{1}{2} (t - w^T x)^221(twTx)2

Now notice that another way we might define tt is by using a vector with 2 elements, [1,0] to indicate the first class, and [0,1] to indicate the second class. Using this definition, we can develop a new type of classification network using a softmax unit over these two classes instead. In this case, we would use a weight matrix ww with two columns, where w_iwiis the column of the i^\text{th}ith class and connects the inputs to the i^\text{th}ith output unit.

Suppose an example belonged to class jj (where jj is 1 or 2 to indicate [1,0] or [0,1]). Then the cross-entropy cost for this network would be:

-\log \left( \frac{\exp (w_j^Tx)}{\exp (w_1^Tx) + \exp (w_2^T x)} \right ) = -w_j^Tx + \log \left( \exp(w_1^Tx) + \exp(w_2^Tx) \right)log(exp(w1Tx)+exp(w2Tx)exp(wjTx))=wjTx+log(exp(w1Tx)+exp(w2Tx))

For any set of weights ww, the network with a logistic output unit will have some error due to the cross-entropy cost function. The question is now asking whether we can define a new network with a set of weights w^*w using some (possibly different) cost function such that:

a) w=f(w)f(w) for some function ff

b) For every input, the cost we get using ww in the network with a logistic output and cross-entropy error is the same cost that we would get using w in the new network with the possibly different cost function.

第 3 个问题

3. 第 3 个问题

The output of a neuro-probabilistic language model is a large softmax unit and this creates problems if the vocabulary size is large. Andy claims that the following method solves this problem:

At every iteration of training, train the network to predict the current learned feature vector of the target word (instead of using a softmax). Since the embedding dimensionality is typically much smaller than the vocabulary size, we don't have the problem of having many output weights any more. Which of the following are correct? Check all that apply.

第 4 个问题

4. 第 4 个问题

We are given the following tree that we will use to classify a particular example xx:

In this tree, each pp value indicates the probability that xx will be classified as belonging to a class in the right subtree of the node at with that pp was computed. For example, the probability that xx belongs to Class 2 is (1-p_1)\times p_2(1p1)×p2. Recall that at training time this is a very efficient representation because we only have to consider a single branch of the tree. However, at test-time we need to look over all branches in order to determine the probabilities of each outcome.

Suppose we are not interested in obtaining the exact probability of every outcome, but instead we just want to find the class with the maximum probability. A simple heuristic is to search the tree greedily by starting at the root and choosing the branch with maximum probability at each node on our way from the root to the leaves. That is, at the root of this tree we would choose to go right if p_1 \geq 0.5p10.5 and left otherwise.

If p_1=0.45p1=0.45p_2=0.6p2=0.6, and p_3=0.95p3=0.95, then which class will the following methods report for xx?

a) Evaluate the probabilities of each of the four classes (the leaf nodes) and report the class of the leaf node with the highest probability. This is the standard approach but may take quite some time.

b) The proposed alternative approach: greedily traverse the tree by choosing the branch with the highest probability and report the class of the leaf node that this finds.

Method a) will report class 4

Method b) will report class 4

Method a) will report class 2

Method b) will report class 3

Method a) will report class 1

Method b) will report class 4

Method a) will report class 3

Method b) will report class 2

Method a) will report class 4

Method b) will report class 1

第 5 个问题

5. 第 5 个问题

Brian is trying to use a neural network to predict the next word given several previous words. He has the following idea to reduce the amount of computation needed to make this prediction.

Rather than having the output layer of the network be a 100,000-way softmax, Brian says that we can just encode each word as an integer: 1 will correspond to the first word, 2 to the second word and so on up to 100,000. For the output layer, we can then simply use a single linear neuron with a squared error cost function. Is this a good idea?

Partly. Brian's method should only be used at the input layer. The output layer must always report probabilities, and squared error loss is not appropriate for that.

Yes. With this method, there are fewer parameters, while the network can still learn equally well.

第 6 个问题

6. 第 6 个问题

In the Collobert and Weston model, the problem of learning a feature vector from a sequence of words is turned into a problem of:

Learning to predict the middle word in the sequence given the words that came before and the words that came after.

Learning to predict the next word in an arbitrary length sequence.

Learning to reconstruct the input vector.

第 7 个问题

7. 第 7 个问题

Andy is in trouble! He is in the middle of training a feed-forward neural network that is supposed to predict the fourth word in a sequence given the previous three words (shown below).

Brian was supposed to get the dataset ready, but accidentally swapped the first and third word in every sequence! Now the neural network sees sequences of the form word 3, word 2, word 1, word 4; so what is actually being trained is this network:

Andy has been training this network for a long time and doesn't want to start over, but he is worried that the network won't do very well because of Brian's mistake. Should Andy be worried?

Andy should not be worried. His feed-forward network is smart enough to automatically recognize that the input sequences were given in the incorrect order and will change the sequences back to the correct order as it's learning.

Andy should be worried. The network will not do a very good job because of Brian's mistake. We know this because changing the order will make things more difficult for a human, so for artificial neural networks it will be even more difficult.

Andy should not be worried. Even though the network knows that each word in the sequence is in a different place, it does not care about the ordering of the input words as long as input sequences are always shown to it in a consistent way.

第 1 个问题

1. 第 1 个问题

The cross-entropy cost function with an nn-way softmax unit (a softmax unit with nndifferent outputs) is equivalent to:

Clarification:

Let's say that a network with nn linear output units has some weights wwww is a matrix with nn columns, and w_iwi indexes a particular column in this matrix and represents the weights from the inputs to the i^\text{th}ith output unit.

Suppose the target for a particular example is jj (so that it belongs to class jj in other words).

The squared error cost function for nn linear units is given by:

\frac{1}{2}\sum_{i=1}^n (t_i - w_i^Tx)^221i=1n(tiwiTx)2

where tt is a vector of zeros except for 1 in index jj.

The cross-entropy cost function for an nn-way softmax unit is given by:

-\log \left( \frac{\exp \left( w_j^Tx\right)}{\sum_{i=1}^n \exp \left( w_i^Tx\right)} \right) = -w_j^Tx + \log \left( \sum_{i=1}^n \exp \left( w_i^Tx \right)\right)log(i=1nexp(wiTx)exp(wjTx))=wjTx+log(i=1nexp(wiTx))

Finally, nn logistic units would compute an output of \sigma (w_i^Tx) = \frac{1}{1+\exp (-w_i^Tx)}σ(wiTx)=1+exp(wiTx)1independently for each class ii. Combined with the squared error the cost would be:

\frac{1}{2}\sum_{i=1}^n (t_i - \sigma (w_i^Tx))^221i=1n(tiσ(wiTx))2

Where again, tt is a vector of zeros with a 1 at index jj (assuming the true class of the example is jj).

Using this same definition for tt, the cross-entropy error for nn logistic units would be the sum of the individual cross-entropy errors:

-\sum_{i=1}^n t_i \log(\sigma(w_i^Tx)) + (1-t_i)\log(1-\sigma(w_i^Tx))i=1ntilog(σ(wiTx))+(1ti)log(1σ(wiTx))

For any set of weights ww, the network with a softmax output unit over nn classes will have some cost due to the cross-entropy error (cost function). The question is now asking whether we can define a new network with a set of weights w^*w using some (possibly different) cost function such that:

a) w=f(w)f(w) for some function ff

b) For every input, the cost we get using ww in the softmax output network with cross-entropy error is the same as the cost we would get using w in the new network with the possibly different cost function.

第 2 个问题

2. 第 2 个问题

A logistic unit with the cross-entropy cost function is equivalent to:

Clarification:

In a network with a logistic output, we will have a single vector of weights ww. For a particular example with target tt (which is 0 or 1), the cross-entropy error is given by:

-t \log \left( \sigma(w^Tx) \right) - (1-t) \log \left( 1 - \sigma(w^T x) \right)tlog(σ(wTx))(1t)log(1σ(wTx)) where \sigma(w^Tx) = \frac{1}{1+\exp (-w^Tx)}σ(wTx)=1+exp(wTx)1.

The squared error if we use a single linear unit would be:

\frac{1}{2} (t - w^T x)^221(twTx)2

Now notice that another way we might define tt is by using a vector with 2 elements, [1,0] to indicate the first class, and [0,1] to indicate the second class. Using this definition, we can develop a new type of classification network using a softmax unit over these two classes instead. In this case, we would use a weight matrix ww with two columns, where w_iwiis the column of the i^\text{th}ith class and connects the inputs to the i^\text{th}ith output unit.

Suppose an example belonged to class jj (where jj is 1 or 2 to indicate [1,0] or [0,1]). Then the cross-entropy cost for this network would be:

-\log \left( \frac{\exp (w_j^Tx)}{\exp (w_1^Tx) + \exp (w_2^T x)} \right ) = -w_j^Tx + \log \left( \exp(w_1^Tx) + \exp(w_2^Tx) \right)log(exp(w1Tx)+exp(w2Tx)exp(wjTx))=wjTx+log(exp(w1Tx)+exp(w2Tx))

For any set of weights ww, the network with a logistic output unit will have some error due to the cross-entropy cost function. The question is now asking whether we can define a new network with a set of weights w^*w using some (possibly different) cost function such that:

a) w=f(w)f(w) for some function ff

b) For every input, the cost we get using ww in the network with a logistic output and cross-entropy error is the same cost that we would get using w in the new network with the possibly different cost function.

第 3 个问题

3. 第 3 个问题

The output of a neuro-probabilistic language model is a large softmax unit and this creates problems if the vocabulary size is large. Andy claims that the following method solves this problem:

At every iteration of training, train the network to predict the current learned feature vector of the target word (instead of using a softmax). Since the embedding dimensionality is typically much smaller than the vocabulary size, we don't have the problem of having many output weights any more. Which of the following are correct? Check all that apply.

第 4 个问题

4. 第 4 个问题

We are given the following tree that we will use to classify a particular example xx:

In this tree, each pp value indicates the probability that xx will be classified as belonging to a class in the right subtree of the node at with that pp was computed. For example, the probability that xx belongs to Class 2 is (1-p_1)\times p_2(1p1)×p2. Recall that at training time this is a very efficient representation because we only have to consider a single branch of the tree. However, at test-time we need to look over all branches in order to determine the probabilities of each outcome.

Suppose we are not interested in obtaining the exact probability of every outcome, but instead we just want to find the class with the maximum probability. A simple heuristic is to search the tree greedily by starting at the root and choosing the branch with maximum probability at each node on our way from the root to the leaves. That is, at the root of this tree we would choose to go right if p_1 \geq 0.5p10.5 and left otherwise.

If p_1=0.45p1=0.45p_2=0.6p2=0.6, and p_3=0.95p3=0.95, then which class will the following methods report for xx?

a) Evaluate the probabilities of each of the four classes (the leaf nodes) and report the class of the leaf node with the highest probability. This is the standard approach but may take quite some time.

b) The proposed alternative approach: greedily traverse the tree by choosing the branch with the highest probability and report the class of the leaf node that this finds.

Method a) will report class 4

Method b) will report class 4

Method a) will report class 2

Method b) will report class 3

Method a) will report class 1

Method b) will report class 4

Method a) will report class 3

Method b) will report class 2

Method a) will report class 4

Method b) will report class 1

第 5 个问题

5. 第 5 个问题

Brian is trying to use a neural network to predict the next word given several previous words. He has the following idea to reduce the amount of computation needed to make this prediction.

Rather than having the output layer of the network be a 100,000-way softmax, Brian says that we can just encode each word as an integer: 1 will correspond to the first word, 2 to the second word and so on up to 100,000. For the output layer, we can then simply use a single linear neuron with a squared error cost function. Is this a good idea?

Partly. Brian's method should only be used at the input layer. The output layer must always report probabilities, and squared error loss is not appropriate for that.

Yes. With this method, there are fewer parameters, while the network can still learn equally well.

第 6 个问题

6. 第 6 个问题

In the Collobert and Weston model, the problem of learning a feature vector from a sequence of words is turned into a problem of:

Learning to predict the middle word in the sequence given the words that came before and the words that came after.

Learning to predict the next word in an arbitrary length sequence.

Learning to reconstruct the input vector.

第 7 个问题

7. 第 7 个问题

Andy is in trouble! He is in the middle of training a feed-forward neural network that is supposed to predict the fourth word in a sequence given the previous three words (shown below).

Brian was supposed to get the dataset ready, but accidentally swapped the first and third word in every sequence! Now the neural network sees sequences of the form word 3, word 2, word 1, word 4; so what is actually being trained is this network:

Andy has been training this network for a long time and doesn't want to start over, but he is worried that the network won't do very well because of Brian's mistake. Should Andy be worried?

Andy should not be worried. His feed-forward network is smart enough to automatically recognize that the input sequences were given in the incorrect order and will change the sequences back to the correct order as it's learning.

Andy should be worried. The network will not do a very good job because of Brian's mistake. We know this because changing the order will make things more difficult for a human, so for artificial neural networks it will be even more difficult.

Andy should not be worried. Even though the network knows that each word in the sequence is in a different place, it does not care about the ordering of the input words as long as input sequences are always shown to it in a consistent way.

第 1 个问题

第 1 个

True or false: the neural network in the lectures that was used to predict relationships in family trees had "bottleneck" layers (layers with fewer dimensions than the input). The reason these were used was to prevent the network from memorizing the training data without learning any meaningful features for generalization.

True


猜你喜欢

转载自blog.csdn.net/sophiecxt/article/details/80864255