Which of the following neural networks are examples of a feed-forward neural network?
A feed-forward network does not have cycles.
2。第 2 个问题
Consider a neural network with only one training case with input \mathbf{x} = (x_1, x_2, \ldots, x_n)^\topx=(x1,x2,…,xn)⊤and correct output tt. There is only one output neuron, which is linear, i.e. y = \mathbf{w}^\top\mathbf{x}y=w⊤x(notice that there are no biases). The loss function is squared error. The network has no hidden units, so the inputs are directly connected to the output neuron with weights \mathbf{w} = (w_1, w_2, \ldots, w_n)^\topw=(w1,w2,…,wn)⊤. We're in the process of training the neural network with the backpropagation algorithm. What will the algorithm add to w_iwi for the next iteration if we use a step size (also known as a learning rate) of \epsilonϵ?
\epsilon( \mathbf{w}^\top\mathbf{x}-t)x_iϵ(w⊤x−t)xi
x_ixi if \mathbf{w}^\top\mathbf{x} > tw⊤x>t
-x_i−xi if \mathbf{w}^\top\mathbf{x} \leq tw⊤x≤t
x_ixi
-\epsilon( \mathbf{w}^\top\mathbf{x}-t)x_i−ϵ(w⊤x−t)xi
There are multiple components to this, all multiplied together: the learning rate, the derivative of the loss function w.r.t. the state of the output unit, and the derivative of the input to the output unit w.r.t. w_iwi.
3。第 3 个问题
Suppose we have a set of examples and Brian comes in and duplicates every example, then randomly reorders the examples. We now have twice as many examples, but no more information about the problem than we had before. If we do not remove the duplicate entries, which one of the following methods will not be affected by this change, in terms of the computer time (time in seconds, for example) it takes to come close to convergence?
Full-batch learning.
Mini-batch learning, where for every iteration we randomly pick 100 training cases.
After Brian's intervention, most mini-batches will contain duplicates and will therefore provide less information.
Online learning, where for every iteration we randomly pick a training case.
Full-batch learning needs to look at every example before taking a step, therefore each step will be twice as expensive. Online learning only looks at one example at a time so each step has the same computational cost as before. On expectation, online learning would make the same progress after looking at half of the dataset as it would have if Brian has not intervened.
Although this example is a bit contrived, it serves to illustrate how online learning can be advantageous when there is a lot of redundancy in the data.
4。第 4 个问题
Consider a linear output unit versus a logistic output unit for a feed-forward network with no hidden layer shown below. The network has a set of inputs xx and an output neuron yyconnected to the input by weights ww and bias bb.
We're using the squared error cost function even though the task that we care about, in the end, is binary classification. At training time, the target output values are 11 (for one class) and 00 (for the other class). At test time we will use the classifier to make decisions in the standard way: the class of an input xx according to our model after training is as follows:
\text{class of }x=
Note that we will be training the network using yy, but that the decision rule shown above will be the same at test time, regardless of the type of output neuron we use for training.
Which of the following statements is true?
Unlike a linear unit, using a logistic unit will not penalize is for getting things right too confidently.
If the target is 1 and the prediction is 100, the logistic unit will squash this down to a number very close to 1 and so we will not incur a very high cost. With a linear unit, the difference between the prediction and target will be very large and we will incur a high cost as a result, despite the fact that we get the classification decision correct.
The error function (the error as a function of the weights) for both types of units will form a quadratic bowl.
At the solution that minimizes the error, the learned weights are always the same for both types of units; they only differ in how they get to this solution.
For a logistic unit, the derivatives of the error function with respect to the weights can have unbounded magnitude, while for a linear unit they will have bounded magnitude.
This cannot be true. The derivative of the squared error with respect to the weights when using a linear unit depends on the distance between the prediction and the target. In other words: the further the prediction from the target, the larger the magnitude of the gradient. The prediction can be arbitrarily bad.
5。第 5 个问题
Consider a neural network with one layer of logistic hidden units (intended to be fully connected to the input units) and a linear output unit. Suppose there are nn input units and mm hidden units. Which of the following statements are true? Check all that apply.
As long as m \geq 1m≥1, this network can learn to compute any function that can be learned by a network without any hidden layers (with the same inputs).
If the weights into the hidden layer are very small, and the weights out of it are large (to compensate), then the hidden units behave like linear units, which makes lots of things possible.
Any function that can be learned by such a network can also be learned by a network without any hidden layers (with the same inputs).
If m>nm>n, this network can learn more functions than if mm is less than nn (with nnbeing the same).
A network with m > nm>n has more learnable parameters than a network without any hidden layers (with the same inputs).
The bulk of the learnable parameters is in the connections from the input units to the hidden units. There are m \cdot nm⋅n learnable parameters there.
6。第 6 个问题
Brian wants to make his feed-forward network (with no hidden units) using a linearoutput neuron more powerful. He decides to combine the predictions of two networks by averaging them. The first network has weights w_1w1 and the second network has weights w_2w2. The predictions of this network for an example xx are therefore:
y=12wT1x+12wT2x
Can we get the exact same predictions as this combination of networks by using a single feed-forward network (again with no hidden units) using a linear output neuron and weights w_3=\frac{1}{2}(w_1+w_2)w3=21(w1+w2)?
Yes
No
未选择的是正确的
第 6 个问题
Brian wants to make his feed-forward network (with no hidden units) using a logisticoutput neuron more powerful. He decides to combine the predictions of two networks by averaging them. The first network has weights w_1w1 and the second network has weights w_2w2. The predictions of this network for an example xx are therefore:
y=1211+e−z1+1211+e−z2 with z_1=w_1^Txz1=w1Tx and z_2=w_2^Txz2=w2Tx.
Can we get the exact same predictions as this combination of networks by using a single feed-forward network (again with no hidden units) using a logistic output neuron and weights w_3=\frac{1}{2}(w_1+w_2)w3=21(w1+w2)?
Yes
No
第 6 个问题
Brian wants to make his feed-forward network (with no hidden units) using a logisticoutput neuron more powerful. He decides to combine the predictions of two networks by averaging them. The first network has weights w_1w1 and the second network has weights w_2w2. The predictions of this network for an example xx are therefore:
y=1211+e−z1+1211+e−z2 with z_1=w_1^Txz1=w1Tx and z_2=w_2^Txz2=w2Tx.
Can we get the exact same predictions as this combination of networks by using a single feed-forward network (again with no hidden units) using a logistic output neuron and weights w_3=\frac{1}{2}(w_1+w_2)w3=21(w1+w2)?
Yes
No