Questions in lecture 3

What is the estimate of the price of the meal after running the iterative algorithm for one step ?

Recall that

The correct price for portions (2, 5, 3) is 850.

Initial guess of the weights (50, 50, 50) gave an estimate of 500.

After running the iterative algorithm for one step the revised guess of the weights is (70, 100, 80)

  880
You have seen how the delta rule can be used to minimize the error by looking at one training case at a time. What is a good reason for using this kind of iterative method?

You have seen how the delta rule can be used to minimize the error by looking at one training case at a time. What is a good reason for using this kind of iterative method?


Suppose we have a dataset of two training points.

x1x2=(1,1)=(0,1)t1t2=0=1

Consider a network with two input units connected to a linear neuron with weights w=(w_1, w_2)w=(w1,w2). What is the equation for the error surface when using a squared error loss function ?

Hint: The squared error is defined as \frac{1}{2}(w^T x_1 - t_1)^2 + \frac{1}{2}(w^T x_2 - t_2)^221(wTx1t1)2+21(wTx2t2)2

E = \frac{1}{2}\left((w_1 - w_2 - 1)^2 + w_2^2\right)E=21((w1w21)2+w22)

E = \frac{1}{2}\left(w_1^2 - 2w_2^2 + 2w_1w_2 + 1\right)E=21(w122w22+2w1w2+1)

E = \frac{1}{2}(w_1 - w_2)^2E=21(w1w2)2


If our initial weight vector is (w_1,0)(w1,0) for some real number w_1w1, then on which of the following error surfaces can we expect steepest descent to converge poorly? Check all that apply.


The range of the function y = \frac{1}{1+e^{-z}}y=1+ez1 is between 0 and 1. Another way of interpreting the logistic unit is that it is modelling:

The probability of the inputs given the outputs.

Instead of randomly perturbing the weights randomly and looking at the change in error, one can try the following approach:

  1. For each weight parameter w_iwi, perturb w_iwi by adding a small (say, 10^{-5}105) constant epsilonepsilon and evaluate the error (call this E_i^+Ei+)
  2. Now reset w_iwi back to the original parameter and perturb it again by subtracting the same small constant \epsilonϵ and evaluate the error again (call this E_i^-Ei).
  3. Repeat this for each weight index ii.
  4. Upon completing this, we update the weights vector by: w_i \leftarrow w_i - \etawiwiη \frac{(E_i^+ - E_i^-)}{2\epsilon}2ϵ(Ei+Ei)

for some learning rate \etaη.

True or false: for an appropriately chosen \etaη, repeating this procedure will find the minimum of the error surface for a linear output neuron.

False


Backpropagation can be used for which kind of neurons ?

When we perform online learning (using steepest descent), we look at each data example, compute the gradient of the error on that case and then take a little step in the direction of the gradient. In offline (also known as batch) learning, we look at each example, compute the gradient, sum these gradients up and then take a (possibly much bigger) step in the direction of the sum of the gradients.

True or false: for one pass over the dataset, these procedures are equivalent in that if we took all of the gradients after each update of the online learning procedure and added them up, then we will get the same gradient as the offline method.

Follow-up question: which method do we expect to be more stable with respect to the choice of learning rate? Here we define stable to mean that the learning procedure will converge to a minimum.

Check one box from the left column and one box from the right column.


猜你喜欢

转载自blog.csdn.net/sophiecxt/article/details/80375253