CS231n Lecture6-Training Neural Networks, part I study notes

一、Commonly used activation function

1. Sigmoid

  • Stickers
At present, the use is reduced, and the main disadvantages are as follows:
  • There are gradient explosions and gradient disappearances. When backpropagating, gradients disappear easily, and when initializing weight parameters, gradient explosions are easy to occur, so the network does not learn much.
  • Not zero-centered, the possible problem is that when backpropagating, the gradient is always positive or negative, and bad dynamic parameter updates will occur.

2. Fishy

  • Stickers

  • There is still the problem of gradient explosion and gradient disappearance, but the center is zero. Sigmoid's second problem no longer exists.

3. ReLU

  • Stickers

  • In recent years, it has been a relatively popular activation function.

Advantages and disadvantages:
  • It is found that the convergence rate of stochastic gradient descent method is greatly accelerated compared with Sigmoid and Tanh.
  • The calculation cost is less.
  • There may be a gradient change to 0, which can no longer be updated.

二、Neural Network architectures

  1. We generally agree that when talking about the number of neural network layers, the input layer is not included.
  2. We cannot use small networks because we are afraid of overfitting. We should use regularization to reduce overfitting.

三、Data Preprocessing

  • Mean subtraction: zero-centered is by subtracting the average of each dimension, meaning that the center point of the data is turned into the origin; normalized (normalized) is to scale the data, there are several forms, one is to scale to 0 Between ~ 1, you can also zoom to -1 ~ 1. Standardization means that the mean is 0 and the variance is 1.
  • What is SVD? ? ?
  • The whitening operation is based on the data in the feature base, and each dimension is divided by the feature value to normalize the scale.
  • A common pitfall is that when data is preprocessed, calculating the average of the data is only to calculate the training set, and then use this average to subtract in the (training set, validation set, test set), rather than calculating the average of all data , And then subtract and divide the data set.
  • This piece is not understood and requires knowledge of subsequent linear algebra.

四、Weight Initialization

  1. Common problems are:
  • All the weights are initialized to 0, which is not feasible, because each neuron will calculate the same result, and the gradient calculation will update the same parameters during the back propagation process. The reasonable initialization is half weight is positive and half weight Is negative.
  • The weight of initialization, we want to be as close to 0 as possible, that is, the weight of random initialization is as small as possible. But it does not mean that the smaller the number, the better. For example, in the neural network layer, if the initial weight is too small, the gradient of back propagation will be small.
  • The recommended heuristic is to initialize the weight vector of each neuron to:, w = np.random.randn(n) / sqrt(n)where nis the number of inputs.
  • In the ReLU activation function, the recommended initialization method is:w = np.random.randn(n) * sqrt(2.0/n)
  1. Batch Normalization: It is used to enhance robustness and is also used as a pre-processing before each network layer to integrate into the network itself in a differentiable way.

五、Regularization

  • Used to prevent network overfitting
  1. L2 regularization: If not focused on explicit feature selection, L2 regularization is better than L1 regularization. The emphasis is on diverging data and severely penalizing the peak weight vector.
  2. L1 regularization: only the most important sparse data is used, not affected by noise.
  3. Max norm constraints: An absolute upper bound is imposed on the weight vector size of each neuron, and projection gradient descent is used to impose constraints. One of its attractive features is that even when the learning rate is set too high, the network will not "explode" because updates are always limited.
  4. Dropout: is a very effective method. Dropout can be achieved by keeping a neuron activated with a certain probability p (hyperparameter), or setting it to 0. (Sticker)

六、Loss functions

  • SVM (Texture)

  • Softmax

  • L2 loss : It is more difficult to optimize than the stable Softmax, and its requirements are very high, and its robustness is poor, because extreme values ​​may cause huge gradients.

七、Learning

1. Gradient Checks

Guess you like

Origin www.cnblogs.com/tsruixi/p/12728473.html