Machine learning, deep learning summary

Currently comprising:
an input normalization, parameter initialization, batch normalization, Dropout, fully connected layers, the activation function, the convolution layer, a pool layer, ResNet, Inception network, loss function, regularization, multiple classification, gradient descent, the SVM Wait

Enter normalization

In order to reduce the number of iterations, using the normalized input, the input is rewritten to mean 0, consistent with the scope of the distribution form.


\(\mu = \frac{1}{m}\sum_i x^{(i)}\)
\(\sigma^2 = \frac{1}{m}\sum_i (x_i - \mu)^2\)
\(x^{(i)}_{norm} = \frac{z^{(i)}-\mu}{\sqrt{\sigma^2 + \varepsilon}}\)

Initialization parameters

  • 0 Initialization:
    • The so-called 0 is to initialize all parameters are initialized to zero, but due to the initialization value is the same, it will not destroy the symmetry that the values of all parameters will homologous always the same.
    • Hide two units have the same activation function connected to the same unit, then the unit must have a different initial parameters. Once they have the same initial parameters, then apply learning algorithms to deterministic and deterministic model of the loss has been updated in the same way these two
  • Random initialization:
    • Gradient explosion: \ (W> the I \)
    • Gradient disappears: \ (W <the I \)
  • He Initialization: He initialization is based on a random initialization, so that each layer of the parameter \ (W ^ {[l] } \) is scaled so that \ (A ^ {[l] } \) with similar mean X , so as to be randomly initialized variance: \ (\ sqrt {\ FRAC {{n-2}. 1-L ^ {}}} \)

Batch normalization (Batch Normalization-BN)

Optimization in the hidden layer


\(\mu = \frac{1}{m}\sum_i z^{(i)}\)
\(\sigma^2 = \frac{1}{m}\sum_i (z_i - \mu)^2\)
\(z^{(i)}_{norm} = \frac{z^{(i)}-\mu}{\sqrt{\sigma^2 + \varepsilon}}\)
\(\tilde{z}^{(i)} = \gamma z^{(i)}_{norm} + \beta\)

Dropout

Dropout is a regularized manner, by the name of exit (drop out) it may be seen to neurons in a random removed so as to reduce a dependence of neurons, may be used to achieve the regularization method. In the application process, we set a new variable \ (D ^ {[l] } = [d ^ {[l] (1)} ... d ^ {[l] (m)}] \) for d represents the l-th hidden layer, wherein each of d (i) d represents data at the i-th and d indicates that the neuron is hidden. It is therefore a vector of 0's and 1's. Use \ (d ^ {[l] (i)} = np.random.rand (a ^ {[l] (i)}. Shape) <keep \ _prob \) to set d and D. In order that the average output after the structure remains unchanged in the hidden layer using Dropout determined in the forward propagation \ (A ^ {[l] } \) after which they divided by \ (Keep \ _prob \) . And also upon the reverse spread \ (dA = \ frac {dA } {keep \ _prob} \)

Full connection layer

  • Spread the preceding paragraph

    \(Z^{[l]} = W^{[l]}A^{[l-1]} + b\)
    \(A^{[l]} = g^{[l]}(Z^{[l]})\)
  • Back Propagation

    \(dA^{[L]} = \frac{\partial loss}{\partial A^{[L]}}\)
    \(dZ^{[l]} = dA^{[l]}\dot{g(Z^{[l]})}\)
    \(dA^{[l-1]} = W^{[l]T}dZ^{[l]}\)
    \(\frac{\partial L}{\partial W^{[l]}} = \frac{1}{m}\frac{\partial L}{\partial Z^{[l]}}A^{[l-1]T}\)
    \(\frac{\partial L}{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l] (i)}\)

Convolution layer

层数 official:


\(n^{[l]} = \lfloor \frac{n^{[l-1]} + 2p^{[l]}-f^{[l]}}{s^{[l]}} +1 \rfloor\)

1x1 convolution layer: L dimension or dimension reduction can be generally used for reducing the parameter

Pool layer

Divided into average pooling and max pooling


\(n^{[l]} = \lfloor \frac{n^{[l-1]} + 2p^{[l]}-f^{[l]}}{s^{[l]}} +1 \rfloor\)

serious

Because of the deep neural networks when performance problems occur flipped that, as the number of layers increases, training error but will increase (degradation problem), therefore cited ResNet. Suppose add layers to form a new network B In the latter network A, only if the increased level of output A is made a mapping identity (identity mapping), i.e., the output of A through B to become new level after the output has not changed, so that the error rate network a and network B is equal, it proved that the deepening of the network will not be worse than before the deepening of the network effect.


\(a^{[l+2]} = g(z^{[l+2]} + a^{[l]})\)

Inception network

Same length and width using a plurality of (possibly different depths) are superimposed to form Inception network.

Loss function

Machine learning is divided into supervised learning and unsupervised learning, the difference between them lies in supervised learning data and the true value, namely X, Y; and supervised learning only the data, there is no real value, namely X. Which supervised learning is divided into linear regression and classification, without supervised learning classification only.
Linear regression prediction is doing, i.e., using the data set and the real values (after training has been set in place) to form a linear curve fit. The main use of functional loss MSE (mean square error), namely:


\(loss = (h_{\theta}(x) - y)^2\) or \((\hat{y} - y)^2\)

There are known problems classified logistic regression is used to distinguish the major categories, wherein the plurality comprises two classification and classification using cross-entropy (cross-entropy) do loss function, namely:

\ (Loss = -ylog \ hat {y} - (1-y) log (1- \ hat {and}) \)

Loss function question : And why linear regression using MSE is used in logistic regression in cross-entropy as a loss function:

  1. From the MSE and cross entropy purposes, MSE represents the distance between the true and predicted values; cross-entropy represents a prediction probability distribution problems and the true probability distribution for the classification probability distribution more reasonable. And there is a negative, log (-1.5) could not be calculated for linear regression problem
  2. In the classification, the start will be used as a Sigmoid function activation function, a sigmoid function if brought into the cross-entropy in time, it will produce a non-convex function that is a function of a plurality of poles, which is not solved by the gradient descent The problem.

Activation function

Excitation function problem: the Sigmoid, Tanh, Relu difference:



Fig. 1 sigmoid, tanh, relu

Sigmoid function: \ (\ Delta (X) = \ {FRAC. 1. 1 + E {} ^ {- X}} \)
Features: function values between 0-1, facilitate the realization of binary, i.e., (> 0.5 is 1 0 otherwise)
Cons:

  1. There are positive and negative values ​​after the soft-saturation problem big time, that little gradient
  2. exp large computation
  3. Only training with the same number of output value, there is a problem ziggle
    • \(\frac{dL}{dw_i} = \frac{dL}{dy}\frac{dy}{dw_i}\)
      \(= \frac{dL}{dy}\frac{dy}{dz}\frac{dz}{dw_i}\)
      \(= \frac{dL}{dy}y(1-y)xi\)
      因此\(dw_i\)将于\(x_i\)同号

Tanh function: \ (tanh (X) = \ X FRAC {E ^ - E ^ {-} X} + {E ^ E ^ {X - X}} \)
Features: Compared sigmoid advantage that it is centrosymmetric
disadvantages: saturation still soft; issue computationally intensive exp
Relu function: \ (max (0, X) \)
features: unsaturated positive, negative hard saturation; sigmoid and tanh ratio smaller than the calculated amount of
disadvantages: noncentral symmetry

Regularization

Regularization : When using the loss function of the cost of building there is a very important function is to regularization. Generally regularization divided into two: L1 and L2 regularization regularization

  • L1 regularization: \ (J _ {\ Theta R_1 J _ =} {\ Theta} + \ the lambda \ SUM \ | \ theta_j \ | \)
  • L2 regularization: \ (J _ {\ Theta R_2 J _ =} {\ Theta} + \ the lambda \ SUM \ theta_j ^ 2 \)
  • L1 regularization parameter object is to make more sparse, is popular in terms of parameters is more items 0; L2 of regularization parameter object is to reduce the weight

Binary and multiple classification

Before output
when it comes to logistic regression problems, it has binary classification and multi-classification problems , their differences and connections are:

  • Is generally used to process binary classification sigmoid: \ (\ Delta (X) = \ {FRAC. 1. 1 + E {} ^ {- X}} \)
  • Use softmax function to process multiple classification problem: \ (S (x_j) = \ E ^ {x_j FRAC {} {} \ SUM K_ {K ^ = E ^ {x_k. 1}}} \) , .... 1 J = K

Gradient descent

Gradient descent : the gradient descent learning machine is the basis for solving / learning problems but because of the depth of the batch gradient descent processing speed, and therefore decrease the amount of calculation is introduced mini-batch there is a problem of convergence after the shock gradient descent, leads to the following three methods do acceleration gradient descent.

  • Momentum: using exponential weighted average, do an average of dw and db

    \ (V_ {d} = \ V_ beta {d} + (1- \ beta) d \)
    \ (V_ {db} = \ V_ beta {db} + (1- \ beta) db \)
    \ (w = in - \ alpha V_ {d} \)
    \ (b = b - \ alpha V_ {db} \)
  • RMSprop: Momentum is the \ (DW \) , \ (db \) to do an exponentially weighted average, while RMSprop is \ (dw ^ 2 \) and \ (db ^ 2 \) do exponential average, and in the iteration, use \ (\ frac {dw} { \ sqrt {\ bar {dw ^ 2}} + \ varepsilon} \) stepping achieve accelerated.

    \(S_{dw} = \beta S_{dw} + (1-\beta)dw^2\)
    \(V_{db} = \beta S_{db} + (1-\beta)db^2\)
    \(w = w - \alpha \frac{dw}{\sqrt{S_{dw}}+\varepsilon}\)
    \(b = b - \alpha \frac{db}{\sqrt{S_{db}}+\varepsilon}\)
  • Adam (Momentum + RMSprop): Adam is above together, but because of the exponential average initial large errors therefore refers to several modifications do:

    \(V_{dw}^{correct} = \frac{V_{dw}}{1-\beta_1^t}\), \(V_{db}^{correct} = \frac{V_{db}}{1-\beta_1^t}\)
    \(S_{dw}^{correct} = \frac{S_{dw}}{1-\beta_2^t}\), \(S_{db}^{correct} = \frac{S_{db}}{1-\beta_2^t}\)
    \(w = w - \alpha \frac{V_{dw}^{correct}}{\sqrt{S_{dw}^{correct}}+\varepsilon}\)
    \(b = b - \alpha \frac{V_{db}^{correct}}{\sqrt{S_{db}^{correct}}+\varepsilon}\)

SVM

SVM (Support vecter Machine) SVM: SVM change is coming from the logistic regression, logistic regression due to loss of function as follows:


\ (Loss = -ylog \ hat {y} - (1-y) log (1- \ hat {and}) \)

SVM, the log function such that the cost function is changed to reduce the amount of computation kernel easy to use after the excitation function does not apply, and therefore the loss of function svm:

\(loss= ycost_1(\theta^Tx) + (1-y)cost_2(\theta^Tx)\)

Since the loss function, the decision boundary is the maximum spacing svm classification: the input \ (\ theta ^ Tx \) can be seen as the dot product of two vectors, i.e. vectors \ (X \) in the vector \ (\ Theta \) of projection take \ (\ | \ Theta \ | \) , namely \ (p \ | \ Theta \ | \) , when greater when p, \ (\ | \ Theta \ | \) may be smaller, and therefore the maximum interval classification.
In SVM, in addition to the direct use of \ (\ theta ^ Tx \) may be used \ (\ ^ Tf of the Theta \) , where f is called Kernel, which can be linear, or may be non-linear. Nonlinear generally Gaussian kernel, namely:

\(f = similarity(x,l) = exp(-\frac{\|x - l\|}{2\sigma^2})\)

  1. When n >> m, typically using logistic regression linear or svm
  2. small n, m medium, the use of a Gaussian svm
  3. small n m large, typically using logistic regression linear or svm

Guess you like

Origin www.cnblogs.com/x1ao/p/12376137.html