Fully connected neural network single-layer model principle

foreword

  Deep learning is to learn the internal laws and representation levels of sample data. The information obtained during the learning process is of great help to the interpretation of data such as text, images and sounds. Fully connected neural network (MLP) is one of the basic network types, which fully embodies the characteristics of deep learning methods compared with traditional machine learning algorithms, namely big data drive, formula derivation, self-iterative update, black box training, etc.

single layer MLP

  Single-layer neural network training is based on finding the weights of a set of perceptrons,Minimize the error between the output of the set of perceptrons and the expected output. The implementation steps are as follows:
insert image description here
  ※Step 1: Initialize a random weight matrix
  ※Step 2: Input feature data to calculate the output of the perceptron, that is, forward propagation
  ※Step 3: Calculate the error between the output vector of the perceptron and the expected output of the sample , that is, the loss function
  ※ Step 4: Calculate the update gradient of the weight matrix according to the calculated error, that is, gradient descent
  ※ Step 5: Use the update gradient to update the weight matrix.
  ※Step 6: Perform repeatedly from step 2 until the end of training (the number of training sessions can be determined freely based on experience)

1. Forward propagation

  Assign weights to each input vector and compute a result vector. Generally, in order to make the neural network have nonlinear characteristics, an activation function is introduced to process the values ​​obtained by linear transformation.

  Linear transformation (weighting and biasing): z = w T x + bz = w^T x + bz=wTx+b

  Nonlinear transformation (activation function sigmoid ): δ ( x ) = 1 1 + e − z \delta (x) = \frac{1}{ {1 + e^{ - z} }}d ( x )=1+ez1

ww   in the above formulaw is the weight,bbb is bias,xxx is the input value,zzz is the linear output value,δ \deltaδ is the nonlinear output value.

2. Activation function

  So why introduce the activation function?

  Answer: If the activation function is missing in the network, the neural network will become a linear classifier , and when the number of layers increases, the value obtained by the neuron at the back will be very large. If this number is much larger than the value of the previous neuron, The front neurons will appear meaningless to the expression of the entire network. Therefore, it needs to be constrained every time a layer of network is created.

  The following are several common activation functions.

2.1 Sigmoid function

S ( x ) = 1 1 + e − x S(x) = \frac{1}{ {1 + e^{ - x} }} S(x)=1+ex1
insert image description here
  When the input value of the Sigmoid function is greater than 5, its output is close to 1. When the input value is less than -5, its output is close to 0, and the input value will be compressed to between (0,1).
  Features : The output is greater than 0, and it is not centrally symmetrical

2.2 tanh function

  tanh is one of the hyperbolic functions, and tanh() is the hyperbolic tangent. In mathematics, the hyperbolic tangent "tanh" is derived from the fundamental hyperbolic functions hyperbolic sine and hyperbolic cosine.
tanh ⁡ x = sinh ⁡ x cosh ⁡ x = ex − e − xex + e − x \tanh x = \frac{ { \sinh x}}{ {\cosh x}} = \frac{ {e^x - e ^{ - x} }}{ {e^x + e^{ - x} }}fishyx=coshxbornx=ex+exexex
insert image description here
  When the input value of the tanh function is greater than 2.5, its output is close to 1. When the input value is less than -2.5, its output is close to -1, and the input value will be compressed to between (-1,1).
  Features : The output is positive and negative, and the center is symmetrical

2.3 ReLu function

  In the usual sense, the ReLu function refers to the slope function in mathematics, that is,
f ( x ) = max ⁡ ( 0 , x ) f(x) = \max (0,x)f(x)=max(0,x )
insert image description here
  When the input value of the ReLu function is greater than 0, the original value is output, and when the input value is less than 0, the output is 0

  In the neural network, define the nonlinear output result of the neuron after the linear transformation. In other words, for the input vector from the previous layer of neural network entering the neuron, the neuron using the ReLu activation function will output to the next layer of neurons or as the output of the entire neural network (depending on where the current neuron is in the network structure Location).
max ⁡ ( 0 , w T x + b ) \max (0,w^T x + b)max(0,wTx+b)

2.4 Leaky ReLu function

  When the input value is negative, the gradient of Leaky ReLU is a constant instead of 0. When the input value is positive, the Leaky ReLU function is consistent with the normal ramp function. That is, f ( x ) = { x ifx > 0 λ x ifx ≤ 0 f(x) = \left\{ \begin{array}{l} x~~~~~~~ifx > 0 \\ \\ \lambda x~~~~ifx \le 0 \\ \end{array} \right.f(x)=x       ifx>0λx    ifx0
insert image description here
  Compared with the ReLu function, the Leaky ReLu function does not make the output value directly 0 when the input value is less than 0, but reduces the input value by 10 times.

3. Loss function

  The loss function is used to evaluate the similarity between the calculated value of the model and the real value.The smaller the loss function, the better the robustness of the model. Of course, the core is to optimize the sum of parameters. In addition, the choice of loss function requires specific analysis of specific problems. The following are the calculation formulas of several common loss functions.

   ◎L2损失函数:L ( y ′ , y ) = ∑ i = 1 n ( yi ′ − yi ) 2 L(y'^~,y) = \sum\limits_{i = 1}^n {(y' _i - y_i )^2 }L ( y ,y)=i=1n(yiyi)2

  ◎Mean square error loss function: { L ( y ′ , y ) = 1 2 n ∑ i = 1 n ( yi ′ − yi ) 2 multi-sample L ( y ′ , y ) = 1 2 ( yi ′ − yi ) 2 single Sample \left\{ \begin{array}{l} L(y'^~ ,y) = \frac{1}{ {2n}}\sum\limits_{i = 1}^n {(y' _i - y_i )^2 } ~~~~~multiple samples\\ \\ L(y'^~ ,y) = \frac{1}{2}(y' _i - y_i )^2 ~~~~~~~ ~~~~~single sample\\ \end{array} \right.L ( y ,y)=2 n1i=1n(yiyi)2 multiplesamples     L ( y ,y)=21(yiyi)2 single            samples

  ◎交叉熵函数:L ( y ′ , y ) = [ y log ⁡ y ′ + ( 1 − y ) log ⁡ ( 1 − y ′ ) ] L(y'^~,y) = [y\log y' ^ ~ + (1 -y)\log (1 - y')]L ( y ,y)=[ylogy +(1y)log(1y)]

  In the above formula, y ′ y’y is the calculated value,yyy is the real value

4. Gradient Descent

  Gradient descent is a feedback-forward calculation method , which reflects an idea of ​​"correcting errors with errors", and is also the core process of iterative updating of neural networks.
  ◎Iterative update w 1 = w 0 − η d L ( w ) dwb 1 = b 0 − η d L ( b ) db \begin{array}{l} w_1 = w_0 - \eta \frac{ {dL(w ) }}{ {dw}} \\ \\ b_1 = b_0 - \eta \frac{ {dL(b)}}{ {db}} \\ \end{array}w1=w0thedwdL(w)b1=b0thedbdL(b)
  Among them w 0 w_0w0and b 0 b_0b0is our current actual value, − η - \etaη isthe step size(certain value), whenLLL takes extreme valuewww time,w 1 w_1w1is the value obtained by gradient descent

  ◎ When the gradient of the loss function is reduced, the chain rule is required to solve
d L ( a , y ) dw = d L ( a , y ) da ⋅ dadz ⋅ dzdw \frac{ {dL(a,y)}}{ {dw} } = \frac{ {dL(a,y)}}{ {da}} \cdot \frac{ {da}}{ {dz}} \cdot \frac{ {dz}}{ {dw}}dwdL(a,y)=d adL(a,y)dzd adwdz
  The above is an introduction to the single-layer neural network, which will be continuously updated and improved~~~

Guess you like

Origin blog.csdn.net/m0_58807719/article/details/128136998