Deep learning | 3 | The correct way to open the multi-layer perceptron

single layer perceptron

insert image description here
The essential difference between perceptron and linear regression: the output is different

  • The output of linear regression is a continuous real value

  • The output of the perceptron is a discrete class {0, 1} or {-1, 1}

Therefore, the perceptron can be regarded as a classifier for binary classification: when the output is 0, it indicates the first category, and when the output is 1, it indicates the second category .

Perceptron Training Method

  1. Initialize w and b
  2. cycle:
  3.    ~~   如果 y i × ( < w i , x i > + b ) ≤ 0 y_i \times (<w_i, x_i> + b) \leq 0 yi×(<wi,xi>+b)0 , then:
  4.      ~~~~      w ← w + y i x i , b ← b + y i w \leftarrow w + y_ix_i, b \leftarrow b + y_i ww+yixi,bb+yi
  5. Until all categories are correctly classified

In the above process, the loss function can be written as:

l = max ( 0 , − y < w , x > ) \mathcal{l}=\text{max}(0, -y<\pmb{w, x}>) l=max(0,y<w,x>)

AI's First Crisis: The XOR Problem

Perceptrons cannot solve the XOR problem: it is difficult to divide the XOR problem with a linear function.

At this time, the multi-layer perceptron comes into play to solve the problem.

multilayer perceptron

insert image description here
One of the purposes proposed by the multi-layer perceptron is to allow it to solve the XOR problem.

For several features, some features are first classified, and another part is classified. The result of two-part classification, and then classify, can solve the XOR problem.
Intuitively, the biggest feature of the multi-layer perceptron is that it has one more hidden layer.

Given an input of dim=m, it is required to finally output an output vector (category) of dim=n, then an intuitive way to achieve it is:

[ h 1 ⋮ h k ] = [ w 1 1 ( 0 ) ⋯ w 1 m ( 0 ) ⋮ ⋱ ⋮ w k 1 ( 0 ) ⋯ w n m ( 0 ) ] ⋅ [ x 1 ⋮ x m ] + [ b 1 ( 0 ) ⋮ b k ( 0 ) ] \left[\begin{matrix} h_1 \\ \vdots \\ h_k\\ \end{matrix}\right]=\left[\begin{matrix} w_11^{(0)} & \cdots & w_1m^{(0)} \\ \vdots & \ddots & \vdots \\ w_k1^{(0)} & \cdots & w_nm^{(0)} \\ \end{matrix}\right]\cdot \left[\begin{matrix} x_1 \\ \vdots \\ x_m\\ \end{matrix}\right] + \left[\begin{matrix} b_1^{(0)} \\ \vdots \\ b_k^{(0)}\\ \end{matrix}\right] h1hk = w11(0)wk1(0)w1m(0)wnm(0) x1xm + b1(0)bk(0)

[ o 1 ⋮ o n ] = [ w 1 1 ( 1 ) ⋯ w 1 k ( 1 ) ⋮ ⋱ ⋮ w n 1 ( 1 ) ⋯ w n k ( 1 ) ] ⋅ [ h 1 ⋮ h k ] + [ b 1 ( 1 ) ⋮ b n ( 1 ) ] \left[\begin{matrix} o_1 \\ \vdots \\ o_n\\ \end{matrix}\right]=\left[\begin{matrix} w_11^{(1)} & \cdots & w_1k^{(1)} \\ \vdots & \ddots & \vdots \\ w_n1^{(1)} & \cdots & w_nk^{(1)} \\ \end{matrix}\right]\cdot \left[\begin{matrix} h_1 \\ \vdots \\ h_k\\ \end{matrix}\right] + \left[\begin{matrix} b_1^{(1)} \\ \vdots \\ b_n^{(1)}\\ \end{matrix}\right] o1on = w11(1)wn1(1)w1k(1)wnk(1) h1hk + b1(1)bn(1)

Written in matrix form, the above is

h = W ( 0 ) x + b ( 0 ) \pmb{h=W^{(0)}x+b^{(0)}} h=W(0)x+b(0)

o = W ( 1 ) h + b ( 1 ) \pmb{o=W^{(1)}h+b^{(1)}} o=W(1)h+b(1)

But in fact, it is still a single-layer network.
After the above formula is brought in, it will be in the following form:

o = W ( 1 ) ( W ( 0 ) x + b ( 0 ) ) + b ( 1 ) \pmb{o=W}^{(1)}(\pmb{W}^{(0)}\pmb{x+b}^{(0)})+\pmb{b}^{(1)} o=W(1)(W(0)x+b(0))+b(1)

Then, after expansion, it can be written as:

o = W ( 1 ) W ( 0 ) x + W ( 1 ) b ( 0 ) + b ( 1 ) \pmb{o=W}^{(1)}\pmb{W}^{(0)}\pmb{x+W}^{(1)}\pmb{b}^{(0)}+\pmb{b}^{(1)} o=W(1)W(0)x+W(1)b(0)+b(1)

其中, W ( 0 ) ∈ R k × m , W ( 1 ) ∈ R k × n , b ( 0 ) ∈ R k × 1 , b ( 1 ) ∈ R n × 1 \pmb{W}^{(0)}\in\mathbb{R}^{k\times m}, \pmb{W}^{(1)}\in\mathbb{R}^{k\times n}, \pmb{b}^{(0)}\in\mathbb{R}^{k\times 1}, \pmb{b}^{(1)}\in\mathbb{R}^{n\times 1} W(0)Rk×m,W(1)Rk×n,b(0)Rk×1,b(1)Rn×1

那么, W = W ( 1 ) × W ( 0 ) ∈ R n × m , b = W ( 1 ) b ( 0 ) + b ( 1 ) ∈ R n × 1 \pmb{W}=\pmb{W}^{(1)}\times\pmb{W}^{(0)}\in\mathbb{R}^{n\times m}, \pmb{b}=\pmb{W}^{(1)}\pmb{b}^{(0)}+\pmb{b}^{(1)}\in\mathbb{R}^{n\times 1} W=W(1)×W(0)Rn×m,b=W(1)b(0)+b(1)Rn×1

Essentially, the formula can be written as

o = W x + b \pmb{o=Wx+b} o=Wx+b

Therefore, it is still a single-layer neural network in essence.

The biggest reason for the above problems is the lack of nonlinearity in the transmission process of neurons .

One input, multiplied by several matrices, is essentially equivalent to multiplied by one matrix.

Because matrix multiplication is a linear process: after transformation, it can be transformed back.

The basic form of a linear transformation:

y = W x \pmb{y}=\pmb{Wx} y=Wx

The way to change this situation is to add non-linear transformations to the model.

Add nonlinear transformation, activation function

Sigmod function

x = 1 1 + exp ( − x ) \text{x}=\frac{1}{1 + \text{exp}(-x)} x=1+exp(x)1
draw the image

import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

import torch
import matplotlib.pyplot as plt

plt.figure(figsize=(2, 2))

x = torch.arange(-5, 5, 0.1, requires_grad=True)
y = x.sigmoid()
plt.scatter(x.detach().numpy(), y.detach().numpy())
plt.show()

insert image description here

Under what circumstances is it appropriate to use the Sigmoid activation function?

The output of the sigmoid function ranges from 0 to 1. It normalizes the output of each neuron since the output value is bounded from 0 to 1;
for models that have predicted probabilities as output. Since the probability ranges from 0 to 1, the Sigmoid function is very suitable;
the gradient is smooth to avoid "jumping" output values;
the function is differentiable. This means that the slope of the sigmoid curve for any two points can be found;
a clear prediction, ie very close to 1 or 0.

The shortcomings of the Sigmoid activation function:

Gradient vanishing: Note: The rate of change becomes flat as the Sigmoid function approaches 0 and 1, that is, the gradient of the Sigmoid approaches 0. When the neural network uses the Sigmoid activation function for backpropagation, the gradient of the neuron whose output is close to 0 or 1 tends to 0. These neurons are called saturating neurons. Therefore, the weights of these neurons are not updated. Also, the weights of neurons connected to such neurons are updated very slowly. This problem is called vanishing gradient. So imagine if a large neural network contains sigmoid neurons, many of which are saturated, the network cannot perform backpropagation.
Not centered on zero: If the Sigmoid output is not centered on zero, the output is always greater than 0, and the output of the non-zero center will make the input of the neurons in the next layer biased (Bias Shift), and further This slows down the convergence rate of gradient descent.
Computationally expensive: The exp() function is computationally expensive and slow to run on a computer compared to other nonlinear activation functions.

y.sum().backward()
plt.figure(figsize=(3, 3))
plt.scatter(x.detach().numpy(), x.grad.detach().numpy())
plt.show()

insert image description here

Tanh function

x = 1 − exp ( − 2 x ) 1 + exp ( − 2 x ) \text{x}=\frac{1-\text{exp}(-2x)}{1 + \text{exp}(-2x)} x=1+exp(2x)1exp(2x)

y = x.tanh()
plt.figure(figsize=(3, 3))
plt.scatter(x.detach().numpy(), y.detach().numpy())
plt.show()

insert image description here

The shortcomings of tanh:

Similar to sigmoid, the Tanh function also has the problem of gradient disappearance, so it will also "kill" the gradient when it is saturated (when x is large or small).
Note: In general binary classification problems, the tanh function is used for the hidden layer and the sigmoid function is used for the output layer, but this is not fixed and needs to be tuned to the specific problem.

x.grad.zero_()
y.sum().backward()
plt.figure(figsize=(3, 3))
plt.scatter(x.detach().numpy(), x.grad.detach().numpy())
plt.show()

insert image description here

ReLU function

relu ( x ) = { 0 , x ≤ 0 x , x > 0 \text{relu}(x)= \begin{cases} 0, & \text{x} \leq0 \\ x, & \text{x > 0} \end{cases} reread ( x )={ 0,x,x0x > 0

y = x.relu()
plt.figure(figsize=(3, 3))
plt.scatter(x.detach().numpy(), y.detach().numpy())
plt.show()

insert image description here

x.grad.zero_()
y.sum().backward()
plt.figure(figsize=(3, 3))
plt.scatter(x.detach().numpy(), x.grad.detach().numpy())
plt.show()

insert image description here
In practice, the relu function has a more concise expression:

relu ( x ) = max ( x , 0 ) \text{relu}(x)=\text{max}(x, 0) reread ( x )=max(x,0)

The ReLU function is a popular activation function in deep learning. Compared with the sigmoid function and tanh function, it has the following advantages:

When the input is positive, the derivative is 1, which improves the gradient disappearance problem to a certain extent and accelerates the convergence speed of gradient descent; the
calculation speed is much faster. There are only linear relationships in the ReLU function, so it can be calculated faster than sigmoid and tanh.
Considered to have biological plausibility (Biological Plausibility), such as unilateral inhibition, wide excitation boundary (that is, the degree of excitation can be very high)

Disadvantages of the ReLU function:

Dead ReLU problem. ReLU fails completely when the input is negative, which is not a problem during forward propagation. Some areas are sensitive and others are not. But in the process of backpropagation, if a negative number is input, the gradient will be completely zero;
[Dead ReLU problem] ReLU neurons are relatively easy to "die" during training. During training, if a ReLU neuron in the first hidden layer cannot be activated on all training data after an inappropriate update of the parameters, the gradient of the neuron's own parameters will always be 0, It can never be activated during the subsequent training process. This phenomenon is called the dead ReLU problem, and it can also occur in other hidden layers.
Not centered on zero: Similar to the Sigmoid activation function, the output of the ReLU function is not centered on zero, and the output of the ReLU function is 0 or a positive number. Introducing a bias offset to the neural network of the next layer will affect the gradient descent. efficiency.

For each neuron, there is an activation function configured.

Its physical meaning is: when its threshold is large enough, it will be activated (passed to the lower layer), otherwise it will not be passed.

In fact, this physical meaning has a neuroscience explanation.

Therefore, the overall form of a multi-layer perceptron is:

h = δ ( W ( 0 ) x + b ( 0 ) ) \pmb{h}=\delta(\pmb{W}^{(0)}\pmb{x+b}^{(0)}) h=d ( W(0)x+b(0))

o = softmax ( W ( 1 ) h + b ( 1 ) ) \pmb{o}=\text{softmax}(\pmb{W}^{(1)}\pmb{h+b}^{(1)}) o=softmax(W(1)h+b(1))

Summarize

A multilayer perceptron is a multilayer classification model based on neural networks .

Its output is the output after softmax, that is, the output is a probability;

Its multiple layers come from the non-linearity of the activation function.

The most commonly used activation function is the Relu function because it is the most convenient to calculate.

Guess you like

Origin blog.csdn.net/weixin_51672245/article/details/131128833