Feedforward Neural Network (Multilayer Perceptron) Basics

1. Introduction to Neural Networks

Definition of neural network : Artificial Neural Networks (ANNs for short), also referred to as neural networks (NNs) or Connection Model, is an algorithmic mathematical model that imitates the behavioral characteristics of animal neural networks and performs distributed parallel information processing. This kind of network depends on the complexity of the system, and achieves the purpose of processing information by adjusting the interconnection relationship between a large number of internal nodes.

1.1 Biological background of neural networks

  • The working mechanism of nerve cells
    Neuron theory (neurons theory): Nerve cells are independent of each other and transmit signals in some form.

  • Neuron theory
    (1) The neural network is composed of many independent nerve cell individuals ("neurons"), which are connected through the contact points between neurons; (2) All neurons have an
    asymmetric polar structure: one side has a long "axon" fibrous protrusion, and the other side has many "dendrites". Dendrites are structures that receive input information from other neurons, while axons are output structures that neurons transmit information to distant places; (3) Based on the structural changes in the development, degeneration, and regeneration of neural tissues, Cajal first proposed the concept of plasticity of neural connections; (4) Dendrites receive information, trigger areas to integrate potentials, and generate nerve impulses. The terminal synapse is the output area, which is transmitted to
    the next neuron
    . The human brain nervous system contains nearly 86 billion neurons, and each neuron has a thousand synapses.


  • The assumed characteristics of biological neural networks:
    (1) Each neuron is an information processing unit with multiple inputs and single outputs
    ; (2) Neuron input is divided into two types: excitatory input and inhibitory input
    ; (3) Neuron has spatial integration characteristics and threshold characteristics ;
    (4) There is a fixed time delay between neuron input and output , which mainly depends on synaptic delay

1.2 Artificial neurons and perceptrons

In 1943, psychologist WSMcCulloch and mathematical logician W.Pitts proposed an abstract and simplified model constructed according to the structure and working principle of biological neurons—the MP model . Such models usually formalize neurons as a "weighted sum of input signals on activation function composite".


MP model received from nnThe input signal xi x_itransmitted by n other neuronsxi, these input signals pass through weighted connections wi w_iwiis transmitted, the neuron receivestotal input valueWillneuron threshold θ \theta θ is compared, and then passedactivation function f f f is processed to produce the output of the neuron. That is:
f ( ∑ i = 1 nwixi − θ ) f(\sum^n _{i=1} w_i x_i - \theta )f(i=1nwixiθ ) among which,xi x_ixirepresent signals from other neurons, wi w_iwiIndicates the corresponding connection weight, θ \thetaθ represents the threshold of the neuron,fff represents the activation function (Activation Function) (or Transfer Function) which is usually continuously differentiable.

  • Whether a neuron activates or not depends on the threshold level θ \thetaθ , that is, only when the sum of its inputs exceeds the thresholdθ \thetaWhen θ , the neuron is activated and emits a pulse, otherwise the neuron will not produce an output signal.
  • When a neuron is activated, the neuron is said to be in an active or excited state, and vice versa, the neuron is said to be in an inhibited state.

1.3 Commonly used activation functions

1.3.1 Linear Function

f ( x ) = k x + c f(x) = kx + c f(x)=kx+c

1.3.2 Ramp Function

f ( x ) = { T ,    x > c k x ,    ∣ x ∣ ⩽ c − T ,    x < − c f(x) = \begin{cases} T,\,\, x>c\\ kx,\,\, |x|\leqslant c \\ -T,\,\, x < -c \end{cases} f(x)= T,x>ckx,xcT,x<c

1.3.3 Threshold Function

f ( x ) = { 1 ,    x ⩾ c 0 ,    x < c f(x) = \begin{cases} 1,\,\, x \geqslant c\\ 0,\,\, x < c \end{cases} f(x)={ 1,xc0,x<c

1.3.4 sigmoid function

Sigmoid function is a common S-type function in biology, also known as S-type growth curve. In information science, due to its single-increase and inverse function single-increase properties, the Sigmoid function is often used as the activation function of the neural network to map variables between 0 and 1.

The sigmoid function is also called the Logistic function. It is used for the output of hidden layer neurons. The value range is (0,1) (0 means "inhibition", 1 means "excitement"). It can map a real number to the interval of (0,1), which can be used for binary classification. The effect is better when the feature difference is complex or the difference is not particularly large. Sigmoid has the following advantages and disadvantages as an activation function:

  • Advantages: smooth, easy to derive.
  • Disadvantages: The activation function has a large amount of calculation. When backpropagating to find the error gradient, the derivation involves division; when backpropagating, it is easy for the gradient to disappear, so that the training of the deep network cannot be completed.

Sigmoid function definition:
S ( x ) = 1 1 + e − x S(x) = \frac{1}{1 + e^{-x}}S(x)=1+ex1
to xxDerivation of x
: S ′ ( x ) = e − x ( 1 + e − x ) 2 = S ( x ) ( 1 − S ( x ) ) S'(x) = \frac{e^{-x}}{(1 + e^{-x})^2} = S(x)(1 - S(x))S(x)=(1+ex)2ex=S(x)(1S ( x ))
Graph of the Sigmoid function:
insert image description here

1.3.5 Hyperbolic tangent function (tanh function)

tanh ( x ) = ex − e − xex + e − x tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}t english ( x )=ex+exexexThe function image is:



The sigmoid function and the tanh function are two activation functions that were widely used in the early stages of research. Both are S-type saturation functions. When the value of the input of the sigmoid function tends to positive infinity or negative infinity, the gradient will approach zero, thus the phenomenon of gradient dispersion occurs. The output of the sigmoid function is always a positive value, not zero-centered, which will cause the weight to be updated only in one direction, thus affecting the convergence speed. The tanh activation function is an improved version of the sigmoid function. It is a symmetrical function centered on zero. It has a fast convergence speed and is not prone to loss value fluctuations, but it cannot solve the problem of gradient dispersion. The calculation amount of the two functions is exponential, and the calculation is relatively complicated. The softsign function is an improved version of the tanh function, which is a S-type saturated function with zero as the center and a value range of (−1, 1).

Why does the LR model use the sigmoid function, and what is the mathematical principle behind it?

1.3.6 ReLU (Rectified Linear Regression, rectified linear unit)

In modern neural networks, the default recommendation is to use the activation function g ( z ) = max { 0 , z } g(z) = max \{ 0, z \}g(z)=max{ 0,z } defined rectified linear unit (Rectified Linear Regression) or calledReLU.

In general, the linear rectification function refers to the slope function in mathematics , that is, f ( x ) = max { 0 , x } f(x) = max \{ 0, x \}f(x)=max{ 0,x } .
In the neural network, linear rectification is used as the activation function of the neuron, which defines the linear transformation of the neuronw T x + bw^Tx + bwTx+The nonlinear output result after b . In other words, for the input vector xxfrom the previous layer of neural network entering the neuronx , a neuron using the linear rectification activation function will outputmax ( 0 , w T x + b ) max(0, w^Tx + b)max(0,wTx+b)



2. Single layer perceptron (single layer neural network, linear regression)

2.1 Single layer perceptron model

In 1957, Frank Rosenblatt combined the MP model and Hebb learning rules to invent the perceptron, a two-layer neural network with a structure similar to the MP model, and is generally regarded as the simplest artificial neural network.

The difference between the perceptron and the MP model: the input is not discrete 0/1, and the activation function is not necessarily a threshold function.

The organizational structure of the perceptron model is as follows:


The corresponding simplified diagram is:

After further development and deformation, it has become a commonly used classic form. Because there is only one layer, it is also called a single-layer perceptron . as follows:


Compared with the MP model, the perceptron introduces a bias b. Formulated as:
f ( x ) = sign ( wx + b ) f(x) = sign(wx + b)f(x)=sign(wx+b ) Among them,sign ( x ) sign(x)s i g n ( x ) is the activation function:
sign ( x ) = { + 1 , x ⩾ 0 − 1 , x < 0 sign(x) = \begin{cases} +1,\,\, x \geqslant 0\\ -1,\,\, x < 0 \end{cases}sign(x)={ +1,x01,x<0Corresponding to the two states of "activation" and "inhibition".


2.2 Geometric interpretation of the perceptron

Since wx + b = 0 wx + b = 0wx+b=0 is equivalent tonnA hyperplane in n- dimensional space, www is the normal vector of the hyperplane,bbb is the intercept of the hyperplane,xxx is a point in space.

  • when xxWhen x is on the positive side of the hyperplane,wx + b > 0 wx + b > 0wx+b>0 , the perceptron is activated;
  • when xxWhen x is on the negative side of the hyperplane,wx + b < 0 wx + b < 0wx+b<0 , the perceptron is suppressed.

So, from a geometric point of view, the perceptron is nnA hyperplane in n- dimensional space that divides the feature space into two parts.


2.3 Single layer perceptron and linear classification task

Due to the property of separating hyperplanes that perceptrons have, they are often used to classify data.

First, a set of training data is given, then the parameters ω and b of the model are determined through the training data, and finally the learned model is used to predict the category of new data.

Suppose the given training data is: T = ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) T = (x_1, y_1), (x_2, y_2), ... , (x_N, y_N)T=(x1,y1),(x2,y2),...,(xN,yN) ,其中, x i ∈ X = R n , y ∈ { + 1 , − 1 } , i = 1 , 2 , . . . , N x_i \in X = R^n, y \in \{ +1, -1 \}, i = 1,2,...,N xiX=Rn,y{ +1,1},i=1,2,...,N

The goal of learning is to find a hyperplane that can separate positive and negative instances in the training data.

Solution method: parameter initialization +Gradient Descentrenew

The obtained solution is not an analytical solution in the mathematical sense, but an optimal solution in the engineering sense (not the only one, as long as a better result is obtained).



2.4 Defects of single-layer perceptron

Minsky proved in his 1969 book "The Perceptron" that the Perceptron cannot solve the XOR problem.




3. Multilayer perceptron (feedforward neural network)

Multi Layer Perception, MLP: Introduce one or more hidden layers on the basis of a single-layer neural network, so that the network has multiple network layers, which is called a multi-layer perceptron , or a feedforward neural network .
In theory, a multilayer network can simulate any complex function.


As can be seen from the above figure, the multi-layer perceptron layer and the layer arefull connectionof. The bottom layer of the multi-layer perceptron isinput layer, the middle ishidden layer,Finallyoutput layer

MLP does not limit the number of hidden layers, nor does it limit the number of neurons in the output layer, so we can choose the appropriate number of hidden layers according to our respective needs.

The Expressive Power of Multilayer Perceptrons: Solving the Exclusive OR Problem (XOR).




Why use an activation function?

  • Without using an activation function, the output of each layer is a linear function of the input of the upper layer. No matter how many layers the neural network has, the output is a linear combination of the inputs.
  • Using the activation function, nonlinear factors can be introduced into neurons, so that the neural network can arbitrarily approximate any nonlinear function, so that the neural network can be used in more nonlinear models.

The activation function needs to have the following properties:

  • Non-linear functions that are continuous and differentiable (allowing non-differentiability at a small number of points). Differentiable activation functions can be used directly to learn network parameters using numerical optimization methods.
  • The activation function and its derivative function should be as simple as possible, which is conducive to improving the computational efficiency of the network.
  • The value range of the derivative function of the activation function should be within an appropriate interval, neither too large nor too small, otherwise it will affect the efficiency and stability of the training.



4. Backpropagation

The feedforward neural network was introduced earlier, so the question arises: how should the neural network be optimized?
In 1986, Rummelhart and McClelland improved the back propagation algorithm (back propagation, BP) to optimize the neural network, so the neural network is often called BP neural network.

  • Forward propagation calculates output results through training data and weight parameters;
  • Backpropagation throughchain rule of derivativescalculateloss functionFor the gradient of each parameter, and update the parameters according to the gradient.

Note: Backpropagation only refers to the process of the gradient of the loss function to the parameter flowing backward through the network, but it is now often understood as the entire training method of the neural network, which consists of two loops of error propagation and parameter update.







Reference:
[1] Tianchi Course: Deep Learning Principles and Practice
[2] "Deep Learning" (Flower Book)
[3] MP Model of Neural Network Learning
[4] Is the MP Model the basis of neural network? NTU Zhou Zhihua Group proposed a new neuron model FT
[5] The perceptron: a probabilistic model for information storage and organization in the brain
[6] Machine Learning - Single Layer Perceptron
[7] Introduction to Multilayer Perceptron (MLP)
[8] Machine Learning Basics (12) - Multilayer Perceptron
[9] Deep Learning | Backpropagation Detailed Explanation

Guess you like

Origin blog.csdn.net/qq_42757191/article/details/124923876
Recommended