Analysis of neural network (single layer)


Machine learning models can be implemented in many forms, and neural networks are one of them. Refer to the article I wrote about the first acquaintance with machine learning. For the second picture, neural networks are used instead of models, and learning rules are used instead of machine learning. Because in the context of neural networks, the process of determining the model is called learning rules. Next, I will only briefly introduce some basic knowledge in single-layer neural networks to lay the foundation for the learning of multi-layer neural networks. (The specific and detailed code involved is not given below, you can try to program it yourself, if you are really difficult, you can directly look at the resource MATLAB programming neural network I posted )

1. Node

As we all know, neural networks are developed by simulating brain mechanisms. The brain relies on neurons and the association between neurons to memorize and store information, and neural networks use the association between nodes to transmit information, and the weight value is used to simulate the association between neurons. It may not be clear to say that, here I will give a simple example, look at the following figure:
Insert picture description here
the circles in the figure are nodes, and the arrows represent signal flow. x1, x2, x3 are the input signals, w1, w2, and w3 are the weights of the corresponding signals, and b is the bias. The information of the neural network is stored in the form of weights and biases. The input signal from the outside must be multiplied by a weight value before reaching the node, and finally get the weighted sum of the offset value, that is, v = w 1 ∗ x 1 + w 2 ∗ x 2 + w 3 ∗ x 3 + bv=w1 *x1+w2*x2+w3*x3+bv=w1x1+w2x2+w 3x3+b . Finally, the node inputs the weighted sum into the activation function and produces an output, that is,y = ϕ (v) y=\phi(v)Y=ϕ ( v ) , there are many forms of activation functions.

2 layer

Different node connection methods can establish a variety of neural networks. The most commonly used type of neural network uses a hierarchical node structure. The neural network can be classified through different layers, the general classification is as follows

Single layer neural network Multilayer neural network
Input layer-output layer Shallow neural network: input layer-single hidden layer-output layer
Deep neural network: input layer—multiple hidden layers—output layer

In a layered neural network, the signal enters the input layer, passes through the hidden layer, and leaves the network through the output layer. In this process, the signal advances layer by layer. Some meanings can be clearly seen from the figure below:
Insert picture description here
In addition to the input layer and output layer, each node in the hidden layer needs to receive the information passed by all nodes in the previous layer, and pass the output to all nodes in the next layer . Here we need to be clear. Except that the input layer does not require an activation function, the other layers need activation functions for output. Here is a feature. If the hidden layer uses a linear activation function, the hidden layer will be invalidated, and the output layer can use linear The activation function has no effect.

3. Supervised learning of neural network

For this section, I will just introduce the steps of supervised learning:

  1. Initialize weights with appropriate values
  2. Obtain input from the training data, enter the neural network, and finally obtain the output from the neural network, and compare it with the standard output
  3. Adjust the weight value to reduce the error
  4. Repeat steps 2 and 3 for all training data

4. delta rules

As mentioned earlier, in order to train a neural network with new information, the weights need to be changed accordingly. The systematic method to modify the weights according to the given information is to learn rules. The delta rule is a representative rule of a single-layer neural network.
Insert picture description here
If an input node causes an output node to produce an error, the weight between the two nodes will be adjusted in proportion to the input value and the output deviation. The public expression is wij = wij + α ∗ ei ∗ xj w_{ij}= w_{ij}+\alpha*e_i*x_jwij=wij+aeixj, Where α \alphaα is called the learning rate, and the value is(0, 1] (0,1)(0,1 ] . The learning rate determines the amount of weight change each time. If it is too high, it will cause the output to linger around the solution and fail to converge. If it is too low, the process of calculating the convergence to the solution will be too slow.

5. Generalized delta rule

The delta rule introduced in the previous section is quite outdated, because the delta rule has a broader form. For any activation function, the delta rule can be written as wij = wij + α ∗ δ i ∗ xj w_{ij}=w_{ij}+\alpha*\delta_i*x_jwij=wij+adixj, Among them δ i = ϕ 1 (vi) ∗ ei \ delta_i = \ phi ^ 1 (v_i) * e_idi=ϕ1 (vi)ei. If a linear activation function is used, then δ i = ei \delta_i=e_idi=ei, So the learning rules in the previous section only apply to linear activation functions. Although the weight update formula of the generalized delta rule is more complicated, the basic idea has not changed, and it is still determined according to the proportion of the output node error and the input node value.

6. SGD、Batch、Mini Batch

6.1 SGD

SGD (Stochastic Gradient Descent) calculates the error based on each training data and adjusts the weight immediately. Since SGD adjusts the weight for each data point, the performance of the neural network fluctuates up and down during the training process. SGD calculates the weight update method as Δ wij = α ∗ δ i ∗ xj \Delta w_{ij}=\alpha*\delta_i*x_jΔwij=adixj, Which means that the delta rule is based on the SGD method.

6.2 Batch

This method uses all the errors of the training data to calculate each weight update value, and then uses the average value of the weight update to adjust the weight. It uses all the training data, but only performs one update. Batch calculates the weight update method as Δ wij = 1 N ∗ ∑ k = 1 N Δ wij (k) \Delta w_{ij}=\frac{1}{N}*\displaystyle\sum_{k=1}^{ N}\Delta w_{ij}(k)Δwij=N1k=1NΔwij( k ) , due to the use of the calculation method of averaging the weight update, the training of the Batch method takes a lot of time.

6.3 Mini Batch

This method is between the above two methods. Here is a simple example. If 50 arbitrary data points are taken from 200 training data points, the Batch method is applied to these 50 data points. In this case, 4 weight adjustments are required to complete all data points. Training process. If the number of data points is selected reasonably, the method can take into account the advantages of the two methods, namely the speed of the SGD method and the stability of the Batch method.

7. 实现SGD方法

考虑三个输入节点和和一个输出节点组成的神经网络,无偏置。采用Sigmoid函数作为激活函数,有四个训练数据点,分别是 [ 0010 ] [0 0 1 0] [0010] [ 0110 ] [0 1 1 0] [0110] [ 1011 ] [1 0 1 1] [1011] [ 1111 ] [1111] [1111],每个数据点的最后一个是标准输出,然后编写实现SGD方法的代码,训练10000次,将最后得到的结果与标准输出进行比较

y=[0.0102 0.0083 0.9932 0.9917]'

这个与标准输出相差无几

y=[0 0 1 1]'

8. 实现Batch方法

所考虑的模型和上节一样,编程中唯一不同的是计算权重更新的方法,同样训练10000次,最后得到的结果为

y=[0.0209 0.0169 0.9863 0.9830]'

这明显比SGD方法误差大,当我们训练40000次时

y=[0.0102 0.0083 0.9932 0.9917]'

此时的精度和SGD一样,但是Batch花费的时间更多,也就是Batch学习速度更慢。

9. 两者比较

Insert picture description here
从上图明显可以看出在同样的训练次数下,SGD方法明显比Batch方法学习速度更快。

10. 局限性

同样考虑上文显示的模型,如果改一下标准输出为 [ 0110 ] [0 1 1 0] [ 0 1 1 0 ] , the final result is

y=[0.5297 0.5000 0.4703 0.4409]'

Isn't it completely wrong, but why on earth is this? Define the first three coordinates of the input as x, y, z coordinates, and the z coordinates are all 1, so you only need to consider the x and y coordinates and draw them on a picture. The red is the standard output
Insert picture description here

It can be seen from this figure that if you want to divide the area of ​​0 and 1, you need a complicated curve. Similarly, drawing the standard output graph shown in the previous section to
Insert picture description here
divide the area of ​​0 and 1 only requires a straight line. In other words, this is a linearly separable problem, and a single-layer neural network can only solve the linearly separable problem. This is why more and more multilayer neural networks are being developed, because multilayer neural networks do not have this limit.

Guess you like

Origin blog.csdn.net/woaiyyt/article/details/113091892