[Ch05-02] solve the problem multivariable linear regression using neural networks

Series blog, the original author maintained on GitHub: https://aka.ms/beginnerAI ,
click on the star with a star do not mean more stars the harder the author

5.2 Neural Network Solution

Similar problems with a single linear regression eigenvalues, multiple variables (multiple eigenvalues) of linear regression can be viewed as high-dimensional space is a linear fitting. In an example case has two features, which are no longer linear fit a straight line to fit points, but with a plane fitting to the points.

5.2.1 Neural Network Architecture

We define a layer of the neural network shown in Figure 5-1, the input layer is 2 or more, anyway, no more than 2 of the differences. Features of this layer of the neural network are:

  1. No intermediate layer, and output layer only entries (one entry is not made);
  2. Only one output layer neuron;
  3. A linear output neuron has, through the activation function is not treated, i.e., the lower the figure, through the \ (\ Sigma \) summed \ (the Z \) after the value of the directly \ (the Z \) value output.

Figure 5-1 Multi Input Single Output neurons monolayer structure

Compared with the previous chapter neurons, this is just one more input, but is a qualitative change, i.e., a neuron may simultaneously receive a plurality of inputs, which is the fundamental neural network can handle complex logic.

The input layer

The first sample is a single look like this:

\[ x_1 = \begin{pmatrix} x_{11} & x_{12} \end{pmatrix} = \begin{pmatrix} 10.06 & 60 \end{pmatrix} \]

\[ y_1 = \begin{pmatrix} 302.86 \end{pmatrix} \]

A total of 1000 samples, each sample two feature values, X is a \ (1000 \ times 2 \) matrix:

\[ X = \begin{pmatrix} x_1 \\ x_2 \\ \dots \\ x_{1000} \end{pmatrix} = \begin{pmatrix} x_{1,1} & x_{1,2} \\ x_{2,1} & x_{2,2} \\ \dots & \dots \\ x_{1000,1} & x_{1000,2} \end{pmatrix} \]

\[ Y = \begin{pmatrix} y_1 \\ y_2 \\ \dots \\ y_{1000} \end{pmatrix}= \begin{pmatrix} 302.86 \\ 393.04 \\ \dots \\ 450.59 \end{pmatrix} \]

\ (x_1 \) represents the first sample, \ (X_ {1,1} \) represents a characteristic value of the first sample, \ (Y_1 \) is the sample value of the first tag.

Weight W and B

Since the two input layer is characterized in the output layer is a variable, w is the shape of 2X1, and b is the shape of 1x1.

\[ W= \begin{pmatrix} w_1 \\ w_2 \end{pmatrix} \]

\[B=(b)\]

B is a single value, because there is only one neuron output layer, so only a BIAS, BIAS each neuron corresponds to one, if there are a plurality of neurons, which will have their own values ​​b.

Output layer

Since we only want to complete a regression (fitting) task, so the output layer only one neuron. Because it is linear, with no activation function.
\ [\ Begin {aligned} z & = \ begin {pmatrix} x_ {11} & x_ {12} \ end {pmatrix} \ begin {pmatrix} w_1 \\ w_2 \ end {pmatrix} + (b) \\ & = x_ {11} w_1 + x_ { 12} w_2 + b \ end {aligned} \]

Written in matrix form:

\[Z = X\cdot W + B\]

Loss function

Because it is a linear regression problem, so the loss function using the mean square error function.

\[loss(w,b) = \frac{1}{2} (z_i-y_i)^2 \tag{1}\]

Wherein, \ (z_i \) is the predicted sample value, \ (y_i \) is the value of the sample label.

5.2.2 back-propagation

Sample Calculation single multi-feature

Different previous chapter, the chapter forward calculation formula is a multi-value feature:

\[z_i = x_{i1} \cdot w_1 + x_{i2} \cdot w_2 + b\]
\[ =\begin{pmatrix} x_{i1} & x_{i2} \end{pmatrix} \begin{pmatrix} w_1 \\ w_2 \end{pmatrix}+b \tag{2} \]

Since \ (X \) has two characteristic values, corresponding to \ (W is \) has a weight value of the two weights. \ (x_ {i1} \) represents the \ (I \) a first characteristic value of the sample, so it is \ (X \) or \ (W is \) is a vector or matrix, then we trans calculation method of propagation gradient formula still valid? The answer is yes, we come together to be a simple deduction.

Because of \ (W is \) is divided into \ (W1 \) and \ (w2 of \) in two parts, according to equations 1 and 2, we alone derivative thereof:

\[ \frac{\partial loss}{\partial w_1}=\frac{\partial loss}{\partial z_i}\frac{\partial z_i}{\partial w_1}=(z_i-y_i) \cdot x_{i1} \tag{3} \]
\[ \frac{\partial loss}{\partial w_2}=\frac{\partial loss}{\partial z_i}\frac{\partial z_i}{\partial w_2}=(z_i-y_i) \cdot x_{i2} \tag{4} \]

Loss seeking function \ (W is \) partial derivative matrix is not determined directly, so each request to become \ (W is \) partial derivative of component. Since \ (W \) shape is:

\[ W= \begin{pmatrix} w_1 \\ w_2 \end{pmatrix} \]

So beg \ (loss \) to \ (W \) partial derivatives, due to the \ (W \) is a matrix, so it should be written like this:

\[ \begin{aligned} \frac{\partial loss}{\partial W}&= \begin{pmatrix} {\partial loss}/{\partial w_1} \\ \\ {\partial loss}/{\partial w_2} \end{pmatrix} =\begin{pmatrix} (z_i-y_i)\cdot x_{i1} \\ (z_i-y_i) \cdot x_{i2} \end{pmatrix} \\ &=\begin{pmatrix} x_{i1} \\ x_{i2} \end{pmatrix} (z_i-y_i) =\begin{pmatrix} x_{i1} & x_{i2} \end{pmatrix}^T(z_i-y_i) \\ &=x_i^T(z_i-y_i) \end{aligned} \tag{5} \]

\[ {\partial loss \over \partial B}=z_i-y_i \tag{6} \]

This multi-feature multiple computing

When multiple sample calculations, we make an instance of the derived samples with m = 3:

\[ z_1 = x_{11}w_1+x_{12}w_2+b \]

\[ z_2= x_{21}w_1+x_{22}w_2+b \]

\[ z_3 = x_{31}w_1+x_{32}w_2+b \]

\ [J (w, b) = \ frac {1} {2 \ 3 times} [(z_1-y_1) ^ 2 + (z_2-y_2) ^ 2 + (z_3-y_3) ^ 2] \]

\[ \begin{aligned} \frac{\partial J}{\partial W}&= \begin{pmatrix} \frac{\partial J}{\partial w_1} \\ \\ \frac{\partial J}{\partial w_2} \end{pmatrix} =\begin{pmatrix} \frac{\partial J}{\partial z_1}\frac{\partial z_1}{\partial w_1}+\frac{\partial J}{\partial z_2}\frac{\partial z_2}{\partial w_1}+\frac{\partial J}{\partial z_3}\frac{\partial z_3}{\partial w_1} \\ \\ \frac{\partial J}{\partial z_1}\frac{\partial z_1}{\partial w_2}+\frac{\partial J}{\partial z_2}\frac{\partial z_2}{\partial w_2}+\frac{\partial J}{\partial z_3}\frac{\partial z_3}{\partial w_2} \end{pmatrix} \\ &=\begin{pmatrix} \frac{1}{3}(z_1-y_1)x_{11}+\frac{1}{3}(z_2-y_2)x_{21}+\frac{1}{3}(z_3-y_3)x_{31} \\ \frac{1}{3}(z_1-y_1)x_{12}+\frac{1}{3}(z_2-y_2)x_{22}+\frac{1}{3}(z_3-y_3)x_{32} \end{pmatrix} \\ &=\frac{1}{3} \begin{pmatrix} x_{11} & x_{21} & x_{31} \\ x_{12} & x_{22} & x_{32} \end{pmatrix} \begin{pmatrix} z_1-y_1 \\ z_2-y_2 \\ z_3-y_3 \end{pmatrix} \\ &=\frac{1}{3} \begin{pmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \\ x_{31} & x_{32} \end{pmatrix}^T \begin{pmatrix} z_1-y_1 \\ z_2-y_2 \\ z_3-y_3 \end{pmatrix} \\ &=\frac{1}{m}X^T(Z-Y) \end{aligned} \tag{7} \]

\[ {\partial J \over \partial B}={1 \over m}(Z-Y) \tag{8} \]

5.2.3 code implementation

Equation 6 and Section 4.4 of the formula 5 as well, so we still use those classes in the fourth chapter already written HelperClass directory to express our neural network. Although the plurality of input neurons, but do not change the code to adapt to this change, because the forward calculation code, a matrix multiplication is used, it can be automatically adapted to input x of a plurality of columns, as long as the corresponding w is a matrix shape to the right.

However, during initialization, we have to manually specify the x and w shape, as shown in the following code:

if __name__ == '__main__':
    # net
    params = HyperParameters(2, 1, eta=0.1, max_epoch=100, batch_size=1, eps = 1e-5)
    net = NeuralNet(params)
    net.train(reader)
    # inference
    x1 = 15
    x2 = 93
    x = np.array([x1,x2]).reshape(1,2)
    print(net.inference(x))

In the parameters specifying the learning rate 0.1, the maximum number of cycles 100, a sample of the batch size, and a stop condition loss function value 1e-5.

When initialization of the neural network, specified input_size = 2, and output_size = 1, i.e., a neuron may receive two inputs, a finally output.

The last part of inference, is the two conditions (15 kilometers, 93 square meters) substitution, see the output.

In the following initialization code for the neural network, W initialization is performed according to the value input_size and output_size.

class NeuralNet(object):
    def __init__(self, params):
        self.params = params
        self.W = np.zeros((self.params.input_size, self.params.output_size))
        self.B = np.zeros((1, self.params.output_size))

Forward calculation code

class NeuralNet(object):
    def __forwardBatch(self, batch_x):
        Z = np.dot(batch_x, self.W) + self.B
        return Z

Error back-propagation codes

class NeuralNet(object):
    def __backwardBatch(self, batch_x, batch_y, batch_z):
        m = batch_x.shape[0]
        dZ = batch_z - batch_y
        dB = dZ.sum(axis=0, keepdims=True)/m
        dW = np.dot(batch_x.T, dZ)/m
        return dW, dB

5.2.4 operating results

In Visual Studio 2017, you can use Ctrl + F5 to run the code Level2, however, the printout will encounter a frustrating:

epoch=0
NeuralNet.py:32: RuntimeWarning: invalid value encountered in subtract
  self.W = self.W - self.params.eta * dW
0 500 nan
epoch=1
1 500 nan
epoch=2
2 500 nan
epoch=3
3 500 nan
......

Subtraction how to go wrong? What is nan?

nan mean abnormal values, leading to calculate overflow, there has been no sense of values. Now is the time to monitor every 500 iterations, we monitor the frequency of the tone of some small, try again:

epoch=0
0 10 6.838664338516814e+66
0 20 2.665505502247752e+123
0 30 1.4244204612680962e+179
0 40 1.393993758296751e+237
0 50 2.997958629609441e+290
NeuralNet.py:76: RuntimeWarning: overflow encountered in square
  LOSS = (Z - Y)**2
0 60 inf
...
0 110 inf
NeuralNet.py:32: RuntimeWarning: invalid value encountered in subtract
  self.W = self.W - self.params.eta * dW
0 120 nan
0 130 nan

The first 10 iterations, the loss function value has reached 6.83e + 66, and the longer it runs the greater the value, and finally overflowed. The following historical record also shows that the loss of function in the process.

Figure 5-2 loss varying during training function value

5.2.5 looking for the cause of failure

We can NeuralNet.py file, set on the line of code in Figure 5-3 breakpoints, tracking what the training process in order to find the problem.

5-3 in FIG VisualStudio Debug

In the VS2017 used F5 debug mode operation, see the results of the first 50 lines:

batch_x
array([[ 4.96071728, 41.        ]])
batch_y
array([[244.07856544]])

Sample data returned is normal. Look at the next line:

batch_z
array([[0.]])

第一次运行前向计算,由于W和B初始值都是0,所以z也是0,这是正常的。再看下一行:

dW
array([[ -1210.80475712],
       [-10007.22118309]])
dB
array([[-244.07856544]])

dW和dB的值都非常大,这是因为图5-4所示这行代码。

图5-4 有问题的代码行

batch_z是0,batch_y是244.078,二者相减,是-244.078,因此dB就是-244.078,dW因为矩阵乘了batch_x,值就更大了。

再看W和B的更新值,一样很大:

self.W
array([[ 121.08047571],
       [1000.72211831]])
self.B
array([[24.40785654]])

如果W和B的值很大,那么再下一轮进行前向计算时,会得到更糟糕的结果:

batch_z
array([[82459.53752331]])

果不其然,这次的z值飙升到了8万多,如此下去,几轮以后数值溢出是显而易见的事情了。

那么我们到底遇到了什么情况?

代码位置

ch05, Level2

Guess you like

Origin www.cnblogs.com/woodyh5/p/12034431.html