Implementation of Deep Neural Network Algorithms

For the current popular deep learning, it is necessary to maintain the spirit of learning - programmers, especially architects, must always pay attention to and be sensitive to core technologies and key algorithms. Use - Whether it is used or not is a political issue, whether it will be written or not is a technical issue, just like the military does not care about the issue of whether to fight, but about how to win.

How programmers learn machine learning
For programmers, machine learning has a certain threshold (this threshold is also its core competitiveness). I believe that many people will have a headache when learning machine learning for English papers full of mathematical formulas, even May retreat. But in fact, the machine learning algorithm landing program is not difficult to write. The following is the reverse multi-layer (BP) neural network algorithm implemented by 70 lines of code, that is, deep learning. In fact, not only neural networks, but most machine learning algorithms such as logistic regression, decision tree C45/ID3, random forest, Bayesian, collaborative filtering, graph computing, Kmeans, and PageRank can be implemented in a 100-line stand-alone program.

The real difficulty of machine learning lies in why it does this calculation, what is the mathematical principle behind it, and how to derive the formula. Most of the information on the Internet introduces this part of the theoretical knowledge, but seldom tells you the calculation process of the algorithm And what is the landing of the program? For programmers, all you need to do is engineering applications, and you don't need to prove a new mathematical calculation method. In fact, most machine learning engineers use open source packages or tool software written by others to input data and adjust calculation coefficients to train results, and even rarely implement the algorithm process themselves. But it is still very important to grasp the calculation process of each algorithm, so that you can understand what kind of changes the algorithm makes in the data, and what kind of effect the algorithm is designed to achieve.

This paper focuses on the single-computer implementation of the reverse neural network. Regarding the multi-computer parallelization of the neural network, Fourinone provides a very flexible and complete parallel computing framework. We only need to understand the single-computer program implementation to conceive and design a distributed parallelization scheme. , if you do not understand the algorithm calculation process, all ideas will not be able to unfold. In addition, there is also a convolutional neural network, which is mainly a dimensionality reduction idea for image processing, which is beyond the scope of this article.

Further reading:


The calculation process
of the neural network The neural network structure is shown in the figure below. The leftmost is the input layer, the rightmost is the output layer, and there are multiple hidden layers in the middle. Each neural node of the hidden layer and the output layer is composed of The upper layer nodes are multiplied by their weights, and the circle marked with "+1" is the intercept term b. For each node outside the input layer: Y=w0*x0+w1*x1+…+wn*xn+b, From this we can know that the neural network is equivalent to a multi-layer logistic regression structure.


 

(Image from UFLDL Tutorial )


Algorithm calculation process: start from the input layer, calculate from left to right, and move forward layer by layer until the output layer produces results. If there is a gap between the result value and the target value, then calculate from right to left, calculate the error of each node backwards layer by layer, and adjust all the weights of each node. After reaching the input layer in reverse, recalculate forward, repeating Iterate the above steps until all weight parameters converge to a reasonable value. Because computer programs solve equation parameters and mathematical methods are different, generally first randomly select parameters, and then continuously adjust parameters to reduce errors until they approach the correct value, so most machine learning is iterative training, let's look at the program in detail below It will be clear to see the implementation of the process.

The algorithm program implementation
of the neural network is divided into three processes: initialization, forward calculation results, and reverse weight modification.

1. Initialization process
Since it is an n-layer neural network, we use a two-dimensional array layer to record the node value. The first dimension is the number of layers, the second dimension is the node position of the layer, and the value of the array is the node value; similarly, the node error value layerErr It is also recorded in a similar way. Use the three-dimensional array layer_weight to record the weight of each node. The first dimension is the number of layers, the second dimension is the node position of the layer, and the third dimension is the position of the lower node. The value of the array is the weight value of a node reaching a node in the lower layer, and the initial value is 0 A random number between -1. In order to optimize the convergence speed, the momentum method weight adjustment is used here. It is necessary to record the last weight adjustment, and use the three-dimensional array layer_weight_delta to record. The intercept item is processed: the value of the intercept is set to 1 in the program, so that it is only necessary to calculate it.

2. Calculate the result forward
Use the S function 1/(1+Math.exp(-z)) to unify the value of each node between 0-1, and then calculate forward layer by layer until the output Layer, for the output layer, there is actually no need to use the S-function. Here, we regard the output result as a probability value between 0 and 1, so the S-function is also used, which is also conducive to the unity of program implementation.

3. Reverse weight modification How the
neural network calculates the error generally uses the squared error function E, as follows:



That is, the squares of the errors of the multiple output items and the corresponding target value are accumulated and divided by 2. In fact, the error function of logistic regression is also the same. As for why this function is used to calculate the error, what is its mathematical rationality, and how did you get it? I suggest that programmers who do not want to be mathematicians should not go In depth, what we need to do now is how to take the error of this function E to its minimum value, and we need to derive it. If there is some mathematical basis for derivation, we can try to deduce how to derive the weight from the function E. Get the following formula:



It doesn't matter if you can't derive it. We only need to use the result formula. In our program, we use layerErr to record the minimum error after E derivation of the weight, and then adjust the weight according to the minimum error.

Note that the momentum method is used to adjust here, and the experience of the previous adjustment is taken into account to avoid falling into a local minimum. The following k represents the number of iterations, mobp is the momentum term, and rate is the learning step size:

Δw(k+1) = mobp*Δw(k)+rate*Err*Layer


There are also many uses the following formula, the difference in effect is not too big:

Δw(k+1) = mobp*Δw(k)+(1-mobp)rate*Err*Layer


In order to improve performance, note that the program implementation is to calculate the error and adjust the weight at the same time in a while, first locate the position on the penultimate layer (that is, the last hidden layer), and then reversely adjust it layer by layer, according to L+ The calculated error of the 1 layer is used to adjust the weight of the L layer, and the error of the L layer is calculated at the same time, which is used to calculate the weight when the next cycle to the L-1 layer, and this cycle continues until the end of the penultimate layer (input layer).

Summary
During the entire calculation process, the value of the node changes every time it is calculated and does not need to be saved, while the weight parameters and error parameters need to be saved and need to provide support for the next iteration. Therefore, if we conceive a distributed The multi-computer parallel computing scheme, you can understand why there is a concept of Parameter Server in other frameworks.

The complete program implementation of the multi-layer neural network
The following implementation program BpDeep.java can be used directly, and it can also be easily modified to be implemented in any other language such as C, C#, Python, etc., because it is the basic statement used, and no other Java libraries are used. (except for the Random function).

import java.util.Random;
public class BpDeep{
    public double[][] layer;//nodes of each layer of the neural network
    public double[][] layerErr;//The error of each node of the neural network
    public double[][][] layer_weight;//The weight of each layer node
    public double[][][] layer_weight_delta;//The weight momentum of each layer node
    public double mobp;//Momentum coefficient
    public double rate;//Learning coefficient

    public BpDeep(int[] layernum, double rate, double mobp){
        this.mobp = mobp;
        this.rate = rate;
        layer = new double[layernum.length][];
        layerErr = new double[layernum.length][];
        layer_weight = new double[layernum.length][][];
        layer_weight_delta = new double[layernum.length][][];
        Random random = new Random();
        for(int l=0;l<layernum.length;l++){
            layer[l]=new double[layernum[l]];
            layerErr[l]=new double[layernum[l]];
            if(l+1<layernum.length){
                layer_weight[l]=new double[layernum[l]+1][layernum[l+1]];
                layer_weight_delta[l]=new double[layernum[l]+1][layernum[l+1]];
                for(int j=0;j<layernum[l]+1;j++)
                    for(int i=0;i<layernum[l+1];i++)
                        layer_weight[l][j][i]=random.nextDouble();//Random initialization weight
            }   
        }
    }
    //calculate the output layer by layer forward
    public double[] computeOut(double[] in){
        for(int l=1;l<layer.length;l++){
            for(int j=0;j<layer[l].length;j++){
                double z=layer_weight[l-1][layer[l-1].length][j];
                for(int i=0;i<layer[l-1].length;i++){
                    layer[l-1][i]=l==1?in[i]:layer[l-1][i];
                    z+=layer_weight[l-1][i][j]*layer[l-1][i];
                }
                layer[l][j]=1/(1+Math.exp(-z));
            }
        }
        return layer[layer.length-1];
    }
    //Reversely calculate the error layer by layer and modify the weight
    public void updateWeight(double[] tar){
        int l=layer.length-1;
        for(int j=0;j<layerErr[l].length;j++)
            layerErr[l][j]=layer[l][j]*(1-layer[l][j])*(tar[j]-layer[l][j]);

        while(l-->0){
            for(int j=0;j<layerErr[l].length;j++){
                double z = 0.0;
                for(int i=0;i<layerErr[l+1].length;i++){
                    z=z+l>0?layerErr[l+1][i]*layer_weight[l][j][i]:0;
                    layer_weight_delta[l][j][i]= mobp*layer_weight_delta[l][j][i]+rate*layerErr[l+1][i]*layer[l][j];//Hidden layer momentum Adjustment
                    layer_weight[l][j][i]+=layer_weight_delta[l][j][i];//Hidden layer weight adjustment
                    if(j==layerErr[l].length-1){
                        layer_weight_delta[l][j+1][i]= mobp*layer_weight_delta[l][j+1][i]+rate*layerErr[l+1][i];//Intercept momentum adjustment
                        layer_weight[l][j+1][i]+=layer_weight_delta[l][j+1][i];//Intercept weight adjustment
                    }
                }
                layerErr[l][j]=z*layer[l][j]*(1-layer[l][j]);//Record error
            }
        }
    }

    public void train(double[] in, double[] tar){
        double[] out = computeOut(in);
        updateWeight(tar);
    }
}

 
An example of using a neural network
Finally, let's find a simple example to see the magical effect of neural networks. In order to facilitate the observation of the data distribution, we use a two-dimensional coordinate data. There are 4 data below. The square represents the data type is 1, and the triangle represents the data type is 0. It can be seen that the data belonging to the square type are (1, 2 ) and (2, 1), the data belonging to the triangle type are (1, 1), (2, 2), now the problem is to divide the 4 data into 1 and 0 on the plane, and use this to predict new the type of data.



We can use the logistic regression algorithm to solve the above classification problem, but the logistic regression gets a linear straight line as the dividing line. It can be seen that no matter how the red line above is placed, there is always a sample that is wrongly divided into different types. , so for the above data, only one straight line cannot divide their classification very correctly. If we use the neural network algorithm, we can get the classification effect of the following figure, which is equivalent to dividing the space by the union of multiple straight lines, which is more accurate. high.



The following is the source code of the test program BpDeepTest.java:

import java.util.Arrays;
public class BpDeepTest{
    public static void main(String[] args){
        //Initialize the basic configuration of the neural network
        //The first parameter is an integer array, which represents the number of layers of the neural network and the number of nodes in each layer. For example, {3,10,10,10,10,2} means that the input layer is 3 nodes and the output layer is 2 There are 4 hidden layers in the middle, and each layer has 10 nodes
        //The second parameter is the learning step size, and the third parameter is the momentum coefficient
        BpDeep bp = new BpDeep(new int[]{2,10,2}, 0.15, 0.8);

        //Set the sample data, corresponding to the 4 two-dimensional coordinate data above
        double[][] data = new double[][]{{1,2},{2,2},{1,1},{2,1}};
        //Set the target data, corresponding to the classification of 4 coordinate data
        double[][] target = new double[][]{{1,0},{0,1},{0,1},{1,0}};

        //Iterative training 5000 times
        for(int n=0;n<5000;n++)
            for(int i=0;i<data.length;i++)
                bp.train(data[i], target[i]);

        //Check the sample data according to the training results
        for(int j=0;j<data.length;j++){
            double[] result = bp.computeOut(data[j]);
            System.out.println(Arrays.toString(data[j])+":"+Arrays.toString(result));
        }

        // Predict the classification of a new data based on the training results
        double[] x = new double[]{3,1};
        double[] result = bp.computeOut(x);
        System.out.println(Arrays.toString(x)+":"+Arrays.toString(result));
    }
}

 
Summary
The above test program shows that the neural network has a very magical classification effect. In fact, the neural network has certain advantages, but it is not a universal algorithm close to the human brain. It may disappoint us in many cases, and it needs to combine a large amount of data in various scenarios. Use it to see its effect. We can change 1 hidden layer to n layers, and adjust the number of nodes in each layer, the number of iterations, the learning step size and the momentum coefficient to obtain an optimal result. However, in many cases, the effect of the n-layer hidden layer is not significantly improved than that of the 1-layer, but the calculation is more complicated and time-consuming. Our understanding of the neural network needs more practice and experience.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326654618&siteId=291194637