Multilayer Perceptron (MLP) Algorithm Principle and Code Implementation

Getting Started with Multilayer Perceptrons

Neural network has become a very popular term in recent years. The convolutional neural network (CNN) or recurrent neural network (RNN) that is often heard can be regarded as the specific application of neural networks in specific scenarios.

In this article, we try to start with the basis of neural networks: Multilayer perceptron (MLP), so as to understand its basic algorithmic principles for solving prediction problems.

To get started with MLP, I think the easiest way is to understand it as a generalized linear model .

Look at the figure above, for a linear model (top left figure)
y = w [ 1 ] ∗ x [ 1 ] + w [ 2 ] ∗ x [ 2 ] + w [ 3 ] ∗ x [ 3 ] + by=w[1]* x[1]+w[2]*x[2]+w[3]*x[3]+by=w[1]x[1]+w[2]x[2]+w[3]x[3]+bFor
MLP (upper right panel),xxxyyA set of yellow hhis added between yThe h module is called the hidden layer.

yyy h h h andxxThe relation of x
is h [ 1 ] = f ( w [ 1 , 1 ] x [ ​​1 ] + w [ 2 , 1 ] x [ ​​2 ] + w [ 3 , 1 ] x [ ​​3 ] + b [ 0 ] ) h [1]=f(w[1,1]x[1]+w[2,1]x[2]+w[3,1]x[3]+b[0])h[1]=f(w[1,1]x[1]+w[2,1]x[2]+w[3,1]x[3]+b[0])
h [ 2 ] = f ( w [ 1 , 2 ] x [ 1 ] + w [ 2 , 2 ] x [ 2 ] + w [ 3 , 2 ] x [ 3 ] + b [ 1 ] ) h[2]=f(w[1,2]x[1]+w[2,2]x[2]+w[3,2]x[3]+b[1]) h[2]=f(w[1,2]x[1]+w[2,2]x[2]+w[3,2]x[3]+b[1])
y = v [ 1 ] h [ 1 ] + v [ 2 ] h [ 2 ] + s y=v[1]h[1]+v[2]h[2]+s y=v[1]h[1]+v[2]h[2]+s
wheref ( ⋅ ) f(·)f() is a nonlinear function called an activation function.

Mathematically, if f ( ⋅ ) f(·)f() is linear, then the MLP with added hidden layers is essentially a linear model. In order to make the learning ability of MLP more powerful,f ( ⋅ ) f(·)f() requires the use of nonlinear functions. At present, there are more corrected nonlinear functions (relu), tangent hyperbolic (tanh) or sigmoid functions.

The expressions of the three functions are
relu: y = max ( 0 , x ) \text{relu}:y=max(0, x)resume:y=max(0,x )
tanh : y = ex − e − xex + e − x \text{tanh}:y=\frac{e^xe^{-x}}{e^x+e^{-x}}fishy:y=ex+exexex
sigmoid : y = 1 1 + e − x \text{sigmoid}:y=\frac{1}{1+e^{-x}} sigmoid:y=1+ex1
The following are their respective graphics

One of the most direct questions here is why the learning ability becomes stronger with these activation functions?

We can understand that any nonlinear function can be represented by multiple piecewise linear functions, that is, piecewise linear functions can fit any nonlinear function. And relu can obviously fit a piecewise linear function, so theoretically it can fit any nonlinear function; from the figure, tanh and sigmoid can be considered as softened relu, so they have little effect on the fitting effect, and this It would be more convenient to calculate the gradient with two functions.

In fact, there is another understanding of the design of MLP, which is regarded as an abstract model based on biological neurons, which is also the original origin of the neural network model. The structure of the neuron is as follows, and the MLP is indeed very similar to its shape.

Algorithm optimization principle

In the previous section, there is only 1 hidden layer, and there are only 2 hidden units in this layer, and the total number of variables is 11. But in fact, the number of hidden layers and the number of hidden units in each hidden layer can be more, and the number of variables is more at this time. So, how do you find the optimal values ​​for these variables?

Back to the original intention of MLP design, we want to minimize the prediction error of MLP for samples. For any sample, the error can be defined as
L oss = ( y − y ^ ) 2 Loss=(y-\hat y)^2Loss=(yy^)2
Still taking the previous example as an example, the above formula becomes
L oss = [ y − ( v 1 f ( w 11 x 1 + w 21 x 2 + w 31 x 3 + b 0 ) + v 2 f ( w 12 x 1 + w 22 x 2 + w 32 x 3 + b 1 ) + s ) ] 2 Loss=[y-(v_1f(w_{11}x_1+w_{21}x_2+w_{31}x_3+b_0)+v_2f (w_{12}x_1+w_{22}x_2+w_{32}x_3+b_1)+s)]^2Loss=[y(v1f(w11x1+w21x2+w31x3+b0)+v2f(w12x1+w22x2+w32x3+b1)+s)]2
If there areNNN samples, then the error of each sample needs to be summed and minimized
min ∑ i = 1 NL ossi min \quad \sum_{i=1}^NLoss_imini=1NLossi
In the above formula, the variables to be optimized are v , w , b \pmb v,\pmb w,\pmb bv,w,b andsss , and it is an unconstrained problem, so it can be solved by a gradient algorithm.

For classic gradient algorithms, you can refer to the principles of gradient algorithms: steepest descent method, Newton method, and quasi-Newton method . However, since the number of optimization variables in MLP is usually very large and the optimization process is time-consuming, algorithms such as the Newton method that need to calculate the Hessian matrix are naturally inappropriate; even the quasi-Newton method that has used other schemes to replace the Hessian matrix The algorithm of Sen matrix is ​​also gradually abandoned by everyone; in the end, driven by the improvement of training efficiency, people generally choose algorithms such as the steepest descent method that only require a first-order gradient as the basic algorithm.

Since the information of the first-order gradient needs to be used, how to calculate the gradient? Before calculating, we need to make it clear that the gradient here refers to the loss function Loss for the optimization variables v , w , b \pmb v,\pmb w,\pmb bv,w,b andssGradient of s .

We target ww with LossTake the gradient of w as an example to explain the solution process. Still use the previous MLP example, and write the expression in matrix form
h = f ( w T x + b ) ⇒ ∂ h ∂ w T = f ′ ⋅ xh=f(w^Tx+b) \Rightarrow \frac{\ partial h}{\partial w^T}=f'·xh=f(wTx+b)wTh=fx
y ^ = v T h + s ⇒ ∂ y ^ ∂ h = v T \hat y = v^Th+s \Rightarrow \frac{\partial \hat y}{\partial h}=v^T y^=vTh+shy^=vT
L o s s = ( y − y ^ ) 2 ⇒ ∂ L o s s ∂ y ^ = y − y ^ Loss=(y-\hat y)^2 \Rightarrow \frac{\partial Loss}{\partial \hat y}=y-\hat y Loss=(yy^)2y^Loss=yy^
根据求导的链式法则
∂ L o s s ∂ w T = ∂ L o s s ∂ y ^ ⋅ ∂ y ^ ∂ h ⋅ ∂ h ∂ w T = ( y − y ^ ) ⋅ v T ⋅ f ′ ⋅ x \frac{\partial Loss}{\partial w^T}=\frac{\partial Loss}{\partial \hat y}·\frac{\partial \hat y}{\partial h}·\frac{\partial h}{\partial w^T}=(y-\hat y)·v^T·f'·x wTLoss=y^Losshy^wTh=(yy^)vTfx
This is what we often say: error backpropagation, here it can be understood that the process of gradient calculation is gradually derived from the back to the front.

The above example shows the gradient solution process for a sample. In order to obtain an accurate gradient value, it is necessary to go through the above process for each sample. If the number of our training samples is relatively large, the process of gradient calculation will be very slow.

If we change the way of thinking, we only calculate the gradient value of one of the samples each time, and then use this value as the final gradient value, then the result is that the gradient calculation process is fast but the value is not so accurate.

Therefore, the current popular compromise solution is to select a part of samples (batch) each time and calculate the gradient value as the overall gradient value. By adjusting the number of batches, the efficiency of calculation and the accuracy of values ​​can be taken into account.

With the gradient, let's review the basic iterative formula of the gradient algorithm:
θ k + 1 = θ k − η k ⋅ dk \theta_{k+1}=\theta_k-\eta_k·d_kik+1=ikthekdk
Among them, θ \thetaθ represents the optimization variable,η k \eta_kthekand dk d_kdkrespectively for the kkthThe iteration step size of k times (more called "learning rate" in MLP) and the iteration direction.

Lower voltage
dk = ∂ L oss ∂ θ = ∇ ( θ ) d_k=\frac{\partial Loss}{\partial \theta}=\nabla(\theta);dk=θLoss=( θ )

d k d_k dkThe calculation process of has just been described. If the algorithm logic stops here, we generally call it Stochastic gradient descent (SGD).

However, it is generally difficult for people to be satisfied with this relatively basic algorithm, so on top of this, many improved algorithms have appeared one after another. It can be seen from the iterative formula that the improvement points mainly include three categories: the first category is for the iterative direction, such as Momentum; the second category is for the learning rate, such as Adagrad; the third category is to improve the iterative direction and learning rate at the same time, such as Adam . Specific improvement strategies can refer to academic literature: An overview of gradient descent optimization algorithms .

sklearn code implementation

The MLP toolkit has been integrated in sklearn, and the classification function is used in the following example: MLPClassifier. For the meaning of the function parameters, please refer to the detailed explanation of MLPClassifier parameters . We use two_moons for the data set. For the detailed introduction of this data set, please refer to the previous articles: Introduction to Decision Trees, sklearn Implementation, Principle Interpretation and Algorithm Analysis .

import mglearn.plots
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

from data.two_moons import two_moons
import matplotlib.pyplot as plt


def train_two_moons():
    # 导入两个月亮数据集
    features, labels = two_moons()
    
    # 将数据集分为训练集和测试集
    X_train, X_test, y_train, y_test = train_test_split(features, labels, stratify=labels, random_state=42)
    # 设置 MLP 分类器属性:激活函数为tanh,2个隐藏层,每层均包含10个隐藏单元,最大迭代次数为1000
    mlp = MLPClassifier(solver='lbfgs', activation='tanh', random_state=0, hidden_layer_sizes=[10, 10], max_iter=1000)

    # 使用训练集对 MLP 分类器进行训练
    mlp.fit(X_train, y_train)
    # 打印模型在训练集上的准确率
    print('Accuracy on train set: {:.2f}'.format(mlp.score(X_train, y_train)))
    # 打印模型在测试集上的准确率
    print('Accuracy on test set: {:.2f}'.format(mlp.score(X_test, y_test)))
    
    # 绘制 2D 分类边界
    mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=.3)
    # 绘制训练集的散点图
    mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)
    # 设置 x 轴标签
    plt.xlabel("Feature 0")
    # 设置 y 轴标签
    plt.ylabel("Feature 1")
    

if __name__ == '__main__':

    train_two_moons()

After running the above code, you can first get the performance indicators of MLP:

Accuracy on train set: 1.00
Accuracy on test set: 0.84

The decision boundary of the model can then also be obtained. Obviously, the decision boundary is non-linear, but it looks smooth.

Analysis of core strengths and weaknesses

The main advantage of MLP is that by adjusting the number of hidden layers and the number of hidden units in each hidden layer, a complex model can be constructed to learn all kinds of information contained in the data. Given enough computational time and data, and carefully tuned parameters, MLPs can often beat other machine learning algorithms (whether for classification or regression tasks).

There are two main shortcomings: the first is that complex models often take a long time to train the parameters in the model, and the algorithm evolution of SGD->Momentum, Adagrad->Adam in the second section can be understood as people trying to pass The way to improve the algorithm is to reduce the total time of model training; the second is poor explainability, that is, we cannot explain clearly the reason why the model gives a certain prediction result, so this model is often called a black box model.

Guess you like

Origin blog.csdn.net/taozibaby/article/details/131633097