Fitting of Nonlinear Function of BP Algorithm

1. Principle of BP algorithm

The BP algorithm consists of two processes: the forward propagation of the input signal and the backward propagation of the output error.
Forward propagation: The input sample enters the network from the input layer, and is passed to the output layer layer by layer through the hidden layer. If the actual output of the output layer is different from the expected output, it will go to the error for back propagation; if the actual output of the output layer is different from the expected output If the output is the same, the learning algorithm ends.
Backpropagation: The output error is back-transmitted to the input layer through the hidden layer, and the error is distributed to each node of each layer during the back-propagation process, and the error signal of each node of each layer is obtained, and it is used as the basis for correcting the weight of each node according to. This calculation process is completed using the gradient descent method. After continuously adjusting the weights and thresholds of neurons in each layer, the error signal is reduced to a minimum.
The process of continuously adjusting the weights and thresholds is the learning and training process of the network. After the forward propagation of the signal and the back propagation of the error, the weights and thresholds are adjusted repeatedly until the preset number of learning and training times, or The output error is reduced to the allowable level.
link: link

2. Model

BP neural network

3. Algorithm steps

1. Traditional BP algorithm:

(1) Given input/output sample pair (u,y)
(2) Randomly generate initial weight values ​​wi , wo , b 1 , b 2 w_i,w_o,b_1,b_2wi,wo,b1,b2
(3) Calculate the output value:

hi_in=w1*u+b;
hi_out=f(hi_in);//f为激活函数,自选,一般是sigmoid函数
y_in=w2*hi_out+b2;
y_out=g(y_in);//g为激活函数,自选,一般是sigmoid函数,如果拟合非线性函数,还可以是恒等函数

(4) Calculate the maximum error percentage: max Error = ( y − yout ) / y ∗ 100 maxError=(y-y_{out})/y*100m a x E r r o r=(yyout)/y100 % . _
(5) Determine whether maxError meets the requirements. If the requirements are met, the weight will not be updated in this loop (no downward progress), otherwise continue.
Note: Some places here may use whether the objective function value meets the requirements, and then judge the next direction
(6) Objective function selection: E=1 2 × ( y − yout ) 2 \frac{1}{2}\ times(y-y_out)^221×(yyout)2
(7) 支件梯度 这些法最新权重:
增量:
δ wo = ∂ E ∂ wo = ∂ E ∂ yout ∂ yout ∂ yin ∂ yin ∂ wo \delta w_o=\frac{\partial E}{\partial w_o} =\frac{\partial E}{\partial y_{out}}\frac{\partial y_{out}}{\partial y_{in}}\frac{\partial y_{in}}{\partial w_{o }}δwo=woE=youtEyinyoutwoyin

δ b 2 = ∂ E ∂ b 2 = ∂ E ∂ y o u t ∂ y o u t ∂ y i n ∂ y i n ∂ b 2 \delta b_2=\frac{\partial E}{\partial b_2}=\frac{\partial E}{\partial y_{out}}\frac{\partial y_{out}}{\partial y_{in}}\frac{\partial y_{in}}{\partial b_{2}} b _2=b2E=youtEyinyoutb2yin

δ w i = ∂ E ∂ w i = ∂ E ∂ y o u t ∂ y o u t ∂ y i n ∂ y i n ∂ h o u t ∂ h o u t ∂ h i n ∂ h i n ∂ w i \delta w_i=\frac{\partial E}{\partial w_i}=\frac{\partial E}{\partial y_{out}}\frac{\partial y_{out}}{\partial y_{in}}\frac{\partial y_{in}}{\partial h_{out}}\frac{\partial h_{out}}{\partial h_{in}}\frac{\partial h_{in}}{\partial w_{i}} δwi=wiE=youtEyinyouthoutyinhinhoutwihin

δ b 1 = ∂ E ∂ b 1 = ∂ E ∂ y o u t ∂ y o u t ∂ y i n ∂ y i n ∂ h o u t ∂ h o u t ∂ h i n ∂ h i n ∂ b 1 \delta b_1=\frac{\partial E}{\partial b_1}=\frac{\partial E}{\partial y_{out}}\frac{\partial y_{out}}{\partial y_{in}}\frac{\partial y_{in}}{\partial h_{out}}\frac{\partial h_{out}}{\partial h_{in}}\frac{\partial h_{in}}{\partial b_{1}} b _1=b1E=youtEyinyouthoutyinhinhoutb1hin

Note: Because the above is a single hidden layer and single output, only partial derivatives are required. For a network with multiple outputs or a network with multiple hidden layers, the full derivative is calculated in the middle.
UPDATE:
wo = w_o =wo= wo − η × δ wo , η w_o-\eta\times\delta w_o,\etawothe×δwo,η is the learning rate;
b 2 = b_2=b2= wi − η × δ b 2 w_i-\eta\times\delta b_2withe×b _2;
w i = w_i= wi= wi − η × δ wi w_i-\eta\times\delta w_iwithe×δwi;
b 1 = b_1= b1= b 1 − η × δ b 1 b_1-\eta\times\delta b_1b1the×b _1;

2. Limitations of traditional algorithms

  • It is easy to form a local minimum and not get the global optimum
  • The large number of training times makes the learning efficiency low and the convergence speed slow
  • The selection of hidden layer nodes lacks theoretical guidance
  • Learning new samples during training tends to forget old samples

3. Improvement of BP algorithm

  • Add momentum term
  • Adaptively adjust the learning rate
  • Introduce the steepness factor
    (1) to increase the momentum
    term. The standard BP algorithm only adjusts according to the gradient drop direction of the error at time t, but does not consider the gradient direction before time t, which often makes the training process oscillate and converge slowly.
    The solution is to add a momentum term to the weight adjustment formula: α × δ w ( t − 1 ) , 0 < α < 1 \alpha\times\delta w(t-1),0<\alpha<1a×δw(t1),0<a<1 .
    The essence of the momentum item is to take a part of the previous weight adjustment and add it to the current weight adjustment. It reflects the previously accumulated adjustment experience and plays a damping role for the adjustment at time t. When there are sudden fluctuations in the error surface, the oscillation tendency can be reduced and the training speed can be improved.
    (2) Adaptively adjust the learning rate
    In the standard BP algorithm, the learning rateη \etaη It is difficult to determine an optimal learning rate that is suitable from the beginning to the end. In the flat area,η \etaIf η is too small, the number of trainings will increase; in the area where the error changes drastically,η \etaIf η is too large, it will cross a narrow "pit" due to excessive adjustment, which will cause oscillation in training and increase the number of iterations.
    The solution is to automatically adjust the learning rateη \etan .
    Set an initial learning rate, if the total error increases after a batch of weight adjustments, then this adjustment is invalid, letη = β × η , β < 1 \eta=\beta\times\eta,\beta< 1the=b×the ,b<1 ; if the total error decreases after a batch of weight adjustments, then this adjustment is valid, letη = θ × η , θ > 1 \eta=\theta\times\eta,\theta>1the=i×the ,i>1 .
    (3) Introducing the steepness factor
    This method has different methods for different activation functions. For the sigmoid function:
    there is a flat area on the error surface, and the reason why the weight adjustment enters the flat area is that the output value of the output layer enters the activation function. saturation region.
    If, after adjusting to enter the flat region, we manage to compress the net input of the output layer so that its output exits the saturation region of the activation function, we can change the shape of the error function and thus get out of the flat region. Introduce a steepness factor λ = 1 \lambda=1
    in the activation functionl=1, lety = 1 1 + e − x λ y=\frac{1}{1+e^{-\frac{x}{\lambda}}}y=1+elx1.
    When it is found that the change of the objective function is small, but the error is still relatively large, it can be judged that it has entered the flat zone at this time, and modify λ \lambdaλ,使λ > 1 \lambda>1l>1 . After exiting the flat zone, letλ = 1 \lambda=1l=1 λ
    \lambdaλ >1 : the net input coordinates are compressed byλ \lambdaλ times, the sensitive section of the activation function curve becomes longer, so that the net input with a larger absolute value can exit the saturation value.
    λ \lambdaλ : The activation function reverts to its original state, with high sensitivity to net inputs with small absolute values.

4. Matters needing attention

  • When programming, the input and output data must be normalized first, because the value range of the activation function of the final output layer may not be consistent with the expected value.
  • The weight change is to use all of a set of inputs, and it is not possible to input a data to change the weight once
  • The training set curve may look inconsistent with the expected curve, but it does not mean that the training is invalid. Sometimes what you see may not be true. The output of the test set may meet the requirements. It is only after verification that you can know whether the training is reasonable. correct.

Five, matlab code

%训练数据集
row = 1;
for i=0:0.01:1
   for j=0:0.01:1
   u(1,row) = i;
   u(2,row) = j;
   ytrue(1,row) = sin(u(1,row))/2+sin(u(2,row))/2;
   row = row+1;
   end
end
row=row-1;

%测试数据集
t_row=1;
for i=1:0.01:1.1
   for j=1:0.01:1.1
   test_u(1,t_row) = i;
   test_u(2,t_row) = j;
   testValue(1,t_row) = sin(test_u(1,t_row))/2+sin(test_u(2,t_row))/2;
   t_row = t_row+1;
   end
end
t_row=t_row-1;

%网络结构
inputnum = 2;     %输入层节点数
hiddennum = 9;    %隐含层节点数
outputnum = 1;    %输出层节点数

%循环次数
count=1000;

%参数
alpha=0.95;%动量因子
beta=0.8;%自适应学习率调整因子
theta=1.2;%自适应学习率调整因子
lambda=1;%陡度因子
yita=0.0005;%学习率

%定义训练集变量
hidein=zeros(hiddennum,row);
hideout=zeros(hiddennum,row);
yin=zeros(1,row);
yout=zeros(1,row);

%定义测试集变量
testError=zeros(1,t_row);
testHidein=zeros(hiddennum,t_row);
testHideout=zeros(hiddennum,t_row);
testOut=zeros(1,t_row);

%初始化权重值
wi1=0.2*rand(hiddennum,1)-0.1;
wi2=0.2*rand(hiddennum,1)-0.1;
bi=0.2*rand(hiddennum,1)-0.1;
wo=0.2*rand(hiddennum,1)-0.1;
bo=0.2*rand-0.1;

%定义权重增量
dwi1Last=zeros(hiddennum,1);
dwi2Last=zeros(hiddennum,1);
dbiLast=zeros(hiddennum,1);
dwoLast=zeros(hiddennum,1);
dboLast=zeros(1,1);

%中间变量
perror=zeros(1,row);
pdwi1=zeros(hiddennum,row);
pdwi2=zeros(hiddennum,row);
pdbi=zeros(hiddennum,row);
pdwo=zeros(hiddennum,row);
pdbo=zeros(1,row);

%目标函数
pobjfunc=zeros(1,row);
objfuncLast=0;

for i=1:count    
    dwi1=zeros(hiddennum,1);
    dwi2=zeros(hiddennum,1);
    dbi=zeros(hiddennum,1);
    dwo=zeros(hiddennum,1);
    dbo=zeros(1,1);
    objfunc=0;
    sumAbsError=0;
    
    %求输出和各个输入引起的增量
    for j=1:row
        hidein(:,j)=wi1*u(1,j)+wi2*u(2,j)+bi;
        hideout(:,j)=sigmoid(hidein(:,j),lambda);
        yin(j)=hideout(:,j)'*wo+bo;
        
        yout(j)=yin(j);
%         yout(j)=sigmoid(yin(j),lambda);
        
        perror(j)=ytrue(j)-yout(j);
        pobjfunc(j)=0.5*perror(j)*perror(j);
        
        pdwo(:,j)=perror(j)*hideout(:,j);
        pdbo(j)=perror(j);
%         pdwo(:,j)=perror(j)*yout(j)*(1-yout(j))*hideout(:,j);
%         pdbo(j)=perror(j)*yout(j)*(1-yout(j));
        
        pdwi1(:,j)=perror(j)*wo.*hideout(:,j).*(1-hideout(:,j))*u(1,j);%perror(j)*yout(j)*(1-yout(j))*wo.*hideout(:,j).*(1-hideout(:,j))*u(1,j);
        pdwi2(:,j)=perror(j)*wo.*hideout(:,j).*(1-hideout(:,j))*u(2,j);%perror(j)*yout(j)*(1-yout(j))*wo.*hideout(:,j).*(1-hideout(:,j))*u(2,j);
        pdbi(:,j)=perror(j)*wo.*hideout(:,j).*(1-hideout(:,j)); %perror(j)*yout(j)*(1-yout(j))*wo.*hideout(:,j).*(1-hideout(:,j)); 
%         pdwi1(:,j)=perror(j)*yout(j)*(1-yout(j))*wo.*hideout(:,j).*(1-hideout(:,j))*u(1,j);
%         pdwi2(:,j)=perror(j)*yout(j)*(1-yout(j))*wo.*hideout(:,j).*(1-hideout(:,j))*u(2,j);
%         pdbi(:,j)=perror(j)*yout(j)*(1-yout(j))*wo.*hideout(:,j).*(1-hideout(:,j)); 
    end
    
    %求各个输入产生的增量之和
    for j=1:row
       dwo=dwo+pdwo(:,j);
       dbo=dbo+pdbo(j);
       dwi1=dwi1+pdwi1(:,j);
       dwi2=dwi2+pdwi2(:,j);
       dbi=dbi+pdbi(:,j);
       objfunc=objfunc+pobjfunc(j);
       sumAbsError=sumAbsError+abs(perror(j));
    end
    
    %求增量均值
    dwo=dwo/row;
    dbo=dbo/row;
    dwi1=dwi1/row;
    dwi2=dwi2/row;
    dbi=dbi/row;
    
    %修改陡度因子
    if(abs(objfunc-objfuncLast)<0.01 && sumAbsError/row>0.5)
        lambda=1.2;
    else
        lambda=1;
    end
    
    %修改学习率
    if(i>5)
        if(objfuncLast/row<objfunc/row)
            yita=beta*yita;
        else
            yita=theta*yita;
        end
    end
    objfuncLast=objfunc;

    %修改权重
    wi1=wi1+(yita*dwi1+alpha*dwi1Last);
    dwi1Last=dwi1;
    wi2=wi2+(yita*dwi2+alpha*dwi2Last);
    dwi2Last=dwi2;
    bi=bi+(yita*dbi+alpha*dbiLast);
    dbiLast=dbi;
    wo=wo+(yita*dwo+alpha*dwoLast);
    dwoLast=dwo;
    bo=bo+(yita*dbo+alpha*dboLast);
    dboLast=dbo;
    
    %测试集
    maxTestError=0;
    for m=1:t_row
        testHidein(:,m)=wi1*test_u(1,m)+wi2*test_u(2,m)+bi;
        testHideout(:,m)=sigmoid(testHidein(:,m),1);
        testOut(m)=testHideout(:,m)'*wo+bo;
        testError(m)=(testValue(m)-testOut(m))/testValue(m);
        if(maxTestError<abs(testError(m)))
            maxTestError=abs(testError(m));
        end
    end
    if(maxTestError<0.05)
        break;
    end
    
end

%训练集函数输出
[x,y] = meshgrid(0:0.1:1,0:0.1:1);
trainFunctionOutput =sin(x)/2+sin(y)/2;
figure(1)
mesh(x,y,trainFunctionOutput)
xlabel('x');
ylabel('y');
zlabel('trainFunctionOutput');

%训练集实际输出
t1=linspace(min(u(1,:)),max(u(1,:)),10);
t2=linspace(min(u(2,:)),max(u(2,:)),10);
[X,Y]=meshgrid(t1,t2);
trainOutput=griddata(u(1,:),u(2,:),yout,X,Y);
figure(2)
mesh(X,Y,trainOutput)
xlabel('Input1');
ylabel('Input2');
zlabel('trainOutput');

%测试集函数输出
[tx,ty] = meshgrid(1:0.01:1.1,1:0.01:1.1);
testTunctionOutput =sin(tx)/2+sin(ty)/2;
figure(3)
mesh(tx,ty,testTunctionOutput)
xlabel('x');
ylabel('y');
zlabel('testTunctionOutput');

%测试集实际输出
t3=linspace(min(test_u(1,:)),max(test_u(1,:)));
t4=linspace(min(test_u(2,:)),max(test_u(2,:)));
[tX,tY]=meshgrid(t3,t4);
testOutput=griddata(test_u(1,:),test_u(2,:),testOut,tX,tY);
figure(4)
mesh(tX,tY,testOutput)
xlabel('Input1');
ylabel('Input2');
zlabel('testOutput');

%绘制误差曲线
t5=linspace(min(test_u(1,:)),max(test_u(1,:)));
t6=linspace(min(test_u(2,:)),max(test_u(2,:)));
[X1,X2]=meshgrid(t5,t6);
E=griddata(test_u(1,:),test_u(2,:),testError,X1,X2);
figure(5)
mesh(X1,X2,E)
xlabel('Input1');
ylabel('Input2');
zlabel('error');

Guess you like

Origin blog.csdn.net/weixin_44941350/article/details/121961580