Pytorch fully connected network: Discussion on the influence of activation function on one-dimensional fitting problem; the problem of non-decreasing loss after network deepening

    1. Briefly discuss the influence of the activation function on the fitting problem

       Recently, I am learning fully connected networks, hoping to use fully connected networks to achieve fitting or interpolation problems. That is, given some known scatter points, if you want to use a fully connected network, input an x, and the network will output a y accordingly, and calculate MSELoss using the real values ​​of the known scatter points, thereby approaching the known points.

The structure of the network is very simple, using two linear layers with different activation functions in the middle (the code shows ReLU):

class DNN(nn.Module):
    def __init__(self):
        super().__init__()
        layers =  [1,50,1]
        self.layer1 = nn.Linear(layers[0],layers[1])
        self.layer2 = nn.Linear(layers[1],layers[2])
        self.relu = nn.ReLU()
    def forward(self,d):
        d1 = self.layer1(d)
        d1 = self.relu(d1)
        d2 = self.layer2(d1)
        return d2

The example I use is sin(x), 10 of the 50 points from -π to π are randomly selected as our training data [x,y]:

x = np.linspace(-np.pi,np.pi).astype(np.float32)
y = np.sin(x)
#随机取十个点
x_train = random.sample(x.tolist(),10)
y_train = np.sin(x_train)
plt.scatter(x_train,y_train,c="r")
plt.plot(x,y)

The positions of the random points are as follows:

Set some parameters, such as iteration 10,000 times, learning rate adjusted to 0.1, etc., and only input one from the training data each time. The following are the training results for different activation functions:

From the above results we can get some rough discussions that are not quite accurate :

(1) Different activation functions have different effects on the output.

(2) LeakyReLU I am not sure why the right border is seriously separated, which has a serious impact on the results.

(3) In terms of smoothness, the intuitive feeling is that ELU and Sigmoid will be smoother, and Tanh will have a pinch-out point, which also has a certain relationship with the position of the data point. The splicing of ReLU's overall straight line will be heavier.

 (4) As far as the degree of fitting is concerned, no matter what kind of activation function it is, there is no way to ensure that the known points fall well on the predicted curve, and increasing the number of neurons may have a better effect. In this regard, Tanh performed a little better.

2. The problem of network deepening

       The above is a discussion of the different activation functions of the two-layer simple network. Next, I would like to talk about the impact of deepening the network on the results. We know that if it is a fitting problem, we hope that the predicted overall curve is smooth and in line with the general trend, but if it is an interpolation problem, we prefer to see that the known data points are on our predicted curve, because It is almost impossible for the loss to be completely 0, so we want to make it converge as much as possible, even if it is overfitting. 

So I set up a four-layer fully connected network, and used different activation functions for nonlinear mapping in the middle:

class DNN(nn.Module):
    def __init__(self):
        super().__init__()
        layers =  [1,50,25,12,1]
        self.layer1 = nn.Linear(layers[0],layers[1])
        self.layer2 = nn.Linear(layers[1],layers[2])
        self.layer3 = nn.Linear(layers[2],layers[3])
        self.layer4 = nn.Linear(layers[3],layers[4])
        self.elu = nn.ELU()
    def forward(self,d):
        d1 = self.elu(self.layer1(d ))
        d2 = self.elu(self.layer2(d1))
        d3 = self.elu(self.layer3(d2))
        d4 = self.layer4(d3)

        return d4

The result is as follows:

       It can be seen that no matter which activation function it is, the predicted curve is completely wrong, and the loss is basically unchanged during the training process, but the weight of the neuron is changing. This is obviously not a problem with the activation function, and it can be seen that no matter what the input is at the left and right ends of 0, the output value is basically the same. This has always been a point of my doubts. It may be caused by too many neurons and only one x in the input data.

       But when I thought it was just that there were too many neurons in each layer, I reduced the number of neurons in each layer and used ReLU and Tanh as activation functions, but the result remained the same, and even got worse.

Neurons in each layer:

 layers =  [1,12,6,3,1]

I hope the big guys can come up with some possible reasons! ! !

Guess you like

Origin blog.csdn.net/qq_43397591/article/details/126263933