Use sklearn to conduct experiments to analyze the impact of different optimizers with different activation functions on loss

data preparation

The first thing to use is the make_blobs function to generate a data set. The first parameter of the make_blobs function is n_samples. This parameter can be set to an integer or an array-like parameter that is similar to a list. If the integer type entered here is Generate a corresponding number of samples, the default value is 100, the program is set to 100 to indicate that 100 data sets are generated; the parameter n_features indicates the feature dimension of the sample, here the default is 2, if it is not given in the program, it means that there are 2 features; the parameter noise means to add Gaussian noise to the generated data set, which is not available by default but is set to 0.25 in the program; random_state is a random number seed to ensure the reproducibility of experimental results. The first return value of this function is the sample data, and the second return parameter is the label value of the sample data.

# two_moons数据集
x, y = make_moons(n_samples=100, noise=0.25, random_state=3)

model training

After generating the data sets required for the experiment, two lists were created. One list stores optimization methods, including adam and sgd, and the other list stores activation functions, including logistic, tanh, and relu. Then use MLPRegressor to generate an MLP classifier, which is the classifier of the multi-layer perceptron. Here, the MLPClassifier function is used to generate a specified MLP classifier. The parameters of this function include hidden_layer_sizes, which is the correlation of the hidden layer. Settings; slover is a solver for weight optimization, the optional values ​​include lbfgs, sgd and adam, and adam is the default; activation is the activation function, the optional values ​​​​are identity, logistic, tanh and relu, and the default value is relu; max_iter is the maximum number of iterations, the default value is 200.
The multi-layer neural network in the code is a neural network with two hidden layers, each layer has 10 neurons, and then uses the two optimization functions in the list and the Cartesian product of the activation function, that is, two by two respectively The combined data set is tested to show the impact of two optimization methods and three activation functions on the loss curve in matplotlib.

mlp = MLPRegressor(max_iter=300,
                           solver=opt,  # 可选 adam,lbfgs,sgd。 其中lbfgs不支持loss_curve_
                           activation=act_fun,  # 激活函数, identity,logistic, tanh, relu                           
                           hidden_layer_sizes=[10, 10],# [10,10]包含两个隐层,每个隐层包含10个隐单元
                           random_state=0, verbose=False)

Result analysis

The results are shown in the figure below
insert image description here
(1) The curve of the loss function of the classifier generally shows a gradual downward trend, and the downward trend is first fast and then slow, and it will be in a state of convergence after falling to a certain value. The value of the loss loss function basically tends to a stable value, and the range of change is small and almost constant. The meaning of convergence is that the value of the loss function basically tends to be stable, and the performance of the model reaches the optimal state in this mode. Generally, the value of the loss function is taken to the minimum, and the optimization process of the data set in this mode ends.

(2) From the experimental results, it can be seen that the combination of adam and tanh converges best, because the convergence speed of these two combinations is the fastest and the value of the loss function can be minimized when they converge. The loss function is the degree of inconsistency between the actual value and the predicted value. It does not necessarily mean that the smaller the loss value, the better. The speed of algorithm convergence must also be considered. If the loss value after optimization of one algorithm is better than the loss after optimization of another algorithm The value is small, but it takes much longer than it. In this case, the optimization method with faster time and the same value of the loss function will be preferred. Another situation is that there is an overfitting phenomenon, which causes the value of the loss function on the training set to be 0, that is, all predicted values ​​​​and actual values ​​​​are equal, but the prediction effect on the data on the test set is not good, nor is it A good optimization method.

(3) The loss function of the test set is also meaningful, but the general meaning is not too great. It can judge the condition of the training model by looking at the loss function value of the test set and the loss function value of the training set. If the loss function of the training set decreases and the loss function of the test set The convergence of the loss function indicates that the training model is likely to be overfitting; if the value of the loss function of the test set continues to increase, it indicates that there may be a problem with the data set. So the loss function value of the test set also makes sense.

(4) The performance evaluation of the classifier is generally evaluated using the loss_curve of the training set and the accuracy_curve of the test set, because the loss function is the degree of consistency between the value test value and the actual value, and can be continuously improved by optimizing the value of the loss function Adjust the parameter values ​​of the training model so that the value of the loss function is continuously reduced to continuously improve the prediction performance of the training model. The loss function here is generally a non-linear function, which can be continuously optimized using the gradient descent function to find the optimal solution of the loss function, so it is more effective for the training set; the accuracy rate generally refers to the consistency rate between the actual value and the predicted value, and also It is the predicted correct value divided by the total number of predictions, so it is more effective for the test set.

Guess you like

Origin blog.csdn.net/qq_48068259/article/details/127881868