Machine Learning Experiment Three Parameter Tuning Optimization Model Algorithm

Machine Learning Lab Report

Lab Title: Optimization Algorithms

1. Purpose of the experiment:

1 Master the basic framework of iterative optimization algorithm
2 Master stochastic gradient descent and coordinate axis descent algorithm

2. Experimental steps:

1. Stochastic gradient descent:
① Ridge regression
 Momentum method
 Learning rate adaptation
② Logistic regression (L2 regularization)
 Two-class classification
 Multi-class classification
2. Coordinate axis descent:
 Lasso regression
insert image description here

3. Experimental results:

1. Ridge regression:
insert image description here

Experiment code:
insert image description here

① Observe the initial value of w, and the changes of g and w values ​​after each epoch:

The weights w are randomly generated from a normal distribution and their initial values ​​have no meaning. Therefore, the initial value of the weight parameter w is randomly generated:
insert image description here

Change the value of epoch and observe the changes of g and w values:

Running result: (epoch=20)
insert image description here
insert image description here

Change epoch=10:

Modify the code:
insert image description here

Running result: (epoch=10)
insert image description here

Result analysis:
For each epoch, the code iterates over the entire dataset once. During the iterative process, the code will calculate the gradient g on each batch, and after calculating the gradients of all batches, update the weight w according to the update rule of the batch gradient descent method.
Since the code only calculates the gradient g on the last batch before updating the weight w, the changes in g and w values ​​after each epoch actually only depend on the data of the last batch. And because the data set is randomly generated, the data of the last batch is also random, so the changes of g and w values ​​after each epoch are also random. In this example, since the learning rate eta of the weight update is too small, it does not actually cause any change in the weight.

②Adjust the value of hyperparameters and observe the influence of different hyperparameters on the changes of g and w values:

Increase the learning rate to 0.1 and change the epoch to 10:

Modify the code:
insert image description here

operation result:
insert image description here

Change the L2 regularization parameter lambda to 0.01:
Modify the code:
insert image description here

operation result:
insert image description here

Change the L2 regularization parameter lambda to 0:

Modify the code:

insert image description here

operation result:
insert image description here

Analysis results:
For the adjustment of hyperparameters, different hyperparameters may have different effects on the changes of g and w values. Taking the learning rate eta as an example, if it is adjusted to a larger value, the update amount of w will be larger, making the model converge to an appropriate value faster; but at the same time, if the learning rate is adjusted too large, it will It leads to unstable training, and may even cause gradient explosion or disappearance problems. In this example, if the learning rate is changed to 0.1, the update rate of w becomes faster, but it may destroy the convergence of the model.
The adjustment of the L2 regularization parameter lambda will affect the model complexity and generalization ability. If lambda is set too small, the model will overfit the data; conversely, if lambda is set too large, the model will be over-smoothed, failing to capture the features in the dataset. In this example, if lambda is set to 0, the model will not be regularized, which may lead to overfitting; conversely, if lambda is set too large, the model may become oversmoothed, resulting in underfitting. Fitting problem.

2. Ridge regression - momentum method

insert image description here

Experiment code:
insert image description here

operation result:
insert image description here

① Observe the initial values ​​of v, w, and the changes of g, v, and w values ​​after each epoch:

The initial value of v is a numpy array with an element value of 0, and the initial value of w should be a numpy array with the same element value as x's feature number.
insert image description here

After each epoch, the g, v and w values ​​will change. Where g represents the gradient of the loss function under the current epoch when the weight is w, v is the cumulative velocity vector in the momentum method, which is used to help the optimizer skip the local minimum, and w is the parameter estimate under the current epoch.
For the changes of g and v, after each epoch, their values ​​will be different. Specifically, the g value will change according to the sample data and parameter updates, and v is the cumulative vector calculated based on g and the previous momentum value in each iteration, and it will also change slightly as the epoch progresses , but the overall trend is stable or oscillating and slowly decaying. For the change of w, since the parameter w is updated after each epoch, its value will also change accordingly.

②The difference between g and v changes after each epoch:

After each epoch, print the changes of g, v and w. It can be seen from the experimental results that the value of g will gradually converge with the number of iterations at each epoch; the value of v will undergo a rapid "acceleration" process at the beginning, and then gradually stabilize; the value of w means The parameter estimation results are updated after each round of iteration, so there will be some changes after each round of iteration.
It can be found that the value of g is decreasing, indicating that the parameters are iterating in the correct direction; the value of v is fast at first and then gradually stabilized, indicating that the momentum method can help quickly adjust the step size when the gradient changes sharply, and accumulate historical gradients information to avoid falling into a local minimum; the value of w gradually approaches the global optimal solution.

3. Ridge regression - learning rate adaptation

insert image description here

Experiment code:
insert image description here
insert image description here

operation result:

insert image description here
insert image description here

① Observe the initial values ​​of v, w, and the changes of g, v, and w values ​​after each epoch:

The initial value of v is a numpy array with an element value of 0, and the initial value of w should be a numpy array with the same element value as x's feature number.
insert image description here

g is the average of the gradient of the loss function for each mini-batch of data. Every time a batch is iterated, g is calculated and the value of w is updated. Since the number of batches of training data in each epoch is x.shape[0] / batch, at the end of each epoch, the value of g will be the average of all batches of g values.
v is the sum of squares of historical gradients used to adaptively adjust the learning rate in the Adagrad algorithm. Every time a batch is iterated, the value of v will be updated with the current gradient. When v is updated each time, the gradient square value of the current batch is first added to v, and after w is updated, the parameters required for the next time are calculated. In each epoch, the value of v also changes with the accumulation of the sum of squared gradients for each batch.
w is the parameter of the model, which will be updated during each batch iteration. At the end of each epoch, the value of w represents the optimal parameter obtained by the model after the current epoch training is completed. In the above code, w is initialized as a vector of all zeros. The gradient information of each batch is used to update the value of w, making it gradually approach the optimal solution.

②The difference between g and v changes after each epoch:

From the above experimental results, it can be seen that after each epoch, g and v will be recalculated, but compared with g and v at the end of the previous epoch, the calculation results of g and v at the end of the current epoch are different. Therefore, the specific values ​​of g and v are different after each epoch.
Compared with g, the change of v may be more gentle, because v is the accumulation of the sum of squares of the historical gradient, and a part will be updated in each epoch, so as the number of epochs increases, the change of v may be more obvious. And g will be recalculated every epoch, so compared to v, the change of g may be more drastic.

4.Lasso regression

insert image description here

Experiment code:
insert image description here
insert image description here

operation result:
insert image description here
insert image description here
insert image description here

① Observe the initial value of w, and the changes of c and w values ​​after each epoch:

w is initialized as an array of all zeros of length 3:
insert image description here

According to the above experimental results, it can be seen that each epoch completes all the training on the training data set. After each epoch, the values ​​of c and w will change. Specifically, in the inner loop of each epoch, c represents the gradient of the loss function to the jth parameter (w[j]) under the current w. Then, according to the derivative of the loss function, it is judged that the direction of w[j] should be updated (plus or minus), and the updated value is stored in w[j]. Therefore, w will gradually approach the true weight w_true.

②Adjust the value of lmbd and observe the influence of different lmbd values ​​on the w value after optimization:

According to the above experimental results, it can be known that
if the value of lmbd is adjusted, it is actually modifying the regularization strength.
When lmbd is larger, it is equivalent to forcing the model not to be too complex, so the value of w is close to 0 after the optimization.
When lmbd is smaller, the complexity of the model is allowed to be higher, so the value of w after optimization is closer to the real weight.
We can see that as the regularization strength increases, the value of each parameter in w gradually tends to 0.
When lmbd takes a small value, the value of w is closer to the true weight than in other cases. This is probably because, a smaller lmbd allows the model to be overly complex, so it fits the data better.

③This challenge task: the algorithm on the left uses all samples in each iteration, please modify it to update a w[j] by using a small batch of samples each time

Modified code:
insert image description here
insert image description here

operation result:
insert image description here
insert image description here

Idea analysis:
The modified part is mainly to disrupt the order of samples at each iteration, and then divide the samples into multiple small batches, and use a small batch each time to update the parameters. Specifically, we do this by adding two additional nested loops: the j loop traverses all mini-batches, and the k loop traverses each parameter in the current mini-batch, updating each parameter. Since the parameters are updated using a small batch each time, each sample will be used multiple times during the entire iteration process, thereby reducing possible overfitting problems.

4. Logistic Regression—Two Classes Classification

Experiment code:
insert image description here
insert image description here

operation result:
insert image description here

①Observe the initial value of w, and the changes of mu, g and w values ​​after each epoch

w is generated by a normal distribution and has randomness. From the above experimental results, it can be concluded that:
epoch is small: it may not be able to fully learn the characteristics of the data set, resulting in underfitting. After each iteration, the mu, g, and w values ​​may change less.
Larger epochs: May overfit the dataset, resulting in poor generalization. After each iteration, the values ​​of mu, g, and w may vary greatly.

②Adjust the value of lmbd and observe the influence of different lmbd values ​​on the value of w after optimization

From the above experimental results, it can be concluded that
when lmbd=0, there is no regularization term, and the model may be overfitting. After optimization, w will be more inclined to fit the data on the training set, and the prediction performance on new data may be poor.
lmbd usually takes a small value. When lmbd gradually increases, the regularization penalty will gradually work. After the optimization is over, the value of w will decrease, but we don't want it to decrease too much to cause underfitting. Therefore, when using different lmbd values ​​for training, it is necessary to find a balance point.

5. Experimental experience

Optimization algorithms play a vital role in machine learning. These algorithms are used to optimize the parameters in the model to make the model perform better on the training set and test set, thereby improving the generalization ability of the model.
In ridge regression, an L2 regularization term is added to limit the size of the parameters. This regularization term can effectively prevent overfitting and improve the generalization ability of ridge regression. Momentum and learning rate adaptive algorithms are common optimization methods that can speed up model training. Among them, the momentum method introduces a momentum variable, and uses the gradient information of the previous iterations to update the parameters, thereby smoothing the update direction and avoiding the problem of parameters oscillating in one direction; the learning rate adaptive algorithm is based on the current parameter The size of the learning rate is adjusted according to the situation, which effectively solves the problem that the learning rate needs to be set manually in the traditional gradient descent algorithm.
The Lasso algorithm is also one of the regularization algorithms, which introduces the L1 regularization term to limit the size of the parameters, and sets the value of some parameters to 0, so as to achieve the purpose of feature selection. Compared with ridge regression, Lasso tends to produce sparse solutions, that is, only a few parameters are non-zero. This makes Lasso very effective when dealing with high-dimensional data.
For binary classification problems, Logistic regression is a commonly used method. It utilizes the Logistic function to map predicted values ​​to probability values ​​between 0 and 1 so that samples can be classified. During optimization, we usually use stochastic gradient descent algorithm to update parameters and use cross entropy as loss function. Cross-entropy can effectively measure the gap between the predicted value and the real value, thereby guiding the optimization process of the model.
In the experiment, through the application and comparison of different optimization algorithms, I have a deep understanding of the mathematical principles and implementation details in each algorithm. I've also found that the effectiveness of optimization algorithms varies widely across different problems and datasets. Therefore, when using an optimization algorithm for model optimization, it is necessary to select an appropriate algorithm according to the specific situation and perform appropriate parameter adjustments.

Guess you like

Origin blog.csdn.net/xqmids99/article/details/130554429