Efficient Training Model - Parameter Quantity and Hyperparameter Tuning

Author: Zen and the Art of Computer Programming

1 Introduction

As the field of deep learning becomes hotter, more and more researchers and engineers are turning their attention to how to effectively train neural networks. Although deep learning models have achieved amazing results in many tasks, well-trained models often require a lot of parameters and calculations, which determines that their scope of promotion and application is greatly limited. This article will use the two main training model performance indicators of "parameter amount" and "hyperparameter" to deeply discuss the parameter optimization method in the training process. And try to find the appropriate number of parameters and hyperparameter settings to maximize the performance of the trained model.

1.1 Parameters and hyperparameters

First we define the parameters (Parameters) and hyperparameters (Hyperparameters).

  • Parameters: Parameters that can be optimized in the model, generally including weights and biases. A typical deep learning model may have billions or even tens of billions of parameters. The size of the parameter directly affects the performance indicators such as the fitting ability, generalization ability and convergence speed of the model.
  • Hyperparameters: Refers to parameters that do not participate in model training during the training process of the model, such as learning rate, regularization coefficient, batch size, number of iterations, activation function, etc. The selection of hyperparameters is closely related to data sets, model structures, hardware devices, and other environmental factors. Different data sets and model structures will lead to different hyperparameter settings.

1.2 Defects of Gradient Descent Algorithm

The traditional gradient descent algorithm has several significant disadvantages:

  • There is a local minimum or saddle point problem: Due to the existence of a local minimum or saddle point, the optimization process is prone to fall into an invalid loop, resulting in poor performance.
  • Difficult to deal with non-convex objective functions: The traditional gradient descent algorithm is only suitable for the optimization of convex functions. For non-convex functions, the step size of gradient descent cannot guarantee convergence.
  • There is no global optimal solution: Although there are some methods that can obtain the global optimal solution by folding the local optimal solution, it is still difficult to guarantee that all local optimal solutions can be collected.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/132256012