How to use configuration file parameters - implement pre-trained model training

insert image description here

introduction

The application of pre-training models in various fields has achieved remarkable results, but standard pre-training models may not fully meet individual needs. In order to achieve better performance and effects, researchers and developers need to conduct customized training on pre-trained models. In the past, modifying the architecture of the model, adjusting the data processing flow, optimizing the training strategy, and adjusting the running settings usually required modification of the source code, which may be challenging for most users. However, by modifying the configuration file parameters, we can flexibly adjust various aspects of the pre-trained model, thus enabling personalized training.

This article will discuss how to realize the customized training of the pre-trained model by modifying the parameters in the configuration file. We will introduce the basic structure of the configuration file and the meaning of the parameters, elaborate on the method and steps of parameter modification, and show the impact of parameter modification on the pre-training model through actual cases. Through this method, users can flexibly modify the model architecture, adjust data processing methods, optimize training strategies, and adjust operating settings according to their own needs and task characteristics to achieve better performance and results.

Why use configuration files to pre-train models

To manage various settings of deep learning experiments, we use configuration files to record all these configurations. This configuration file system has the characteristics of modularity and inheritance.

Using configuration files to pre-train models has important management, flexibility, maintainability, reproducibility, and shareability advantages. Configuration files provide a convenient way to record and manage various settings of deep learning experiments, making the customized training of models and experiment management more efficient and reliable.

Using the configuration file, we can break away from the source code. We only need to modify the parameters of the model in the configuration file and add the corresponding model block to realize the function of modifying the model. Compared with the most old-fashioned pre-training method, which only changes the parameters of the model through the data set, this method of modifying the configuration file can modify the essence of the model to make the model more suitable for you

It would be even more perfect if the configured model can be visualized through Netron software. How is this different from playing with building blocks?
insert image description here

Configuration file structure

The configuration file structure can be roughly divided into the following four types:

  • Model part:

Model architecture: including the hierarchical structure of the model, parameter configuration of each layer, selection of activation function, etc.
Model parameters: including learning rate, optimizer type, regularization parameters and other parameter settings related to model training.

  • Data (data) part:

Dataset Path: Specify the path to the dataset used for training and validation.
Data preprocessing: including settings of preprocessing operations such as data cleaning, data standardization, and data enhancement.
Data division: Specify the division method and proportion of the training set, verification set and test set.

  • Training strategy (schedule) part:

Learning rate strategy: Set the initial value of the learning rate, the decay method, the number of decay steps, etc.
Regularization Strategy: Set the regularization method and parameters.
Batch size: Specifies the number of samples for each training batch.
Training Iterations: Set the total number of iterations for model training.

  • Run settings (runtime) section:

Training device: Specify which hardware device to train on, such as CPU, GPU, etc.
Logging: Set the save path and format of the log file.
Model save: Specify the save path and save frequency of model parameters during training.
Visualization tool: Select a visualization tool to monitor the indicator changes during the model training process.

This kind of modification of model pre-training through configuration files has a disadvantage, that is, it must rely on the configuration formats specified by different frameworks to configure and define parameters. Although the bottom layer of each framework uses a method similar to torch, the combination is different.

Below I use the ResNet50 configuration file of Flying Plasma to illustrate the meaning of different configuration files.
Taking the PaddleClas suite as an example, the configuration file is as follows:
insert image description here

# global configs
Global:
  checkpoints: null
  pretrained_model: null
  output_dir: ./output/
  device: gpu
  save_interval: 1
  eval_during_train: True
  eval_interval: 1
  epochs: 120
  print_batch_step: 10
  use_visualdl: False
  # used for static mode and model export
  image_shape: [3, 224, 224]
  save_inference_dir: ./inference
  # training model under @to_static
  to_static: False

# model architecture
Arch:
  name: ResNet50
  class_num: 1000
 
# loss function config for traing/eval process
Loss:
  Train:
    - CELoss:
        weight: 1.0
  Eval:
    - CELoss:
        weight: 1.0


Optimizer:
  name: Momentum
  momentum: 0.9
  lr:
    name: Piecewise
    learning_rate: 0.1
    decay_epochs: [30, 60, 90]
    values: [0.1, 0.01, 0.001, 0.0001]
  regularizer:
    name: 'L2'
    coeff: 0.0001


# data loader for train and eval
DataLoader:
  Train:
    dataset:
      name: ImageNetDataset
      image_root: ./dataset/ILSVRC2012/
      cls_label_path: ./dataset/ILSVRC2012/train_list.txt
      transform_ops:
        - DecodeImage:
            to_rgb: True
            channel_first: False
        - RandCropImage:
            size: 224
        - RandFlipImage:
            flip_code: 1
        - NormalizeImage:
            scale: 1.0/255.0
            mean: [0.485, 0.456, 0.406]
            std: [0.229, 0.224, 0.225]
            order: ''

    sampler:
      name: DistributedBatchSampler
      batch_size: 64
      drop_last: False
      shuffle: True
    loader:
      num_workers: 4
      use_shared_memory: True

  Eval:
    dataset: 
      name: ImageNetDataset
      image_root: ./dataset/ILSVRC2012/
      cls_label_path: ./dataset/ILSVRC2012/val_list.txt
      transform_ops:
        - DecodeImage:
            to_rgb: True
            channel_first: False
        - ResizeImage:
            resize_short: 256
        - CropImage:
            size: 224
        - NormalizeImage:
            scale: 1.0/255.0
            mean: [0.485, 0.456, 0.406]
            std: [0.229, 0.224, 0.225]
            order: ''
    sampler:
      name: DistributedBatchSampler
      batch_size: 64
      drop_last: False
      shuffle: False
    loader:
      num_workers: 4
      use_shared_memory: True

Infer:
  infer_imgs: docs/images/inference_deployment/whl_demo.jpg
  batch_size: 10
  transforms:
    - DecodeImage:
        to_rgb: True
        channel_first: False
    - ResizeImage:
        resize_short: 256
    - CropImage:
        size: 224
    - NormalizeImage:
        scale: 1.0/255.0
        mean: [0.485, 0.456, 0.406]
        std: [0.229, 0.224, 0.225]
        order: ''
    - ToCHWImage:
  PostProcess:
    name: Topk
    topk: 5
    class_id_map_file: ppcls/utils/imagenet1k_label_list.txt

Metric:
  Train:
    - TopkAcc:
        topk: [1, 5]
  Eval:
    - TopkAcc:
        topk: [1, 5]

For model parameter configuration, here is a trick that can divide all models into four parts for configuration, namely:

  1. Top Module: The top module is the highest level module of the network, responsible for the overall output and task execution. It usually includes an output layer, a loss function, and an evaluation indicator, etc., to generate the final prediction result or perform a specific task.
  2. Backbone Network: The backbone network is the core part of the network and is responsible for extracting high-level feature representations of input data. The backbone network usually consists of multiple convolutional layers, pooling layers, and fully connected layers to learn the features of the input data layer by layer.
  3. Neck: The neck is located between the backbone network and the top-level module, and plays the role of connection and conversion. The neck usually consists of some intermediate layers or modules, which are used to further process and compress the feature representation extracted by the backbone network, so as to better adapt to the needs of specific tasks.
  4. Head: The head is located behind the neck and is responsible for further processing and decoding the features output by the neck to generate the final prediction result. The head usually includes some fully connected layers, pooling layers, normalization layers, etc., to map features to the final output space.

The location of these parts in the network structure can vary according to the specific network architecture, since different network models may have different hierarchical structures and component configurations. However, generally speaking, the backbone network is usually located in the middle part of the network, the neck and head are located after the backbone network, and the top-level modules are located at the very top of the network.

Basically, many model frameworks will have an overview of model components: for example PaddleDetection
insert image description here
If you can figure it out, if different blocks are docked, it is actually the same as playing Lego, You can also spell out a model yourself.

An example of training a model through a configuration file

To modify the configuration of the model through the configuration file and perform pre-training model training, you can follow the example code below:

import torch
import torch.nn as nn
from torchvision.models import resnet50
import yaml

# 加载配置文件
with open('config.yaml', 'r') as file:
    config = yaml.safe_load(file)

# 创建主干网络
if config['model']['type'] == 'resnet50':
    backbone = resnet50(pretrained=config['model']['pretrained'])
    backbone.fc = nn.Identity()  # 移除原始ResNet50的全连接层

# 获取主干网络输出尺寸
dummy_input = torch.zeros(1, 3, 224, 224)  # 假设输入为3通道、224x224的图像
backbone_output_size = backbone(dummy_input).size(1)

# 创建顶层模块
if config['top_module']['type'] == 'linear_classifier':
    top_module = nn.Sequential(
        nn.Dropout(config['top_module']['dropout']),
        nn.Linear(backbone_output_size, config['model']['num_classes'])
    )

# 创建颈部
if config['neck']['type'] == 'global_avg_pooling':
    neck = nn.AdaptiveAvgPool2d((1, 1))  # 全局平均池化

# 创建头部
if config['head']['type'] == 'linear_layer':
    head = nn.Linear(backbone_output_size, config['head']['hidden_size'])

# 进行预训练模型训练
# 定义损失函数和优化器等
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(list(backbone.parameters()) + list(top_module.parameters()), lr=config['training']['lr'])
epochs = config['training']['epochs']

# 进行训练循环
for epoch in range(epochs):
    # 在每个epoch中进行训练和验证等操作
    backbone.train()
    top_module.train()
    
    for batch_idx, (images, labels) in enumerate(train_loader):
        # 将输入数据(images)和标签(labels)加载到设备上
        images = images.to(device)
        labels = labels.to(device)
        
        # 前向传播
        features = backbone(images)
        outputs = top_module(features)
        
        # 计算损失
        loss = loss_fn(outputs, labels)
        
        # 反向传播和优化
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # 打印训练信息
        if batch_idx % config['training']['log_interval'] == 0:
            print(f'Train Epoch: {
      
      epoch+1}/{
      
      epochs} '
                  f'Batch: {
      
      batch_idx+1}/{
      
      len(train_loader)} '
                  f'Loss: {
      
      loss.item()}')
    
    # 验证模式
    backbone.eval()
    top_module.eval()
    val_loss = 0
    correct = 0
    
    with torch.no_grad():
        for images, labels in val_loader:
            # 将输入数据(images)和标签(labels)加载到设备上
            images = images.to(device)
            labels = labels.to(device)
            
            # 前向传播
            features = backbone(images)
            outputs = top_module(features)
            
            # 计算损失
            loss = loss_fn(outputs, labels)
            val_loss += loss.item()
            
            # 计算准确率
            _, predicted = outputs.max(1)
            correct += predicted.eq(labels).sum().item()
    
    val_loss /= len(val_loader)
    accuracy = correct / len(val_dataset)
    
    # 打印验证信息
    print(f'Validation Epoch: {
      
      epoch+1}/{
      
      epochs} '
          f'Loss: {
      
      val_loss:.4f} Accuracy: {
      
      accuracy:.4f}')

# 保存模型
torch.save({
    
    
    'backbone_state_dict': backbone.state_dict(),
    'top_module_state_dict': top_module.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'epoch': epochs
}, 'pretrained_model.pth')

How to fine-tune the configuration file to train an excellent model

At the beginning, you must use the default configuration parameters of the pre-trained model for training. If you find that the model performance is average, then consider fine-tuning the configuration file. To modify the configuration parameters
to improve the performance of the model, you usually need to judge and judge based on the following aspects Adjustment:

Dataset characteristics

Observe the characteristics of your dataset, including image resolution, color space, number of target classes and variance between classes, etc. Depending on the characteristics of the dataset, you can choose the appropriate backbone, neck and head structures, and corresponding hyperparameters.

  1. Image Resolution:

Image resolution refers to the pixel size of the image, such as 224x224 or 384x384. High-resolution images capture more detail, but also require greater computing resources. Low-resolution images may lose some detail, but are computationally more efficient.
Example: If the dataset contains high-resolution images, the backbone network and neck structure can be selected to handle images with rich details. For example, choose ResNet-152 backbone network and larger neck feature map size.

  1. color space:

Color space refers to the color representation of the image, such as RGB, grayscale, etc. Different color spaces may have different effects on different tasks.
Example: For tasks based on color information (e.g. image classification, object detection), the RGB color space is usually preferred. For black and white images, a grayscale color space can be used.

  1. Number of target categories and variance:

The number of target categories in a dataset and the variance between categories have important implications for model design and parameter tuning. The greater the disparity between different classes, the more complex structure the model may need to handle this disparity.
Example: For datasets with a large number of categories and large variance, a deeper and wider backbone network can be selected in order to extract richer features.

  1. Sample balance of the dataset:

Whether the distribution of class samples in the data set is balanced is also an important factor affecting model design and parameter adjustment. If some classes have low sample sizes, the model may tend to predict common classes more often, resulting in skewed performance.
Example: For unbalanced datasets, weighted loss functions can be used to increase the weight of minority class samples to balance the training process. Data augmentation techniques can also be considered to increase the number of samples in the minority class.

model architecture

Understand the principles and design ideas of model architecture. Different model architectures are suitable for different tasks and datasets. For example, for object detection tasks, YOLOv3 is a commonly used model, which adopts special feature extraction and prediction strategies. According to your task requirements, you can try to modify the number of layers, channels, attention mechanism, etc. of the model.

  1. Principles of model architecture:

Model architecture refers to the overall structure and components of the model, including the backbone network, neck and head, etc. Each part has a specific function and design philosophy to achieve optimal performance for a specific task.
The backbone network is responsible for extracting high-level features from raw images, usually in the form of a convolutional neural network (CNN). The neck and head modules are responsible for further processing and interpreting the features of the backbone network and producing the final prediction results.

  1. Applicability of different model architectures:

The choice of model architecture depends on the characteristics of the task and dataset. Different tasks and data sets require different feature representation and prediction strategies, so the applicable model architectures will also vary.
For example, for image classification tasks, the model needs to have good feature extraction capabilities, so a common choice is to use a classic convolutional neural network (such as ResNet, VGG, Inception, etc.) as the backbone network, and then use global average pooling and linear classification The organ serves as the neck and head.
For the object detection task, the model needs to be able to detect both the location and the category of the object. YOLOv3 is a commonly used object detection model that divides an image into grid cells and predicts multiple bounding boxes and class probabilities at each cell. This design enables YOLOv3 to achieve high accuracy while maintaining a high detection speed.

  1. Modify the key parameters of the model architecture:

When modifying the model architecture, key parameters include number of layers, number of channels, attention mechanism, etc.
Increasing the number of layers in a model can increase the representational power of the model, but it also increases computation and memory requirements. Appropriately increasing the number of layers can improve the performance of the model, but it is necessary to pay attention to the problem of overfitting.
Adjusting the number of channels can control the dimensionality and complexity of the feature map. Increasing the number of channels can improve the feature representation ability, but it will also increase the computational cost. Properly adjusting the number of channels can balance performance and efficiency.
The introduction of an attention mechanism can increase the model's attention to important features, thereby improving performance. The attention mechanism can be realized by self-attention mechanism (such as Transformer), channel attention mechanism (such as SENet), etc.

Prior Research and Experience

Refer to relevant research papers, blog posts, and lessons learned for models and configurations that have achieved good performance. You can learn from the experience of others and try to use similar configurations or parameter settings.

  1. Read related research papers:

By reading research papers in related fields, you can learn about the current models and configurations that achieve good performance on a specific task. Papers usually describe the architecture and parameter settings of the model in detail, as well as experimental results on different datasets.
Example: If you are dealing with image classification tasks, you can refer to well-known image classification papers in the field, such as ResNet, EfficientNet, ViT, etc. Learn about their architectural design, hyperparameter settings, and performance on publicly available datasets.

  1. Reference blog post and lessons learned:

Many machine learning practitioners share their experiences and practices in blog posts or technical communities. These articles often provide valuable advice on model configuration, parameter tuning, and techniques.
Example: You can search for optimization techniques for specific tasks or models, such as "improve performance of object detection models" or "hyperparameter tuning for neural networks". Read these articles for practical guidance and advice.

  1. Example sharing and open source projects:

Some open source projects and platforms provide sharing of examples of models and configurations. These examples can help you understand model architectures and parameter settings that achieve good performance on similar tasks.
Examples: Open source projects on GitHub, winning solutions in Kaggle competitions, sharing on forums, etc. are all good ways to get examples to share. You can refer to these examples and make appropriate adjustments according to your own tasks.

Hyperparameter Tuning

Hyperparameters include learning rate, batch size, weight decay, etc. Through the performance of experiments and verification sets, hyperparameters can be tuned and the best configuration can be selected. You can try to optimize the training process using techniques such as learning rate scheduler, adaptive optimizer, etc.

  1. Learning rate tuning:

The learning rate is an important hyperparameter that controls the magnitude of model parameter updates. An appropriate learning rate can speed up model convergence and improve performance, while an inappropriate learning rate may cause training to be unstable or stuck in a local minimum.
Common learning rate tuning strategies include learning rate decay, learning rate preheating, learning rate adaptive adjustment, etc. You can try different learning rate schedulers, such as StepLR, CosineAnnealingLR, ReduceLROnPlateau, etc., to dynamically adjust the learning rate.

  1. Batch size:

Batch size refers to the number of samples used at each parameter update. Smaller batch sizes can improve the convergence rate of the model, but may lead to increased noise during training. Larger batch sizes reduce noise but consume more memory.
When choosing a batch size, available hardware resources and memory constraints need to be considered. Larger batch sizes are generally recommended, but make sure you do so within the limits of your hardware resources.

  1. Weight decay:

Weight decay is a regularization technique used to reduce the magnitude of model parameters to prevent overfitting. Larger weight values ​​can be penalized by adding a weight decay term to the loss function.
The coefficient of weight decay is usually tuned as a hyperparameter. You can try different weight decay factors, such as 0.0001, 0.001, etc., and observe the performance of the model on the validation set.

  1. Adaptive optimizer:

Traditional optimization algorithms such as stochastic gradient descent (SGD) have fixed learning rate and momentum parameters. However, adaptive optimizers (such as Adam, RMSprop) can automatically adjust the learning rate and momentum parameters according to the change of the gradient to improve the training effect.
When using an adaptive optimizer, you can adjust parameters such as learning rate, momentum, weight decay, etc., and observe the performance of the model on the validation set.

Example :
Assuming you are training an object detection model, you can use the following hyperparameter configuration for tuning:

Learning rate: The initial learning rate is set to 0.001, and then the learning rate decay strategy is used, such as every 10 epochs, the learning rate is decayed to half of the current one.
Batch size: Choose an appropriate batch size, such as 32 or 64, based on available hardware resources and memory constraints.
Weight decay: Try different weight decay coefficients, like 0.0001 or 0.001, and choose the value that performs best on the validation set.
Adaptive optimizer: use the Adam optimizer, set the appropriate initial learning rate, momentum parameters and weight decay coefficient, for example, the learning rate is 0.001, the momentum is 0.9, and the weight decay is 0.0005.
Through experiments and validation set performance, you can try different hyperparameter configurations and choose the configuration that performs best on the validation set. Techniques such as cross-validation can be used to evaluate the performance of different hyperparameter configurations and choose the configuration with the best performance as the final choice.

It should be noted that hyperparameter tuning is an iterative process that requires multiple experiments and adjustments based on the actual situation. In addition, you can also consider using automated hyperparameter tuning tools such as Hyperopt, Optuna, etc. to speed up the hyperparameter search process.

Iterative experimentation and evaluation

Conduct a series of experiments and evaluations to observe how the performance of the model changes under different configurations. Use appropriate evaluation indicators (such as accuracy, precision, recall, mAP, etc.) to evaluate the performance of the model, and adjust and optimize according to the evaluation results.
Here are some detailed explanations and examples:

  1. experimental design

Determine the objectives and evaluation metrics for the experiment. According to the task requirements, select appropriate evaluation indicators, such as accuracy rate, precision rate, recall rate, mAP (mean mean precision), etc. Make sure that the evaluation metrics fully reflect the performance of the model under different configurations.

  1. Parameter adjustment

Based on the results of previous research, experience, and hyperparameter tuning, an initial set of configuration parameters is chosen. These parameters include model architecture, hyperparameters, optimizer settings, and more. Train the model with this set of parameters and evaluate the performance on the validation set.

  1. performance evaluation

Calculate the performance of the model on the validation or test set according to the selected evaluation metric. Observe the values ​​of the indicators, and analyze and compare them. Find the configuration with the best performance by comparing the performance under different configurations.

  1. Tuning and Optimization

Based on the evaluation results, analyze the performance difference of the model under different configurations. If the model performs poorly under certain configurations, you can try to adjust related parameters, such as increasing the number of network layers, adjusting the learning rate decay strategy, changing the data enhancement method, etc. Retrain the model and reevaluate performance.

  1. iterative experiment

An iterative process of conducting multiple experiments and evaluations. According to the results of the previous round of experiments, the configuration parameters are adjusted, and the next round of training and evaluation is performed. Through continuous iteration, the performance of the model is gradually optimized.

Example: For target detection tasks, you can try different backbone networks (such as ResNet, EfficientNet, etc.), adjust the depth and number of channels of the network, try different detection heads (such as YOLO, SSD, etc.), and adjust related hyperparameters ( such as learning rate, batch size, etc.). Through training and evaluation, compare the mAP values ​​​​under different configurations, and choose the configuration with the best performance.

⚠️ Note that when modifying parameters, it is recommended to follow the steps below:

  1. Identify goals and indicators: Specify the goals and performance indicators you want to improve, such as improving classification accuracy and speeding up model inference.
  2. Single variable principle: When adjusting parameters, try to change only one variable and keep other parameters unchanged. This allows for a better understanding of the impact of each parameter and avoids the difficulty of analyzing results due to simultaneous variation of multiple parameters.
  3. Set reasonable ranges and step sizes: Set reasonable ranges and step sizes for parameters based on experience or previous research. Avoid setting step sizes that are too large or too small as this may miss the optimum or cause overtuning.
  4. Experiment and Evaluation: Conduct experiments and evaluate the performance of the model. Use the validation set or cross-validation to evaluate the performance of the model under different parameter configurations. Record the experimental results and perform statistics and analysis. Compare the performance indicators under different parameter configurations, and observe their changing trends and influence degrees. This allows quantitative and qualitative information on parameter tuning to be obtained.
  5. Iteration and optimization: According to the results of experiments and analysis, the parameters are further adjusted. Methods such as grid search, random search, Bayesian optimization, etc. can be employed to explore a wider parameter space and find better configurations.
  6. Pay attention to balance: When adjusting parameters, pay attention to balancing the performance and needs of various aspects. For example, increasing the complexity of a model may improve accuracy, but also increase computation and memory requirements. Therefore, it is necessary to make a trade-off between model performance and computing resources to find the optimal configuration that suits the task requirements.

Guess you like

Origin blog.csdn.net/weixin_42010722/article/details/131400741