Problems encountered when customizing and improving the yolox framework and their resolution records

-------Write a part first, follow-up update--------

-------2022.07.16 Update the analysis and solution of the KeyError error during the verification set evaluation during the yolox training process

-------2022.07.28 Update how yolox customizes the size of the three-level channel number of the feature pyramid part passed from the backbone to the neck

foreword

Note:

The yolox code (here refers to the open source code of its team on github) is large in size and complex in structure. Although it can be seen that the developer's code writing and module arrangement are still very standardized and reasonable, due to too many functional modules, training It is still necessary to iteratively call various folders, python files, classes, and functions layer by layer, which brings certain troubles to program understanding and later improvement on yolox.

(1) According to the instructions in the official README file, we start training by running the tools/train.py file from the terminal window, and all training calls are based on this file.

(2) The definition and call of the Exp class is the core of yolox training. Yolox packages the specific configuration of the training and some functions during the training process in this Exp class. When running train.py, the address of the python file where the Exp class to be used is located should be passed in as a parameter.

For example, when I use the data set in voc format, I modify it directly based on the yolox_voc_s.py file provided by the yolox source code as the Exp class I use for training. In particular, it should be noted that the Exp class defined in the yolox_voc_s.py file is not directly defined in one step, it inherits the Exp class defined in yolox/exp/yolox_base.py when it is defined (defined in yolox_base.py The Exp class is the inherited BaseExp class in yolox/exp/base_exp.py, but most of the properties and class functions of the Exp class are implemented in the Exp class defined in yolox_base.py). Therefore, although we only use the Exp class defined in the yolox_voc_s.py file on the surface, if we want to carefully modify the content of the Exp class, we often need to go back to the Exp class defined in the yolox/exp/yolox_base.py it inherits. Modified. It should be paid special attention that do not stare at the Exp class in yolox_base.py during the program running and modification process, because the Exp class in yolox_voc_s.py may affect individual classes of the Exp class in yolox_base.py when inheriting Attributes and class functions are overwritten and modified, and only looking at the class attribute parameters of the Exp class in yolox_base.py for reference may cause misjudgment. (For example, because I only looked at the two class attributes self.depth and self.width defined in the Exp class in yolox_base.py, I couldn’t find out where the bug was. It turned out that the default values of these two class attributes were in yolox_voc_s. The Exp class in py was modified after inheritance, which caused me to check repeatedly before and didn’t find it, so I need to pay special attention to this.)

1. Model definition file introduction:

The modularization of yolox is relatively clear. The yolox model is generally regarded as the backbone, the fpn-based neck, and the head. Here, when yolox builds the model, the main modules of the model are placed in the yolox/models folder:

darknet.py defines the backbone;

yolo_pafpn defines the fpn structure with pan module as the neck of the model. At the same time, the YOLOPAFPN class defined in it actually instantiates the model backbone defined by darknet.py with the class attribute self.backbone, and calculates it in the forward function Backbone, and connect the output result to the fpn designed by the backbone. Therefore, yolo_pafpn actually includes the backbone and fpn of the yolox framework, and the complete network can be formed by adding the head part later.

yolo_head.py defines the YOLOXHead class, which packs the head part of YOLOX, which is also the longest and most complicated part of the model in terms of code size.

yolox.py mainly composes the structure contained in YOLOPAFPN class and YOLOXHead class introduced above into the yolox model, and sorts out the output results of the model into a dictionary containing the values of each part.

2. Points to note for model improvement:

Backbone is divided into several linear operation modules dark2, dark3, dark4, and dark5. The calculation results behind dark3, dark4, and dark5 should be saved separately and passed to the next three levels of fpn for calculation. Therefore, when making changes to the model backbone, special attention should be paid to whether the number of image channels calculated by dark3, dark4, and dark5 is consistent with the number of image channels to be accepted by the implementation of the corresponding layer operation module of fpn to be accessed. (It is emphasized here again that the two class attributes of self.depth and self.width in the exp class will affect the model depth (total number of layers) and the number of channels set when the model operates on images)

3. The analysis and solution of the KeyError error during the verification set evaluation during the yolox training process

In yolox training, in some specified epochs, the model will perform verification test performance on the verification set after the training set is trained. After I replaced the training set, the following error message appeared:

07-16 13:56:51 | INFO     | yolox.core.trainer:266 - epoch: 1/300, iter: 110/116, mem: 5523Mb, iter_time: 0.229s, data_time: 0.000s, total_loss: 10.5, iou_loss: 3.8, l1_loss: 0.0, conf_loss: 5.7, cls_loss: 0.9, lr: 1.124e-03, size: 640, ETA: 3:04:37
2022-07-16 13:56:52 | INFO     | yolox.core.trainer:370 - Save weights to ./YOLOX_outputs/yolox_voc_s
100%|##########| 29/29 [00:02<00:00, 10.64it/s]
2022-07-16 13:56:55 | INFO     | yolox.evaluators.voc_evaluator:160 - Evaluate in main process...
Writing ship VOC results file
Eval IoU : 0.50
2022-07-16 13:56:58 | INFO     | yolox.core.trainer:208 - Training of experiment is done and the best AP is 0.00， the best AP_50 is 0.00

2022-07-16 13:56:58 | ERROR    | yolox.core.launch:98 - An error has been caught in function 'launch', process 'MainProcess' (25844), thread 'MainThread' (140597191565888):
Traceback (most recent call last):

  File "tools/train.py", line 151, in <module>
    launch(
    └ <function launch at 0x7fde2b82af70>

> File "/home/dwt/MyCode/pycharm_projects/YOLOX_sample/yolox/core/launch.py", line 98, in launch
    main_func(*args)
    │          └ (╒═══════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════...
    └ <function main at 0x7fde1d3d4280>

  File "tools/train.py", line 134, in main
    trainer.train()#这里运行yolox/core下的trainer.py文件中的类Trainer的一个实例trainer下的类Trainer中定义的类函数train()，其中类Trainer通过定义各种类函数来实现了模型训练过程
    │       └ <function Trainer.train at 0x7fde18897940>
    └ <yolox.core.trainer.Trainer object at 0x7fde188a60a0>

  File "/home/dwt/MyCode/pycharm_projects/YOLOX_sample/yolox/core/trainer.py", line 88, in train
    self.train_in_epoch()
    │    └ <function Trainer.train_in_epoch at 0x7fde1889f040>
    └ <yolox.core.trainer.Trainer object at 0x7fde188a60a0>

  File "/home/dwt/MyCode/pycharm_projects/YOLOX_sample/yolox/core/trainer.py", line 98, in train_in_epoch
    self.after_epoch()
    │    └ <function Trainer.after_epoch at 0x7fde1889f3a0>
    └ <yolox.core.trainer.Trainer object at 0x7fde188a60a0>

  File "/home/dwt/MyCode/pycharm_projects/YOLOX_sample/yolox/core/trainer.py", line 235, in after_epoch
    self.evaluate_and_save_model()
    │    └ <function Trainer.evaluate_and_save_model at 0x7fde1889f670>
    └ <yolox.core.trainer.Trainer object at 0x7fde188a60a0>

  File "/home/dwt/MyCode/pycharm_projects/YOLOX_sample/yolox/core/trainer.py", line 339, in evaluate_and_save_model
    ap50_95, ap50, summary = self.exp.eval(
                             │    │   └ <function Exp.eval at 0x7fde18897f70>
                             │    └ ╒═══════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════...
                             └ <yolox.core.trainer.Trainer object at 0x7fde188a60a0>

  File "/home/dwt/MyCode/pycharm_projects/YOLOX_sample/yolox/exp/yolox_base.py", line 335, in eval
    return evaluator.evaluate(model, is_distributed, half)
           │         │        │      │               └ False
           │         │        │      └ False
           │         │        └ fbnet_YOLOX(
           │         │            (backbone): fbnet_YOLOPAFPN(
           │         │              (backbone): fbnet_backbone(
           │         │                (stem): Focus(
           │         │                  (conv): BaseConv(
           │         │            ...
           │         └ <function VOCEvaluator.evaluate at 0x7fde188914c0>
           └ <yolox.evaluators.voc_evaluator.VOCEvaluator object at 0x7fddfc162a00>

  File "/home/dwt/MyCode/pycharm_projects/YOLOX_sample/yolox/evaluators/voc_evaluator.py", line 128, in evaluate
    eval_results = self.evaluate_prediction(data_list, statistics)
                   │    │                   │          └ tensor([ 2.2012,  0.0739, 28.0000], device='cuda:0')
                   │    │                   └ {0: (tensor([[-7.5409e+01, -6.3225e+01,  5.1623e+02,  2.6991e+02],
                   │    │                             [-1.3955e+02, -6.8198e+00,  5.6981e+02,  2.1258e+0...
                   │    └ <function VOCEvaluator.evaluate_prediction at 0x7fde188915e0>
                   └ <yolox.evaluators.voc_evaluator.VOCEvaluator object at 0x7fddfc162a00>

  File "/home/dwt/MyCode/pycharm_projects/YOLOX_sample/yolox/evaluators/voc_evaluator.py", line 205, in evaluate_prediction
    mAP50, mAP70 = self.dataloader.dataset.evaluate_detections(
                   │    │          │       └ <function VOCDetection.evaluate_detections at 0x7fde18891e50>
                   │    │          └ <yolox.data.datasets.voc.VOCDetection object at 0x7fde188b6df0>
                   │    └ <torch.utils.data.dataloader.DataLoader object at 0x7fddfc162dc0>
                   └ <yolox.evaluators.voc_evaluator.VOCEvaluator object at 0x7fddfc162a00>

  File "/home/dwt/MyCode/pycharm_projects/YOLOX_sample/yolox/data/datasets/voc.py", line 271, in evaluate_detections
    mAP = self._do_python_eval(output_dir, iou)
          │    │               │           └ 0.5
          │    │               └ '/tmp/tmpiht1ubmu'
          │    └ <function VOCDetection._do_python_eval at 0x7fde18897040>
          └ <yolox.data.datasets.voc.VOCDetection object at 0x7fde188b6df0>

  File "/home/dwt/MyCode/pycharm_projects/YOLOX_sample/yolox/data/datasets/voc.py", line 337, in _do_python_eval
    rec, prec, ap = voc_eval(
                    └ <function voc_eval at 0x7fde188913a0>

  File "/home/dwt/MyCode/pycharm_projects/YOLOX_sample/yolox/evaluators/voc_eval.py", line 109, in voc_eval
    R = [obj for obj in recs[imagename] if obj["name"] == classname]
                        │    │                            └ 'ship'
                        │    └ '000033'
                        └ {}

KeyError: '000033'
(fbnet_yolox) dwt@dwt-Ubuntu:~/MyCode/pycharm_projects/YOLOX_sample$ python tools/train.py -f exps/example/yolox_voc/yolox_voc_s.py -d 1 -b 8 --fp16 -o
2022-07-16 13:59:59 | INFO     | yolox.core.trainer:142 - args: Namespace(batch_size=8, cache=False, ckpt=None, devices=1, dist_backend='nccl', dist_url=None, exp_file='exps/example/yolox_voc/yolox_voc_s.py', experiment_name='yolox_voc_s', fp16=True, logger='tensorboard', machine_rank=0, name=None, num_machines=1, occupy=True, opts=[], resume=False, s

The initial judgment should be that an error occurred when reading the verification set file. After searching, I found a blog with similar problems: KeyError occurred in yolov5 v3.0 training

Refer to this blog content to know:

The reason for this error is that the data cache file may have changed, and yolov5 will save the read data to the cache file by default, so that the download and read speed will be very fast!

Similarly, I view the source code of the voc_eval.py file in yolox:

def voc_eval(
    detpath,
    annopath,
    imagesetfile,
    classname,
    cachedir,
    ovthresh=0.5,
    use_07_metric=False,
):
    # first load gt
    if not os.path.isdir(cachedir):
        os.mkdir(cachedir)
    cachefile = os.path.join(cachedir, "annots.pkl")
    # read list of images
    with open(imagesetfile, "r") as f:
        lines = f.readlines()
    imagenames = [x.strip() for x in lines]

    if not os.path.isfile(cachefile):
        # load annots
        recs = {}
        for i, imagename in enumerate(imagenames):
            recs[imagename] = parse_rec(annopath.format(imagename))
            if i % 100 == 0:
                print("Reading annotation for {:d}/{:d}".format(i + 1, len(imagenames)))
        # save
        print("Saving cached annotations to {:s}".format(cachefile))
        with open(cachefile, "wb") as f:
            pickle.dump(recs, f)
    else:
        # load
        with open(cachefile, "rb") as f:
            recs = pickle.load(f)

Guessing from the words "cachedir" and other words that yolox should also have a similar function of data caching to speed up subsequent data reading, so this error may be due to the fact that yolox also uses the data set cache I used before and the data I replaced. The set does not correspond to the cause. Thus, the address of the cache file annots.pkl is found: datasets/VOCdevkit/annotations_cache directly deletes the folder (when yolox runs again, if the folder cannot be found, it will re-create and re-cache the data), run the training command again to start The problem did not recur after training.

4. The number of channels of the three levels of the feature pyramid part of the neck passed from the backbone

Read the yolox source code, from the source code part in the file yolox/models/yolo_pafpn.py that defines the neck module of the yolox framework (as shown in the figure below):

It can be seen that the in_channel = [256, 512, 1024] parameter shows that yolox defaults to let the feature pyramid of the neck part receive the feature map feature maps of the three channel numbers of 256, 512, and 1024 in turn. Of course, in this part of the code, you can also see that there is a width parameter. This parameter is multiplied by the number in in_channel in the latter part, which adjusts the channel number size of the feature map received by the feature pyramid to a certain extent (as shown in the figure below):

For example, the original in_channel parameter is [256, 512, 1024], and when the width is equal to 0.5, the feature map size actually received by the feature pyramid can be changed to [128, 256, 512] by width * in_channel. But only by adjusting the width, although the in_channel can be changed to a certain extent, the ratio of the channel size of the feature map received at the three levels of the feature pyramid remains unchanged, and it has always been 1:2:4. In this way, the adjustment of the number of channels is still not flexible enough, so we have to modify it directly from in_channel.

It is not feasible to modify the parameters in_channels and width directly in the class YOLOPAFPN in yolox/models/yolo_pafpn.py, because this is only the definition file of the neck part of yolox. The in_channels and width defined here are only default values. When actually running the training, New values will be passed to the parameters in_channels and width when calling this part of the definition. So we have to find the part that calls the neck to modify the parameters in_channels and width. Under the default training file of yolox, the function to assemble the yolox model by calling each module file of the yolox model framework and then train is the class function get_model() part defined in yolox/exp/yolox_base.py. As shown below:

It can be seen that YOLOPAFPN is called here, and the new in_channels parameter is passed in, so the value of the in_channels parameter we modify here will actually change the number of input feature map channels of the feature pyramid of the yolox model in actual training. The modification of the width parameter depends on which file you call the EXP class defined in. If you call the EXP class in exps/example/yolox_voc/yolox_voc_s.py, you need to find the width parameter modification in the file. For specific reasons, please refer to the preface of this article.

Problems encountered when customizing and improving the yolox framework and their resolution records

Guess you like