yolov5--Complete alchemy guide

foreword

Recently, I am working on a yolov5 gesture recognition project. I have climbed a lot of pits and eliminated many bugs. Let me record it. Refer to the experience of the predecessors, and I will recommend it when I encounter a well-written article. I will mainly talk about these bugs. If there are any deficiencies, please comment and point out.

alchemy method

collect data set

1. Crawl data.
Here we mainly refer to the online crawler code. I have a code but I didn’t write it so I won’t share it.
Advantages: A large amount of data can be easily obtained
Disadvantages: The quality of data simply crawled from the Internet varies, and the basic quality is very poor

2. Use plotplayer to make
data set requirements:
The following two links are given in detail, and I would like to add one point: be flexible according to actual needs, and be close to the use environment.
Reference link:
add link description
add link description

Prepare the video, open potplayer, and press the shortcut key: alt+g.
insert image description here

You can explore other settings of plotplayer yourself. I give a recommended method:
the number of acquisitions is 99999 (guaranteed to capture the entire video), and the capture is 200ms.

Manually process the data a little bit to remove some blurry and poor quality pictures.

Divide the dataset

Most of the methods on the Internet are the division of VOC datasets, but mine is a dataset in YOLO format.

1. First use the code of this blogger to divide the data set, modify the source file path and the new file path to
divide the data set code

import os
import random
from shutil import copy2
 
# 源文件夹路径
file_path = r"D:/Code/Data/centerlinedata/tem_voc/JPEGImages/"
# 新文件路径
new_file_path = r"D:/Code/Data/GREENTdata/"
# 划分数据比例6:2:2
split_rate = [0.6, 0.2, 0.2]
class_names = os.listdir(file_path)
# 目标文件夹下创建文件夹
split_names = ['train', 'val', 'test']
print(class_names)  # ['00000.jpg', '00001.jpg', '00002.jpg'... ]
 
# 判断是否存在目标文件夹,不存在则创建---->创建train\val\test文件夹
if os.path.isdir(new_file_path):
    pass
else:
    os.makedirs(new_file_path)
for split_name in split_names:
    split_path = new_file_path + "/" + split_name
    print(split_path)   # D:/Code/Data/GREENTdata/train, val, test
    if os.path.isdir(split_path):
        pass
    else:
        os.makedirs(split_path)
 
# 按照比例划分数据集,并进行数据图片的复制
for class_name in class_names:
    current_data_path = file_path  # D:/Code/Data/centerlinedata/tem_voc/JPEGImages/
    current_all_data = os.listdir(current_data_path)
    current_data_length = len(current_all_data)  # 文件夹下的图片个数
    current_data_index_list = list(range(current_data_length))
    random.shuffle(current_data_index_list)
 
    train_path = os.path.join(new_file_path, 'train/')   # D:/Code/Data/GREENTdata/train/
    val_path = os.path.join(new_file_path, 'val/')       # D:/Code/Data/GREENTdata/val/
    test_path = os.path.join(new_file_path, 'test/')     # D:/Code/Data/GREENTdata/test/
 
    train_stop_flag = current_data_length * split_rate[0]
    val_stop_flag = current_data_length * (split_rate[0] + split_rate[1])
 
 
current_idx = 0
train_num = 0
val_num = 0
test_num = 0
# 图片复制到文件夹中
for i in current_data_index_list:
    src_img_path = os.path.join(current_data_path, current_all_data[i])
    if current_idx <= train_stop_flag:
        copy2(src_img_path, train_path)
        train_num += 1
    elif (current_idx > train_stop_flag) and (current_idx <= val_stop_flag):
        copy2(src_img_path, val_path)
        val_num += 1
    else:
        copy2(src_img_path, test_path)
        test_num += 1
    current_idx += 1
print("Done!", train_num, val_num, test_num)
 

Division completed
insert image description here

2. Create the following file structure:
all_split # This is the folder that was just divided
  images
    train # The training picture that was just divided
    val # The verification picture that was just divided
  labels
    train # Training label, the path to label with labelimg
    val # Verification label, The path to label with labelimg
  test # The test image just divided
  A.yaml # Configuration file

A.yaml is configured as follows:
insert image description here

3. Next, use labelimg to label the training set and validation set.
Reference link:
Add link descriptionAdd
link description

yolov5 model training

The alchemy mainly refers to the method of this blogger. I put forward a few points for attention.
Tutorial: Super detailed yolov5 model training from scratch

1. The batch must be set smaller.
If you don’t know how big it can be, you can use the auto batch parameter , that is, set it to -1

python train.py --img 640 --batch -1 --data ./yolo_A/A.yaml --weights yolov5s.pt --cache     

Run this line of code, as shown below, 15 is automatically selected as the batch parameter
insert image description here
. This is a 2080ti graphics card, so you should weigh how much your graphics card can use. In fact, this parameter is related to the complexity of the network . For the same data set, the more complex the network, the smaller the batch size. Here we use yolov5l.pt. When using yolov5s.pt before, the auto batch was displayed as 47. Deep learning, haha, learn computing power

Suggestion: Know how big the batch setting is, and manually set it to an index of 2, which is convenient for GPU operations

Some errors in batch settings are too large:
insert image description here
insert image description here
insert image description here
To sum up: errors related to cuDNN , memory errors, etc., you can doubt the batch parameters.

2. The blogger did not divide the data set , and a scientific training data set is needed. But bloggers are getting people started, which is understandable.
Data set division can refer to the method I introduced above.
The meaning of data set division

Simple measures to improve training effectiveness

1. Select high-quality pictures and divide the data set for cross-validation

2. yolov5l.pt – a compromise choice
As shown in the figure, yolov5l is slower, but it can greatly improve the AP index.
insert image description here
3. Multi-GPU training, increase the batch size
Although I said before that you should not set the batch size too high, but this is based on your hardware equipment. If there are GPUs available in the laboratory, then consider multi-GPU training and increase the batch size. This effect is significant.
The understanding of batch size
did not use the verification situation before multi-GPU training: the
insert image description here
verification situation after using multi-GPU training:
insert image description here
Of course, this is only a special case, but the method is worth trying, and you don't need to know too much principle.

As shown in the figure below, it is an error that the multi-GPU training batch setting is too high. By the way, multi-GPU training cannot use auto batch.
insert image description here
3 GPU training, setting 64 batch size will report an error, it seems that it is not divisible by 3. A setting of 48 is just fine.
insert image description here

The official website introduction link
torch.distributed.run is the latest pytorch version, the old version uses torch.distributed.launch

4. Regarding the training strategy, this article is well written, you can read it.
add link description

Description of parameters

def parse_opt(known=False):
    parser = argparse.ArgumentParser()
    parser.add_argument('--weights', type=str, default=ROOT / 'yolov5s.pt', help='initial weights path')
    parser.add_argument('--cfg', type=str, default='', help='model.yaml path')
    parser.add_argument('--data', type=str, default=ROOT / 'data/coco128.yaml', help='dataset.yaml path')
    parser.add_argument('--hyp', type=str, default=ROOT / 'data/hyps/hyp.scratch-low.yaml', help='hyperparameters path')
    parser.add_argument('--epochs', type=int, default=100, help='total training epochs')
    parser.add_argument('--batch-size', type=int, default=16, help='total batch size for all GPUs, -1 for autobatch')
    parser.add_argument('--imgsz', '--img', '--img-size', type=int, default=640, help='train, val image size (pixels)')
    parser.add_argument('--rect', action='store_true', help='rectangular training')
    parser.add_argument('--resume', nargs='?', const=True, default=False, help='resume most recent training')
    parser.add_argument('--nosave', action='store_true', help='only save final checkpoint')
    parser.add_argument('--noval', action='store_true', help='only validate final epoch')
    parser.add_argument('--noautoanchor', action='store_true', help='disable AutoAnchor')
    parser.add_argument('--noplots', action='store_true', help='save no plot files')
    parser.add_argument('--evolve', type=int, nargs='?', const=300, help='evolve hyperparameters for x generations')
    parser.add_argument('--bucket', type=str, default='', help='gsutil bucket')
    parser.add_argument('--cache', type=str, nargs='?', const='ram', help='image --cache ram/disk')
    parser.add_argument('--image-weights', action='store_true', help='use weighted image selection for training')
    parser.add_argument('--device', default='', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')
    parser.add_argument('--multi-scale', action='store_true', help='vary img-size +/- 50%%')
    parser.add_argument('--single-cls', action='store_true', help='train multi-class data as single-class')
    parser.add_argument('--optimizer', type=str, choices=['SGD', 'Adam', 'AdamW'], default='SGD', help='optimizer')
    parser.add_argument('--sync-bn', action='store_true', help='use SyncBatchNorm, only available in DDP mode')
    parser.add_argument('--workers', type=int, default=8, help='max dataloader workers (per RANK in DDP mode)')
    parser.add_argument('--project', default=ROOT / 'runs/train', help='save to project/name')
    parser.add_argument('--name', default='exp', help='save to project/name')
    parser.add_argument('--exist-ok', action='store_true', help='existing project/name ok, do not increment')
    parser.add_argument('--quad', action='store_true', help='quad dataloader')
    parser.add_argument('--cos-lr', action='store_true', help='cosine LR scheduler')
    parser.add_argument('--label-smoothing', type=float, default=0.0, help='Label smoothing epsilon')
    parser.add_argument('--patience', type=int, default=100, help='EarlyStopping patience (epochs without improvement)')
    parser.add_argument('--freeze', nargs='+', type=int, default=[0], help='Freeze layers: backbone=10, first3=0 1 2')
    parser.add_argument('--save-period', type=int, default=-1, help='Save checkpoint every x epochs (disabled if < 1)')
    parser.add_argument('--seed', type=int, default=0, help='Global training seed')
    parser.add_argument('--local_rank', type=int, default=-1, help='Automatic DDP Multi-GPU argument, do not modify')

    # Logger arguments
    parser.add_argument('--entity', default=None, help='Entity')
    parser.add_argument('--upload_dataset', nargs='?', const=True, default=False, help='Upload data, "val" option')
    parser.add_argument('--bbox_interval', type=int, default=-1, help='Set bounding-box image logging interval')
    parser.add_argument('--artifact_alias', type=str, default='latest', help='Version of dataset artifact to use')

    return parser.parse_known_args()[0] if known else parser.parse_args()

–weight parameter: pre-trained weight file, for example: yolov5s.pt, yolov5l.pt
–cfg: trained model file, the default is yolov5s.yaml ( the code is empty by default, if not specified, the model parameters will be determined according to the weight parameter. If specified, the specified model file is the main one. )
–data: location of training data
–hyp: hyperparameters, generally not used
–epochs: number of training rounds, default is 100
–batch-size: default is 16
–imgsize or --img or --imgsz: default 640, must be 32 times faster, no need to change for now
–resume: breakpoint training
–nosave: save only the last checkpoint data
–device: use cuda device
–cache: use cache to accelerate
– workers: The working process of cpu loading data, which affects the training speed and will occupy the cpu memory. The larger the size, the faster the training speed. But after reaching the bottleneck, it will not rise but fall, and it may also exceed the CPU load and report an error.
–hyp: hyperparameter evolution, you can try this

Note: As long as the training parameters are consistent and the data is not changed, no matter how many times you train, the result will be the same, including prompt information, result.png, etc. There is no such thing as multiple training to see the average effect

epilogue

Addicted to alchemy, unable to extricate himself!
insert image description here

Guess you like

Origin blog.csdn.net/private_Jack/article/details/128345434