[yolov5]The size of tensor a (3078) must match the size of tensor b (3) at non-singleton dimension报错

Description of the problem:
An error occurs when using yolov5 to train the data set I made.
insert image description here
Background:
The data set is a yolo format data set made by myself. The picture is large and has not been resized. The resolution is 4000*3000. The guess may be that the data set appears question

Search for error messages on the Internet:

Solution one:

img = Image.open(image_path) 改为
img = Image.open(image_path).convert(‘RGB’)。

Solution two:

问题1::RuntimeError: DataLoader worker (pid XXX) is killed by signal: Bus error

problem causes:

Generally, this kind of problem occurs in docker. Since the default shared memory of docker is 64M, when the number of workers is large, the space is not enough, and an error occurs.

Solution:
1. Self-defeating martial arts

  • will num_workersbe set to 0

2. Solve the problem

Configure a larger shared memory when creating docker, add parameters --shm-size="15g", and set a shared memory of 15g (set according to the actual situation):

nvidia-docker run -it --name [container_name] --shm-size="15g" ...
  • by df -hviewing
df -h
# df -h
Filesystem                                          Size  Used Avail Use% Mounted on
overlay                                             3.6T  3.1T  317G  91% /
tmpfs                                                64M     0   64M   0% /dev
tmpfs                                                63G     0   63G   0% /sys/fs/cgroup
/dev/sdb1                                           3.6T  3.1T  317G  91% /workspace/tmp
shm                                                  15G  8.1G  7.0G  54% /dev/shm
tmpfs                                                63G   12K   63G   1% /proc/driver/nvidia
/dev/sda1                                           219G  170G   39G  82% /usr/bin/nvidia-smi
udev                                                 63G     0   63G   0% /dev/nvidia3
tmpfs                                                63G     0   63G   0% /proc/acpi
tmpfs                                                63G     0   63G   0% /proc/scsi
tmpfs                                                63G     0   63G   0% /sys/firmware
  • Where shm is the shared memory space

问题2 RuntimeError: DataLoader worker (pid(s) ****) exited unexpectedly

problem causes:

Since the dataloader uses multi-threaded operations, if there are other multi-threaded operations with some problems in the program, it may cause threads to nest threads, which is prone to deadlocks.

solution:

1. Self-defeating martial arts
will num_workersbe set to 0

2. Solve the problem

  • __getitem__ Disable opencv's multithreading in the dataloader method:
def __getitem__(self, idx):
	import cv2
	cv2.setNumThreads(0)

Solution three:

When encapsulating the dataloader, the last one left is less than one batchsize! The built-in dataloader will have this phenomenon
. Correction:

batch_size_s = len(targets) #不足一个batch_size直接停止训练
if batch_size_s < BATCH_SIZE:
    break

Solution four:

Tracing back the reason upwards, my reason is:

AutoAnchor: Running kmeans for 9 anchors on 3078 points...
AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.7565:  20%|██▉            | 198/1000 [01:42<06:55,  1.93it/s]
AutoAnchor: ERROR: DataLoader worker (pid 193674) is killed by signal: Killed. 

So directly reduce the batch to 32, and the training parameters are as follows:

python train.py --img 640 \
                --batch 32 \
                --epochs 300 \
                --data /ai/AD4/code/yolov5/data/waterpipewire_yolo.yaml \
                --weights /ai/AD4/code/yolov5/models/model/yolov5s.pt

Start training!
insert image description here

References:
1. Pytorch dataloader error "DataLoader worker (pid xxx) is killed by signal" Solution
2. RuntimeError: The size of tensor a (128) must match the size of tensor b (16) at non-singleton dimen

Guess you like

Origin blog.csdn.net/weixin_48936263/article/details/124674410