Description of the problem:
An error occurs when using yolov5 to train the data set I made.
Background:
The data set is a yolo format data set made by myself. The picture is large and has not been resized. The resolution is 4000*3000. The guess may be that the data set appears question
Search for error messages on the Internet:
Solution one:
img = Image.open(image_path) 改为
img = Image.open(image_path).convert(‘RGB’)。
Solution two:
问题1::RuntimeError: DataLoader worker (pid XXX) is killed by signal: Bus error
problem causes:
Generally, this kind of problem occurs in docker. Since the default shared memory of docker is 64M, when the number of workers is large, the space is not enough, and an error occurs.
Solution:
1. Self-defeating martial arts
- will
num_workers
be set to 0
2. Solve the problem
Configure a larger shared memory when creating docker, add parameters --shm-size="15g"
, and set a shared memory of 15g (set according to the actual situation):
nvidia-docker run -it --name [container_name] --shm-size="15g" ...
- by
df -h
viewing
df -h
# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 3.6T 3.1T 317G 91% /
tmpfs 64M 0 64M 0% /dev
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/sdb1 3.6T 3.1T 317G 91% /workspace/tmp
shm 15G 8.1G 7.0G 54% /dev/shm
tmpfs 63G 12K 63G 1% /proc/driver/nvidia
/dev/sda1 219G 170G 39G 82% /usr/bin/nvidia-smi
udev 63G 0 63G 0% /dev/nvidia3
tmpfs 63G 0 63G 0% /proc/acpi
tmpfs 63G 0 63G 0% /proc/scsi
tmpfs 63G 0 63G 0% /sys/firmware
- Where shm is the shared memory space
问题2 RuntimeError: DataLoader worker (pid(s) ****) exited unexpectedly
problem causes:
Since the dataloader uses multi-threaded operations, if there are other multi-threaded operations with some problems in the program, it may cause threads to nest threads, which is prone to deadlocks.
solution:
1. Self-defeating martial arts
will num_workers
be set to 0
2. Solve the problem
__getitem__
Disable opencv's multithreading in the dataloader method:
def __getitem__(self, idx):
import cv2
cv2.setNumThreads(0)
Solution three:
When encapsulating the dataloader, the last one left is less than one batchsize! The built-in dataloader will have this phenomenon
. Correction:
batch_size_s = len(targets) #不足一个batch_size直接停止训练
if batch_size_s < BATCH_SIZE:
break
Solution four:
Tracing back the reason upwards, my reason is:
AutoAnchor: Running kmeans for 9 anchors on 3078 points...
AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.7565: 20%|██▉ | 198/1000 [01:42<06:55, 1.93it/s]
AutoAnchor: ERROR: DataLoader worker (pid 193674) is killed by signal: Killed.
So directly reduce the batch to 32, and the training parameters are as follows:
python train.py --img 640 \
--batch 32 \
--epochs 300 \
--data /ai/AD4/code/yolov5/data/waterpipewire_yolo.yaml \
--weights /ai/AD4/code/yolov5/models/model/yolov5s.pt
Start training!
References:
1. Pytorch dataloader error "DataLoader worker (pid xxx) is killed by signal" Solution
2. RuntimeError: The size of tensor a (128) must match the size of tensor b (16) at non-singleton dimen