Can yolov5 use grayscale images for training and detection starting from 0?

Can yolov5 be trained with grayscale images? Start yolov5 grayscale image training and detection from 0


1 preview

Usually we use RGB images for target detection. When considering multiple video streams for target detection at the same time, it consumes a lot of graphics card computing power and inference time. Now through experimental comparison, grayscale images (GRAY) and RGB images are compared for training and testing.

This time, yolov5_6.2 is used to try grayscale images for target detection training and target detection testing.

Directly convert the color image into a grayscale image through the opencv algorithm and run 'python train.py' for training.

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

When train.py is run, no error is reported, but the first two lines of the printed model structure are as follows:

[Table 1-1 Model structure interception]
from n params module arguments
0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]

From the above figure, we can see that the input of the model is fixed 3 channels, which means that even if the training data set is a grayscale image, the image will still be processed when reading the image.

Observe the source code, line 248 of the utils/dataloaders.py file

245        else:
246            # Read image
247            self.count += 1
248            img0 = cv2.imread(path)  # BGR
249            assert img0 is not None, f'Image Not Found {path}'
250            s = f'image {self.count}/{self.nf} {path}: '

​The following is an introduction to the imread method of opencv

img = cv2.imread(path, flag)   	# flag: the way in which img should be read
								# flag: default value is cv2.IMREAD_COLOR
# cv2.IMREAD_COLOR(1):    BGR
# cv2.IMREAD_GRAYSCALE(0): GRAY
# cv2.IMREAD_UNCHANGED(-1):

​ By default, opencv, 'cv2.imread(path)' will read 3-channel (BGR) images. If it is a grayscale image, the layer will be copied three times (BGR default, BGR), so the read image is three-channel.

2 Modify the source code to enable grayscale training

The location of the source code change is followed by the comment 'Hlj2308'.

2.1 Modify the image reading mode

In line 248 of the utils/dataloaders.py file, the original

img0 = cv2.imread(path)  # BGR

Change to

img0 = cv2.imread(path, 0)  # Hlj2308_GRAY

2.2 Modify the number of channels in the source code parameter transmission

2.2.1 Change the parameter ch=3 in lines 122 and 130 of the train.py file to ch=1.

model = Model(cfg or ckpt['model'].yaml, ch=1, nc=nc, anchors=hyp.get('anchors')).to(device)  # Hlj2308
model = Model(cfg, ch=1, nc=nc, anchors=hyp.get('anchors')).to(device)  # Hlj2308 create

2.2.2 Change ch=3 to ch=1 in line 151 of the models/yolo.py file.

149 class DetectionModel(BaseModel):
150     # YOLOv5 detection model
151     def __init__(self, cfg='yolov5s.yaml', ch=1, nc=None, anchors=None):  
152         super().__init__()
153         if isinstance(cfg, dict):
154             self.yaml = cfg  # model dict

2.3 Run train.py

In the process of running train.py, errors are reported all the way, and errors are corrected and debugged all the way. Finally, normal training and testing can be achieved.

[Error report 1]

File "....packages\torch\nn\modules\conv.py", line 442, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [32, 1, 6, 6], expected input[8, 3, 640, 640] to have 1 channels, but got 3 channels instead

The number of channels the data is read into is wrong. It should be 1 channel, but 3 channels are read here.

It may be a 'mode' problem when the PIL module 'Image.open()' opens the image. Modify line 612 in models/common.py file

im, f = Image.open(requests.get(im, stream=True).raw if str(im).startswith('http') else im), im

Add the following two lines of code later

if im.mode != 'L':
    im = im.convert('L')

Still exists【Error 1】

2.4 Modify the source code in utils/general.py

Modify lines 1031-1032 of the source code in utils/general.py as follows

def imread(path, flags=cv2.IMREAD_COLOR):
    return cv2.imdecode(np.fromfile(path, np.uint8), flags)

Change to

def imread(path, flags=cv2.IMREAD_GRAYSCALE):
    return cv2.imdecode(np.fromfile(path, np.uint8), cv2.IMREAD_GRAYSCALE)

Still exists【Error 1】

2.5 Modify the flags of all cv2 reading images in dataloaders.py

In lines 677, 691, 881, 1042, 1119, 1122, and 1125 of the utils/dataloaders.py file, the original

cv2.imread(f)

Change to the following, you can search and replace through Ctr+F

cv2.imread(f, 0)

At this point, run train.py and [Error 1] will not exist.

[Error 2] is as follows

File "..../yolov5_6.2_gray/utils/dataloaders.py", line 706, in load_mosaic
    img4 = np.full((s * 2, s * 2, img.shape[2]), 114, dtype=np.uint8)  # base image with 4 tiles
IndexError: tuple index out of range

Solution: The reason is that the img.shape[2] index does not exist, because the image is a single-channel, so it is only a two-dimensional array, not a three-dimensional array. Here is the 706 lines of source code \utils\dataloaders.py

img4 = np.full((s * 2, s * 2, img.shape[2]), 114, dtype=np.uint8)  

Change to

img4 = np.full((s * 2, s * 2, img.shape[2] if len(img.shape)==3 else 1), 114, dtype=np.uint8) 

[Error 3] is as follows, it is due to the incorrect modification method of [Error 2]

File "..../yolov5_6.2_gray/utils/dataloaders.py", line 721, in load_mosaic
    img4[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]  # img4[ymin:ymax, xmin:xmax]
ValueError: could not broadcast input array from shape (360,640) into shape (360,640,1)

Solution: The printing error corresponds to the shape of img4 and img, including 'img4.shape, img.shape = (1280, 1280, 1) (360, 640)', which is obviously caused by the mismatch in image size. Modify [Error 2] as follows

img4 = np.full((s * 2, s * 2), 114, dtype=np.uint8)  # base image with 4 tiles

[Error 4] is as follows

File "D:\yolov5train\yolov5_6.2_grayTrain\utils\dataloaders.py", line 642, in __getitem__
    augment_hsv(img, hgain=hyp['hsv_h'], sgain=hyp['hsv_s'], vgain=hyp['hsv_v'])
  File "D:\yolov5train\yolov5_6.2_grayTrain\utils\augmentations.py", line 69, in augment_hsv
    hue, sat, val = cv2.split(cv2.cvtColor(im, cv2.COLOR_BGR2HSV))
cv2.error: OpenCV(4.2.0) c:\projects\opencv-python\opencv\modules\imgproc\src\color.simd_helpers.hpp:92: error: (-2:Unspecified error) in function '__cdecl cv::impl::`anonymous-namespace'::CvtHelper<struct cv::impl::`anonymous namespace'::Set<3,4,-1>,struct cv::impl::A0x3b52564f::Set<3,-1,-1>,struct cv::impl::A0x3b52564f::Set<0,5,-1>,2>::CvtHelper(const class cv::_InputArray &,const class cv::_OutputArray &,int)'
> Invalid number of channels in input image:
>     'VScn::contains(scn)'
> where
>     'scn' is 1

Solution: The above problem is caused by hsv channel splitting. It is not needed here. Comment out the code corresponding to line 642 in \utils\dataloaders.py, as follows

# augment_hsv(img, hgain=hyp['hsv_h'], sgain=hyp['hsv_s'], vgain=hyp['hsv_v'])

[Error 5] is as follows

File "D:\yolov5train\yolov5_6.2_grayTrain\utils\dataloaders.py", line 665, in __getitem__
    img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB
ValueError: axes don't match array

Solution: The above problem is guessed to be caused by BGR to RGB conversion. Replace the 665 lines of source code in \utils\dataloaders.py

img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB

It is incorrect to change it to the following

img = img.transpose((2, 0, 1))  # HWC to CHW

opencv loads images through HWC, and Pytorch requires CHW, so the image needs to be converted to CHW through 'transpose((2, 0, 1)) '. However, the size of the 'img' that needs to be 'transposed' at this time is (640, 640), and the 'transpose' operation cannot be performed directly. Therefore, it can be avoided by changing it to [Error 5] as follows.

img = img.reshape(1, img.shape[0], img.shape[1])

2.6 At this time, run the 'train.py' file to run the epoch normally.

Modification of the model cfg file is not involved at this time. The parameters in train.py are set as follows

'--weights', type=str, default='',
'--cfg', type=str, default='./models/yolov5s.yaml',
'--data', type=str, default= './data/my_yolo5.yaml',
'--epochs', type=int, default=50
'--batch-size', type=int, default=8,
'--imgsz', '--img', '--img-size', type=int, default= 640,

'./models/yolov5s.yaml' parameters include the following

# Parameters
nc: 11  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple

'./data/my_yolo5.yaml' parameters include the following

train: ../datasTrain4_LQTest_gray/images/train/
val: ../datasTrain4_LQTest_gray/images/val/

# number of classes
nc: 11

# class names
names: [ 'pedes', 'car', 'bus', 'truck', 'bike', 'elec', 'tricycle','coni', 'warm', 'tralight', 'speVeh']

3 Model testing

​This time both RGB and GRAY use yolov5_6.2 without pre-training weight training for 50 epochs. The data set is the same set of images in RGB and GRAY formats, and the label files are exactly the same. –> Because the data set is small, 2491 (train), 360 (val), and no pre-trained model is used, epoch=50, so P, R, and mAP are relatively low.

3.1 Model testing of RGB image training

Training comparison

The train time of an epoch ranges from 4:24-4:34, and the val time of an epoch ranges from 0:13-0:15.

Comparison of accuracy

​ Trian ends, all: P(0.869) R(0.653) mAP(0.717)@.5 mAP(0.48)@.5:.95

val: all:P(0.712) R(0.651) mAP(0.71)@.5 mAP(0.512)@.5:.95

Comparison of val inference time

Speed: 0.3ms pre-process, 7.1ms inference, 1.8ms NMS per image at shape (8, 3, 640, 640)

3.2 Model detection test of GRAY graph training

Training comparison

The train time of an epoch ranges from 2:48-2:54, and the val time of an epoch ranges from 0:08-0:09.

Comparison of accuracy

​ Trian ends, all: P(0.905) R(0.617) mAP(0.705)@.5 mAP(0.464)@.5:.95

val: all:P(0.728) R(0.619) mAP(0.696)@.5 mAP(0.497)@.5:.95

Comparison of val inference time

Speed: 0.1ms pre-process, 4ms inference, 2.3ms NMS per image at shape (8, 1, 640, 640)

During the above process of running val.py,

[Error 6] is as follows

...
File "D:/yolov5train/yolov5_6.2_grayTrain/val.py", line 169, in run
    model.warmup(imgsz=(1 if pt else batch_size, 3, imgsz, imgsz))  # warmup
...
File "D:\AppData\anaconda3.8\envs\yolov5t\lib\site-packages\torch\nn\modules\conv.py", line 442, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [32, 1, 6, 6], expected input[1, 3, 640, 640] to have 1 channels, but got 3 channels instead

Solution: Replace 169 lines of code in the val.py source code

model.warmup(imgsz=(1 if pt else batch_size, 3, imgsz, imgsz))  # warmup

Change it to the following, and [Error 6] is resolved.

model.warmup(imgsz=(1 if pt else batch_size, 1, imgsz, imgsz))  # warmup

3.3 Comparison of RGB and GRAY training and testing

​This time both RGB and GRAY use yolov5_6.2 without pre-training weight training for 50 epochs. The data set is the same set of images in RGB and GRAY formats, and the label files are exactly the same. Because the data set is allocated: 2491 (train), 360 (val). The comparison results are as follows:

[Table 3-1 Comparison of RGB and GRAY model conclusions]

Insert image description here

4 Conclusion

Based on Table 3-1, the following conclusions can be drawn:

(1) By comparing images with RGB and GRAY, the time for training and inference using grayscale images is indeed reduced. The inference time we are concerned about dropped from 7.1 to 4.1, which reduced the time by 3 / 7.1 = 42%.

(2) There are differences in the tested P, R, and [email protected] values, but the differences in 50 epochs are not large. Since there is no pre-trained model, 50 rounds of training are required.

5 Test RGB to Gray conversion time

​ Pass the 'detect.py' test. Change the parameter ch=3 passed into the model to 1. The '--source' parameter takes the images in the folder that are all RGB. It is found that after direct detection, it will be saved in the 'runs/val' directory in gray format. It turns out that the method 'def next (self) :' in the 'class LoadImages:' class in the 'utils/dataloaders.py' source code has been changed to 'img0 = cv2.imread(path, 0)'. So, I measured the time consumption of the 'cv2.imread()' method, as follows:

读取RGB图的耗时(s): 0.02003192901611328
读取灰度图的耗时(s): 0.011004924774169922

​In fact, the time-consuming comparison of the 'cv2.imread()' method is completely unnecessary. We usually obtain the stream through rtsp stream or camera SDK. The camera data stream obtained is in yuv format, and the y channel data is gray. data. Instead, obtaining RGB data requires calculation of the yuv to RGB formula.

​ It is necessary to study, ① How to read the video in the camera when target detection is deployed. ② What kind of data format is given by the camera SDK. ③ In the pre-processing of target detection, is it processing RGB images, or other formats (or forms of data) streamed from the video? Does it need to be converted to RGB format before other pre-processing operations are performed? Consider the principles that include the camera's encoding and decoding operations.

[The three points mentioned in the previous paragraph that need to be studied, are there any better resources to recommend, thank you~]

The following is the formula for the mutual conversion between the R, G, B components and Y, U, V components of all pixels in the image.
Y + 0.587*G + .0114*B\\ U = -0.147*R -0.289*G + 0.436*B\\ V = 0.615*R -0.515*G -0.100*B \end{cases} Y=0.299R+0.587G+.0114BU=0.147R0.289G+0.436BV=0.615R0.515G0.100B

{ R = Y + 1.14 ∗ V G = Y − 0.39 ∗ U − 0.58 ∗ V B = Y + 2.03 ∗ U \begin{cases} R = Y + 1.14*V\\ G = Y - 0.39*U - 0.58*V\\ B = Y + 2.03*U \end{cases} R=Y+1.14VG=Y0.39U0.58VB=Y+2.03U


Attachment 1: No source code has been modified, grayscale images + yolov6_6.2 are used directly for training

The model parameters are shown in Figure 1

Insert image description here

Figure 1 can also be run. According to my understanding, it is also a 3-channel running training, and the three channels are the values ​​of the GRAY channel. According to yuv's understanding, the y channel is considered a grayscale channel. That is equivalent to the gray value corresponding to y for all three channels.

Appendix 2 The data set labels during training are as follows

Table 1 Target label details

ID label describe Remark
0 feet Pedestrians (people riding balance bikes and flatbeds are also included in this category)
1 car Cars (including SUV, MPV (pickup), VAN (van))
2 bus Bus, bus
3 truck truck, van
4 bike bike
5 elec Motorcycle (electric motorcycle)
6 tricycle Tricycle (electric tricycle, gas tricycle)
7 coni cone barrel
8 warm warning post
9 tralight traffic light
10 speVeh Emergency or special vehicles (ambulances, fire trucks, engineering vehicles such as cranes, excavators, muck trucks, etc.

Guess you like

Origin blog.csdn.net/qq_42835363/article/details/132693597