Two months after the yolo v4 introduced in the previous article [ opencv dnn module example (16) target detection object_detection - yolov4] , Ultralytics released the first official version of YOLOV5, whose performance is comparable to YOLO V4.
Article directory
- 1. Explanation of the differences between Yolo v5 and Yolo v4
-
- 1.1. Data Augmentation - Data enhancement
- 1.2. Auto Learning Bounding Box Anchors - adaptive anchor box
- 1.3. Backbone-Cross-stage Partial Network (CSP)
- 1.4. Neck-Path Aggregation Network (PANET)
- 1.5. Head-YOLO universal detection layer
- 1.5, Activation Function - activation function
- 1.6. Optimization Function - optimization function
- 1.7、Benchmarks- YOLO V5 VS YOLO V4
- 1.8. Comparison and summary
- 2. yolo v5 test
- 3. Custom data set training
Yolo v5 actually has no inheritance relationship with Yolo v4. They are both improved based on yolo v3. However, because it has not published corresponding articles, open source protocols and other issues, it has been questioned that it cannot be regarded as a new generation of YOLO. However, for our learning and use, as long as it can catch mice, either a white cat or a black cat is a good cat.
1. Explanation of the differences between Yolo v5 and Yolo v4
Compare YOLO V5 and V4 from the following aspects, briefly describe the characteristics of their respective new technologies, and compare the differences and similarities between the two.
1.1. Data Augmentation - Data enhancement
YOLO V4 uses a combination of multiple data enhancement technologies for a single image. In addition to classic geometric distortion and lighting distortion, it also innovatively uses image occlusion (Random Erase, Cutout, Hide and Seek, Grid Mask, MixUp) technology. For multi-image combination, the author uses a mixture of CutMix and Mosaic technologies. In addition, the author also used Self-Adversarial Training (SAT) for data enhancement.
The author of YOLO V5 has not published a paper yet, so its data augmentation pipeline can only be understood from a code perspective.
YOLOV5 will pass each batch of training data through the data loader and enhance the training data at the same time.
The data loader performs three types of data enhancement: scaling, color space adjustment and mosaic enhancement.
Interestingly, there are media reports that Glen Jocher, the author of YOLO V5, is the creator of Mosaic Augmentation. He believes that the huge performance improvement of YOLO V4 is largely due to the mosaic data enhancement. Perhaps he is not convinced. After the release of YOLO V4
, YOLO V5 was launched in just two months. Of course, whether to continue to use the name YOLO V5 or adopt other names in the future depends first on whether the final research results of YOLO V5 can truly lead YOLO V4.
But it is undeniable that mosaic data enhancement can indeed effectively solve the most troublesome "small object problem" in model training, that is, small objects are not detected as accurately as large objects.
1.2. Auto Learning Bounding Box Anchors - adaptive anchor box
In the previous YOLO V3, k-means and genetic learning algorithms were used to analyze the custom data set to obtain a preset anchor box suitable for object boundary box prediction in the custom data set.
In YOLO V5, the anchor box is automatically learned based on training data. YOLO V4 does not have an adaptive anchor box .
For the COCO data set, the size of the anchor box under the 640×640 image size has been preset in the configuration file *.yaml of YOLO V5:
anchors:
- [10,13, 16,30, 33,23] # P3/8
- [30,61, 62,45, 59,119] # P4/16
- [116,90, 156,198, 373,326] # P5/32
For custom data sets, since the target recognition framework often needs to scale the original image size, and the size of the target object in the data set may be different from the COCO data set, YOLO V5 will automatically learn the size of the anchor box again.
In the picture above, YOLO V5 is learning the size of the automatic anchor box. For the BDD100K data set, after the image in the model is scaled to 512, the optimal anchor box is:
1.3. Backbone-Cross-stage Partial Network (CSP)
Both YOLO V5 and V4 use CSPDarknet as Backbone. The full name of CSPNet is Cross Stage Partial Networks, which is a cross-stage partial network. CSPNet solves the gradient information duplication problem in network optimization in other large convolutional neural network frameworks Backbone, and integrates gradient changes into the feature map from beginning to end, thus reducing the number of model parameters and FLOPS values, which not only ensures the reasoning speed and accuracy, and reduced model size.
1.4. Neck-Path Aggregation Network (PANET)
Neck is mainly used to generate feature pyramids. The feature pyramid will enhance the model's detection of objects at different scales, allowing it to recognize the same object at different sizes and scales.
Before PANET came out, FPN had been the state of the art in the feature aggregation layer of the object detection framework until the emergence of PANET.
In the research of YOLO V4, PANET is considered to be the most suitable feature fusion network for YOLO, so both YOLO V5 and V4 use PANET as Neck to aggregate features.
1.5. Head-YOLO universal detection layer
The model Head is mainly used for the final detection part. It applies anchor boxes on the feature map and produces a final output vector with class probabilities, object scores and bounding boxes.
In the YOLO V5 model, the model Head is the same as the previous YOLO V3 and V4 versions.
These Heads with different scaling scales are used to detect objects of different sizes (input 608, final output downsampling 5 times), each Head has a total of (80 classes + 1 probability + 4 coordinates) * 3 anchor boxes, a total of 255 channels.
1.5, Activation Function - activation function
The choice of activation function is crucial for deep learning networks. The author of YOLO V5 used Leaky ReLU and Sigmoid activation functions.
In YOLO V5, the middle/hidden layer uses the Leaky ReLU activation function, and the final detection layer uses the Sigmoid activation function. YOLO V4 uses the Mish activation function.
Mish beats Swish on 39 benchmarks and ReLU on 40 benchmarks, with some results showing 3–5% improvements in benchmark accuracy. But be aware that Mish activation is computationally more expensive compared to ReLU and Swish.
1.6. Optimization Function - optimization function
The author of YOLO V5 provides us with two optimization functions Adam and SGD, and both preset matching training hyperparameters. Default is SGD.
YOLO V4 uses SGD.
The author of YOLO V5 recommends that if you need to train smaller custom datasets, Adam is a more suitable choice, although Adam's learning rate is generally lower than SGD.
But if you train a large data set, SGD works better than Adam for YOLOV5.
In fact, there is no unified conclusion in the academic community as to which one is better, SGD or Adam, and it depends on the actual project situation.
The loss calculation of the Cost Function
YOLO series is based on objectness score, class probability score, and bounding box regression score.
YOLO V5 uses GIOU Loss as the loss of bounding box, and uses binary cross entropy and Logits loss function to calculate the loss of class probability and target score. At the same time, we can also use the fl_gamma parameter to activate Focal loss to calculate the loss function.
YOLO V4 uses CIOU Loss as the loss of the bounding box. Compared with other mentioned methods, CIOU brings faster convergence and better performance.
The results in the above figure are based on Faster R-CNN. It can be seen that CIoU actually performs better than GIoU.
1.7、Benchmarks- YOLO V5 VS YOLO V4
Before there is a detailed discussion in the paper, we can only compare the performance of the two by looking at the COCO indicators released by the author and combined with the subsequent example evaluations by the big guys.
1.7.1. Official performance evaluation
In the two figures above, the relationship between FPS and ms/img is inverted. After unit conversion, we can find that YOLO V5 can reach 250FPS on the V100GPU and has a high mAP.
Since the original training of YOLO V4 is on 1080TI, which is far lower than the performance of V100, and the benchmarks of AP_50 and AP_val are different, it is impossible to obtain the benchmarks of the two based on the above table alone.
Fortunately, WongKinYiu, the second author of YOLO V4, used the V100 GPU to provide comparable benchmarks.
As can be seen from the chart, the performance of the two is actually very close, but according to the data, YOLO V4 is still the best object detection framework. YOLO V4 is highly customizable. If you are not afraid of more custom configurations, then Darknet-based YOLO V4 is still the most accurate.
It is worth noting that YOLO V4 actually uses a large number of data enhancement technologies in the Ultralytics YOLOv3 code base. These technologies are also run in YOLO V5. How much impact the data enhancement technology has on the results has to wait for the author's paper analysis.
1.7.2. Training time
According to Roboflow research, YOLO V5 trains very quickly, far exceeding YOLO V4 in training speed. For Roboflow's custom dataset, YOLO V4 took 14 hours to reach the maximum validation evaluation, while YOLO V5 only took 3.5 hours.
1.7.3. Model size
The sizes of different models in the figure are: V5x: 367MB, V5l: 192MB, V5m: 84MB, V5s: 27MB, YOLOV4: 245 MB. The YOLO V5s model size is very
small, which reduces deployment costs and is conducive to rapid deployment of the model.
1.7.4. Reasoning time
On a single image (batch size 1), YOLOV4 infers in 22 ms and YOLOV5s infers in 20 ms.
The YOLOV5 implementation defaults to batch inference (batch size 36), and divides the batch processing time by the number of images in the batch. The inference time of a single image can reach 7ms, which is 140FPS. This is the current state-of-the-art in the field of object detection. of-the-art.
I used the model I trained to perform real-time inference on 10,000 test images. The inference speed of YOLOV5s is very amazing. Each image requires only 7ms of inference time. Coupled with the model size of more than 20 megabytes, it is unrivaled in terms of flexibility.
But in fact, this is not fair to YOLO V4. Since YOLO V4 does not implement default batch reasoning, it is disadvantaged in comparison. There should be many tests on the two object detection frameworks under the same benchmark.
Secondly, YOLO V4 has recently launched a tiny version. The performance and speed comparison between YOLO V5s and V4 tiny requires more practical analysis.
1.8. Comparison and summary
In general, YOLO V4 is better than YOLO V5 in performance, but weaker than YOLO V5 in flexibility and speed.
Since YOLO V5 is still being updated rapidly, the final research results of YOLO V5 remain to be analyzed.
I personally think that for these object detection frameworks, the performance of the feature fusion layer is very important. Currently, both use PANET, but according to research by Google Brain, BiFPN is the best choice for the feature fusion layer. Whoever can integrate this technology is likely to achieve significant performance improvements.
Although YOLO V5 is still inferior, YOLO V5 still has the following significant advantages:
-
Using the Pytorch framework is very user-friendly and can easily train your own data sets. Compared with the Darknet framework adopted by YOLO V4, the Pytorch framework is easier to put into production.
-
The code is easy to read and integrates a large number of computer vision technologies, which is very conducive to learning and reference.
-
Not only is it easy to configure the environment, model training is also very fast, and batch inference produces real-time results.
-
Ability to perform efficient inference directly on single images, batched images, videos, and even webcam port inputs
-
It can easily convert the Pytorch weight file into the ONXX format used by Android, and then convert it to the format used by OPENCV, or convert it to IOS format through CoreML and deploy it directly to the mobile application.
-
Finally, the object recognition speed of YOLO V5s up to 140FPS is very impressive, and the user experience is great
2. yolo v5 test
The current yolo v5 project address is https://github.com/ultralytics/yolov , and the version has been updated to v7.0.
2.1. python test
2.1.1. Installation
git clone https://github.com/ultralytics/yolov5 # clone
cd yolov5
pip install -r requirements.txt # install
2.1.2. Reasoning
-
Using yolov5 hub inference, the latest model will be automatically downloaded from YOLOv5 release.
import torch # Model model = torch.hub.load("ultralytics/yolov5", "yolov5s") # or yolov5n - yolov5x6, custom # Images img = "https://ultralytics.com/images/zidane.jpg" # or file, Path, PIL, OpenCV, numpy, list # Inference results = model(img) # Results results.print() # or .show(), .save(), .crop(), .pandas(), etc. ```
-
Inference using detect.py
detect.py runs inference on various sources. The model is automatically downloaded from the latest YOLOv5 release and the results are saved to runs/detect.python detect.py --weights yolov5s.pt --source 0 # webcam img.jpg # image vid.mp4 # video screen # screenshot path/ # directory list.txt # list of images list.streams # list of streams 'path/*.jpg' # glob 'https://youtu.be/LNwODJXcvt4' # YouTube 'rtsp://example.com/media.mp4' # RTSP, RTMP, HTTP stream
2.1.3. Test output
Pay attention to the usage and operating efficiency comparison of the parameters --dnn and --half, focusing on the three indicator time data of pre-process, inference, and nms.
(yolo_pytorch) E:\DeepLearning\yolov5>python detect.py --weights yolov5n.pt --source data/images/bus.jpg
detect: weights=['yolov5n.pt'], source=data/images/bus.jpg, data=data\coco128.yaml, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs\detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False, vid_stride=1
YOLOv5 v7.0-167-g5deff14 Python-3.9.16 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11264MiB)
Fusing layers...
YOLOv5n summary: 213 layers, 1867405 parameters, 0 gradients
image 1/1 E:\DeepLearning\yolov5\data\images\bus.jpg: 640x480 4 persons, 1 bus, 121.0ms
Speed: 1.0ms pre-process, 121.0ms inference, 38.0ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs\detect\exp2
(yolo_pytorch) E:\DeepLearning\yolov5>python detect.py --weights yolov5n.pt --source data/images/bus.jpg --device 0
detect: weights=['yolov5n.pt'], source=data/images/bus.jpg, data=data\coco128.yaml, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=0, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs\detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=True, vid_stride=1
YOLOv5 v7.0-167-g5deff14 Python-3.9.16 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11264MiB)
Fusing layers...
YOLOv5n summary: 213 layers, 1867405 parameters, 0 gradients
image 1/1 E:\DeepLearning\yolov5\data\images\bus.jpg: 640x480 4 persons, 1 bus, 11.0ms
Speed: 0.0ms pre-process, 11.0ms inference, 7.0ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs\detect\exp2
(yolo_pytorch) E:\DeepLearning\yolov5>python detect.py --weights yolov5n.pt --source data/images/bus.jpg --dnn
detect: weights=['yolov5n.pt'], source=data/images/bus.jpg, data=data\coco128.yaml, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs\detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=True, vid_stride=1
YOLOv5 v7.0-167-g5deff14 Python-3.9.16 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11264MiB)
Fusing layers...
YOLOv5n summary: 213 layers, 1867405 parameters, 0 gradients
image 1/1 E:\DeepLearning\yolov5\data\images\bus.jpg: 640x480 4 persons, 1 bus, 10.0ms
Speed: 0.0ms pre-process, 10.0ms inference, 4.0ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs\detect\exp3
Other test comparisons
pre-process、 inference、 nms
cpu: 1 121 38
gpu: 0 11 7
dnn: 0 10 4
gpu-half: 0 10 4
dnn-half: 1 11 4
2.2. c++ test
Here, the opencv dnn module is used to load the onnx format model exported by yolov5 for testing.
2.2.1. Model export
The official website actually provides onnx format export files for each version of the model, but they are all half-precision models and cannot be used directly in opencv dnn.
Here we take yolov5x as an example to export the onnx model. For first time use, you can view the py file parameters or view it through the command line, as follows. Note that when exporting, select the appropriate onnx opset version to adapt to the opencv dnn version .
(yolo_pytorch) E:\DeepLearning\yolov5>python export.py --weights yolov5x.pt --include onnx --opset 12
export: data=E:\DeepLearning\yolov5\data\coco128.yaml, weights=['yolov5x.pt'], imgsz=[640, 640], batch_size=1, device=cpu, half=False, inplace=False, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=12, verbose=False, workspace=4, nms=False, agnostic_nms=False, topk_per_class=100, topk_all=100, iou_thres=0.45, conf_thres=0.25, include=['onnx']
YOLOv5 v7.0-167-g5deff14 Python-3.9.16 torch-1.13.1+cu117 CPU
Fusing layers...
YOLOv5x summary: 444 layers, 86705005 parameters, 0 gradients
PyTorch: starting from yolov5x.pt with output shape (1, 25200, 85) (166.0 MB)
ONNX: starting export with onnx 1.14.0...
ONNX: export success 10.0s, saved as yolov5x.onnx (331.2 MB)
Export complete (15.0s)
Results saved to E:\DeepLearning\yolov5
Detect: python detect.py --weights yolov5x.onnx
Validate: python val.py --weights yolov5x.onnx
PyTorch Hub: model = torch.hub.load('ultralytics/yolov5', 'custom', 'yolov5x.onnx')
Visualize: https://netron.app
2.2.2, opencv dnn c++ code test
The theme code is the same as in yolov4, the main differences are:
- Preprocessing can be performed according to the situation whether to scale and fill, ensuring that the size is consistent with the network input, see
formatToSquare()
function. - There are some slight adjustments to the data processing of network output in the post-processing code.
The complete code is as follows
#pragma once
#include "opencv2/opencv.hpp"
#include <fstream>
#include <sstream>
#include <random>
using namespace cv;
using namespace dnn;
float inpWidth;
float inpHeight;
float confThreshold, scoreThreshold, nmsThreshold;
std::vector<std::string> classes;
std::vector<cv::Scalar> colors;
bool letterBoxForSquare = true;
cv::Mat formatToSquare(const cv::Mat &source);
void postprocess(Mat& frame, cv::Size inputSz, const std::vector<Mat>& out, Net& net);
void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame);
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<int> dis(100, 255);
int main()
{
// 根据选择的检测模型文件进行配置
confThreshold = 0.25;
scoreThreshold = 0.45;
nmsThreshold = 0.5;
float scale = 1/255.0; //0.00392
Scalar mean = {
0,0,0};
bool swapRB = true;
inpWidth = 640;
inpHeight = 640;
String model_dir = R"(E:\DeepLearning\yolov5)";
String modelPath = model_dir + R"(\yolov5n.onnx)";
String configPath;
String framework = "";
int backendId = cv::dnn::DNN_BACKEND_CUDA;
int targetId = cv::dnn::DNN_TARGET_CUDA;
String classesFile = R"(model\object_detection_classes_yolov3.txt)";
// Open file with classes names.
if(!classesFile.empty()) {
const std::string& file = classesFile;
std::ifstream ifs(file.c_str());
if(!ifs.is_open())
CV_Error(Error::StsError, "File " + file + " not found");
std::string line;
while(std::getline(ifs, line)) {
classes.push_back(line);
colors.push_back(cv::Scalar(dis(gen), dis(gen), dis(gen)));
}
}
// Load a model.
Net net = readNet(modelPath, configPath, framework);
net.setPreferableBackend(backendId);
net.setPreferableTarget(targetId);
std::vector<String> outNames = net.getUnconnectedOutLayersNames();
{
int dims[] = {
1,3,inpHeight,inpWidth};
cv::Mat tmp = cv::Mat::zeros(4, dims, CV_32F);
std::vector<cv::Mat> outs;
net.setInput(tmp);
for(int i = 0; i<10; i++)
net.forward(outs, outNames); // warmup
}
// Create a window
static const std::string kWinName = "Deep learning object detection in OpenCV";
cv::namedWindow(kWinName, 0);
// Open a video file or an image file or a camera stream.
VideoCapture cap;
//cap.open(0);
cap.open(R"(E:\DeepLearning\yolov5\data\images\bus.jpg)");
cv::TickMeter tk;
// Process frames.
Mat frame, blob;
while(waitKey(1) < 0) {
//tk.reset();
//tk.start();
cap >> frame;
if(frame.empty()) {
waitKey();
break;
}
// Create a 4D blob from a frame.
cv::Mat modelInput = frame;
if(letterBoxForSquare && inpWidth == inpHeight)
modelInput = formatToSquare(modelInput);
blobFromImage(modelInput, blob, scale, cv::Size2f(inpWidth, inpHeight), mean, swapRB, false);
// Run a model.
net.setInput(blob);
std::vector<Mat> outs;
//tk.reset();
//tk.start();
auto tt1 = cv::getTickCount();
net.forward(outs, outNames);
auto tt2 = cv::getTickCount();
tk.stop();
postprocess(frame, modelInput.size(), outs, net);
//tk.stop();
// Put efficiency information.
std::vector<double> layersTimes;
double freq = getTickFrequency() / 1000;
double t = net.getPerfProfile(layersTimes) / freq;
std::string label = format("Inference time: %.2f ms (%.2f ms)", t, /*tk.getTimeMilli()*/ (tt2 - tt1) / cv::getTickFrequency() * 1000);
cv::putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));
cv::imshow(kWinName, frame);
}
return 0;
}
cv::Mat formatToSquare(const cv::Mat &source)
{
int col = source.cols;
int row = source.rows;
int _max = MAX(col, row);
cv::Mat result = cv::Mat::zeros(_max, _max, CV_8UC3);
source.copyTo(result(cv::Rect(0, 0, col, row)));
return result;
}
void postprocess(Mat& frame, cv::Size inputSz, const std::vector<Mat>& outs, Net& net)
{
// yolov5 has an output of shape (batchSize, 25200, 85) (Num classes + box[x,y,w,h] + confidence[c])
auto tt1 = cv::getTickCount();
//float x_factor = frame.cols / inpWidth;
//float y_factor = frame.rows / inpHeight;
float x_factor = inputSz.width / inpWidth;
float y_factor = inputSz.height / inpHeight;
std::vector<int> class_ids;
std::vector<float> confidences;
std::vector<cv::Rect> boxes;
int rows = outs[0].size[1];
int dimensions = outs[0].size[2];
float *data = (float *)outs[0].data;
for(int i = 0; i < rows; ++i) {
float confidence = data[4];
if(confidence >= confThreshold) {
float *classes_scores = data + 5;
cv::Mat scores(1, classes.size(), CV_32FC1, classes_scores);
cv::Point class_id;
double max_class_score;
minMaxLoc(scores, 0, &max_class_score, 0, &class_id);
if(max_class_score > scoreThreshold) {
confidences.push_back(confidence);
class_ids.push_back(class_id.x);
float x = data[0];
float y = data[1];
float w = data[2];
float h = data[3];
int left = int((x - 0.5 * w) * x_factor);
int top = int((y - 0.5 * h) * y_factor);
int width = int(w * x_factor);
int height = int(h * y_factor);
boxes.push_back(cv::Rect(left, top, width, height));
}
}
data += dimensions;
}
std::vector<int> indices;
NMSBoxes(boxes, confidences, scoreThreshold, nmsThreshold, indices);
auto tt2 = cv::getTickCount();
std::string label = format("NMS time: %.2f ms", (tt2 - tt1) / cv::getTickFrequency() * 1000);
cv::putText(frame, label, Point(0, 30), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));
for(size_t i = 0; i < indices.size(); ++i) {
int idx = indices[i];
Rect box = boxes[idx];
drawPred(class_ids[idx], confidences[idx], box.x, box.y,
box.x + box.width, box.y + box.height, frame);
}
}
void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame)
{
rectangle(frame, Point(left, top), Point(right, bottom), Scalar(0, 255, 0));
std::string label = format("%.2f", conf);
Scalar color = Scalar::all(255);
if(!classes.empty()) {
CV_Assert(classId < (int)classes.size());
label = classes[classId] + ": " + label;
color = colors[classId];
}
int baseLine;
Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);
top = max(top, labelSize.height);
rectangle(frame, Point(left, top - labelSize.height),
Point(left + labelSize.width, top + baseLine), color, FILLED);
cv::putText(frame, label, Point(left, top), FONT_HERSHEY_SIMPLEX, 0.5, Scalar());
}
2.2.3. Test results
When the previous python test used GPU, forward inference took 10ms and NMS took 4ms. When opencv dnn is used to open dnn here, forward inference took ~7ms and NMS took ~0.3ms.
3. Custom data set training
Here, yolov5s is used as a pre-training model to train a target detection model containing 4 types of vehicle types.
3.1. Data set preparation
First label the picture yourself, for example, taking the voc format as an example, use the labelImg tool for labeling. The default labeling file format of coco is xml, which needs to be converted to txt through a script (in addition, you can directly use the labelme tool to directly save it into the txt format required by yolo) .
Here we only focus on the folders JPEGImages
and labels
. After the annotation is completed, place the image and the generated annotation file in any directory. For example E:\DeepLearning\yolov5\custom-data\vehicle
, then place the image and annotation file into the images and labels folders respectively (yolov5 default path, otherwise you need to modify the img2label_paths function in yolov5/utils/dataloaders.py two parameters).
vehicle
├── images
│ ├── 20151127_114556.jpg
│ ├── 20151127_114946.jpg
│ └── 20151127_115133.jpg
├── labels
│ ├── 20151127_114556.txt
│ ├── 20151127_114946.txt
│ └── 20151127_115133.txt
After that, prepare the list files train.txt, val.txt, and test.txt for the training set, verification set, and test set (optional). The absolute paths of the images are stored in the three files, and the ratio is randomly selected, such as 7:2:1.
3.2. Configuration file
Copy the data/coco.yaml and model/yolov5s.yaml files to the data set directory and make modifications.
For example, the dataset description filemyvoc.yaml
train: E:/DeepLearning/yolov5/custom-data/vehicle/train.txt
val: E:/DeepLearning/yolov5/custom-data/vehicle/val.txt
# number of classes
nc: 4
# class names
names: ["car", "huoche", "guache", "keche"]
Network model configuration file yolov5s.yaml
only modify the parameter nc to the actual number of target detection categories
# Parameters
nc: 4 # number of classes
depth_multiple: 0.33 # model depth multiple
width_multiple: 0.50 # layer channel multiple
anchors:
- [10,13, 16,30, 33,23] # P3/8
- [30,61, 62,45, 59,119] # P4/16
- [116,90, 156,198, 373,326] # P5/32
3.3. Training
As mentioned earlier, after the preparation work is completed, the directory structure is as follows.
After that, we train 20 epocs. The script for single-GPU training is as follows:
python train.py
--weights yolov5s.pt
--cfg custom-data\vehicle\yolov5s.yaml
--data custom-data\vehicle\myvoc.yaml
--epoch 20
--batch-size=32
--img 640
--device 0
The training output content is
E:\DeepLearning\yolov5>python train.py --weights yolov5s.pt --cfg custom-data\vehicle\yolov5s.yaml --data custom-data\vehicle\myvoc.yaml --epoch 20 --batch-size=32 --img 640 --device 0
train: weights=yolov5s.pt, cfg=custom-data\vehicle\yolov5s.yaml, data=custom-data\vehicle\myvoc.yaml, hyp=data\hyps\hyp.scratch-low.yaml, epochs=20, batch_size=32, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs\train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
fatal: unable to access 'http://github.com/ultralytics/yolov5.git/': Recv failure: Connection was reset
Command 'git fetch origin' timed out after 5 seconds
YOLOv5 v7.0-167-g5deff14 Python-3.9.16 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11264MiB)
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs\train', view at http://localhost:6006/
from n params module arguments
0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 18816 models.common.C3 [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 2 115712 models.common.C3 [128, 128, 2]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 3 625152 models.common.C3 [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 1182720 models.common.C3 [512, 512, 1]
9 -1 1 656896 models.common.SPPF [512, 512, 5]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 361984 models.common.C3 [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 90880 models.common.C3 [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 296448 models.common.C3 [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1182720 models.common.C3 [512, 512, 1, False]
24 [17, 20, 23] 1 24273 models.yolo.Detect [4, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
YOLOv5s summary: 214 layers, 7030417 parameters, 7030417 gradients, 16.0 GFLOPs
Transferred 342/349 items from yolov5s.pt
AMP: checks passed
optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 60 weight(decay=0.0005), 60 bias
train: Scanning E:\DeepLearning\yolov5\custom-data\vehicle\train... 998 images, 0 backgrounds, 0 corrupt: 100%|██████████| 998/998 [00:07<00:00, 141.97it/s]
train: New cache created: E:\DeepLearning\yolov5\custom-data\vehicle\train.cache
val: Scanning E:\DeepLearning\yolov5\custom-data\vehicle\val... 998 images, 0 backgrounds, 0 corrupt: 100%|██████████| 998/998 [00:13<00:00, 72.66it/s]
val: New cache created: E:\DeepLearning\yolov5\custom-data\vehicle\val.cache
AutoAnchor: 4.36 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset
Plotting labels to runs\train\exp13\labels.jpg...
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs\train\exp13
Starting training for 20 epochs...
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
0/19 6.36G 0.09633 0.038 0.03865 34 640: 100%|██████████| 32/32 [00:19<00:00, 1.66it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:11<00:00, 1.45it/s]
all 998 2353 0.884 0.174 0.248 0.0749
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
1/19 9.9G 0.06125 0.03181 0.02363 26 640: 100%|██████████| 32/32 [00:14<00:00, 2.18it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.50it/s]
all 998 2353 0.462 0.374 0.33 0.105
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
2/19 9.9G 0.06124 0.02353 0.02014 18 640: 100%|██████████| 32/32 [00:14<00:00, 2.22it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.58it/s]
all 998 2353 0.469 0.472 0.277 0.129
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
3/19 9.9G 0.05214 0.02038 0.0175 27 640: 100%|██████████| 32/32 [00:14<00:00, 2.22it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.56it/s]
all 998 2353 0.62 0.64 0.605 0.279
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
4/19 9.9G 0.04481 0.01777 0.01598 23 640: 100%|██████████| 32/32 [00:14<00:00, 2.17it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.60it/s]
all 998 2353 0.803 0.706 0.848 0.403
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
5/19 9.9G 0.0381 0.01624 0.01335 19 640: 100%|██████████| 32/32 [00:14<00:00, 2.16it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.55it/s]
all 998 2353 0.651 0.872 0.8 0.414
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
6/19 9.9G 0.03379 0.01534 0.01134 28 640: 100%|██████████| 32/32 [00:14<00:00, 2.18it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.58it/s]
all 998 2353 0.94 0.932 0.978 0.608
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
7/19 9.9G 0.03228 0.01523 0.00837 10 640: 100%|██████████| 32/32 [00:14<00:00, 2.21it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:09<00:00, 1.67it/s]
all 998 2353 0.862 0.932 0.956 0.591
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
8/19 9.9G 0.0292 0.01458 0.007451 20 640: 100%|██████████| 32/32 [00:14<00:00, 2.21it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.56it/s]
all 998 2353 0.97 0.954 0.986 0.658
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
9/19 9.9G 0.02739 0.01407 0.006553 29 640: 100%|██████████| 32/32 [00:15<00:00, 2.12it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.58it/s]
all 998 2353 0.982 0.975 0.993 0.74
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
10/19 9.9G 0.0248 0.01362 0.005524 30 640: 100%|██████████| 32/32 [00:14<00:00, 2.14it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.55it/s]
all 998 2353 0.985 0.973 0.993 0.757
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
11/19 9.9G 0.02377 0.01271 0.005606 27 640: 100%|██████████| 32/32 [00:15<00:00, 2.13it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.52it/s]
all 998 2353 0.964 0.975 0.989 0.725
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
12/19 9.9G 0.02201 0.01247 0.005372 33 640: 100%|██████████| 32/32 [00:14<00:00, 2.19it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.57it/s]
all 998 2353 0.988 0.988 0.994 0.83
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
13/19 9.9G 0.02103 0.01193 0.004843 22 640: 100%|██████████| 32/32 [00:14<00:00, 2.14it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.57it/s]
all 998 2353 0.981 0.987 0.994 0.817
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
14/19 9.9G 0.02017 0.01167 0.00431 22 640: 100%|██████████| 32/32 [00:14<00:00, 2.20it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:09<00:00, 1.60it/s]
all 998 2353 0.96 0.952 0.987 0.782
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
15/19 9.9G 0.01847 0.01158 0.004043 32 640: 100%|██████████| 32/32 [00:14<00:00, 2.20it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.56it/s]
all 998 2353 0.988 0.992 0.994 0.819
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
16/19 9.9G 0.01771 0.0114 0.003859 24 640: 100%|██████████| 32/32 [00:14<00:00, 2.20it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.55it/s]
all 998 2353 0.967 0.96 0.99 0.832
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
17/19 9.9G 0.01665 0.01077 0.003739 32 640: 100%|██████████| 32/32 [00:14<00:00, 2.22it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.59it/s]
all 998 2353 0.992 0.995 0.994 0.87
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
18/19 9.9G 0.01559 0.01067 0.003549 45 640: 100%|██████████| 32/32 [00:14<00:00, 2.21it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:10<00:00, 1.53it/s]
all 998 2353 0.991 0.995 0.995 0.867
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
19/19 9.9G 0.01459 0.01009 0.003031 31 640: 100%|██████████| 32/32 [00:14<00:00, 2.18it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:11<00:00, 1.42it/s]
all 998 2353 0.994 0.995 0.994 0.885
20 epochs completed in 0.143 hours.
Optimizer stripped from runs\train\exp13\weights\last.pt, 14.4MB
Optimizer stripped from runs\train\exp13\weights\best.pt, 14.4MB
Validating runs\train\exp13\weights\best.pt...
Fusing layers...
YOLOv5s summary: 157 layers, 7020913 parameters, 0 gradients, 15.8 GFLOPs
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 16/16 [00:11<00:00, 1.37it/s]
all 998 2353 0.994 0.995 0.994 0.885
car 998 1309 0.995 0.999 0.995 0.902
huoche 998 507 0.993 0.988 0.994 0.895
guache 998 340 0.988 0.993 0.994 0.877
keche 998 197 0.999 1 0.995 0.866
Results saved to runs\train\exp13
During the training process, you can use tensorboard to visually view the training curve. Start it in the yolov5 directory tensorboard --logdir runs\train
, and then http://localhost:6006/
access it to view:
the training speed is very fast, with 998 pictures, and it only takes about 8 minutes to train 20epoc. The training saved model is stored in runs\train\exp13
the directory.
Other relevant screenshots
Use the script python detect.py --weights runs\train\exp13\weights\best.pt --source custom-data\vehicle\images\11.jpg
to test the following
result chart