Implementation of yolov5 target detection network based on libtorch (3) - Kmeans clustering to obtain anchor frame size

Counting this article, we have updated the yolov5 target detection framework to the third article. The first two articles respectively explained the analysis of the json label file of the coco data set, and the network structure and implementation of the yolov5 network:

Realization of yolov5 target detection network based on libtorch - COCO dataset json label file analysis

Implementation of yolov5 target detection network based on libtorch (2) - network structure implementation

In this article, we mainly talk about the concept of the anchor frame in target detection and the principle and implementation of using the kmeans clustering algorithm to obtain the size of the anchor frame.


01

The structure of CSP1_n and CSP2_n mentioned in the previous article is wrong, please correct it here

In the previous article, we mentioned that the structures of CSP1_n and CSP2_n both contain several ResUnit_n modules, and each ResUnit_n module contains several CBL modules, as shown in the following figure:

ResUnit_n structure

Error structure for CSP1_n and CSP2_n

In fact, the structure of CSP1_n and CSP2_n in the above figure is wrong. In fact, each CSP1_n or CSP2_n only contains one ResUnit_n module , not multiple. The n value determines the number of CBL modules in the ResUnit_n module, that is, the input n value determines the number of CBL modules in CSP1_n and CSP2_n as 2n. As shown below:

Correct structure of CSP1_n and CSP2_n


02

Prediction target box regression principle

When we first see the word " predicted target frame regression ", we are often confused. In fact, it is not so mysterious. This is an operation that converts the predicted information of the network into the actual target frame information . So the question is, why do we need a conversion operation?

If the network is allowed to directly predict the position coordinates and width and height of the target frame, the value ranges of coordinates and width and height are too wide, which will undoubtedly increase the training time and learning difficulty of the network, leading to network instability. It even makes the network drift away in the wrong direction of learning.

To solve this problem, the yolov5 network adds prior information of the target frame to the prediction information. Popular understanding is to artificially tell the network the information range of the target frame before network training, such as the value range of the frame center coordinates, the value range of width and height. Then the network learns on the basis of this value range to obtain more accurate frame information. This is equivalent to restricting the learning direction of the network, thus greatly increasing the stability and convergence speed of the network. As shown below:

specific methods:

  • center coordinates of the box

As we have said before, the yolov5 network divides the 640*640 image into N*N (usually 80*80, or 40*40, or 20*20) grid areas. The output of the network outputs the prediction information of all grids. The prediction information of each grid includes the classification probability and confidence of the target, and the x center coordinates, y center coordinates, length and width of the target frame (xpred, ypred, wpred , hpred). For each grid, it predicts that the position of the target frame is near the grid, so the approximate range of x center coordinates and y center coordinates can be determined. Using this feature, the predicted x center coordinates and y of the grid can be determined. The center coordinates are used as the value range limit, as follows:

In the above formula, xgrid and ygrid are the grid coordinates of the grid ( 0≤xgrid<N, 0≤ygrid<N ), and xgrid and ygrid are certain known information, that is, prior information . xpred and ypred are the predicted values ​​of the center coordinates of the grid corresponding to the target box ( 0<xpred<1, 0<ypred<1 ). xc and yc are the regression center coordinates of the target frame, that is, the center coordinates of the target frame after conversion. According to the above formula, the value ranges of xc and yc are as follows, and the relative grid coordinates are offset by -0.5~1.5, which is in line with the target The fact that the box position is near this grid.

  • frame width and height

The width and height of the predicted target frame also add prior information:

In the above formula, wpred and hpred are the target frame width and height prediction values ​​output by the network ( 0<wpred<1, 0<hpred<1 ), wanchor and hanchor are the prior information of the target frame width and height, that is, prior Tell the network the range of width and height of the target box:

So how did wanchor and hanchor get it in advance? This requires a statistical analysis of the width and height of all the target boxes contained in the training set - counting the width and height of all target boxes in the training set, and extracting several main width and height combinations, that is, the most representative The combination of width and height, and the boxes corresponding to these combinations of width and height are what we call the anchor box . The yolov5 network uses the kmeans algorithm to cluster the width and height of all target boxes contained in the training set, and obtains 9 most representative combinations of width and height, that is, 9 anchor boxes, and then divides the width and height of these The 9 anchors are divided into 3 groups:

  • The 3 anchor boxes with the smallest width and height are assigned to each grid of the 80*80 grid

  • Three anchor boxes centered in width and height are assigned to each grid of the 40*40 grid

  • The 3 anchor boxes with the largest width and height are assigned to each grid of the 20*20 grid

This corresponds to the output information of each layer of grid we mentioned above. The number 3 in the following 3*80*80, 3*40*40, 3*20*20 corresponds to the three anchor boxes above:

  • 80*80 grid output 3*80*80 small size detection target information

  • 40*40 grid output information of 3*40*40 medium-sized detection targets

  • 20*20 grid output information of 3*20*20 large-size detection targets


03

Kmeans algorithm principle

Similar to the DBSCAN clustering algorithm and AGNES clustering algorithm we mentioned earlier, kmeans is also a commonly used unsupervised clustering algorithm.

Understanding and Application of DBSCAN Clustering Algorithm

Understanding and Application of AGNES Clustering Algorithm

Suppose the data set has m samples of X0, X 1, X 2, ..., X m-1, where each sample Xi is a one-dimensional vector of length n:

Then the basic steps of the Kmeans algorithm are as follows:

  1. Randomly select k samples (C0, C 1, C 2, ..., C k-1) from m samples as the center point, where k is the number of preset categories. For example, yolov5 needs to divide the All target boxes in the training set are divided into 9 categories, then k=9.

  2. Calculate the distance between each sample and k center points separately, and then assign the sample to the center point closest to it. The measure of distance usually uses Euclidean distance. For example, for any sample Xi, calculate its Euclidean distance with C0, C 1, C 2, ..., C k-1 respectively (as shown below), and then compare d(Xi,C0), d (Xi, C1), d(Xi, C2), ..., d(Xi, Ck-1) obtain the center point with the smallest distance from the sample Xi, and assign the sample Xi to the center point.

  3. After step 2, each sample is assigned to the nearest center point, and all samples assigned to the same center point form a category or cluster, so all samples are divided into k clusters. Then calculate the centroids of k clusters, and replace the original center points as new k center points. For example, if a cluster contains three samples Xt0, Xt1, and Xt2, then the centroid C i of the cluster is calculated as follows:

  4. Judging whether the change of the new center point obtained in step 3 relative to the original center point is less than the set threshold, if it is less than the threshold, stop the calculation. Then judge whether the number of repetitions (number of iterations) of steps 2 and 3 exceeds the set number, and stop the calculation if it exceeds.


04

Opencv3.4.1's Kmeans algorithm interface description

Opencv3.4.1 integrates the Kmeans algorithm:

double kmeans(
                 InputArray data, 
                 int K, 
                 InputOutputArray bestLabels,
                 TermCriteria criteria, 
                 int attempts,
                 int flags, 
                 OutputArray centers = noArray() 
              );

Parameter Description:

data cv::Mat type, is the input data sample, each row is a sample, so the number of rows is equal to the total number of samples, and the number of columns is equal to the data length of each sample, only float type data is supported
K The number of categories for clustering
bestLabels cv::Mat type, the output sample label, the number of rows is equal to the total number of samples, the number of columns is 1, and the value of each row indicates which category the corresponding sample is divided into
criteria TermCriteria type, the input algorithm termination condition, can be terminated when the specified number of iterations is exceeded, can also be terminated when the specified iteration accuracy is reached, or both
attempts The number of calculations of the Kmeans algorithm, the clustering results obtained by different calculations may be different, choose the best result output
flags The method of obtaining the initial center point, supports: KMEANS_RANDOM_CENTERS, KMEANS_PP_CENTERS, KMEANS_USE_INITIAL_LABELS these three methods, the most commonly used is KMEANS_RANDOM_CENTERS random selection method
centers_‍ cv: Mat type, output the final center point, the number of rows is K, and the number of columns is the data length of each sample

Test code:

void kmeans_test(void)
{
  //10个待聚类样本,每个样本的数据长度为1
  float buffer[10] = {0.1, 5.2, 10.35, 0.08, 4.9, 5.234, 11.0, 0.12, 9.89, 0.05};


  Mat input(10, 1, CV_32FC1, buffer);


  const int K = 3;  //划分为3类
  const int attemps = 300;  //kmeans计算300次
  //迭代终止条件
  const cv::TermCriteria term_criteria = cv::TermCriteria(cv::TermCriteria::EPS + cv::TermCriteria::COUNT, 300, 0.01);
  cv::Mat labels_, centers_;
  cv::kmeans(input, K, labels_, term_criteria, attemps, cv::KMEANS_RANDOM_CENTERS, centers_);


  cout << input << endl;
  cout << endl << labels_ << endl;
  cout << endl << centers_ << endl;
}

operation result:


05

Kmeans clustering to obtain the anchor box

yolov5 uses the Kmeans algorithm to cluster the width and height of all target boxes in the training set. As a result, 9 combinations of width and height are obtained, that is, 9 anchor boxes are obtained. However, the sample distance measurement using the Kmeans algorithm here is not the Euclidean distance, but the iou distance, because the author of the yolo series believes that when the width and height of different samples are relatively large, using the Euclidean distance will lead to a large error in the clustering results, so change it To use iou distance. Let's introduce the concept of iou.

Iou is a measure of the similarity of two boxes (including the similarity of position, width, and height). It is the ratio of the area of ​​the intersection area of ​​two boxes to the area of ​​the merged part, so it is also called the intersection ratio.

As shown in the figure above, the width and height of the two boxes are (w1, h1) and (w2, h2) respectively, the red area is the intersection area of ​​the two boxes, and its width and height are (w, h), "blue +Red+Grey” area is the merged area of ​​the two boxes, then the area of ​​the intersected area is:

The area of ​​the merged region is:

Then get the iou:

The value range of iou is [0,1]. When the two boxes have no intersecting area, iou takes 0, and when the two boxes completely overlap, iou takes 1. Therefore, the larger the iou value, the more similar the shapes and positions of the two boxes. In order to make the measurement value negatively correlated with the similarity, that is, the smaller the measurement value, the greater the similarity, we take a negative value for the iou value and add 1 to get:

The above formula is the distance measure used when using the Kmeans algorithm to cluster the width and height of the target frame. Because we only care about the width and height of the boxes, not their positions, it is assumed that the center points of the two boxes are coincident , as shown in the following figure:

However, if we use Opencv's Kmeans algorithm to calculate, there is a problem, that is, Opencv's Kmeans algorithm uses the Euclidean distance to measure the sample distance by default, and the iou measurement cannot be used, so there will be clustering caused by the above-mentioned different sample widths and high differences. The problem of large error in the results. In order to use the Kmeans interface of Opencv and avoid this problem, we can first normalize the width and height of each target box so that their values ​​are all in the range of 0~1, so that there is no large difference in value problem. The following formula, where row and col are the height and width of the image respectively:

The result of the final clustering is also a value between 0 and 1, so we need to multiply it by 640 to convert it to the width and height of the 640*640 coordinate system. The code is as follows, the parsing of the json tag file, as we have already said before:

void kmeans_antror(void)
{
  json j;
  //解析json标签文件,得到训练集包含的所有目标框的宽、高
  ifstream jfile("D:/数据/coco/annotations_trainval2017/annotations/instances_train2017.json");
  jfile >> j;
  ns::coco_label cr;
  ns::from_json(j, cr);
  //每个样本的数据内容为宽、高,因此每个样本的数据长度为2
  Mat input = Mat::zeros(cr.annotations_list.size(), 2, CV_32FC1);
  for (int i = 0; i < cr.annotations_list.size(); i++)
  {
    cout << "i: " << i << endl;
    //读取目标框对应的图像,得到图像的宽、高
    Mat image = read_img_grom_id("F:/train2017/%012d.jpg", cr.annotations_list[i].image_id);
    //对目标框的宽、高做归一化
    float w = cr.annotations_list[i].bbox[2] / image.cols;// *640;
    float h = cr.annotations_list[i].bbox[3] / image.rows;// *640;
    //将目标框的宽、高写入Opencv的Mat中
    input.ptr<float>(i)[0] = w;   
    input.ptr<float>(i)[1] = h;   


  }


  const int K = 9;  //聚类的类别数
  const int attemps = 300;   //kmeans算法的计算次数
  //迭代的停止条件
  const cv::TermCriteria term_criteria = cv::TermCriteria(cv::TermCriteria::EPS + cv::TermCriteria::COUNT, 300, 0.01);
  cv::Mat labels_, centers_;
  cv::kmeans(input, K, labels_, term_criteria, attemps, cv::KMEANS_RANDOM_CENTERS, centers_);
  //将聚类得到的0~1结果乘以640转换得到640*640图像中的对应目标框宽、高
  cout << centers_ * 640 << endl;
}

The running results are as follows, from top to bottom are the width and height of the 9 anchor boxes:

Welcome to scan the QR code to follow this WeChat official account, and more exciting content will be updated from time to time, so stay tuned~

Guess you like

Origin blog.csdn.net/shandianfengfan/article/details/120245928