The autoanchor of YOLOv5 is enough to read this article

It is simple and rude, and the nonsense is not wordy. The purpose of learning is to solve the following three problems.

1. The default anchor_t is set to 4, how to adjust this parameter? Is there any need to adjust? (First of all, many people on the Internet say that this parameter is the aspect ratio is wrong, it just controls the threshold of the looseness of the anchor setting)

2. How does the code complete the automatic anchor clustering, how is it a bit magical?

3. After clustering, must the result be better than manually calculating the receptive field?

Portal: yolov5/autoanchor.py

https://github.com/ultralytics/yolov5/blob/master/utils/autoanchor.py​github.com/ultralytics/yolov5/blob/master/utils/autoanchor.py

A quick look at the structure:

        Let's take a rough overview of the function name to understand the whole process. The coco128 data is used as the learning data set. This is a castrated version of the COCO data set, which contains only 128 pictures and 929 labeled boxes. Take a look at the utils/autoanchor.py file under the yolov5 project, and the functions in the file

def check_anchor_order(m)

def check_anchors(dataset, model, thr=4.0, imgsz=640)
    def metric(k)

def kmean_anchors(dataset='./data/coco128.yaml', n=9, img_size=640, thr=4.0, gen=1000, verbose=True)
    def metric(k, wh)
    def anchor_fitness(k)
    def print_results(k, verbose=True)

        From the above function name is relatively clear, kmeans is a very common clustering method, indicating that the anchor is clustered through kmeans. After making it clear, further check the sub-functions in the function. The metric should be some kind of evaluation index used to judge whether clustering is needed or whether the clustering is good or bad. Anchor_fitness is fitting the anchor according to kmeans. Print_result just looks at the name and it doesn’t matter directly. neglect.


Take a closer look at the doorway:

1) check_anchor_order (not important)

def check_anchor_order(m):
 # Check anchor order against stride order for YOLOv5 Detect() module m, and correct if necessary
1:    a = m.anchors.prod(-1).mean(-1).view(-1)  # mean anchor area per output layer
2:    da = a[-1] - a[0]  # delta a
3:    ds = m.stride[-1] - m.stride[0]  # delta s
4:    if da and (da.sign() != ds.sign()):  # same order
5:        LOGGER.info(f'{PREFIX}Reversing anchor order')
6:       m.anchors[:] = m.anchors.flip(0)

        1: m.anchors is the anchor [3x3x2] read from the configuration file, respectively representing 3 layers of features x 3 anchors x [w, h], the first line first calculates the area of ​​each anchor, and then calculates the mean value of each layer , the final a represents the small 3x1 matrix of the average anchor of the three-layer feature layer. Take models/yolov5s.yaml

In the code segment 1: Indicates the mark, which is used to correspond between the description and the code. In the
following mark [400, 300], the shape of the variable or tensor is [400, 300]

        The default anchor configuration of the yolov5s model

a = [(10*13+16*30+33*23)/3, (30*61+62*45+59*119)/3, (116*90+156*198+375*326)/3] = tensor([ 456.33334, 3880.33325, 54308.66797])

2: da, calculate the average value of the two anchors with the largest difference, usually the last feature layer has the largest receptive field.

3: ds, calculate the maximum and minimum difference of downsampling, stride=[8, 16, 32], so the difference is 32-8=24

4: Determine whether the last layer of the anchor is larger than the first layer, if not, the order is wrong. Anchors need to be in order from feature layer 8, 16, 32, from small to uppercase

        It just judges whether the stride and anchor are in the same direction, and it is tricky.

2) check_orders (key)

        The code is too long, but the essence is here. Be sure to read carefully. The ninth line is the essence of the entire clustering

1: m = model.module.model[-1] if hasattr(model, 'module') else model.model[-1]  # Detect()
2: shapes = imgsz * dataset.shapes / dataset.shapes.max(1, keepdims=True)
3: scale = np.random.uniform(0.9, 1.1, size=(shapes.shape[0], 1))  # augment scale
4: wh = torch.tensor(np.concatenate([l[:, 3:5] * s for s, l in zip(shapes * scale, dataset.labels)])).float()  

1: Obtain the detection head, obtain the relevant attributes of the detection head such as anchor, stride, na(number of anchors), nc(number of classes), nl(number of layers)

2: dataset.shapes records the size of all images on the training set, taking the coco128 dataset as an example, it is a [128 x 2] numpy array. This sentence finds the largest value of width and height on 128 images, normalizes the dataset.shapes/maximum value to 0 to 1, and puts it in proportion to a maximum side length of 640.

3: According to the uniform distribution, the scaling scale between 0.9 and 1.1 is randomly generated

4: dataset.labels where the format is [category, x, y, w, h], xywh has been normalized, and finally get the width and height of the scaled label, the shape is [929, 2] tensor, coco128 training There are a total of 929 labeled boxes on the set, and thus a set of the width and height of all labeled boxes on the training set after a slight perturbation of wh

5: stride = m.stride.to(m.anchors.device).view(-1, 1, 1)  # model strides
6: anchors = m.anchors.clone() * stride  # current anchors
7: bpr, aat = metric(anchors.cpu().view(-1, 2))

5~6: stride=[8, 16,32], which is a tensor of [3, 1]; m.anchors is the result of normalization according to the current stride, after the anchors are calculated, all the anchors on the three-layer features are restored to [640, 640] on this image scale.

7: The metric input is anchors[9x2] and wh[929x2] in step 4

def metric(k):  # compute metric
8:     r = wh[:, None] / k[None]
9:     x = torch.min(r, 1 / r).min(2)[0]  # ratio metric
10:    best = x.max(1)[0]  # best_x
11:    aat = (x > 1 / thr).float().sum(1).mean()  # anchors above threshold
12:    bpr = (best > 1 / thr).float().mean()  # best possible recall
13:    return bpr, aat

8: [929x2] / [9, 2] = [929x9x2], calculate the ratio of the width and height of 929 annotation boxes to the width and height of 9 anchors, width to width, and height to height. At this time, the label frame is larger or smaller than the anchor, and the scale may range from a few tenths to a few hundred, so it is difficult to set the threshold.

9: To be clear, the purpose of the ratio is to make the label and the anchor as close as possible. When the ratio is close to 1, it means that the setting is reasonable and coincides with the label. Otherwise, a larger or smaller value means that the setting is not good. Here is a little trick. Since it is meaningless to be too big or too small, take the reciprocal to make the very large ones very small, and the ones close to 1 are almost unchanged. Look at min(r, 1/r).min() again, and convert the ratios to between 0 and 1. The closer to 1, the better, and uniformly convert extremely large or extremely small values ​​to extremely small. Wonderful or wonderful, but also elegant enough!

10: At this time, x [929x9], take out the one with the highest matching degree, that is, the closer to 1, so use max. Among the 9 anchors, the pair with the label aspect ratio is the highest, that is, the value is the largest and the closest to 1. best[929x1]

11: There is another point that needs to be reminded. When we determine whether the anchor matches the label, as long as one of the 9 anchors exceeds the threshold, it is not necessary for all 9 anchors to match. Therefore, only the maximum value is taken in the previous step, and the smallest aspect ratio value is selected in the ninth step, and the smallest one matches the width and height.

Why is x>1/thr not x>thr? It can be understood that x is already a ratio and normalized from 0 to 1. 1 in 1/thr means that the standard value is 1, and thr means that the difference is within several times the ratio. For example, if it is set to 4, it means that the difference between the width and height of gt and anchor cannot exceed 4 times, which is actually a very broad requirement.

12: Two indicators explain the meaning,

        AAT indicates that on average, several anchors on the training set exceed the threshold, and all anchors participate in the calculation. For example, using the coco128/yolov5s configuration file is calculated as 4.26695, which means that each tag can match 4.26 anchors on average, and this result is also very good.

        BPR calculates the best case, selects the one with the highest ratio among the 9 anchors, and each label corresponds to a matching result with the highest score, and finally judges how many exceed the threshold in the best case.

From this we have solved problem 1, and also figured out the meaning of the two key indicators of bpr and  aat
. Does thresh need to be adjusted? How to adjust? The actual effect needs to be adjusted experimentally, but on the whole, the larger the thresh setting, the looser the requirement for the anchor setting, and the smaller the setting, the higher the requirement for the anchor setting. The smaller the personal understanding is, the better it is for playing games to mine data distribution to cluster anchors. Of course, it may affect the generalization ability of the algorithm. Even anchor_t can be used as a parameter, and the image of anchor_t and bpr or anchor_t and aat can be drawn step by step between 1-5 as the basis for judgment.

3) kmean_anchors (successful)

        When bpr <= 0.98, the anchor is clustered, but when doing this experiment, the preset anchor must be adjusted before going to the next step. Review kmeans first

        The specific implementation process of the kmeans clustering algorithm

def kmean_anchors(dataset='./data/coco128.yaml', n=9, img_size=640, thr=4.0, gen=1000, verbose=True):

     npr = np.random
1:  thr = 1 / thr

2:  def metric(k, wh):  # compute metrics
        r = wh[:, None] / k[None]
        x = torch.min(r, 1 / r).min(2)[0]  # ratio metric
        # x = wh_iou(wh, torch.tensor(k))  # iou metric
    return x, x.max(1)[0]  # x, best_x

3:  def anchor_fitness(k):  # mutation fitness
         _, best = metric(torch.tensor(k, dtype=torch.float32), wh)
         return (best * (best > thr).float()).mean()  # fitness

4:  def print_results(k, verbose=True):
        k = k[np.argsort(k.prod(1))]  # sort small to large
        x, best = metric(k, wh0)
        bpr, aat = (best > thr).float().mean(), (x > thr).float().mean() * n  # best possible recall, anch > thr
        s = f'{PREFIX}thr={thr:.2f}: {bpr:.4f} best possible recall, {aat:.2f} anchors past thr\n' \

     f'{PREFIX}n={n}, img_size={img_size}, metric_all={x.mean():.3f}/{best.mean():.3f}-mean/best, ' \
     f'past_thr={x[x > thr].mean():.3f}-mean: '
     for x in k:
        s += '%i,%i, ' % (round(x[0]), round(x[1]))
     if verbose:
         LOGGER.info(s[:-2])
     return k

     # Get label wh
5:  shapes = img_size * dataset.shapes / dataset.shapes.max(1, keepdims=True)
6:  wh0 = np.concatenate([l[:, 3:5] * s for s, l in zip(shapes, dataset.labels)])  # wh

    # Filter
7:  i = (wh0 < 3.0).any(1).sum()
     if i:
        LOGGER.info(f'{PREFIX}WARNING: Extremely small objects found: {i} of {len(wh0)} labels are < 3 pixels in size')
8:   wh = wh0[(wh0 >= 2.0).any(1)]  # filter > 2 pixels
      # wh = wh * (npr.rand(wh.shape[0], 1) * 0.9 + 0.1)  # multiply by random scale 0-1

      # Kmeans init
      try:
          LOGGER.info(f'{PREFIX}Running kmeans for {n} anchors on {len(wh)} points...')
          assert n <= len(wh)  # apply overdetermined constraint
9:       s = wh.std(0)  # sigmas for whitening
10:      k = kmeans(wh / s, n, iter=30)[0] * s  # points
         assert n == len(k)  # kmeans may return fewer points than requested if wh is insufficient or too similar
      except Exception:
         LOGGER.warning(f'{PREFIX}WARNING: switching strategies from kmeans to random init')
11:     k = np.sort(npr.rand(n * 2)).reshape(n, 2) * img_size  # random init
      wh, wh0 = (torch.tensor(x, dtype=torch.float32) for x in (wh, wh0))
      k = print_results(k, verbose=False)

3: Fitness calculates the mean value of those ratios where the aspect ratio of the anchor and the label exceeds the threshold. In fact, it is used as a value to evaluate whether the matching degree is good or not. .

5~6: The same processing as before, get the width and height of the label and zoom

7~8: any one of the elements is not empty/0/None outputs True, all all must not be empty/0/None outputs True. It is filtered, the width and height must be greater than 2, and it feels more appropriate to use all

9: The data is whitened here. On the one hand, the dependence between the data is reduced, and on the other hand, the variance of each feature is 1. Mainly for principal component analysis, eliminating the feature dimension with a small proportion of variance, and the other is standardization.

10: After the label width and height are standardized, it is used as the input of kmeans. kmeans needs to specify the number of cluster centers, here is n=9, and kemans iterates for 30 rounds. Finally, the clustered anchor box will be returned. If the number returned is less than 9, 9 anchors will be randomly generated.

# Evolve
    f, sh, mp, s = anchor_fitness(k), k.shape, 0.9, 0.1  # fitness, generations, mutation prob, sigma
    pbar = tqdm(range(gen), bar_format='{l_bar}{bar:10}{r_bar}{bar:-10b}')  # progress bar
    for _ in pbar:
1:     v = np.ones(sh)
        while (v == 1).all():  # mutate until a change occurs (prevent duplicates)
            v = ((npr.random(sh) < mp) * random.random() * npr.randn(*sh) * s + 1).clip(0.3, 3.0)
2:      kg = (k.copy() * v).clip(min=2.0)
         fg = anchor_fitness(kg)
3:      if fg > f:
4:         f, k = fg, kg.copy()
            pbar.desc = f'{PREFIX}Evolving anchors with Genetic Algorithm: fitness = {f:.4f}'
            if verbose:
                print_results(k, verbose)
    return print_results(k)

        This paragraph is mainly due to the fact that the fitness may not meet the requirements after kmeans clustering, and the best results may be found after 1000 steps of relatively small disturbances. When obtaining the value of v, the expansion multiple is also obtained in a Gaussian distribution manner, which is more in line with the need to ensure a high probability of the center by left and right disturbances on the basis of clustering, and finally achieve the best among the best. Clustering is not actually a black magic. There are 9 anchors and the simplest one is kmeans, but after all, clustering has a certain randomness (randomly initialized cluster centers), so 1000 times of subtle disturbances were simulated and evaluated through evolution. Indicators, choose the best among them, and finally output the anchor after clustering, everything is just right.

        The answer to the last question is relatively clear. It is true that the accuracy or recall of the anchor on a specific data set can be improved through kmeans, but it is not necessarily that the artificially set anchor will be worse. It is just a productivity tool that simplifies the training process. It is necessary to consider how to set the anchor, and do less divergence. In fact, it can handle almost the same regression. A tool can help us find some hidden problems, and it can also be used as a diagnostic tool for data and models, which is convenient and efficient~

Source: YOLOv5's autoanchor is enough to read this article - Zhihu (zhihu.com)

参考:Training YOLO? Select Anchor Boxes Like This | by Olga Chernytska | Towards Data Science

Guess you like

Origin blog.csdn.net/dou3516/article/details/130555280
Recommended