The clip operation may result in failure to optimize (after clipping, it is equivalent to the learnable parameter being replaced by a constant, so derivation cannot be performed)

When I was re-watching GIoUmy motivation, I had this sentence:

当GT框与预测框不重叠时，框无论向哪个方向移动IoU始终是0，无法优化

That is to say, the following operation is performed. The left picture has overlap and IoU is greater than 0. The right picture has no overlap and IoU is always 0.

Please add image description

It cannot be optimized , which means that the gradient is 0, so the gradient-based optimization method cannot be used. Which part of the process of calculating IoU will cause the loss to be 0? Just because the IoU is always 0, the gradient of the IoU
is 0 . Yet?
(This is obvious. No matter which direction the Box moves slightly, the IoU is a constant value of 0, that is, it is always a constant value in this neighborhood. Obviously the gradient is 0)

But at the code level, which component is responsible for this unoptimizable part?

Let me talk about the conclusion first. It is clipcaused by the operation. clipAfter that, it is equivalent to the learnable parameter being replaced by a constant, so it cannot be derived. Let's do an experiment to see the comparative experiment of the gradient being 0 when the IoU is 0.

1. Code experiment one

First put a function that calculates IoU. If it is not for detection, there is no need to pay attention to the specific implementation, as long as you know that there is one step in the middle clip.

import torch

def box_area(boxes: torch.Tensor) -> torch.Tensor:
    """
    Computes the area of a set of bounding boxes, which are specified by their
    (x1, y1, x2, y2) coordinates.

    Args:
        boxes (Tensor[N, 4]): boxes for which the area will be computed. They
            are expected to be in (x1, y1, x2, y2) format with
            ``0 <= x1 < x2`` and ``0 <= y1 < y2``.

    Returns:
        Tensor[N]: the area for each box
    """
    # boxes = _upcast(boxes)
    return (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])

    

def box_iou(boxes1, boxes2):
    area1 = box_area(boxes1)
    area2 = box_area(boxes2)

    lt = torch.maximum(boxes1[:, None, :2], boxes2[:, :2])  # [N,M,2]
    rb = torch.minimum(boxes1[:, None, 2:], boxes2[:, 2:])  # [N,M,2]

    wh = (rb - lt).clip(min=0)  # [N,M,2]
    inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]

    union = area1[:, None] + area2 - inter

    iou = inter / union
    return iou, union

# 此处的预测框是可学习参数
pred_box = torch.tensor([0.3, 0.3, 0.6, 0.6])[None]
pred_box.requires_grad = True

# 这是GT值
GT_box   = torch.tensor([0.0, 0.0, 1.0, 1.0])[None]

# 学习率，每次朝着负梯度方向移动 0.001 的比例
lr = 0.001
for _ in range(10000):
    iou, _ = box_iou(pred_box, GT_box)
    loss = 1 - iou.sum()
    print(loss)
    loss.backward()
    with torch.no_grad():
        pred_box -= pred_box.grad * lr
        print(pred_box.grad)
        pred_box.grad = None # 手动清除该 Tensor 梯度

pred_boxand GT_boxare x1y1x2y2both coordinates. Looking at the code coordinates, they obviously overlap, so the loss can be reduced, and finally pred_boxmove toGT_box

And if I change to have no overlap pred_boxwithGT_box

# pred_box = torch.tensor([0.3, 0.3, 0.6, 0.6])[None]
pred_box = torch.tensor([10.3, 10.3, 10.6, 10.6])[None]         # 就是这一步改成没有重叠的
pred_box.requires_grad = True

GT_box   = torch.tensor([0.0, 0.0, 1.0, 1.0])[None]
lr = 0.001
for _ in range(10000):
    iou, _ = box_iou(pred_box, GT_box)
    loss = 1 - iou.sum()
    print(loss)
    loss.backward()
    with torch.no_grad():
        pred_box -= pred_box.grad * lr
        print(pred_box.grad)
        pred_box.grad = None # 手动清除该 Tensor 梯度

Print the gradient and you can see that the gradient is always 0

tensor([[0., 0., 0., 0.]])

This experiment only verified that when the two boxes do not overlap, the gradient is 0 and cannot be optimized. Next, a small experiment is conducted to verify that clip will cause the gradient to be 0.

2. Code Experiment 2

pred_box_wo_clip = torch.tensor([0.3, 0.3, -3, -3])
pred_box_wo_clip.requires_grad = True


GT_box   = torch.tensor([1.0, 1.0, 1.0, 1.0])
lr = 0.0001
for _ in range(35000):
    pred_box = torch.clip(pred_box_wo_clip, min=-1)
    loss = ( (GT_box - pred_box)**2 ).sum()
    print(loss)
    loss.backward()
    with torch.no_grad():
        pred_box_wo_clip -= pred_box_wo_clip.grad * lr
        print(pred_box_wo_clip.grad)
        pred_box_wo_clip.grad = None # 手动清除该 Tensor 梯度

Running the code, you can see that since the gradient of the last two dimensions is 0, the first two dimensions can be differentiated normally.

tensor([-1.4000, -1.4000,  0.0000,  0.0000])

In mathematics, such a clip operation is called hard truncation.

3. Memories

This reminds me of a competition I had with Brother Rui a year ago. There was an operation to find the inverse cosine, and then reversely find the angle. It was probably Jiang Zi’s:

cos = xxxx   # 这里拿到角度的cos值
theta = arccos(cos)

Since 那个求反余弦角度函数it will obviously only accept (-1, 1)input, if it is not within this range, NaN will be returned directly. Due to the numerical accuracy error of cos, it sometimes returns 1.000001, and then I directly clipset it to 1, and then get the angle.

We were using DDPG at the time, and no matter how we adjusted it, he just couldn't learn anything and kept turning in one direction. One possible reason is that because clip is used here, the gradient returned is always 0, so it cannot be optimized?

But not here... I really broke my defense... I looked through last year's blog and found that the operation was done by numpy, so it has nothing to do with the gradient. I don't know if I still had the clip operation in the program at that time [cover my face and laugh] cry]