The pit encountered by Pytorch: Why can't the L1loss loss decrease during model training?

Recently, I was using L1loss to train a regression model. I found that the loss was extremely unstable during the model training process, and the training effect was very poor. I finally found the reason!

The original code is as follows:

criterion = nn.L1Loss()
def train():
    print('Epoch {}:'.format(epoch + 1))
    model.train()
    # switch to train mode
    for i, sample_batched in enumerate(train_dataloader):
        input, target = sample_batched['geno'], sample_batched['pheno']
        # compute output
        output = model(input.float().cuda())
        loss = criterion(output, target.float().cuda())
        # compute gradient and do SGD step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

The problem with the above code is:

loss = criterion(output, target.float().cuda())

The batchsize I input is 4, so the size of the output is [4,1], which is a two-dimensional data; the size of the target is [4]. The output of loss is a correct value . This is why I didn't find the problem! Let's take a look at the code of l1_loss in the pytorch library:

def l1_loss(input, target, size_average=None, reduce=None, reduction='mean'):
    # type: (Tensor, Tensor, Optional[bool], Optional[bool], str) -> Tensor
    r"""l1_loss(input, target, size_average=None, reduce=None, reduction='mean') -> Tensor

    Function that takes the mean element-wise absolute value difference.

    See :class:`~torch.nn.L1Loss` for details.
    """
    if not torch.jit.is_scripting():
        tens_ops = (input, target)
        if any([type(t) is not Tensor for t in tens_ops]) and has_torch_function(tens_ops):
            return handle_torch_function(
                l1_loss, tens_ops, input, target, size_average=size_average, reduce=reduce,
                reduction=reduction)
    if not (target.size() == input.size()):
        warnings.warn("Using a target size ({}) that is different to the input size ({}). "
                      "This will likely lead to incorrect results due to broadcasting. "
                      "Please ensure they have the same size.".format(target.size(), input.size()),
                      stacklevel=2)
    if size_average is not None or reduce is not None:
        reduction = _Reduction.legacy_get_string(size_average, reduce)
    if target.requires_grad:
        ret = torch.abs(input - target)
        if reduction != 'none':
            ret = torch.mean(ret) if reduction == 'mean' else torch.sum(ret)
    else:
        expanded_input, expanded_target = torch.broadcast_tensors(input, target)
        ret = torch._C._nn.l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
    return ret

The warning in the code requires that the size of the input and target must be consistent, otherwise incorrect results will occur . I have given the warning to ignore in my own code, so I have never seen this warning! Here I would like to remind everyone that you must not ignore warnings at will, and you must read warnings carefully, not just errors. . . .

I changed the code to the following and there is no problem:

loss = criterion(output.squeeze(), target.float().cuda())

Now that the problem is solved, you need to know why the size mismatch will cause the model to go wrong, otherwise the bug you have been looking for for so long is not in vain = =

Let's try wrong input first, the input size is [4,1], and the target size is [4] :

input = tensor([[-0.3704, -0.2918, -0.6895, -0.6023]], device='cuda:0',
       grad_fn=<PermuteBackward>)
target = tensor([ 63.6000, 127.0000, 102.2000, 115.4000], device='cuda:0')

expanded_input, expanded_target = torch.broadcast_tensors(input, target)

ret = torch._C._nn.l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))

 Return expanded_input:

tensor([[-0.3704, -0.2918, -0.6895, -0.6023],
        [-0.3704, -0.2918, -0.6895, -0.6023],
        [-0.3704, -0.2918, -0.6895, -0.6023],
        [-0.3704, -0.2918, -0.6895, -0.6023]], device='cuda:0',
       grad_fn=<PermuteBackward>)

Return expanded_target:

tensor([[ 63.6000,  63.6000,  63.6000,  63.6000],
        [127.0000, 127.0000, 127.0000, 127.0000],
        [102.2000, 102.2000, 102.2000, 102.2000],
        [115.4000, 115.4000, 115.4000, 115.4000]], device='cuda:0') 

return ret:

tensor(102.5385, device='cuda:0', grad_fn=<PermuteBackward>)

Next is the correct input, the input size is [4], and the target size is [4]: 

 input = tensor([-0.3704, -0.2918, -0.6895, -0.6023], device='cuda:0',
       grad_fn=<PermuteBackward>)
target = tensor([ 63.6000, 127.0000, 102.2000, 115.4000], device='cuda:0')

expanded_input, expanded_target = torch.broadcast_tensors(input, target)

ret = torch._C._nn.l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))

 Return expanded_input:

 tensor([[-0.3704, -0.2918, -0.6895, -0.6023],
        [-0.3704, -0.2918, -0.6895, -0.6023],
        [-0.3704, -0.2918, -0.6895, -0.6023],
        [-0.3704, -0.2918, -0.6895, -0.6023]], device='cuda:0',
       grad_fn=<PermuteBackward>)

return ret:

tensor(102.5385, device='cuda:0', grad_fn=<PermuteBackward>)

 After the mean is averaged, the returned ret value is the same, the only difference is expanded_input. This intermediate value is not the same, will it cause the gradient to change? In order to verify this idea, we output the gradient value of input in the code.

for name, parms in model.named_parameters():
    print('name:', name)
    print('grad_requirs:', parms.requires_grad)
    print('grad_value:', parms.grad)

The following is the wrong input, the input size is [4,1], and the target size is [4]: 

 ===
name: module.linear1.bias
grad_requirs: True
grad_value: tensor([-0.1339,  0.0000,  0.0505,  0.0219, -0.1498,  0.0265, -0.0604, -0.0385,
         0.0471,  0.0000,  0.0304,  0.0000,  0.0000,  0.0406,  0.0066,  0.0000,
        -0.0259, -0.1544,  0.0000, -0.0208,  0.0050,  0.0000,  0.0625, -0.0474,
         0.0000,  0.0858, -0.0116,  0.0777,  0.0000, -0.0828,  0.0000, -0.1265],
       device='cuda:0')
===
name: module.linear2.weight
grad_requirs: True
grad_value: tensor([[-0.9879, -0.0000, -1.0088, -0.1680, -0.7312, -0.0066, -0.3093, -0.7478,
         -0.3104, -0.0000, -0.1615, -0.0000, -0.0000, -0.3162, -0.1047, -0.0000,
         -0.4030, -0.3385, -0.0000, -0.1738, -0.0831, -0.0000, -0.3490, -0.1129,
         -0.0000, -0.8220, -0.0279, -0.3754, -0.0000, -0.3566, -0.0000, -0.5950]],
       device='cuda:0')
===
name: module.linear2.bias
grad_requirs: True
grad_value: tensor([-1.], device='cuda:0')
===

The following is the correct input, the size of the input is [4], and the size of the target is the gradient obtained by [4]

 ===
name: module.linear1.bias
grad_requirs: True
grad_value: tensor([-0.1351,  0.0000,  0.0000,  0.0000, -0.0377,  0.0000, -0.0809, -0.0394,
         0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0202,  0.0098, -0.0365,
        -0.0263, -0.2063, -0.1533, -0.0626,  0.0050,  0.0000,  0.0000, -0.0950,
         0.0000,  0.0000, -0.0348,  0.0000,  0.0000, -0.1108, -0.0402, -0.1693],
       device='cuda:0')
===
name: module.linear2.weight
grad_requirs: True
grad_value: tensor([[-7.4419,  0.0000,  0.0000,  0.0000, -1.9245,  0.0000, -2.7927, -2.4551,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000, -0.0309, -0.4843, -0.0211,
         -1.7046, -7.7090, -0.1696, -0.9997, -0.0862,  0.0000,  0.0000, -2.0397,
          0.0000,  0.0000, -0.3125,  0.0000,  0.0000, -3.9532, -0.0643, -6.5799]],
       device='cuda:0')
===
name: module.linear2.bias
grad_requirs: True
grad_value: tensor([-1.], device='cuda:0')
===

Sure enough, the gradient values ​​are different! ! ! Lesson learned: Every line of code should deeply understand the mechanism of its action, don't take it for granted!

Guess you like

Origin blog.csdn.net/u013685264/article/details/125328578