Repoptimizer paper understanding and code analysis

The previous article introduced RepVGG . RepVGG has a quantization problem. Repopt solves the unfriendly problem of quantification by integrating the prior into the optimizer and unifying the training and testing models.

Link to the paper: Re-parameterizing Your Optimizers rather than Architectures

Introduction

Repopt proposes to use the prior information of the model structure directly to modify the gradient value, which is called gradient reparameterization, and the corresponding optimizer is called RepOptimizer. Repopt focuses on the VGG-style straight model, and the trained RepOptVGG model is consistent with the VGG structure, with high training efficiency, simple and direct structure, and extremely fast inference speed.

Differences from RepVGG :
1) RepVGG adds a structural prior (shortcut, 1x1 branch) during the training process, and fuses multiple branches into a single-way 3x3 convolution during inference. RepOptVGG transfers the structure prior to the gradient, which is realized by the designed RepOpt optimizer.
2) In terms of structure, RepOptVGG is a true-straight structure, and the model is consistent during training and testing. RepVGG training requires more video memory and training time if there are multiple branches.
3) RepOptVGG realizes the equivalent change of structural re-participation and gradient re-parameterization by customizing the optimizer.

Idea

insert image description here

Repopt found an interesting phenomenon of structural priors: when each branch contains only one linear trainable operator, the performance of the model improves if the constant scale value is set correctly. We refer to this linear block as Constant Scale Linear Addition (CSLA) . We can replace a CSLA block with a single operator and achieve equivalent training dynamics by designing the optimizer to vary the gradient. Repopt calls this multiplier the Grad Mult, as shown in the image above.

Proof: Training a CSLA block with regular SGD is equivalent to training a simple convolution with modified gradients

Each branch in the CSLA block contains only one trainable linear operator, and there are no nonlinear operations such as BN or dropout in the structure. Repopt found that training a CSLA block with regular SGD is equivalent to training a simple convolution with modified gradients. The following is a simple example to prove this conclusion.

Suppose CSLA consists of two convolutions of the same shape, where each kernel contains a trainable linear operator. As shown in the following formula, where α A , α B \alpha_A,\alpha_BaA,aBIt is a trainable linear operator, W is the parameter of convolution, X is the input, Y is the output of CSLA, and * indicates the convolution operation.

insert image description here

The corresponding gradient reparameter formula YGR = X ∗ W ′ Y_{GR}=X*W^{\prime}YGR=XW , whereW ′ W^{\prime}W' Represents the convolution after gradient reparameterization, assuming that the loss function is L, the number of training iterations is i, and the gradient of the convolution parameter W is expressed as∂ L ∂ W \frac{\partial L}{\partial W}WL, F ( ∂ L ∂ W ′ ) F(\frac{\partial L}{\partial W^{\prime}}) F(WL) represents any change in the corresponding gradient reparameterization. We hope that the output of CSLA after several training sessions is consistent with the output after gradient reparameterization, namely

insert image description here

Through the linear additivity of convolution, we need to guarantee Equation 6

insert image description here

Before the i=0 iteration starts, the correct initialization ensures the equivalence of Equation 6, and the initialization is shown in Equation 7

insert image description here

Next, we use mathematical induction to prove that in W ′ W^{\prime}W , the equivalence of Equation 6 always holds, and the formula for W gradient update is as follows. Updating the
insert image description here
corresponding CSLA block, we obtain Equation 10

insert image description here
We use F ( ∂ L ∂ W ′ ) F(\frac{\partial L}{\partial W^{\prime}})F(WL) to updateW ′ W^{\prime}W , which means

insert image description here

Assuming that formulas 6, 10, and 11 hold true when iterating for the ith time, then formula 12 can be obtained

insert image description here

Taking the partial derivative of Equation 6, we have Equation 13

insert image description here

We obtain Equation 14, namely F ( ∂ L ∂ W ′ ) F(\frac{\partial L}{\partial W^{\prime}})F(WL) in the exact form

insert image description here
From formulas 11 and 14, we can deduce that when iterating to i+1 times, the following equation holds

insert image description here

Since Assumption Equation 6 holds

insert image description here

Through initial condition formulas 7, 8, and mathematical induction, we can prove that when i>=0, formula 6 holds. At the same time, we know that F ( ∂ L ∂ W ′ ) F(\frac{\partial L}{\partial W^{\prime}})F(WL) in the exact form, as shown in Equation 14.

Method

Above, Repopt has been introduced to find a suitable structural prior CSLA block, and it has been proved by mathematical induction that CSLA can be equivalent to a simple convolution operation through gradient reparameterization. Below, we use RepOpt-VGG as a demonstration example, specifically Introduce how Repopt designs and describes the behavior of gradient reparameters.

In RepOptVGG, the corresponding CSLA block replaces the 3x3 convolution, 1x1 convolution, and bn layer in the RepVGG block with a 3x3 convolution and 1x1 convolution with learnable scaling parameters. Further expand to multi-branch, assuming that s and t are the scaling coefficients of 3x3 convolution and 1x1 convolution respectively, then the corresponding update rules are:
insert image description here

The understanding of formula 3 needs to be combined with RepVGG. When the input and output channels are not equal, there are only two branches conv3x3 and conv1x1, where conv1x1 can be equivalent to a special conv3x3, so the gradient can be re-parameterized as sc 2 + tc 2 s_c ^ 2+t_c^2sc2+tc2, as demonstrated above. When the input and output channels are equal, there are 3 branches in total at this time, namely identity, conv3x3, conv1x1, and Identity can also be equivalent to a special conv3x3, whose convolution kernel is composed of 0 and 1, so the gradient re-parameter is 1 + sc 2 + tc 2 1+s_c^2+t_c^21+sc2+tc2

It should be noted that CSLA does not have a training-time nonlinearity such as BN, and there are no non-sequential (non sequential) trainable parameters. CSLA is only an indirect tool to describe RepOptimizer here.

Then the remaining question is how to determine this scaling factor

HyperSearch

Inspired by DARTS, we replace the constant scaling factors in CSLA with trainable parameters. Training is performed on a small dataset (such as CIFAR100), and after training on the small data, we fix these trainable parameters as constants.

insert image description here

Code

LinearAddBlock defines the CSLA block, which is only trained when HyperSearch is determined.

class LinearAddBlock(nn.Module):

    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1,
                 dilation=1, groups=1, padding_mode='zeros', use_se=False, is_csla=False, conv_scale_init=1.0):
        super(LinearAddBlock, self).__init__()
        self.in_channels = in_channels
        self.relu = nn.ReLU()
        self.conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=stride, padding=padding, bias=False)
        self.scale_conv = ScaleLayer(num_features=out_channels, use_bias=False, scale_init=conv_scale_init)
        self.conv_1x1 = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=1, stride=stride, padding=0, bias=False)
        self.scale_1x1 = ScaleLayer(num_features=out_channels, use_bias=False, scale_init=conv_scale_init)
        if in_channels == out_channels and stride == 1:
            self.scale_identity = ScaleLayer(num_features=out_channels, use_bias=False, scale_init=1.0)
        self.bn = nn.BatchNorm2d(out_channels)
        if is_csla:     # Make them constant
            self.scale_1x1.requires_grad_(False)
            self.scale_conv.requires_grad_(False)
        if use_se:
            raise NotImplementedError("se block not supported yet")
        else:
            self.se = nn.Identity()

    def forward(self, inputs):
        out = self.scale_conv(self.conv(inputs)) + self.scale_1x1(self.conv_1x1(inputs))
        if hasattr(self, 'scale_identity'):
            out += self.scale_identity(inputs)
        out = self.relu(self.se(self.bn(out)))
        return out

class ScaleLayer(torch.nn.Module):

    def __init__(self, num_features, use_bias=True, scale_init=1.0):
        super(ScaleLayer, self).__init__()
        self.weight = Parameter(torch.Tensor(num_features))
        init.constant_(self.weight, scale_init)
        self.num_features = num_features
        if use_bias:
            self.bias = Parameter(torch.Tensor(num_features))
            init.zeros_(self.bias)
        else:
            self.bias = None

    def forward(self, inputs):
        if self.bias is None:
            return inputs * self.weight.view(1, self.num_features, 1, 1)
        else:
            return inputs * self.weight.view(1, self.num_features, 1, 1) + self.bias.view(1, self.num_features, 1, 1)

RealVGGBlock is the real module of RepOptVGG, the structure is simple as shown below.

class RealVGGBlock(nn.Module):

    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1,
                 dilation=1, groups=1, padding_mode='zeros', use_se=False,
    ):
        super(RealVGGBlock, self).__init__()
        self.relu = nn.ReLU()
        self.conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=stride, padding=padding, bias=False)
        self.bn = nn.BatchNorm2d(out_channels)

        if use_se:
            raise NotImplementedError("se block not supported yet")
        else:
            self.se = nn.Identity()

    def forward(self, inputs):
        out = self.relu(self.se(self.bn(self.conv(inputs))))
        return out

Assuming that we have obtained the scales required by HyperSearch through small data training, then when training RepOptVGG, RepVGGOptimizer needs to assign the scales of the CSLA block to RealVGGBlock during initialization. The assignment process is shown in reinitialize, which corresponds to formula 3 in Method.

def reinitialize(self, scales_by_idx, conv3x3_by_idx, use_identity_scales):
        for scales, conv3x3 in zip(scales_by_idx, conv3x3_by_idx):
            in_channels = conv3x3.in_channels
            out_channels = conv3x3.out_channels
            kernel_1x1 = nn.Conv2d(in_channels, out_channels, 1, device=conv3x3.weight.device)
            if len(scales) == 2:
                conv3x3.weight.data = conv3x3.weight * scales[1].view(-1, 1, 1, 1) \
                                      + F.pad(kernel_1x1.weight, [1, 1, 1, 1]) * scales[0].view(-1, 1, 1, 1)
            else:
                assert len(scales) == 3
                assert in_channels == out_channels
                identity = torch.from_numpy(np.eye(out_channels, dtype=np.float32).reshape(out_channels, out_channels, 1, 1)).to(conv3x3.weight.device)
                conv3x3.weight.data = conv3x3.weight * scales[2].view(-1, 1, 1, 1) + F.pad(kernel_1x1.weight, [1, 1, 1, 1]) * scales[1].view(-1, 1, 1, 1)
                if use_identity_scales:     # You may initialize the imaginary CSLA block with the trained identity_scale values. Makes almost no difference.
                    identity_scale_weight = scales[0]
                    conv3x3.weight.data += F.pad(identity * identity_scale_weight.view(-1, 1, 1, 1), [1, 1, 1, 1])
                else:
                    conv3x3.weight.data += F.pad(identity, [1, 1, 1, 1])

We need to obtain the gradient Mask in the gradient re-parameterization process, which is divided into three situations similar to the reinitialize process, and the specific implementation is as follows.

def generate_gradient_masks(self, scales_by_idx, conv3x3_by_idx, cpu_mode=False):
        self.grad_mask_map = {
    
    }
        for scales, conv3x3 in zip(scales_by_idx, conv3x3_by_idx):
            para = conv3x3.weight
            if len(scales) == 2:
                mask = torch.ones_like(para, device=scales[0].device) * (scales[1] ** 2).view(-1, 1, 1, 1)
                mask[:, :, 1:2, 1:2] += torch.ones(para.shape[0], para.shape[1], 1, 1, device=scales[0].device) * (scales[0] ** 2).view(-1, 1, 1, 1)
            else:
                mask = torch.ones_like(para, device=scales[0].device) * (scales[2] ** 2).view(-1, 1, 1, 1)
                mask[:, :, 1:2, 1:2] += torch.ones(para.shape[0], para.shape[1], 1, 1, device=scales[0].device) * (scales[1] ** 2).view(-1, 1, 1, 1)
                ids = np.arange(para.shape[1])
                assert para.shape[1] == para.shape[0]
                mask[ids, ids, 1:2, 1:2] += 1.0
            if cpu_mode:
                self.grad_mask_map[para] = mask
            else:
                self.grad_mask_map[para] = mask.cuda()

The structure prior is transformed into gradient prior by means of Repopt gradient re-parameterization, which can unify the training and test model structure, and effectively solve the unfriendly problem of RepVGG quantization. Its structure is used in YOLOV6 and shows excellent performance.

Guess you like

Origin blog.csdn.net/litt1e/article/details/128129239