The previous article introduced RepVGG . RepVGG has a quantization problem. Repopt solves the unfriendly problem of quantification by integrating the prior into the optimizer and unifying the training and testing models.
Link to the paper: Re-parameterizing Your Optimizers rather than Architectures
Introduction
Repopt proposes to use the prior information of the model structure directly to modify the gradient value, which is called gradient reparameterization, and the corresponding optimizer is called RepOptimizer. Repopt focuses on the VGG-style straight model, and the trained RepOptVGG model is consistent with the VGG structure, with high training efficiency, simple and direct structure, and extremely fast inference speed.
Differences from RepVGG :
1) RepVGG adds a structural prior (shortcut, 1x1 branch) during the training process, and fuses multiple branches into a single-way 3x3 convolution during inference. RepOptVGG transfers the structure prior to the gradient, which is realized by the designed RepOpt optimizer.
2) In terms of structure, RepOptVGG is a true-straight structure, and the model is consistent during training and testing. RepVGG training requires more video memory and training time if there are multiple branches.
3) RepOptVGG realizes the equivalent change of structural re-participation and gradient re-parameterization by customizing the optimizer.
Idea
Repopt found an interesting phenomenon of structural priors: when each branch contains only one linear trainable operator, the performance of the model improves if the constant scale value is set correctly. We refer to this linear block as Constant Scale Linear Addition (CSLA) . We can replace a CSLA block with a single operator and achieve equivalent training dynamics by designing the optimizer to vary the gradient. Repopt calls this multiplier the Grad Mult, as shown in the image above.
Proof: Training a CSLA block with regular SGD is equivalent to training a simple convolution with modified gradients
Each branch in the CSLA block contains only one trainable linear operator, and there are no nonlinear operations such as BN or dropout in the structure. Repopt found that training a CSLA block with regular SGD is equivalent to training a simple convolution with modified gradients. The following is a simple example to prove this conclusion.
Suppose CSLA consists of two convolutions of the same shape, where each kernel contains a trainable linear operator. As shown in the following formula, where α A , α B \alpha_A,\alpha_BaA,aBIt is a trainable linear operator, W is the parameter of convolution, X is the input, Y is the output of CSLA, and * indicates the convolution operation.
The corresponding gradient reparameter formula YGR = X ∗ W ′ Y_{GR}=X*W^{\prime}YGR=X∗W′ , whereW ′ W^{\prime}W' Represents the convolution after gradient reparameterization, assuming that the loss function is L, the number of training iterations is i, and the gradient of the convolution parameter W is expressed as∂ L ∂ W \frac{\partial L}{\partial W}∂W∂L, F ( ∂ L ∂ W ′ ) F(\frac{\partial L}{\partial W^{\prime}}) F(∂W′∂L) represents any change in the corresponding gradient reparameterization. We hope that the output of CSLA after several training sessions is consistent with the output after gradient reparameterization, namely
Through the linear additivity of convolution, we need to guarantee Equation 6
Before the i=0 iteration starts, the correct initialization ensures the equivalence of Equation 6, and the initialization is shown in Equation 7
Next, we use mathematical induction to prove that in W ′ W^{\prime}W′ , the equivalence of Equation 6 always holds, and the formula for W gradient update is as follows. Updating the
corresponding CSLA block, we obtain Equation 10
We use F ( ∂ L ∂ W ′ ) F(\frac{\partial L}{\partial W^{\prime}})F(∂W′∂L) to updateW ′ W^{\prime}W′ , which means
Assuming that formulas 6, 10, and 11 hold true when iterating for the ith time, then formula 12 can be obtained
Taking the partial derivative of Equation 6, we have Equation 13
We obtain Equation 14, namely F ( ∂ L ∂ W ′ ) F(\frac{\partial L}{\partial W^{\prime}})F(∂W′∂L) in the exact form
From formulas 11 and 14, we can deduce that when iterating to i+1 times, the following equation holds
Since Assumption Equation 6 holds
Through initial condition formulas 7, 8, and mathematical induction, we can prove that when i>=0, formula 6 holds. At the same time, we know that F ( ∂ L ∂ W ′ ) F(\frac{\partial L}{\partial W^{\prime}})F(∂W′∂L) in the exact form, as shown in Equation 14.
Method
Above, Repopt has been introduced to find a suitable structural prior CSLA block, and it has been proved by mathematical induction that CSLA can be equivalent to a simple convolution operation through gradient reparameterization. Below, we use RepOpt-VGG as a demonstration example, specifically Introduce how Repopt designs and describes the behavior of gradient reparameters.
In RepOptVGG, the corresponding CSLA block replaces the 3x3 convolution, 1x1 convolution, and bn layer in the RepVGG block with a 3x3 convolution and 1x1 convolution with learnable scaling parameters. Further expand to multi-branch, assuming that s and t are the scaling coefficients of 3x3 convolution and 1x1 convolution respectively, then the corresponding update rules are:
The understanding of formula 3 needs to be combined with RepVGG. When the input and output channels are not equal, there are only two branches conv3x3 and conv1x1, where conv1x1 can be equivalent to a special conv3x3, so the gradient can be re-parameterized as sc 2 + tc 2 s_c ^ 2+t_c^2sc2+tc2, as demonstrated above. When the input and output channels are equal, there are 3 branches in total at this time, namely identity, conv3x3, conv1x1, and Identity can also be equivalent to a special conv3x3, whose convolution kernel is composed of 0 and 1, so the gradient re-parameter is 1 + sc 2 + tc 2 1+s_c^2+t_c^21+sc2+tc2。
It should be noted that CSLA does not have a training-time nonlinearity such as BN, and there are no non-sequential (non sequential) trainable parameters. CSLA is only an indirect tool to describe RepOptimizer here.
Then the remaining question is how to determine this scaling factor
HyperSearch
Inspired by DARTS, we replace the constant scaling factors in CSLA with trainable parameters. Training is performed on a small dataset (such as CIFAR100), and after training on the small data, we fix these trainable parameters as constants.
Code
LinearAddBlock defines the CSLA block, which is only trained when HyperSearch is determined.
class LinearAddBlock(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1,
dilation=1, groups=1, padding_mode='zeros', use_se=False, is_csla=False, conv_scale_init=1.0):
super(LinearAddBlock, self).__init__()
self.in_channels = in_channels
self.relu = nn.ReLU()
self.conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=stride, padding=padding, bias=False)
self.scale_conv = ScaleLayer(num_features=out_channels, use_bias=False, scale_init=conv_scale_init)
self.conv_1x1 = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=1, stride=stride, padding=0, bias=False)
self.scale_1x1 = ScaleLayer(num_features=out_channels, use_bias=False, scale_init=conv_scale_init)
if in_channels == out_channels and stride == 1:
self.scale_identity = ScaleLayer(num_features=out_channels, use_bias=False, scale_init=1.0)
self.bn = nn.BatchNorm2d(out_channels)
if is_csla: # Make them constant
self.scale_1x1.requires_grad_(False)
self.scale_conv.requires_grad_(False)
if use_se:
raise NotImplementedError("se block not supported yet")
else:
self.se = nn.Identity()
def forward(self, inputs):
out = self.scale_conv(self.conv(inputs)) + self.scale_1x1(self.conv_1x1(inputs))
if hasattr(self, 'scale_identity'):
out += self.scale_identity(inputs)
out = self.relu(self.se(self.bn(out)))
return out
class ScaleLayer(torch.nn.Module):
def __init__(self, num_features, use_bias=True, scale_init=1.0):
super(ScaleLayer, self).__init__()
self.weight = Parameter(torch.Tensor(num_features))
init.constant_(self.weight, scale_init)
self.num_features = num_features
if use_bias:
self.bias = Parameter(torch.Tensor(num_features))
init.zeros_(self.bias)
else:
self.bias = None
def forward(self, inputs):
if self.bias is None:
return inputs * self.weight.view(1, self.num_features, 1, 1)
else:
return inputs * self.weight.view(1, self.num_features, 1, 1) + self.bias.view(1, self.num_features, 1, 1)
RealVGGBlock is the real module of RepOptVGG, the structure is simple as shown below.
class RealVGGBlock(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1,
dilation=1, groups=1, padding_mode='zeros', use_se=False,
):
super(RealVGGBlock, self).__init__()
self.relu = nn.ReLU()
self.conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=stride, padding=padding, bias=False)
self.bn = nn.BatchNorm2d(out_channels)
if use_se:
raise NotImplementedError("se block not supported yet")
else:
self.se = nn.Identity()
def forward(self, inputs):
out = self.relu(self.se(self.bn(self.conv(inputs))))
return out
Assuming that we have obtained the scales required by HyperSearch through small data training, then when training RepOptVGG, RepVGGOptimizer needs to assign the scales of the CSLA block to RealVGGBlock during initialization. The assignment process is shown in reinitialize, which corresponds to formula 3 in Method.
def reinitialize(self, scales_by_idx, conv3x3_by_idx, use_identity_scales):
for scales, conv3x3 in zip(scales_by_idx, conv3x3_by_idx):
in_channels = conv3x3.in_channels
out_channels = conv3x3.out_channels
kernel_1x1 = nn.Conv2d(in_channels, out_channels, 1, device=conv3x3.weight.device)
if len(scales) == 2:
conv3x3.weight.data = conv3x3.weight * scales[1].view(-1, 1, 1, 1) \
+ F.pad(kernel_1x1.weight, [1, 1, 1, 1]) * scales[0].view(-1, 1, 1, 1)
else:
assert len(scales) == 3
assert in_channels == out_channels
identity = torch.from_numpy(np.eye(out_channels, dtype=np.float32).reshape(out_channels, out_channels, 1, 1)).to(conv3x3.weight.device)
conv3x3.weight.data = conv3x3.weight * scales[2].view(-1, 1, 1, 1) + F.pad(kernel_1x1.weight, [1, 1, 1, 1]) * scales[1].view(-1, 1, 1, 1)
if use_identity_scales: # You may initialize the imaginary CSLA block with the trained identity_scale values. Makes almost no difference.
identity_scale_weight = scales[0]
conv3x3.weight.data += F.pad(identity * identity_scale_weight.view(-1, 1, 1, 1), [1, 1, 1, 1])
else:
conv3x3.weight.data += F.pad(identity, [1, 1, 1, 1])
We need to obtain the gradient Mask in the gradient re-parameterization process, which is divided into three situations similar to the reinitialize process, and the specific implementation is as follows.
def generate_gradient_masks(self, scales_by_idx, conv3x3_by_idx, cpu_mode=False):
self.grad_mask_map = {
}
for scales, conv3x3 in zip(scales_by_idx, conv3x3_by_idx):
para = conv3x3.weight
if len(scales) == 2:
mask = torch.ones_like(para, device=scales[0].device) * (scales[1] ** 2).view(-1, 1, 1, 1)
mask[:, :, 1:2, 1:2] += torch.ones(para.shape[0], para.shape[1], 1, 1, device=scales[0].device) * (scales[0] ** 2).view(-1, 1, 1, 1)
else:
mask = torch.ones_like(para, device=scales[0].device) * (scales[2] ** 2).view(-1, 1, 1, 1)
mask[:, :, 1:2, 1:2] += torch.ones(para.shape[0], para.shape[1], 1, 1, device=scales[0].device) * (scales[1] ** 2).view(-1, 1, 1, 1)
ids = np.arange(para.shape[1])
assert para.shape[1] == para.shape[0]
mask[ids, ids, 1:2, 1:2] += 1.0
if cpu_mode:
self.grad_mask_map[para] = mask
else:
self.grad_mask_map[para] = mask.cuda()
The structure prior is transformed into gradient prior by means of Repopt gradient re-parameterization, which can unify the training and test model structure, and effectively solve the unfriendly problem of RepVGG quantization. Its structure is used in YOLOV6 and shows excellent performance.