Deep learning knowledge 6: (model quantitative compression) ---- pytorch custom Module, and understand the DoReFaNet network definition method through it.

Refer to the Chinese official, for details refer to: How to customize Module in PyTorch

1. Custom Module

Insert picture description here
Module is the basic way pytorch organizes neural networks. Module contains model parameters and calculation logic. Function carries the actual function and defines the forward and backward calculation logic.
The following takes the simplest MLP network structure as an example to introduce how to implement a custom network structure. The complete code can be found in the repo .

1.1 Function

Function is the core class of pytorch's automatic derivation mechanism. Function is parameterless or stateless. It is only responsible for receiving input and returning the corresponding output; for reverse, it receives the corresponding gradient of the output and returns the corresponding gradient of the input.

Here we only focus on how to customize Function. See the source code for the definition of Function . Here is a simplified code snippet:

class Function(object):
    def forward(self, *input):
        raise NotImplementedError
 
    def backward(self, *grad_output):
        raise NotImplementedError

The input and output of forward and backward are Tensor objects.

The Function object is callable, that is, it can be called by (). The input and output of the call are Variable objects. The following code example shows how to implement a ReLU activation function and call it:


import torch
from torch.autograd import Function
 
class ReLUF(Function)：
    def forward(self, input):
        self.save_for_backward(input)
 
        output = input.clamp(min=0)
        return output
 
    def backward(self, output_grad):
        input = self.to_save[0]
 
        input_grad = output_grad.clone()
        input_grad[input < 0] = 0
        return input_grad
 
## Test
if __name__ == "__main__":
      from torch.autograd import Variable
 
      torch.manual_seed(1111)  
      a = torch.randn(2, 3)
 
      va = Variable(a, requires_grad=True)
      vb = ReLUF()(va)
      print va.data, vb.data
 
      vb.backward(torch.ones(va.size()))
      print vb.grad.data, va.grad.data

If the forward input is needed in the backward, you need to explicitly save the required input in the forward. In the above code, forward uses the self.save_for_backward function to temporarily save the input, and uses saved_tensors (python tuple object) to take it out in the backward.

Obviously, the forward input should correspond to the backward input; at the same time, the forward output should match the backward input.
note:
Since Function may need to temporarily store the input tensor, it is recommended not to reuse the Function object to avoid the problem of early memory release.
As shown in the sample code , each call to forward regenerates a ReLUF object, and cannot be generated during initialization and repeated calls in forward. (Meaning that the sample code is correct, it is that each fc generates a new object. Because each object manages its own weight and input )

2 Module

Similar to Function, Module object is also callable, and input and output are also Variable. The difference is that Module can have parameters. Module consists of two main parts: parameters and calculation logic (Function call). Since the ReLU activation function has no parameters, here we take the most basic fully connected layer as an example to illustrate how to customize the Module.
The operation logic of the fully connected layer is defined as follows Function:

import torch
from torch.autograd import Function
 
class LinearF(Function):
 
     def forward(self, input, weight, bias=None):
         self.save_for_backward(input, weight, bias)
 
         output = torch.mm(input, weight.t())
         if bias is not None:
             output += bias.unsqueeze(0).expand_as(output)
 
         return output
 
     def backward(self, grad_output):
         input, weight, bias = self.saved_tensors
 
         grad_input = grad_weight = grad_bias = None
         if self.needs_input_grad[0]:
             grad_input = torch.mm(grad_output, weight)
         if self.needs_input_grad[1]:
             grad_weight = torch.mm(grad_output.t(), input)
         if bias is not None and self.needs_input_grad[2]:
             grad_bias = grad_output.sum(0).squeeze(0)
 
         if bias is not None:
             return grad_input, grad_weight, grad_bias
         else:
             return grad_input, grad_weight

needs_input_grad is a tuple with a bool element, the length is the same as the number of forward parameters, and it is used to identify whether each input is input to calculate the gradient; for inputs that do not require gradients, unnecessary calculations can be reduced.

Function (LinearF here) defines the basic calculation logic. Module only needs to allocate memory space for parameters during initialization, and pass the parameters to the corresponding Function object during calculation. code show as below:

import torch
import torch.nn as nn
 
class Linear(nn.Module):
 
    def __init__(self, in_features, out_features, bias=True):
         super(Linear, self).__init__()
         self.in_features = in_features
         self.out_features = out_features
         self.weight = nn.Parameter(torch.Tensor(out_features, in_features))
         if bias:
             self.bias = nn.Parameter(torch.Tensor(out_features))
         else:
            self.register_parameter('bias', None)
 
    def forward(self, input):
         return LinearF()(input, self.weight, self.bias)

It should be noted that the parameter is the memory space maintained by the tensor object, but the tensor needs to be packaged as a Parameter object. Parameter is a special subclass of Variable , the only difference is that the default requires_grad of Parameter is True. Varaible is the core class of the automatic derivation mechanism. It will not be introduced here, please refer to the tutorial .

2.3 Custom Recurrent Neural Network (RNN)

Runnable code reference: RNN
Among them, Parameters is a subclass of Variable, and it is an automatic derivation mechanism, so when we define the code network, we basically don’t need to be in backwards , as long as we define the forward.

3 Define DoReFaNet

Since it compresses weights, we only need to define a convolution class, activation function class, and then get the weights in him, quantify, and define the network as follows:

class AlexNet_Q(nn.Module):
  def __init__(self, wbit, abit, num_classes=1000):
    super(AlexNet_Q, self).__init__()
    Conv2d = conv2d_Q_fn(w_bit=wbit)
    Linear = linear_Q_fn(w_bit=wbit)

    self.features = nn.Sequential(
      nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),
      nn.BatchNorm2d(96),
      nn.ReLU(inplace=True),
      nn.MaxPool2d(kernel_size=3, stride=2),

      Conv2d(96, 256, kernel_size=5, padding=2),
      nn.BatchNorm2d(256),
      nn.ReLU(inplace=True),
      activation_quantize_fn(a_bit=abit),
      nn.MaxPool2d(kernel_size=3, stride=2),

      Conv2d(256, 384, kernel_size=3, padding=1),
      nn.ReLU(inplace=True),
      activation_quantize_fn(a_bit=abit),

      Conv2d(384, 384, kernel_size=3, padding=1),
      nn.ReLU(inplace=True),
      activation_quantize_fn(a_bit=abit),

      Conv2d(384, 256, kernel_size=3, padding=1),
      nn.ReLU(inplace=True),
      activation_quantize_fn(a_bit=abit),
      nn.MaxPool2d(kernel_size=3, stride=2),
    )
    self.classifier = nn.Sequential(
      Linear(256 * 6 * 6, 4096),
      nn.ReLU(inplace=True),
      activation_quantize_fn(a_bit=abit),

      Linear(4096, 4096),
      nn.ReLU(inplace=True),
      activation_quantize_fn(a_bit=abit),
      nn.Linear(4096, num_classes),
    )

    for m in self.modules():
      if isinstance(m, Conv2d) or isinstance(m, Linear):
        init.xavier_normal_(m.weight.data)

  def forward(self, x):
    x = self.features(x)
    x = x.view(x.size(0), 256 * 6 * 6)
    x = self.classifier(x)
    return

The customization rights are quantified as follows:


class weight_quantize_fn(nn.Module):
  def __init__(self, w_bit):
    super(weight_quantize_fn, self).__init__()
    assert w_bit <= 8 or w_bit == 32
    self.w_bit = w_bit
    self.uniform_q = uniform_quantize(k=w_bit)

  def forward(self, x):
    if self.w_bit == 32:
      weight_q = x
    elif self.w_bit == 1:
      E = torch.mean(torch.abs(x)).detach()
      weight_q = self.uniform_q(x / E) * E
    else:
      weight = torch.tanh(x)
      weight = weight / 2 / torch.max(torch.abs(weight)) + 0.5
      weight_q = 2 * self.uniform_q(weight) - 1
    return weight_q

def conv2d_Q_fn(w_bit):
  class Conv2d_Q(nn.Conv2d):
    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
                 padding=0, dilation=1, groups=1, bias=True):
      super(Conv2d_Q, self).__init__(in_channels, out_channels, kernel_size, stride,
                                     padding, dilation, groups, bias)
      self.w_bit = w_bit
      self.quantize_fn = weight_quantize_fn(w_bit=w_bit)

    def forward(self, input, order=None):
      weight_q = self.quantize_fn(self.weight)
      # print(np.unique(weight_q.detach().numpy()))
      return F.conv2d(input, weight_q, self.bias, self.stride,
                      self.padding, self.dilation, self.groups)

  return Conv2d_Q

Among them, the uniform_quantize bit quantization function is:

def uniform_quantize(k):
  class qfn(torch.autograd.Function):

    @staticmethod
    def forward(ctx, input):
      if k == 32:
        out = input
      elif k == 1:
        out = torch.sign(input)
      else:
        n = float(2 ** k - 1)
        out = torch.round(input * n) / n
      return out

    @staticmethod
    def backward(ctx, grad_output):
      grad_input = grad_output.clone()
      return grad_input

  return qfn().apply

After the activation layer is activated, the output bits are quantized as:

class activation_quantize_fn(nn.Module):
  def __init__(self, a_bit):
    super(activation_quantize_fn, self).__init__()
    assert a_bit <= 8 or a_bit == 32
    self.a_bit = a_bit
    self.uniform_q = uniform_quantize(k=a_bit)

  def forward(self, x):
    if self.a_bit == 32:
      activation_q = x
    else:
      activation_q = self.uniform_q(torch.clamp(x, 0, 1))
      # print(np.unique(activation_q.detach().numpy()))
    return activation_q

4. Perform model channel compression, using yolov3 code as an example:

Channel compression is mainly through BN的aX+b=ouputthe distribution of the weight a of the arrangement . After training, the weight a is sorted, the threshold is obtained according to the cropping ratio, and then the channels smaller than the threshold are excluded and the ones larger than the threshold are retained.

Get the index of the number of network layers including the BN layer:

def parse_module_defs(module_defs):
    CBL_idx = []
    Conv_idx = []
    for i, module_def in enumerate(module_defs):
        # 具有bn层的，才进行保存索引，然后用来压缩训练。
        if module_def['type'] == 'convolutional':
            if module_def['batch_normalize'] == '1':
                CBL_idx.append(i)
            else:
                Conv_idx.append(i)
    #排除一些不要进行压缩的层，为了保证准确度。
    ignore_idx = set()
    for i, module_def in enumerate(module_defs):
        if module_def['type'] == 'shortcut':
            ignore_idx.add(i-1)
            identity_idx = (i + int(module_def['from']))
            if module_defs[identity_idx]['type'] == 'convolutional':
                ignore_idx.add(identity_idx)
            elif module_defs[identity_idx]['type'] == 'shortcut':
                ignore_idx.add(identity_idx - 1)

    ignore_idx.add(84)
    ignore_idx.add(96)
    #获取符合要求跟需要压缩的网络层索引。
    prune_idx = [idx for idx in CBL_idx if idx not in ignore_idx]

    return CBL_idx, Conv_idx, prune_idx

Code for training:

        for batch_i, (_, imgs, targets) in enumerate(dataloader):
            batches_done = len(dataloader) * epoch + batch_i

            imgs = imgs.to(device)
            targets = targets.to(device)

            loss, outputs = model(imgs, targets)

            optimizer.zero_grad()
            loss.backward()
            #进行模型权重的稀疏化训练，opt.s可以控制稀疏值，prune_idx是包含需要进行稀疏训练的
            #卷积层索引。
            BNOptimizer.updateBN(sr_flag, model.module_list, opt.s, prune_idx)

            optimizer.step()

The main function of which the sparse training, Principle: The additional opt.svalue to make the weight the more important to get a can 正梯度外加值. The code function is as follows:

class BNOptimizer():

    @staticmethod
    def updateBN(sr_flag, module_list, s, prune_idx):
        if sr_flag:
            for idx in prune_idx:
                # Squential(Conv, BN, Lrelu)
                bn_module = module_list[idx][1]
                #torch.sign()是一个跳变函数，其值只有-1，0，-1.
                #这里的s指的是附加到权重梯度grad上的值，其可以用来加大或减少梯度值，
                #从而在更新权重的时候，权值大的可以变的越来越大。
                bn_module.weight.grad.data.add_(s * torch.sign(bn_module.weight.data))  # L1

After the entire model is trained, unnecessary network channels can be deleted through the prune ratio set by yourself. The function is as follows:

percent = 0.85
def prune_and_eval(model, sorted_bn, percent=.0):
    model_copy = deepcopy(model)
    #根据裁剪比例进行获取停止的通道索引
    thre_index = int(len(sorted_bn) * percent)
    #任何根据这个停止的通道索引获取其对应的通道权重值，
    #然后作为刷选阈值。
    thre = sorted_bn[thre_index]
    print(f'Channels with Gamma value less than {thre:.4f} are pruned!')
    remain_num = 0
    for idx in prune_idx:
        bn_module = model_copy.module_list[idx][1]
        #通过阈值获取刷选通道
        mask = obtain_bn_mask(bn_module, thre)
        remain_num += int(mask.sum())
        #根据mask的0\1进行获取保留了通道的
        bn_module.weight.data.mul_(mask)
    mAP = eval_model(model_copy)[2].mean()

    print(f'Number of channels has been reduced from {len(sorted_bn)} to {remain_num}')
    print(f'Prune ratio: {1-remain_num/len(sorted_bn):.3f}')
    print(f'mAP of the pruned model is {mAP:.4f}')

    return thre