ECO: Efficient Convolutional Network for Online Video Understanding

Recently, it is necessary to test the flops and params of this network. Here is a brief introduction to part of the network structure and code, and how to test it.

The author of the paper stated in the introduction that video understanding algorithms have made great progress with the support of deep learning, but many algorithms only focus on action detection in a shorter period of time. This short-term content is insufficient if we want to further understand the intent of longer-scale actions. At the same time, some 3D convolution algorithms are so computationally complex that they can only process part of the video. These algorithms usually have one characteristic - they integrate information at different times "after the fact".

Therefore, the author directly designed an end-to-end architecture to solve these problems. The author realized that since a single picture can already classify the action content, a large number of adjacent frames are redundant, so it is enough to process only one frame in a period of time. In addition, the author believes that instead of integrating the scores of each time period to obtain the final score, it is better to integrate the feature information in each time period. We can take a frame at a certain distance and put these pictures into 3D convolution for processing. This can not only process temporal information better, but also process information over a longer period of time.

Without further ado, let’s talk about the network structure:

The above picture is the network structure of ECO Lite: the video is divided into N sections of the same length, and a frame is randomly selected from each section. These pictures are first processed through a conventional 2D convolution network to obtain feature maps, and then Feature maps are stacked as input to a 3D convolutional network. 3D networks classify actions based on temporal information.

There is a key design here - random sampling. Randomly sampling within each segment helps increase the diversity of sampling, allowing more variations in an action to enter the network. In addition, although the division of time sections can be further studied, for the sake of simplicity in reasoning, the author did not adopt a more complex dynamic division method.

Considering the needs of different users, the author named the previous design the lightweight version (Lite, which is ECO in the code, and Full is ECOfully in the code), and thus upgraded to the full-size version (Full) design. The two versions are roughly the same. The difference is that the lightweight version (Lite) mainly focuses on extracting long-term information through the 3D network, so short-term features may be ignored. The full-size version (Full) adds a 2D network to further utilize the information of the 2D part. The new design is shown below:

The feature maps output by the newly added 2D network and the original 3D network will be connected together before entering the classifier.

2D network: The BN-Inception structure is adopted, and the Inception-3c cutoff is obtained. The reason for selection is efficiency. The output is a feature map of 96.

3D network: An improved version of 3D-Resnet18 is used. This structure is also widely used.

2D networks: This is the new 2D network added to the full-size network. BN-Inception is also used, starting from the Inception-4a layer and ending with the last pooling layer. The output is 1024 long, N feature vectors in total.

Please see the paper for details on how to train. Next, we will introduce some codes of the ECO network:

import torch
from torch import nn
from .layer_factory import get_basic_layer, parse_expr
import torch.utils.model_zoo as model_zoo
import yaml


class ECO(nn.Module):
    def __init__(self, model_path='tf_model_zoo/ECO/ECO.yaml', num_classes=101,
                       num_segments=4, pretrained_parts='both'):

        super(ECO, self).__init__()

        self.num_segments = num_segments

        self.pretrained_parts = pretrained_parts

        manifest = yaml.safe_load(open(model_path))

        layers = manifest['layers']

        self._channel_dict = dict()

        self._op_list = list()
        for l in layers:
            out_var, op, in_var = parse_expr(l['expr'])
            if op != 'Concat' and op != 'Eltwise':
                id, out_name, module, out_channel, in_name = get_basic_layer(l,
                                                                3 if len(self._channel_dict) == 0 else self._channel_dict[in_var[0]],
                                                                             conv_bias=True if op == 'Conv3d' else True, num_segments=num_segments)

                self._channel_dict[out_name] = out_channel
                setattr(self, id, module)
                self._op_list.append((id, op, out_name, in_name))
            elif op == 'Concat':
                self._op_list.append((id, op, out_var[0], in_var))
                channel = sum([self._channel_dict[x] for x in in_var])
                self._channel_dict[out_var[0]] = channel
            else:
                self._op_list.append((id, op, out_var[0], in_var))
                channel = self._channel_dict[in_var[0]]
                self._channel_dict[out_var[0]] = channel


    def forward(self, input):
        data_dict = dict()
        data_dict[self._op_list[0][-1]] = input

        def get_hook(name):

            def hook(m, grad_in, grad_out):
                print(name, grad_out[0].data.abs().mean())

            return hook
        for op in self._op_list:
            if op[1] != 'Concat' and op[1] != 'InnerProduct' and op[1] != 'Eltwise':
                # first 3d conv layer judge, the last 2d conv layer's output must be transpose from 4d to 5d
                if op[0] == 'res3a_2':
                    inception_3c_output = data_dict['inception_3c_double_3x3_1_bn']
                    inception_3c_transpose_output = torch.transpose(inception_3c_output.view((-1, self.num_segments) + inception_3c_output.size()[1:]), 1, 2)
                    data_dict[op[2]] = getattr(self, op[0])(inception_3c_transpose_output)
                else:
                    data_dict[op[2]] = getattr(self, op[0])(data_dict[op[-1]])
                    # getattr(self, op[0]).register_backward_hook(get_hook(op[0]))
            elif op[1] == 'InnerProduct':
                x = data_dict[op[-1]]
                data_dict[op[2]] = getattr(self, op[0])(x.view(x.size(0), -1))
            elif op[1] == 'Eltwise':
                try:
                    data_dict[op[2]] = torch.add(data_dict[op[-1][0]], 1, data_dict[op[-1][1]])
                except:
                    for x in op[-1]:
                        print(x,data_dict[x].size())
                    raise
                # x = data_dict[op[-1]]
                # data_dict[op[2]] = getattr(self, op[0])(x.view(x.size(0), -1))
            else:
                try:
                    data_dict[op[2]] = torch.cat(tuple(data_dict[x] for x in op[-1]), 1)
                except:
                    for x in op[-1]:
                        print(x,data_dict[x].size())
                    raise
        # print output data size in each layers
        # for k in data_dict.keys():
        #     print(k,": ",data_dict[k].size())
        # exit()

        # "self._op_list[-1][2]" represents: last layer's name(e.g. fc_action)
        return data_dict[self._op_list[-1][2]]

The above code uses some functions in layer_factory.py to load the ECO model definition file ECO.yaml, establish forward propagation, and return the final data obtained by forward propagation.

ECO implements the definition and initialization of the network model by changing base_model based on TSN. For TSN, please refer to the TSN network code or the code at the end of this article.

The flops and params of the test network are as follows. In addition, ECO_en represents an ensemble with the input frame number of 16, 20, 24, and 32. Modify the configuration file:

import argparse
parser = argparse.ArgumentParser(description="PyTorch implementation of ECO")
parser.add_argument('--dataset', type=str, default="kinetics", choices=['ucf101', 'hmdb51', 'kinetics', 'something','jhmdb'])
parser.add_argument('--modality', type=str, default="RGB", choices=['RGB', 'Flow', 'RGBDiff'])
parser.add_argument('--train_list', default="", type=str)
parser.add_argument('--val_list', default="", type=str)
parser.add_argument('--net_model', type=str, default=None)
parser.add_argument('--net_model2D', type=str, default=None)
parser.add_argument('--net_modelECO', type=str, default=None)
parser.add_argument('--net_model3D', type=str, default=None)
# ========================= Model Configs ==========================
parser.add_argument('--arch', type=str, default="ECO")
parser.add_argument('--num_segments', type=int, default=16)
parser.add_argument('--consensus_type', type=str, default='avg',
                    choices=['avg', 'max', 'topk', 'identity', 'rnn', 'cnn'])
parser.add_argument('--pretrained_parts', type=str, default='both',
                    choices=['scratch', '2D', '3D', 'both','finetune'])
parser.add_argument('--k', type=int, default=3)

parser.add_argument('--dropout', '--do', default=0.5, type=float,
                    metavar='DO', help='dropout ratio (default: 0.5)')
parser.add_argument('--loss_type', type=str, default="nll",
                    choices=['nll'])

# ========================= Learning Configs ==========================
parser.add_argument('--epochs', default=45, type=int, metavar='N',
                    help='number of total epochs to run')
parser.add_argument('-b', '--batch-size', default=1, type=int,
                    metavar='N', help='mini-batch size (default: 256)')
parser.add_argument('-i', '--iter-size', default=1, type=int,
                    metavar='N', help='number of iterations before on update')
parser.add_argument('--lr', '--learning-rate', default=0.001, type=float,
                    metavar='LR', help='initial learning rate')
parser.add_argument('--lr_steps', default=[20, 40], type=float, nargs="+",
                    metavar='LRSteps', help='epochs to decay learning rate by 10')
parser.add_argument('--momentum', default=0.9, type=float, metavar='M',
                    help='momentum')
parser.add_argument('--weight-decay', '--wd', default=5e-4, type=float,
                    metavar='W', help='weight decay (default: 5e-4)')
parser.add_argument('--clip-gradient', '--gd', default=None, type=float,
                    metavar='W', help='gradient norm clipping (default: disabled)')
parser.add_argument('--no_partialbn', '--npb', default=False, action="store_true")
parser.add_argument('--nesterov',  default=False)
parser.add_argument('--num_saturate', type=int, default=5,
                    help='if number of epochs that validation Prec@1 saturates, then decrease lr by 10 (default: 5)')

# ========================= Monitor Configs ==========================
parser.add_argument('--print-freq', '-p', default=20, type=int,
                    metavar='N', help='print frequency (default: 10)')
parser.add_argument('--eval-freq', '-ef', default=5, type=int,
                    metavar='N', help='evaluation frequency (default: 5)')


# ========================= Runtime Configs ==========================
parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
                    help='number of data loading workers (default: 4)')
parser.add_argument('--resume', default='', type=str, metavar='PATH',
                    help='path to latest checkpoint (default: none)')
parser.add_argument('-e', '--evaluate', dest='evaluate', action='store_true',
                    help='evaluate model on validation set')
parser.add_argument('--snapshot_pref', type=str, default="")
parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
                    help='manual epoch number (useful on restarts)')
parser.add_argument('--gpus', nargs='+', type=int, default=None)
parser.add_argument('--flow_prefix', default="", type=str)
parser.add_argument('--rgb_prefix', default="", type=str)

Here batch_size is set to 1, and one frame is extracted from each segment during input. The test code is as follows:


import torch
from models import TSN

from opts import parser


args = parser.parse_args()

model = TSN(400, args.num_segments, args.pretrained_parts, args.modality,
                base_model="ECO",
                consensus_type=args.consensus_type, dropout=args.dropout, partial_bn=not args.no_partialbn)

from thop import profile
#input：(t, c, h, w), 注意：此处t = batch_size*segment_num, 根据TSN而定
input = torch.randn(16,3,224,224)
flops, params = profile(model, inputs=(input,))
print("FLOPs = %f G"%(flops/1e9))
print("params = %f M"%(params/1e6))
# print(model)
#FLOPs = 46.681547 G
#params = 37.503856 M

The model is initialized as above. When testing other frame numbers, segment_nums needs to be changed. At the same time, parameter statistics can also use the summary in torchsummary.

I did not read this paper in detail, but simply initialized the network and tested flops and params. For more details, please refer to the paper and source code.

The paper and source code are as follows:

1804.09066.pdf (arxiv.org)https://arxiv.org/pdf/1804.09066.pdf mzolfaghari/ECO-pytorch: PyTorch implementation for "ECO: Efficient Convolutional Network for Online Video Understanding", ECCV 2018 (github.com)https://github.com/mzolfaghari/ECO-pytorch

ECO: Efficient Convolutional Network for Online Video Understanding

Guess you like