Applications of Multimodal Algorithms in Video Understanding

1 Overview

At this stage, the video classification algorithm mainly focuses on understanding the content of the video as a whole, and labels the video as a whole, with a coarse granularity. Fewer articles focus on the fine-grained understanding of temporal segments, while also analyzing videos from a multimodal perspective. This article will share the solution to improve the accuracy of video understanding by using multi-modal network, and achieved great improvement in the youtube-8m data set.

2 related work

In video classification people, NeXtVLAD [1] is proved to be an efficient and fast video classification method. Inspired by the ResNeXt method, the authors successfully decompose the high-dimensional video feature vectors into a set of low-dimensional vectors. This network significantly reduces the parameters of the previous NetVLAD network, but still achieves remarkable performance in feature aggregation and large-scale video classification.
RNNs [2] have been shown to perform well in modeling sequence data. Researchers typically use RNNs to model temporal information in videos that are difficult for CNN networks to capture. GRU[3] is an important part of the RNN architecture, which can avoid the problem of gradient disappearance. Attention-GRU [4] refers to having an attention mechanism, which helps to distinguish the impact of different features on the current prediction.
In order to combine the spatial and temporal features of video tasks, two-stream CNN [5], 3D-CNN [6], slowfast [7] and ViViT [8] were proposed later. Although these models also achieve good performance on video understanding tasks, there is still room for improvement. For example, many methods only target a single modality, or only process the entire video, without outputting fine-grained labels.

3 Technical solutions

3.1 Overall network structure

This technical solution is designed to fully learn the semantic features of video multimodal (text, audio, image), while overcoming the problem of extremely unbalanced and semi-supervised samples in the youtube-8m dataset.
As shown in Figure 1, the entire network is mainly composed of the front mixed multimodal network (mix-Multmodal Network) and the back graph convolutional network (GCN [9]). The mix-Multmodal Network consists of three differentiated multimodal classification networks, and the specific differentiation parameters are in Table1.
insert image description here
Figure 1. Overall network structure
insert image description here
Table 1. Parameters of three differentiated Multimodal Nets

3.2 Multimodal Networks

As shown in Figure 2, the multimodal network mainly understands three modalities (text, video, audio), and each modality includes three processes: basic semantic understanding, temporal feature understanding, and modality fusion. Among them, the video and audio semantic understanding models use EfficientNet[10] and VGGish respectively, and the timing feature understanding model is NextVLAD. The temporal feature understanding model of text is Bert[11].
For multimodal feature fusion, we use SENet [12]. The pre-processing of the SENet network needs to forcibly compress and align the feature lengths of each mode, which will lead to information loss. In order to overcome this problem, we adopt the network structure of multi-Group SENet. Experiments show that the SENet network of multiple groups has a stronger learning ability than a single SENet.
insert image description here
Figure 2. Multimodal network structure

3.3 Graph Convolution

Since the Youtube-8M coarse-grained tags are all marked, the fine-grained tags only mark part of the data. Therefore, GCN is introduced for semi-supervised classification tasks. The basic idea is to update node representations by propagating information among nodes. For multi-label video classification tasks, label dependencies are an important information.
We use conditional probabilities between labels to introduce label related information into our modeling. Based on the coarse-grained tags of the Youtube-8M dataset, we first count the number of occurrences of each tag to obtain a matrix N∈Rᶜ. C is the number of all categories. Then we count the concurrency times of all tag pairs, so that the concurrency matrix M∈Rᶜ﹡ᶜ can be obtained. Finally, the conditional probability matrix P∈Rᶜ﹡ᶜ is obtained:
insert image description here
Pᵢⱼ represents the occurrence probability of label j when label i appears.
However, the co-occurrence matrices from the training and test datasets may not be completely consistent, nor are the dataset labels 100% correct. On the other hand, some rare co-occurrences may not be real relationships to us and may be just noise. Therefore, we can filter Pᵢⱼ to Pᵢⱼ' using a threshold λ:
insert image description here
in our task, each label will be a node of the graph, and a line between two nodes represents their relationship [13] [14]. So we can train a matrix to represent the relationship of all nodes.
Take a simplified label correlation graph Figure 3 extracted from our dataset as an example, Label BMW --> Label Car, which means that when the BMW label appears, Label Car is likely to occur, but not necessarily vice versa. The label Car is highly correlated with all other labels, and labels without arrows indicate that the two labels have no relationship to each other.
insert image description here
Figure 3. Schematic diagram of label correlation

The GCN network implementation is shown in Figure 4. The GCN module consists of two stacked layers of GCNs (GCN(1) and GCN(2)), which help to learn label correlation graphs to map these label representations into a set of interdependent classifiers. A is the input correlation matrix and P' is initialized by the values ​​of the matrix.
W₁ and W₂ are matrices that will be trained in the network. W is the classifier weight learned by GCN.
insert image description here
Figure 4. GCN network structure

3.4 Label Reweighting

The Youtube-8M video classification task is a multi-label classification task. However, the annotation data only selects one of the multi-labels to be marked as 1, and the rest of the labels are all 0. That is to say, in addition to being marked, a certain video segment may also be other labels set to 0. This problem is also a weak supervision problem.
For this situation, we propose a workaround. Annotated classes are given larger weights and unannotated classes are given smaller weights when computing the loss [15]. This weighted cross-entropy approach will help the model learn better from incomplete datasets.
The original binary cross-entropy loss function is:
insert image description here
where y∈Rᴮ﹡ᶜ is the label of the video segment, p∈Rᴮ﹡ᶜ is the prediction result of the model, B is the batch size, and C is the number of all classes.
The new label reweighting loss function is as follows:
insert image description here
where the weights w ∈ Rᴮ﹡ᶜ are defined as:
insert image description here
n is the small weight added to the unannotated class, and m is the large valued weight for the annotated class.

3.5 Feature Enhancement

To avoid overfitting when training the model, we add randomly generated Gaussian noise and randomly inject it into each element of the input feature vector.
As shown in Figure 6, noise will be added to the input feature vector, and the mask vector randomly selects 50% of the dimensions and sets the value to 1. Here the Gaussian noises are independent but have the same distribution for different input vectors.
insert image description here
Figure 6. Adding Gaussian noise
At the same time, in order to avoid the multimodal model from only learning the characteristics of a certain modality, that is, overfitting on the modality. We also mask the modal features to ensure that there is at least one mode in the input, as shown in Figure 7. This allows the individual modalities to be fully learned.
insert image description here
Figure 7. Modal Mask

4 experiments

4.1 Evaluation indicators

Model performance is evaluated according to MAP@K, where K=100,000.
insert image description here
where C is the number of classes and n is the number of segments predicted for each class. N is the number of positively labeled segments for each class. P(k) is the accuracy of the top k predicted segments. rel(k) is a function that determines whether the segment at rank k was correctly predicted (equal to 1 if correct, 0 otherwise).

4.2 Experimental results

4.2.1 Multimodal

In order to verify the benefit of each modality in the multi-modality, we conducted an ablation experiment, and the results are shown in Table 2. When a single modality is used as a feature, Video has the highest accuracy rate, Audio has the lowest accuracy rate, and Text is close to Video. In dual mode, Video + Text is significantly improved, and after adding Audio, the improvement is limited.
insert image description here
Table 2. Multimodal ablation experiments

4.2.2 Graph Convolution

Also to verify the benefits of GCN, we also did a comparative experiment, in which we chose two thresholds λ, which are 0.2 and 0.4. As shown in Table 3, the results show that the classifier generated by GCN helps to improve the performance compared to the original model (org), especially when λ=0.4.
insert image description here
Table 3. Graph Convolution Experiments

4.2.3 Differentiated Multimodal Networks

In order to verify the effects of parallel multimodal networks and differentiation, we design five sets of experiments. The first group of models is a single multi-modal network, the second, third, and fourth groups are 2, 3, and 4 parallel multi-modal networks, and the fifth group is a differentiated 3 parallel multi-mode networks state network.
From the results, the parallel network can improve the accuracy, but the progress will decrease after 4 parallel networks, so blindly increasing the number of parallel networks will not bring benefits. At the same time, the experimental results also show that the differentiated network structure can fit the data more effectively.
insert image description here
Table 4. Differentiated Multimodal Network Experiments

4.2.4 Label Reweighting

Label reweighting consists of two hyperparameters (n and m). Experiments show that when n=0.1 and m=2.5, the accuracy rate is higher.
insert image description here
Table 5. Label reweighting experiments

4.2.5 Feature Enhancement

Feature augmentation is a kind of data augmentation. Experiments show that by adding Gaussian noise and masking certain modes, the generalization ability of the model can be improved. Moreover, this method of adding Gaussian noise is simple to implement, has strong mobility, and is easy to implement in other networks.
insert image description here
Table 6. Feature enhancement experiments

5 summary

Experiments show that the above-mentioned methods have different degrees of improvement, especially in multi-modality and graph convolution. We hope to explore more label dependencies in the future. GCN networks have also been shown to be useful in this task, and we think it is worthwhile for us to do more experiments combining GCN networks with other state-of-the-art video classification networks.

引用
[1]. Rongcheng Lin, Jing Xiao, Jianping Fan: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification. In: ECCV, workshop(2018)
[2]. Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990
[3]. Kyunghyun Cho, Bart Van Merrienboer, ¨ Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv, 2014.
[4]. Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In NIPS, pages 577–585, 2015.
[5]. Karen Simonyan, Andrew Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos. In: NIPS (2014)
[6]. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri Learning Spatiotemporal Features With 3D Convolutional Networks. In: ICCV(2015)
[7]. Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He, SlowFast Networks for Video Recognition. In: CVPR (2019)
[8]. Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid, ViViT: A Video Vision Transformer. In: CVPR (2021)
[9]. Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, Yanwen Guo: Multi-Label Image Recognition with Graph Convolutional Networks. In: CVPR (2019)
[10]. Mingxing Tan, Quoc V. Le, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, PMLR 97:6105-6114, 2019
[11]. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL), 2019
[12]. Jie Hu, Li Shen, Gang Sun, Squeeze-and-Excitation Networks. In: CVPR (2018)
[13]. Zhang Z, Sabuncu M. Generalized cross entropy loss for training deep neural networks with noisy labels[C]//Advances in neural information processing systems. 2018: 8778-8788.
[14]. Pereira R B, Plastino A, Zadrozny B, et al. Correlation analysis of performance measures for multi-label classification [J]. Information Processing & Management, 2018, 54(3): 359-369.
[15]. Panchapagesan S, Sun M, Khare A, et al. Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting[C]. 2016: 760-764.

Guess you like

Origin blog.csdn.net/autohometech/article/details/126524807