How to eliminate "bad video"? She teaches you how to take the first step

non-serious opening remarks

Video social networking has become the most in social way nowadays. Compared with traditional text and voice chat, using personally recorded short videos, humorous pictures, and emoticons to communicate with friends is not only more interesting but also more humane.

Enter image description

With the popularity of video social networking, the video data generated every day can reach tens of millions of hours. The quality of these data is uneven, and there are a large number of bad videos, such as violence, pornography, and politics. In the face of massive data, relying solely on manual review cannot solve the problem of content review. Therefore, it also gave birth to the birth of intelligent content moderation. Intelligent content auditing refers to the automatic classification of massive videos with the help of artificial intelligence technology, identification of videos involving sensitive content and banning them.

Enter image description

The first step in intelligent content moderation is video classification. Today, we are going to talk about the algorithm behind video classification.

serious opening statement

The story begins with deep learning. (Because from the perspective of deep learning, it can be shown that this article is a compelling algorithm summary.) Deep learning is a vocabulary that has become popular in various fields in recent years. In the fields of speech recognition, image classification, video understanding, etc., deep learning The learned related algorithms have been able to reach or even exceed the human level on certain tasks. This paper summarizes the algorithms of deep learning in this direction from the perspective of video classification.

Video classification refers to classifying the content contained in a video clip given a video clip. Categories are usually actions (like making a cake), scenes (like a beach), objects (like a table), etc. Among them, the video action classification is the most popular. After all, the action itself contains "dynamic" factors, which cannot be described by "static" images, so it is also the most representative of video classification skills.

data set

Friends who are familiar with deep learning should know that deep learning is a data-driven technology, so data sets play a very important role in algorithm research. Although there are a large number of video data uploaded by users on the Internet, most of these data lack category labels, and direct use in algorithm training will lead to poor results. In academia, there are usually some public datasets that have been fully annotated, which are good helpers for algorithm training. Specific to the field of video classification, there are mainly two datasets, trimmed and untrimmed. Trimmed means that the video has been edited so that it only contains the content of the category to be identified; untrimmed means that the video has not been edited and contains a lot of information other than actions/scenes/objects. Untrimmed usually adds motion detection algorithms in addition to video classification algorithms. This is not in today's topic, we can talk about the algorithm of this piece when we have time.

Then the most common datasets for trimmed video are UCF101, HMDB51, Kinetics, and Moments in time. The most common datasets for Untrimmed video are ActivityNet, Charades, and SLAC. The comparison of some datasets is shown in the table below:

Common datasets for video classificationEnter image description

It should be pointed out that from the above table we can see that the dataset for video classification is actually much smaller in scale than the dataset for image classification. This is because annotating a video is far more time-consuming and labor-intensive than annotating an image. The trimmed video is better, and the basic annotation time is equal to the video duration. If it is an untrimmed video, you need to manually mark the start and end time of the action in the video. According to the test, it takes 4 times the length of the video.

So ladies and folks, use and cherish these datasets.

Research progress

In video classification, there are two very important features: appearance features (appearance) and temporal features (dynamics). The performance of a video classification system largely depends on whether it extracts and makes good use of these two features. But extracting these two kinds of features is not so easy, and will encounter the influence of factors such as deformation/perspective transformation/motion blur. Therefore, it is crucial to design effective features that are robust to noise and retain video category information.

Based on the success of ConvNets (deep convolutional neural networks) in image classification, it is natural to think of using ConvNets for video classification. However, ConvNets themselves model the apparent features of 2D images, and for videos, in addition to apparent features, temporal features are also important. So what if you use timing features? There are usually three ideas: LSTM, 3D-ConvNet and Two-Stream.

Enter image description

1. LSTM series

LRCNs [1] is a method combining LSTM and ConvNet for video classification. This combination is natural. The ConvNet classifier that has been trained on the image classification task can well extract the apparent features of video frames; and for the extraction of time series features, it can be achieved by directly adding LSTM layers, because LSTM The ability to take the state of multiple moments as the input of the current moment allows information in the time dimension to be preserved.

Enter image description

Video classification tasks are variable-length input and fixed-length output. The article also introduces the scheme of LRCNs for image description (fixed-length input and variable-length output) and video description (variable-length input and variable-length output). Interested students can view it by themselves.

2. 3D-ConvNet and its derivatives

C3D[2] is a work of Facebook, which mainly extends 2D Convolution to 3D. The principle is shown in the figure below. We know that the 2D convolution operation is to slide the convolution kernel on the input image or feature map to obtain the feature map of the next layer. For example, Figure (a) is convolution on a single-channel image, and Figure (b) is convolution on a multi-channel image (the multi-channel image here can refer to the 3 color channels of the same image, It also refers to multiple stacked frames, that is, a small video), and the final output is a two-dimensional feature map, that is, the multi-channel information is completely compressed. In 3D convolution, in order to preserve the timing information, the convolution kernel is adjusted to increase the one-dimensional temporal depth. As shown in Figure (c), the output of the 3D convolution is still a three-dimensional feature map. Therefore, through 3D convolution, C3D can directly process video while exploiting both apparent and temporal features.

Enter image description

Regarding the experimental effect, the accuracy of C3D on UCF101 is 82.3%, which is not high. The reason is that the network result of C3D is a simple structure designed by itself (only 11 layers), and there is no reference or pre-training from other mature ConvNets structures.

Therefore, in response to this, many scholars have proposed improvements.

  • I3D[3] is an improvement made by DeepMind based on C3D. It is worth mentioning that the I3D article is also an article that publishes the Kinetics dataset. The innovation lies in the weight initialization of the model, how to assign the weights of pre-trained 2D ConvNets to 3D ConvNets. Specifically, an image repeated T times in the time dimension can be regarded as a (very boring) T-frame video, so in order to make the output of the video in the 3D structure equal to the output of a single-frame image in the 2D structure , the weight of the 3D convolution can be made equal to the weight of the 2D convolution and repeated T times, and then the weight is reduced by T times to ensure consistent output. I3D is pre-trained on the Kinetics dataset and then used for UCF101, and its accuracy can reach 98.0%.

  • P3D[4] is an improvement made by MSRA based on C3D. The basic structure is to expand ResNet into a "pseudo" 3D convolution. The "pseudo" 3D convolution means using a 1*3*3 2D spatial convolution and 3* 1*1 1D time domain convolution to simulate the commonly used 3*3*3 3D convolution, as shown in the following figure. P3D optimizes C3D in terms of the number of parameters and running speed.

Enter image description

3. Two-Stream Network and its derivative series

Two Stream[5] is the work of the VGG group (not UGG). The basic principle is to train two ConvNets to model the video frame image (spatial) and dense optical flow (temporal) respectively. The structure of the two networks is The same, all are 2D ConvNets, see the figure below. The networks of the two streams judge the category of the video respectively to obtain the class score, and then fuse the scores to obtain the final classification result.

It can be seen that Two-Stream and C3D are different ideas. The ConvNets it uses are all 2D ConvNets, and the modeling of time series features is reflected in one of the two branch networks. The experimental results of Two-Stream achieve 88.0% accuracy on UCF101.

Enter image description

On the issue of how spatial stream and temporal stream are integrated, many scholars have made improvements.

  • [6] On the basis of the two stream network, 3D Conv and 3D Pooling are used to integrate spatial and temporal, which means two stream + C3D. In addition, the article replaced the network structure of both branches with VGG-16. The accuracy on UCF101 is 92.5%.

  • TSN [7], the work of CUHK, has an exhaustive discussion on further improving the performance of two stream networks. Two streams are used here for the classification of video clips (snippets). Regarding the input data type of two streams, in addition to the original video frame image and dense optical flow, the article found that adding warped optical flow can also improve performance. Tried GoogLeNet, VGG-16 and BN-Inception three network structures on the branch network structure, of which BN-Inception has the best effect. In the training strategy, cross-modal pre-training, regularization, data augmentation and other methods are adopted. 94.2% accuracy is achieved on UCF101.

Enter image description

4. Other

In addition to the above two common ideas, there are also scholars who have taken a different approach and tried different methods.

  • TDD[8] is an improvement to the traditional iDT[9] algorithm (iDT algorithm is the best behavior recognition algorithm before deep learning), which combines trajectory features and two-stream network, and uses two-stream network as feature extraction At the same time, the trajectory is used to select the features, and the depth convolution descriptor of the trajectory is obtained, and finally the linear SVM is used for video classification. TDD is a relatively successful example of combining traditional methods with deep learning algorithms, achieving 90.3% accuracy on UCF.

  • ActionVLAD [10] is a feature fusion method, which can fuse the features of two streams, the features of C3D and the features of other network structures. The idea is to calculate residuals and cluster the original features, and fuse frames at different times to obtain new features. ActionVLAD is a feature fusion of video spatial and temporal dimensions, which makes the expression of features more comprehensive.

  • Non-local Network [11] is the recent work of Facebook He Yuming and RBG. Non-local operations open a new direction for solving the long-distance dependence of space and time in video processing. We know that the convolutional structure can only capture the local information of the data, and it is not flexible enough for the information transfer of non-local features. Non-local Network adjusts a position based on the information of all positions in all frames. The article added this block to I3D for experiments, and the accuracy was improved by 2% on Charades.

Summarize

All the above video classification algorithms have been proposed in recent years, which shows the rapid development of this field. From an academic point of view, video classification is the golden key to the field of video understanding, and its research can lay a solid foundation for research in related fields, including video action detection, video structural analysis, etc., all of which use video classification technology. From the perspective of our actual life, video classification is already doing a lot of things silently, such as the intelligent content review mentioned at the beginning of the article, such as video retrieval, video surveillance, video advertisement placement, automatic driving, sports event analysis, etc. In the near future, I believe that video classification and other AI algorithms will bring us more surprising changes. AI makes life better.

cow people say

Niu Renshuo column is dedicated to the discovery of technical people's thoughts, including technical practice, technical dry goods, technical insights, growth experience, and all technical content worthy of discovery. We hope to gather the best technicians to tap unique, sharp and contemporary voices.

references

[1] J. Donahue, et al. Long-term recurrent convolutional networks for visual recognition and description. CVPR, 2015.

[2] D. Tran, et al. Learning Spatiotemporal Features with 3D Convolutional Networks. ICCV, 2015.

[3] J. Carreira, et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CVPR, 2017.

[4] Z. Qiu, et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks. ICCV, 2017.

[5] K. Simonyan, et al. Two-Stream Convolutional Networks for Action Recognition in Videos. NIPS, 2014.

[6] C. Feichtenhofer, et al. Convolutional Two-Stream Network Fusion for Video Action Recognition. CVPR, 2016.

[7] L. Wang, et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. ECCV, 2016.

[8] L. Wang, et al. Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors. CVPR, 2015.

[9] H. Wang, et al. Action Recognition with Improved Trajectories. ICCV, 2013.

[10] R. Girdhar, et al. ActionVLAD: Learning spatio-temporal aggregation for action classification. CVPR, 2017.

[11] X. Wang, et al. Non-local Neural Networks. arxiv 1711, 2017.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325243515&siteId=291194637