Video content is detected based on the depth of learning technologies

1. Background
carat carat (KilaKila) is the focus on young users of interactive entertainment software content community. KilaKila launch features live interactive voice, video short, fiction dialogue, to meet current young users personalized, fragmentation of entertainment needs. Short video which has vast amounts of video material generated every day, which caused serious user of information overload, it is difficult to choose from the independent content of interest. Every consumer is also a producer of video content, hope their work can be seen more like-minded people, for maximum exposure. But short video scene appears in a large number of duplicate UGC video content, the video repeated exposure is repeated, repeated viewing, resulting in a poor user experience and even churn. This article focuses on building video duplicate content detection service architecture and engineering are given based on the depth of the program learning CNN technology. After the on-line service, duplicate detection accuracy rate of 80%, the video content distribution Improve Efficiency 20%.

2, the image feature descriptor method of
the video content is to be understood that the first step in the video clip of a video frame processing, i.e. random samples. Pumping means video key frame of video frames characterized by full meaning of the entire video, for different types of video encoding format, frame rate, bit rate, video resolution and frame pumping different kinds of video, video is roughly divided into a fixed time interval by a video frame based on the actual pumping and the contents of the image frame drawn, particularly draw frame is subdivided into clusters, based on pumping motion frame based on the lens frame pumping CNN herein take deep learning model feature extraction image frame, and wherein the comparison of the current mainstream model feature extraction capabilities and features for training for the data in order to better model parameters.
The traditional method of feature descriptors can clearly see the movement of the feature point, is beneficial in tracking the feature points and feature for the edge (Edge), region (Patch) is powerless and the like. Depth learning methods can retain characteristics of links and the local spatial neighborhood image (CNN) easier to handle high-dimensional image. For without knowing the extracted features in terms of what users. A lot of practice shows that the depth of learning has obvious advantages in terms of image feature extraction.

_1


Figure 1: Early network structure similar image determination

3, the depth of the learning model selection CNN
1) learning algorithm of depth image feature extraction model (2D-CNN)
the FFmpeg extracted video keyframes fixed intervals of time, pumping flexible frame interval selection. Use AlexNe version of CNN model to process the raw images feature vector dimension of 1000. Original image data input through the depth required learning model image feature extraction, and shows high dimensional data corresponding to the picture name to facilitate access to the subsequent operation.

_2


Figure 2: high-dimensional data


2) extracting features Video Model learning algorithm of depth (3D-CNN)
the FFmpeg selecting a reasonable number of parameters according to a video clip as a single input data. C3D use the video feature version of CNN model to obtain high-dimensional vector expression. Problem-based video analysis, 2D convolution can not effectively obtain information on timing. 3D models can be extracted using a three-dimensional convolution of the image features, but also on the clip can be extracted temporal features, represented by a high-dimensional vector. Clip the fixed frame of the video clip.

_3


Figure III: 2D convolution


_4


Figure IV: 3D convolution


_5


FIG five: C3D network structure model

3) extracting features Video Model learning algorithm depth of
FFmpeg selected as a reasonable amount of metadata according to the video clip parameters. Use R2Plus1D version of the video obtain the CNN model features high-dimensional vector expression. The 3D convolution is decomposed into spatial and temporal convolution convolution using base block ResNet network. Compared C3D model parameters without increasing the amount of enhancing the skills of the model.

_6


Figure VI: a) R3D model convolution kernel; b) R2Plus1D model convolution kernel; R3D model and model of network architecture R2Plus1D

4, search method
1) hashing algorithm to retrieve
this portion of the extracted model 1000 CNN-dimensional feature vector stored in the persistent storage Redis complete database, dynamically updated database stored Redis incremental data. In order to ensure that the relevant feature vector matching queries, one embodiment of which is the use of LSH algorithm query video high dimensional feature vector data space to do high-dimensional vector position sensitive hashing algorithm (Locality-Sensitive Hashing) upcoming video corresponding to each picture calculated feature vector normalization is to get each query feature vector video database most similar to the corresponding video of the latter.
2) clustering algorithm
clustering method to avoid the whole search space, but the whole space is divided, split it into several smaller sub-space, at the time of the search, the lock should fall into subspace query vector, in which subspace do traversal query. Improve search accuracy by increasing the number of index sub-class space.

_7


Figure 7: clustering feature vector

3) Vector quantization (vector quantiation) process is a vector with a point in space to a limited subset of the encoded representation. A typical product of PQ has quantified (Product Quantization) and inverted product quantification (IVFPQ). PQ quantization product: essence is a method of clustering.

5, engineering, architecture video duplicate content detection service option
1) video detection systems architecture repeatability (2D-CNN + LSH algorithm)

_8


Figure VIII: 2D-CNN LSH flowchart of an algorithm +

2) video inspection system architecture repeatability (3D-CNN + clustering algorithm)

_9


Figure IX: 3D-CNN + clustering algorithm flow chart

Guess you like

Origin yq.aliyun.com/articles/739779