Overview of Video Abstract Development in Deep Learning

Nowadays, the pace of urban life is getting faster and faster. When users browse some videos, they do not want to spend a lot of time watching a complete video. More often, users only want to know the most essential information of this video, and it is also based on this demand that film and television commentators such as Gu Amo have received so much attention. At this point, the video summary shows its value.

 

What is a video summary?

 

Video summarization is to extract meaningful segments/frames from the original video in an automatic or semi-automatic way by analyzing the structure of the video and the existing spatiotemporal redundancy in the content. From the technical process of abstract, video abstract can generally be divided into two types, static video abstract and dynamic video abstract. At this stage, our company is mainly devoted to the research of static video summaries. Next, let's talk about the still video summary.

 

What is a still video summary?

 

Static video summarization, also known as video summarization, is a technique for representing video content by a series of static semantic units extracted from the original video stream. Simply put, it is to extract some key frames from a video, and by combining multiple key frames into a video summary, users can quickly browse the original video content through a small number of key frames. Further development can provide users with fast content retrieval services.

 

For example, in the video of the public class, the frame containing the complete PPT is extracted. We provide viewers with all the frames that contain key information, allowing them to understand the main content of a longer video in a shorter period of time. For another example, extracting its key parts from a 2-hour movie and combining it into a 2-minute trailer is also a static video summary. The extraction process is roughly as follows:

 

 

 

Introduction to Still Video Summary Technology

 

The static video abstraction extracts the key frames of the original video by describing the features of each frame of the original video and comparing the feature differences between the frames. Therefore, the first step of static video summarization is to obtain frame information features.

 

Regarding image feature extraction, from AlexNet in 2012, to VGGNet and GoogleNet in 2014, and several years of ILSVRC (ImageNet Large-scale Visual Recognition Challenge), image classification and feature extraction have reached a near-perfect state. The image summarization work in static video summarization basically does not need time-consuming, and can be solved by using the existing image classification network to extract the image feature information of each frame of the video.

 

(Image source: http://www.jianshu.com/p/58168fec534d  )

 

 

(VGG network structure diagram, image source:  http://x-algo.cn/index.php/2017/01/08/1471/  )

 

(Googlenet model, Google official paper with pictures)

But when people read an article or watch a video, they often do not understand based on a single frame or word, but need to combine with what they have seen before to complete the understanding of the overall content. Traditional neural networks cannot do this. Therefore, in video text summarization, a special neural network, Recurrent Neural Networks, is often needed. RNN is a network with a cyclic structure, which can continuously save the previous information. Its general network structure is as follows:

Such a neural network can retain a part of the previous information in the video text summarization, so as to achieve the purpose of connecting the context. Therefore, it is widely used in text and abstract experiments.

 

However, the traditional RNN network still has its drawbacks, it cannot connect to the distant previous information. For example, when we need to predict the last word "French" in "I grew up in France... I speak fluent French", we need to make contact with "France" which is far from the current text, but when two words When the interval is very large, the RNN will lose the ability to learn at a distance. This problem is known as the "long-term dependency problem".

 

In order to solve this problem, a new network was proposed: Long Short Term Network, or LSTM for short, is a special recurrent neural network, which was proposed by Hochreiter & Schmidhuber, which is believed to solve the long-term dependency problem that RNN cannot solve. Different from RNN, it uses a sigmoid layer called "input threshold layer" to decide the value that needs to be discarded or updated. In each step, the state ensures that each information exists in real time and is the latest state. Such networks are widely used in experimental models that require contextual relevance.

 

 

 

The process of still video summarization:

 

Below we use an example to briefly describe the process of still video summarization. The 2016 CVPR article "Video Summarization with Long Short-term Memory" uses LSTM to complete video summarization. Its main models are as follows:

 

First, use the GoogleNet network to obtain the key information of each frame of the video, which is X1...Xt in the above figure. The feature information is input into the network, and after two layers of LSTM, Y1...Yt is the score of the frame, and ф1...фt is the similarity between the frames. With the above model, we use the obtained inter-frame similarity to temporally segment the overall video to avoid keyframe duplication. After the key score of each frame is obtained, the key frame is obtained according to the size of the score and the required number of key frames.

 

Finally, according to customer needs or different video content, the obtained key frames can be processed into key atlas or clustered and recombined to obtain a short video that summarizes the content.

 

Summarize:

 

The application of video summarization is very extensive, and its technology is also a hot spot in the development of computer vision in the past two years. Our company is currently focusing on video summaries related to conference scenes, combining video summaries with text summaries to show users a complete meeting scene with simpler results. made simpler.

 

Contact us, pay attention to Tuya WeChat public account

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324375793&siteId=291194637