The second work progress report

                                              Work progress report for the fifth week

                                                                                   Video coding and decoding technology based on content perception

        The purpose of content-aware video codec technology is to reduce video transmission costs and save bandwidth. Video content characteristics refer to the texture characteristics and the motion characteristics of objects or regions in each frame image. As shown in Figure 1, the texture characteristics of an image are related to its content details, such as color richness, sharpness of object boundaries, and object shape size. As shown in Figure 2, the movement characteristics of the object are related to the degree of change of the video content, such as the speed of the object's movement, the direction of the object's movement, the change in the content of the previous and subsequent frames, or the severity of scene switching.

 

                Figure 1 The texture characteristics of the image

 

        Figure 2 Movement characteristics of objects

       At present, there are two more classic solutions in content perception. One is saliency map prediction, as shown in Figure 3. It uses the human visual system to focus on the complete information obtained in the face. Key specific targets in the scene. The saliency map is to find the "eye-catching" content in the image. We can use the saliency map to extract the visual focus, so that the video encoder has the basis for coding weight distribution.

 

                                                Figure 3 Significance map

        In the article "Image Saliency Detection Based on Deep Neural Networks", the author proposed a saliency map prediction network. The data set used in the experiment is the open source saliency detection data set SALICOM and iSUN. As shown in Figure 4, the network is divided into Local modules and global modules. The input image of the global module is a 96x96x3 RGB image, and the global module uses 3 convolutional layers for feature extraction. The input image of the local module is a 96x96 grayscale image. The first 5 convolutional layers of the pre-trained VGG-16 model are embedded in the CNN network. The CNN network contains an STN module and two convolutional layers. The last module of the network is to obtain the best combination value between two complementary information, including a splicing layer (concat), an activation function layer (maxout) and a fully connected layer (FC), and finally output a saliency map.

 

                                         Figure 4 Saliency map prediction network model

        Another way to solve the problem of content perception is semantic segmentation. Semantic segmentation is a classification at the pixel level. Pixels belonging to the same category are classified into one category. As shown in Figure 5, pixels belonging to people are classified into one category, and pixels belonging to motorcycles are classified into one category. We can also use semantic segmentation to extract the visual focus, so that the video encoder has a basis for coding weight distribution.

               Figure 5 Semantic segmentation

        In the article "Context-aware encoding for clothing parsing", a semantic segmentation model based on color fashion analysis is proposed. The backbone network uses the FCN model and the MobilelNet model. After prediction, upsampling, and a softmax activation function, the network The edge branch establishes a lightweight COE architecture to improve feature extraction capabilities and reduce over-fitting. The final output of the network is a semantic segmentation map of people and color fashion.

        For the video codec part, in the article "Research on Rate Control Algorithm for Dynamic Adaptive Streaming Video Transmission", three modes of X264 rate control are proposed, namely CBR (constant rate factor): fixed bit rate, output code The rate is a fixed value. VBR (variable bit rate): Variable bit rate, simple part is encoded with low bit rate, and complex part is encoded with high bit rate, but the output bit rate is uncontrollable. ABR (average bit rate): Average bit rate, bit rate can also be dynamically adjusted, and within a certain period of time, the average bit rate is close to the target bit rate, and the output file size can be controlled. We can combine the part of the image that the human eye is paying attention to obtained by the above method with the ABR rate control algorithm to save bandwidth.

        Summary: Inadequacies: ①Lack of video evaluation standards ②High-speed moving target detection (Motion blur will occur when the lens is moving at high speed, so the bit rate can be appropriately reduced. ③The method currently found is based on images rather than videos. For video methods, frames should also be considered Rate and other factors.

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_39715243/article/details/109247611