Professor Zhai Guangtao of Jiaotong University: Seeing is not true, how to evaluate the quality of media experience

Improving the quality of user experience is a key issue faced by audio and video media platforms. At the "Xiaohongshu REDtech Youth Technology Salon" event on October 15, 2022, we were fortunate to have Professor Zhai Guangtao from the Department of Electronics, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University share "Media Experience Quality Evaluation", which was introduced from the human visual system. Professor Zhai elaborated on the significance of media experience quality evaluation work and specific technical ideas.

Zhai Guangtao: His research field is multimedia intelligence. He has published more than 400 papers in international journals and conferences, been cited more than 10,000 times, and was selected as Elsevier's China Highly Cited Scientist. He has won honors such as National Excellent Doctorate, Excellent Youth, Young Top Talent, and Outstanding Youth, and has presided over NSFC key and national key R&D projects. Won the first prize of Natural Science of China Electronics Society, PCS2015 and IEEE ICME 2016 Best Student Paper Award, IEEE TMM 2018 Best Paper Award and 2021 Best Paper Nomination Award, IEEE MMC Workshop 2019, CVPR DynaVis Workshop 2020, IEEE BMSB 2022 Best Best Paper Award, etc., serves as editor-in-chief of Displays (Elsevier), editorial board member of "Chinese Science: Information Science", member of IEEE CAS MSA/SPS IMVSP, vice chairman of the Young Scientists Club of the China Electronics Society, director of the Chinese Image and Graphics Society, Shanghai Vice Chairman of the Municipal Image and Graphics Society. 

The following content is compiled based on Teacher Zhai’s on-site report

human visual system

The human eye is the source of information, but after the information enters the human eye, it undergoes very complex processing before it is perceived by us. Generally speaking, we think that after the information is obtained through the retina, it passes through the optic nerve to the external knee body, and then to the visual cortex. For further processing, the visual bottom layer can be divided into V1, V2, V3, V4, MT and other areas.

According to statistics, more than 50% of the neurons in the human cerebral cortex are related to human visual perception. In other words, visual perception is a very complex process. What we see and the visual stimulation we get on the retina are often different.

For example, as shown in Figure 1, if we look closer to the screen and stare at the little red dot in the middle, we will find that the surrounding blue circle disappears when we stare at the little red dot for a little longer. This phenomenon is called  Troxler Fading . When the information provided by peripheral vision to us is very limited and no longer changes, the human brain will automatically ignore it , making our eyes "blind". This blue circle is always imaged on the retina, but our brain thinks it does not exist and makes us invisible. This means that the human brain's visual processing process is far more complicated than retinal signals.

Martinez-Conde, S., Macknik, S. L. & Hubel, D. H. (2004) 'The Role of Fixational Eye-Movements In Visual Perception', Nature Reviews Neuroscience, 5: 229-240;

For example, the picture below is a still picture, but it looks like it will rotate a little. Peripheral vision will produce a difference in the position of the image on the retina between the previous moment and the next moment due to nystagmus and other reasons . This position difference causes us to have the illusion that the image is rotating. The image does not move but we feel it is moving, which also explains a lot. There is a gap between the perception of the external world and the facts.

ANDREY KORSHENKOV/SHUTTERSTOC

Here is an interesting image (below). We first stare at the black dot in the middle of the picture on the left. After ten seconds and then look at the picture on the right, we will see a very magical phenomenon. It is in color at the first moment. of.

Because during the viewing process of the previous images, the human eye, including the human brain, has adapted to the color of a certain area. This is somewhat similar to the concept mentioned above. What we feel here is still the phenomenon of color adaptation, so the human brain is You will feel the non-existent processing process, actively subtracting yellow and blue from the visual stimulus, because yellow and blue are complementary colors, gray minus yellow equals blue, so we see a blue sky, gray minus Blue equals yellow, so we see a bluish-green grass, which is why we see the color on a grayscale image.

The picture below is the visual contrast-sensitive function CSF&JPEG Q-table. The frequency is getting higher and higher from left to right, and the contrast is getting lower and lower from bottom to top. You can imagine a fluctuation signal, which fluctuates faster and faster from left to right. The amplitude of fluctuations from bottom to top is getting smaller and smaller, and we may be able to see an inclusive line on the screen, which may be higher in the middle and lower on both sides. That is to say, the minimum change of the signal that we can observe at the intermediate frequency is relatively small, and we are more sensitive to the space of the intermediate frequency .

In fact, this feature has been used in all digital images and digital videos we see, because most of the images and videos we see currently are still based on DCT compression, and the DCT quantization table, whether it is MPAC or GPAC, takes us into consideration Regarding the sensitivity of different frequency components, we can quantize the low frequencies and high frequencies more strongly, and the quantization of the intermediate frequencies can be smaller to better protect the more sensitive frequency components in the middle.

In our picture, there are three words hidden in it. If you can't see clearly at the current distance, you can go further away, or take off your glasses, and we can see that there are "HIDE AND SEEK" hidden in it. .

LLLIIIOIIILLL/SHUTTERSTOCK

Why can't you see clearly at the current distance? It's because the spatial frequency generated by these small dots is insensitive at the current distance. When you are far away, the spatial frequency increases to a certain extent, because within the unit's viewing angle, the dot's The number has been increased so you can see the hidden words. But as the distance further increases, beyond two or three meters, because the frequency is too high and cannot be seen clearly, we can see information at a suitable spatial distance, which shows that we are most sensitive to this spatial frequency. .

There are many examples like this. The purpose of these examples is to illustrate the point that the signals we see and the signals themselves will make a very big difference.

Media experience quality evaluation

Because there are differences between the signals we see and the signals we perceive, the signals that fall on the retina and the signals we finally experience, we face many challenges when we want to evaluate the quality of media experience, which is not an easy task .

Why is the signal quality of the image imperfect? When the external world reaches our eyes, a video communication system needs to go through many steps. First, the camera collects the signal, and then performs video processing and encoding. Various distortions will also be encountered during the transmission process, and then decoding, The display enters our brain through the human visual system before we can see it. During the entire process, distortion may occur at every step. For example, noise may be generated due to improper brightness and distance during collection, blur may occur due to shaking hands, and there may be some frame loss during the compression process. , frame loss, packet loss and lag may occur, for example, the screen brightness is not enough, the screen is reflective, poor vision, etc. In fact, the quality of the external world we see is often imperfect, so we need to do quality evaluation.

According to statistics, the images taken by humans in 2022 will almost reach 1.5 trillion, which means that more than 5,000 pictures will be born every second. Now, more than 80% of the traffic on the Internet is video, and excellent websites upload more than 300 hours of video every minute. In other words, there are so many images and videos on the Internet that you will never be able to finish them all .

But another statistic tells us that more than 90%, or even 99%, of videos are rarely seen. 1% of videos occupy 99% of the viewing time . Popular videos are watched by everyone, and unpopular videos may never be viewed. have been seen. Of course, quality may include content and the signal itself. Today we mainly look at the signal quality itself. Many images and videos will never be seen because the quality is not good enough. These images and videos occupy a large amount of storage and bandwidth resources on the Internet, causing huge waste.

Some Statistics

Let's talk about the quality evaluation process, here are a few images and videos. The first image is very clear, the second is too dark, the third is a double image, followed by blur, and the last one is laggy. Our process of judging the quality of images is based on artificial subjective evaluation and scoring. However, if this process is completed by humans, more than 300 hours of videos are uploaded to YouTube every minute, and it is impossible to find someone to watch them all. Therefore, we hope that computers can achieve objective quality evaluation, which is also the realization of large-scale automated processing of massive videos and images. necessary conditions .

Quality evaluation problems can be subdivided into several types. To compare two videos, all or part of the information of the original video and the distorted video can be used. These two situations are called full reference and semi-reference. We can also only Judging quality by distorted video is called no reference .

In another situation, in addition to the first distorted video, we also have a second distorted video. Our task is to judge the same content in two channels without reference. After different levels and types of distortion, Which one is better in relative quality is our common quality evaluation task. The most widely used one is reference-free quality evaluation, because the videos we see on the Internet have no reference, and reference issues can only be considered at the encoding end.

From a large scale, the perceptual quality evaluation of images, videos, or media is actually a branch of perceptual signal processing. Perceptual signal processing can be traced back to the pioneering work of D. Marr on computational vision and computational neural technology in the 1970s. Later, D. Hubel and T. Wiesel won the Nobel Prize in 1981 for their research on the information processing mechanism of the human visual system. Later, many experts made great contributions in this field.

Overall, visual perception signal processing includes three parts:

1. Build a visual model to simulate the perception process.

2. Develop an evaluation algorithm to measure the perceived quality of media experience.

3. Use evaluation results to further improve perceived quality.

This process is actually not simple. We were aware of this problem twenty years ago and pointed out several challenges faced in the image quality or video quality evaluation process.

It is very difficult for machines to understand human feelings because we currently know very little about the workings of the human brain. In most cases, the task faced is reference-free quality evaluation, because without reference information, the process becomes very difficult and cannot be achieved by simply comparing the spatial distance between the visual signal and the original signal.

After having the evaluation criteria, integrating it into the existing information processing system to improve the perceived quality of images and videos is not a simple process.

We have several contributions in this direction:

Structured visual perception model

Regarding the work on visual perception models, we found that the current research on visual perception models can be divided into two categories: one is physiologically inspired methods, which rely on some physiological models, with very high complexity but low performance; the second category is lateral fitting The method does not consider the mechanism and is completely data-driven, but its generalization ability is relatively poor. Therefore, our idea is based on physiological psychology and information theory as a modeling method. We propose a structured modeling method for the overall pixel primitives . From low-level vision, middle-level vision, to high-level vision, we propose the retina filter model, Local structure description model, free energy perception model.

Here we take the free energy perception model as an example. Our idea is to introduce the free energy principle in brain science into vision, provide a formal calculation scheme, and also propose efficient acceleration methods, so that the model can be used in quality evaluation. being widely used.

No-reference quality assessment algorithm

We used energy inversion and distortion simulation to solve the problem of missing original information .

In the process of energy inversion, we have a distorted image. If we want to use the free energy model for quality evaluation, we need to estimate the free energy information of the original image. Here we propose the concept of multi-scale self-similarity of the image. We estimate the free energy information of the original image through multi-scale self-similarity of the distorted image, thereby achieving relatively high-precision reference-free quality evaluation.

There is also reference-free quality evaluation of pseudo-reference. The traditional quality evaluation idea is generally to estimate the original clean image information based on the quality-distorted image. This inversion process is difficult. Our idea is to do forward processing instead of inversion, and add more distortion to the distorted image to make a pseudo reference. At this time, if the image is more similar to the pseudo reference, the quality will be worse, and vice versa, the quality will be better. Because this process is relatively stable, it is fast and suitable for large-scale application scenarios.

For example, we know that UGC is a characteristic of Xiaohongshu. UGC videos come from a wide range of sources. So the shooting environment of real-world content is uncontrolled and the quality cannot be guaranteed. How to deal with the quality evaluation problem in this case? We propose very effective full-reference and reference-free feature extraction methods, because after having the features, we can always get the final score through regression or pooling.

UGC-VQA video quality assessment

We have made some small contributions in terms of full-reference and reference-free feature extraction. The specific content of the model will not be introduced in detail here. This method is relatively efficient and has been used in the live broadcast and UGC submission platforms of some Internet companies. use.

The quality evaluation of streaming videos is also an important part of our research. We considered video lag, timing and image quality characteristics, and used models such as CNN, 3D CNN, and GRU. This algorithm is currently used in Some Internet companies have gone online and used it.

Research on audio and video quality evaluation and quality distribution

The aforementioned media mainly refers to images and videos, but audio is not considered. In addition, when discussing quality evaluation earlier, we ignored some important issues.

Subjective quality score MOS: Mean Opinion Score

Quality evaluation in academia mainly refers to the average score, but does this average represent a reasonable quality? As shown in the figure below, we can see that the mean is almost the same, but the variance is particularly large. At this time, if the satisfaction threshold is 48 points, the blue image will definitely be satisfied by everyone, but a considerable number of people will be dissatisfied with the image on the left, so we Thinking about the mean in traditional terms is not enough.

Quality Score Distribution OSD: Opinion Score Distribution

When considering the cost of bit rate or resolution, it does not mean that the higher the bit rate and resolution invested, the user experience will always be improved. The user experience has a saturation effect. These two problems prompt us to consider whether we can use a simple average to represent quality. We have done large-scale experiments to prove that the distribution of subjective scores of a video or image is not a simple distribution. It may be a long tail, left tail, or right tail, or it may produce double peaks, so we further Our work is to use an Alpha-stable model to simulate the distribution of subjective scores, and then further propose an algorithm to estimate the parameterized model, so as to further predict the quality of images or videos more accurately.

Approximation with Alpha-stable model

Audio-Visual Perception

In addition, the interaction between audio and video is also an important thing for us to consider when working in media . The work in this area is divided into two aspects. The audio features we consider when making the visual model can be used to fuse the audio saliency through correlation analysis between audio and video. Of course, we can also use deep learning to directly saliency the audio and video in an end-to-end manner. sexual model.

Audio-Visual Attention Model 

In the quality evaluation model of audio and video integration, the reduction of audio quality will bring about the reduction of the overall experience. In this area, we established a large-scale audio and video integration experience evaluation database earlier and proposed the corresponding algorithm.

"Q&A" session

Q : Which types of image or video quality problems can be solved better in UGC scenarios, and which types of problems still have more room for optimization?

Zhai Guangtao : If UGC videos are considered as images, these common image distortions, such as blur, noise, and darkness, can be solved very well.

But if you consider UGC as a video, the quality of the video itself changes. A UGC video may be very good in one frame and very poor in the next frame. This problem is actually very challenging. This is my relatively simple answer. As for image quality, it is easy to do, but as a whole, continuous changes in the video are more difficult.

Q : If you connect the eyeballs and visual cortex to an EEG device, and then use a deep model to learn the model output and human brain response, you can learn real visual perception end-to-end, and then use the model as a perceptual loss for other visual tasks. For training, this loss function can be used as a metric to achieve quality evaluation. What does the teacher think of this idea?

Zhai Guangtao: The problem of EEG itself is very big, because the EEG signal noise is very strong, and the number of EEG channels is very small. The most commonly used ones are 64, 128, and more than 256 are very difficult, so the sampling is very sparse, and the number of cortical neurons is The number is in the tens of billions. It is unrealistic for us to use only a few electrodes to represent the degree of neural activity in the cortex. So although I very much hope to implement this method, in essence, because the sampling is too It’s sparse, so it’s not realistic at the moment.

Guess you like

Origin blog.csdn.net/REDtech_1024/article/details/130081188