Basic principles of video compression and codec

Basic principles of video compression and codec

Uploading... Reupload Cancel​

Liu Sining

Camera Technical Expert

​follow him

224 people liked this article

Video Information and Compression Coding

1. Interaction between people and the world


Since the birth of human civilization in ancient times, human beings have been constantly struggling to adapt to and transform the environment. The most basic premise is to use the senses to acquire external information. Using various senses, humans can interact with the world environment in a variety of different information, such as:

smell: identify various smells, identify environmental changes and the quality of food and drinking water, etc.;
hearing: identify contact information of the same kind and danger signals such as natural enemies Etc.
Taste: choose the most suitable food.
Touch: very important when making and using tools

. In addition, the most important one is vision. According to statistics, among the various senses of human beings, vision accounts for more than 70% of the information acquisition. And vision can enable people to make the most direct response to environmental changes.

During the development of civilization, people are not satisfied with merely relying on oral records of the images they see, but hope to record them in a more intuitive form. After years of development, after years of development, video has become the most efficient way to record and reproduce information, and can transmit a large amount of information in a relatively short period of time.

The video expresses information through the image of each frame;
the audio contained in the video can provide a lot of information;
the video provides information through the movement of the image and the transformation of the scene;

in summary, we can see that the video information provides the information representation that is closest to the direct experience of people Way.

Second, the video signal representation method: RGB and YUV


Real-world images and early video processing and transmission systems dealt with analog signals. However, in order to be able to adapt to modern computer, network transmission and digital video processing systems, the analog video signal must be converted into digital format.

RGB color space

Most of the current display devices use the RGB color space. RGB is designed from the principle of color. Its color mixing method is like three lights of red, green and blue. When their lights are superimposed on each other, When the colors are mixed, the brightness is equal to the sum of the brightness of the two, that is, additive mixing.

Different colors on the screen are formed by mixing the three basic colors of red, green, and blue in different proportions (weights). A group of red, green and blue is the smallest display unit. Any color on the screen can be recorded and represented by a set of RGB values. Therefore, red, green, and blue are also called the three primary colors, which are R (red), G (green), and B (blue) in English. The so-called "how much" of RGB refers to brightness, and is represented by an integer. When represented by 8 bits, each RGB has 256 levels of brightness, and it is represented by digital quantization from 0, 1, 2... until 255. Note that although the highest number is 255, 0 is also one of the values, so there are 256 levels in total.


The way to represent color images in this way is the RGB color space. The RGB color space is commonly used in display systems. In the image represented in this form, each color component of each pixel is represented by 1 byte, and 256×256×256 different colors can be represented. In common image formats, such as bitmap (bmp) format saves data in RGB form.

YUV color space

YUV is a type of encoding true-color color space (color space). Proper nouns such as Y'UV, YUV,  YCbCr , and YPbPr can all be called YUV, and they overlap with each other.

"Y" means brightness (Luminance, Luma), "U" and "V" are chroma (Chrominance, Chroma), including hue and saturation . Similar to the well-known RGB, YUV is also a color coding method, mainly used in television systems and analog video fields. It separates brightness information (Y) from color information (UV), and can display a complete image without UV information. It's just black and white. This design solves the compatibility problem between color TV and black and white TV very well. Moreover, YUV does not require three independent video signals to be transmitted simultaneously like RGB, so transmission in YUV mode takes up very little bandwidth.

In the process of video processing such as actual codec, YUV format is more commonly used than RGB format. In the YUV format, a pixel is represented by a luminance component and a chrominance component, and each pixel is composed of a luminance component Y and two chrominance components U/V. The luminance component can be in one-to-one correspondence with the chrominance component, and the chrominance component can also be sampled, that is, the total amount of the chrominance component is less than that of the luminance component.

YCbCr color space

For YUV subdivision, there are Y'UV, YUV, YCbCr, YPbPr and other types, among which YCbCr is mainly used for digital signals.

YCbCr is a color space defined by ITU in the standard ITU-R BT.601 (standard definition SDTV), ITU-R BT.709 (high definition HDTV), and ITU-R BT.2020 (ultra high definition).

YCbCr (SDTV) is a part of the ITU-R BT.601  proposal during the development of the video standard of the World Digital Organization . In fact, it is a replica of YUV through Gamma. Among them, Y has the same meaning as Y in YUV, and Cb and Cr also refer to color, but the representation method is different. In the YUV family, YCbCr is the most widely used member in computer systems, and its application fields are very wide. JPEG, MPEG, and H264 all use this format.

Among them, Cr reflects the difference between the red part of the RGB input signal and the brightness value of the RGB signal, and Cb reflects the difference between the blue part of the RGB input signal and the brightness value of the RGB signal, which is the so-called color difference signal.

The "YUV" image in video communication system (especially video codec) is YCbCr. In ordinary work communication, the so-called YUV is also YCbCr. The following uses YUV to refer to YCbCr

Convert RGB format to YUV format

Y'= 0.299*R' + 0.587*G' + 0.114*B'
U'= -0.147*R' - 0.289*G' + 0.436*B' = 0.492*(B'- Y')
V'= 0.615*R' - 0.515*G' - 0.100*B' = 0.877*(R'- Y')

Convert YUV format to RGB format

R' = Y' + 1.140*V'
G' = Y' - 0.394*U' - 0.581*V'
B' = Y' + 2.032*U'

Convert RGB format to YCbCr format

According to ITU-R BT.601 standard. (This formula is most widely used)
Y = ( ( 66 * R + 129 * G + 25 * B + 128) >> 8) + 16
Cb = ( ( -38 * R - 74 * G + 112 * B + 128) >> 8) + 128
Cr = ( ( 112 * R - 94 * G - 18 * B + 128) >> 8) + 128

According to ITU-R BT.709 standard.

Y = 0.183R + 0.614G + 0.062B + 16
Cb = -0.101R - 0.339G + 0.439B + 128
Cr = 0.439R - 0.399G - 0.040B + 128

According to the full range value format of JPEG
Y = 0.299R + 0.587G + 0.144B + 0
Cb = -0.169R - 0.331G + 0.500B + 128
Cr = 0.500R - 0.419G - 0.081B + 128

Convert YCbCr format to RGB format

R' = 1.164*(Y’-16) + 1.596*(Cr'-128)
G' = 1.164*(Y’-16) - 0.813*(Cr'-128) -0.392*(Cb'-128)
B' = 1.164*(Y’-16) + 2.017*(Cb'-128)

Chroma sampling for YUV


The reason why this method is adopted in YUV is mainly because human senses are much more sensitive to luminance information than to chrominance information. Therefore, compared with other pixel formats, the biggest advantage of YUV is that it can properly reduce the sampling rate of chrominance components, and ensure that it does not cause too much impact on the image. Moreover, using this method can also be compatible with black and white and color display devices. For a black-and-white display device, only the chrominance component needs to be removed and only the luminance component can be displayed.

Common chroma sampling methods in YUV are 4:4:4, 4:2:2 and 4:2:0, etc., as shown in the following figure:

YUV storage format

There are three categories of YUV formats: planar, packed, semi-planar.

For the YUV format of planar, the Y of all pixels is stored continuously, followed by the U of all pixels, and then the V of all pixels.

For the packed YUV format, Y, U, and V of each pixel are continuously interleaved.

For the YUV format of the semi-planar, the Y of all pixels is stored continuously first, and then the U and V of all pixels are stored continuously and intersected.

ITU-R Recommendation BT.2020

In the second half of 2012, the International Telecommunication Union Radiocommunication Sector (ITU-R) promulgated the BT.2020 standard for a new generation of ultra-high definition UHD (Ultra-high definition) video production and display systems, redefining the fields of TV broadcasting and consumer electronics Regarding the parameters and indicators of ultra-high-definition video display, it will promote the further standardization of 4K ultra-high-definition home display equipment. The most important thing is that the BT.2020 standard points out that the UHD ultra-high-definition video display system includes two stages: 4K and 8K. The physical resolution of 4K is 3840×2160, while that of 8K is 7680×4320. The reason why the ultra-high-definition video display system will have two stages is actually caused by the differences in the development of ultra-high-definition video display systems in various regions of the world. Possible technical obstacles in the transition from 4K to 8K. In other parts of the world, most still use 4K technology as the next-generation TV broadcasting development standard.

The importance of the BT.2020 standard is unquestionable. Just as BT.709 has played a guiding role in the manufacture of high-definition video transmission and high-definition display equipment, the BT.2020 standard has also profoundly affected the design of ultra-high-definition display equipment in the consumer field. and manufacturing, especially in 4K flat-panel TVs. For example, the physical resolution of most 4K flat-panel TVs currently adopts the 3840×2160 standard of BT.2020 instead of 4096×2160 of the DCI digital cinema standard. However, the BT.2020 standard is not only improved in terms of resolution, but also provides relevant regulations in terms of color, refresh rate, signal format and analysis.

The 4K TV display standard BT.2020 has improved in color: Compared with the BT.709 standard, the BT.2020 standard has greatly improved the performance specifications of the video signal. For example, in terms of color depth, it is upgraded from 8bit in BT.709 standard to 10bit or 12bit, of which 10bit is for 4K system, and 12bit is for 8K system. This improvement plays a key role in enhancing the color level and transition of the entire image. The area of ​​the color gamut is also much larger than the BT.709 standard, which can display richer colors, but correspondingly speaking, the wider the color gamut, the higher the performance requirements of the display device. According to the current 4K ultra-high-definition projector In some cases, it is often necessary to use a new generation of laser or LED solid-state light source models to achieve.

3. Video compression coding


The concept of encoding is widely used in the field of communication and information processing. Its basic principle is to express and transmit information in a certain form of code stream according to certain rules. Commonly used information that needs to be encoded mainly includes: text, voice, video, and control information.

1. Why video coding is needed

For video data, the main purpose of video coding is data compression. This is because the pixel form of dynamic images represents a huge amount of data, and the storage space and transmission bandwidth are completely unable to meet the needs of storage and transmission. For example, the three color components RGB of each pixel of an image need to be represented by one byte, then each pixel needs at least 3 bytes, and the size of an image with a resolution of 1280×720 is 2.76M bytes.

If for the same resolution video, if the frame rate is 25 frames per second, then the bit rate required for transmission will reach 553Mb/s! For higher-definition videos, such as 1080P, 4k, and 8k videos, the transmission bit rate is even more amazing. Such an amount of data cannot be afforded either for storage or transmission. Therefore, compressing video data has become an inevitable choice.

2. Why video information can be compressed

The reason why video information has a lot of space that can be compressed is because there is a lot of data redundancy in it. The main types are:

Temporal redundancy: the content between two adjacent frames of the video is similar, and there is a motion relationship.
Spatial redundancy: there is similarity between adjacent pixels in a certain frame of the video.
Coding redundancy: the occurrence of different data in the video Different probabilities
Visual redundancy: The audience's visual system has different sensitivities to different parts of the video.

For these different types of redundant information, there are different technologies in various video coding standard algorithms to deal with them specifically to pass through different angles. Increase the compression ratio.

3. Video Coding Standardization Organization

In the independent and joint development of international organizations, many important video codec standards have been produced. Major international organizations include ISO/IEC MPEG, ITU-T, Google, Microsoft, AVS Working Group and AOM Alliance, etc.

ITU-T, the full name of International Telecommunications Union - Telecommunication Standardization Sector, is the International Telecommunications Union - Telecommunication Standardization Sector. The VECG (Video Coding Experts Group) under the organization is mainly responsible for the formulation of standards for the real-time communication field, mainly formulating standards such as H.261/H263/H263+/H263++.

ISO, the full name of International Standards Organization, is the International Organization for Standardization . The MPEG (Motion Picture Experts Group) affiliated to the organization is mainly responsible for video standards for video storage, broadcast TV, and network transmission, mainly formulating MPEG-1/MPEG-4 and so on.

In fact, the standards that really have a strong influence in the industry are produced by the cooperation of the two organizations. Such as MPEG-2, H.264/AVC and H.265/HEVC etc.

The main standards include: JPEG, MJPEG, JPEG2000, H.261, MPEG-1, H.262/MPEG-2, H.263, MPEG-4 (Part2/ASP), H.264/MPEG-4 (Part10/AVC ), H.265/MPEG-H (Part2/HEVC), H.266/VVC, VP8/VP9, AV1, AVS1/AVS2, SVAC1/SVAC2, etc.

The development of video coding standards formulated by different standard organizations is shown in the following figure:

4. Introduction of main video coding standards

4.1. JPEG

    JPEG is the abbreviation of Joint Photographic Experts Group (Joint Photographic Experts Group), which is the first international image compression standard. The JPEG image compression algorithm can not only provide good compression performance, but also have relatively good reconstruction quality, and is widely used in the field of image and video processing.


4.2. MJPEG

    M-JPEG (Motion-Join Photographic Experts Group) technology is motion still image (or frame-by-frame) compression technology, which is widely used in the field of non-linear editing and can be accurate to frame editing and multi-layer image processing to convert moving video sequences Treated as a continuous still image, this compression method compresses each frame individually and completely. During the editing process, each frame can be randomly stored, and frame-accurate editing can be performed. In addition, the compression and decompression of M-JPEG are symmetrical , can be realized by the same hardware and software. But M-JPEG only compresses the spatial redundancy within the frame. The time redundancy between frames is not compressed, so the compression efficiency is not high. Using M-JPEG digital compression format, when the compression ratio is 7:1, it can provide programs equivalent to Betecam SP quality images.

  The algorithm on which the JPEG standard is based is based on DCT ( Discrete Cosine Transform ) and variable length coding. The key technologies of JPEG include transform coding, quantization, differential coding , motion compensation, Huffman coding, and run-length coding.

  The advantages of M-JPEG are: it can be easily edited to frame accuracy, and the equipment is relatively mature. The disadvantage is that the compression efficiency is not high.

  In addition, the compression method of M-JPEG is not a completely uniform compression standard, and the codecs and storage methods of different manufacturers do not have a unified format. That is to say, each model of video server or encoding board has its own version of M-JPEG, so data transmission between servers, and data transmission from a non-linear production network to a server is simply impossible.

4.3、JPEG2000

    JPEG 2000 (JP2) is an image compression standard and coding system. It was created by the Joint Photographic Experts Group committee in 2000 with the intention of superseding their original discrete cosine transform-based JPEG standard (created in 1992) with a newly designed, wavelet-based method. The standardized filename extension is .jp2 for ISO/IEC 15444-1 conforming files and .jpx for the extended part-2 specifications, published as ISO/IEC 15444-2. The registered MIME types are defined in RFC 3745. For ISO/IEC 15444-1 it is image/jp2.

4.4, H.261

    The H.261 video coding standard was born in 1988, which can be described as the first milestone in the development of video compression coding. Because starting from H.261, the video coding method adopts the waveform-based hybrid coding method that has been used until now. The main goal of the H.261 standard is to be used in high-real-time, low-bit-rate video image transmission occasions such as video conferencing and videophone.

  In the era when the H.261 standard was produced, due to the inconsistency of the TV standards in various countries, they could not directly communicate with each other. In order to solve the problem of data source format incompatibility, H.261 defines a common intermediate format CIF (Common Intermediate Format). The encoded target format is first converted to CIF format for encoding and transmission, and the receiving end decodes it and then converts it to its respective format. The luminance resolution of the CIF format video stipulated by H.261 is 352×288, and the luminance resolution of the QCIF format is 176×144.

H.261 information source coding technology adopted:

intra-frame coding/inter-frame coding decision: judge according to the correlation between frames - high correlation uses inter-frame coding, and low correlation uses intra-frame coding.
Intra-coding: For intra-coding frames, 8×8 pixel blocks are encoded directly using DCT.
Inter-frame coding/motion estimation: use macroblock-based motion compensation predictive coding; the current macroblock finds the best matching macroblock from the reference frame, and calculates its relative offset (Vx, Vy) as a motion vector; coding DCT, quantization coding of the residual signal of the current macroblock and the predicted macroblock; inter-frame coding/motion estimation: use macroblock-based motion compensation predictive coding; the current macroblock finds the best matching macroblock from the reference frame , and calculate its relative offset (Vx, Vy) as a motion vector; the encoder uses DCT, quantization to encode the residual signal of the current macroblock and the predicted macroblock; loop filter: actually a digital low-
pass filter , to filter out unnecessary high-frequency information to eliminate square effects; loop filter: actually a digital low-pass filter, to filter out unnecessary high-frequency information to eliminate square effects; 4.5, MPEG-

1

    The MPEG-1 standard was published in August 1993, and is used to encode digital storage media moving images and their accompanying audio at a data transfer rate of 1.5Mbps. The standard consists of five parts:

   Part 1 describes how to code composite audio and video according to Part 2 (Video) and Part 3 (Audio). Part IV describes the procedure for verifying that the output bitstream of a decoder or encoder complies with the provisions of the first three Parts. The fifth part is an encoder and decoder implemented in complete C language.

    From the moment the standard was promulgated, MPEG-1 has achieved a series of successes, such as the extensive use of VCD and MP3, versions after Windows 95 have an MPEG-1 software decoder, portable MPEG-1 cameras, and so on.

4.6. MPEG-2/H.262

    The MPEG organization launched the MPEG-2 compression standard in 1994 to realize the possibility of interoperability between video/audio services and applications. The MPEG-2 standard is a detailed specification for the compression scheme and system layer of standard digital TV and high-definition TV in various applications. The coding rate is from 3 megabits to 100 megabits per second. The formal specification of the standard is in ISO/ IEC13818. MPEG-2 is not a simple upgrade of MPEG-1. MPEG-2 has made more detailed regulations and further improvements in system and transmission. MPEG-2 is especially suitable for the coding and transmission of broadcast-level digital TV, and is recognized as the coding standard for SDTV and HDTV.

   The principle of MPEG-2 image compression is to use two kinds of characteristics in the image: spatial correlation and temporal correlation. These two kinds of correlations lead to a large amount of redundant information in the image. If we can remove these redundant information and keep only a small amount of irrelevant information for transmission, the transmission frequency band can be greatly saved. The receiver uses these irrelevant information and according to a certain decoding algorithm, can restore the original image under the premise of ensuring a certain image quality. A good compression coding scheme is able to remove the redundant information in the image to the greatest extent.

    MPEG-2 coded images are divided into three categories, called I-frames, P-frames and B-frames. The I-frame image adopts an intra-frame encoding method, that is, only the spatial correlation in the single-frame image is used, and the temporal correlation is not used. P-frame and B-frame images use inter-frame coding, that is, both spatial and temporal correlations are utilized. P-frame images only use forward time prediction, which can improve compression efficiency and image quality. The P-frame image may include an intra-frame coded part, that is, each macroblock in the P-frame may be forward predicted or intra-frame coded. The B-frame image adopts bidirectional time prediction, which can greatly improve the compression factor.

    MPEG-2 code stream is divided into six levels. In order to better represent coded data, MPEG-2 stipulates a hierarchical structure with syntax. It is divided into six layers, which are from top to bottom: image sequence layer, group of pictures (GOP), image, macroblock strip, macroblock, and block.

4.7, H.263

    H.263 is a standard draft of ITU ITU -T, which is designed for low bit rate communication. But in fact, this standard can be used in a wide range of code streams, not just for low code stream applications. It can be considered to be used to replace H.261 in many applications. The encoding algorithm of H.263 is the same as that of H.261, but some improvements and changes have been made to improve performance and error correction capabilities. The .263 standard can provide better image effects than H.261 at a low bit rate. The differences between the two are: (1) H.263 uses half-pixel precision for motion compensation, while H.261 uses full-pixel precision Accuracy and loop filtering; (2) Some parts of the data stream hierarchy are optional in H.263, so that the codec can be configured for lower data rates or better error correction capabilities; (3) H. 263 contains four negotiable options to improve performance; (4) H.263 uses unlimited motion vectors and syntax-based arithmetic coding ; (5) uses prior prediction and the same frame prediction method as PB frames in MPEG; (6) H.263 supports 5 resolutions, that is, in addition to supporting QCIF and CIF supported in H.261, it also supports SQCIF, 4CIF and 16CIF. SQCIF is equivalent to half the resolution of QCIF, while 4CIF and 16CIF are respectively 4 times and 16 times of CIF.

    H.263+, which IUT-T launched in 1998, is the second version of the H.263 proposal. It provides 12 new negotiable modes and other features, further improving the compression coding performance. For example, H.263 has only 5 video source formats, H.263+ allows more source formats, and there are multiple options for image clock frequency to broaden the application range; another important improvement is scalability, which allows multiple display rates, Multi-rate and multi-resolution have enhanced the transmission of video information in a heterogeneous network environment that is prone to bit errors and packet loss. In addition, H.263+ improves the unlimited motion vector mode in H.263, plus 12 new optional modes, which not only improves the coding performance, but also enhances the flexibility of applications. H.263 has basically replaced H.261.

4.8. MPEG-4 (Part2/ASP)

    Motion Picture Experts Group MPEG officially announced the first version of the MPEG-4 (ISO/IEC14496) standard in February 1999. At the end of the same year, the second edition of MPEG-4 was finalized, and it officially became an international standard in early 2000.

    MPEG-4 is very different from MPEG-1 and MPEG-2. MPEG-4 is not just a specific compression algorithm, it is an international standard formulated for the needs of integration and compression technologies such as digital TV, interactive graphics applications (video and audio synthesis content), interactive multimedia (WWW, data capture and distribution). The MPEG-4 standard integrates many multimedia applications into a complete framework, aiming to provide standard algorithms and tools for multimedia communication and application environments, thereby establishing a system that can be widely used in multimedia transmission, storage, retrieval and other application fields. unified data format.

4.9. H.264/MPEG4 (Part10 AVC)

    H.264 is a new-generation video compression coding standard formulated by the Joint Video Team (JVT) composed of ISO/IEC and ITU-T. The standard is named AVC (Advanced Video Coding) in ISO/IEC as the tenth option of the MPEG-4 standard; it is officially named the H.264 standard in ITU-T.

4.10, H.265/HEVC

    H.265 is a new video coding standard formulated by ITU-T VCEG after H.264. The H.265 standard revolves around the existing video coding standard H.264, retains some original technologies, and improves some related technologies at the same time. The new technology uses advanced technology to improve the relationship between bit stream, encoding quality, delay and algorithm complexity to achieve optimal settings. The specific research content includes: improving compression efficiency, improving robustness and error recovery ability, reducing real-time delay, reducing channel acquisition time and random access delay, reducing complexity and so on. Due to algorithm optimization, H264 can realize standard definition digital image transmission at a speed lower than 1Mbps; H265 can realize 720P (resolution 1280*720) ordinary high-definition audio and video transmission at a transmission speed of 1~2Mbps.

4.11 VP8/VP9

https://en.wikipedia.org/wiki/VP8

4.12. AV1

    AV1 (AOMedia Video Codec 1.0):
   Relevant website:
http://audiovisualone.com  : AV1 company website
http://audio-video1.com  : A provider of video and audio applications mainly for families
http://www .ctolib.com/learndromoreira-digital  video introduction.html: Related learning materials
https://xiph.org/daala  : Xiph (Mozilla Foundation) video compression technology, namely Daala (Dala)
http://aomedia.org  : The official website of the Open Media Alliance, mainly established by some hardware manufacturers, is the main organization that releases AV1
http://aomanalyzer.org/  : The bit stream analysis tool website corresponding to the AV1 codec

   AV1 background:
   In early 2015, Google was working on VP10 research, Xiph was continuing the research on Daala video compression technology, and Cisco open sourced a royalty-free video codec, namely Thor. After that, MPEG-LA announced the annual fee for H265 for the first time (more than 8 times that of H264). Due to the high annual fee of H265, some hardware manufacturers (Amozon, Cisco, Google, Intel, Microsoft, Mozilla, Netflix) jointly established the "Open Media Alliance" in September 2015, in which the content is released by Google, Netflix and Amazon is responsible for the maintenance of the website by Google and Mozilla. On April 5, 2016, AMD, ARM, and NVIDIA also joined the alliance. These companies that joined the alliance all had a common goal, that is, to develop a royalty-free video codec. At this time, AV1 took advantage of the trend.

   Introduction to AV1:
    AV1 is a royalty-free, open-source new video coding standard that integrates some of the best coding ideas from Daala, Thor, and VPx. It was released by the Alliance for Open Media (AOMedia). The first version, Version 0.1.0, was released in 2016. Released April 7. The main goal of AV1 at this stage is to achieve a coding performance higher than 50% of VP9/HEVC under the premise of a reasonable increase in coding and decoding complexity.
    AOMedia claims that it is expected to finalize information about the AV1 stream from the end of 2016 to March 2017, and support it on hardware for the first time in March 2018. It should be noted that one of the biggest differences between AV1 and previous video coding standards is that it is more inclined to support hardware codecs, which is also considered from the perspective of personal interests of AV1 initiators (mostly hardware equipment manufacturers).

   AV1 resource download:
AV1 source code download address: https://aomedia.googlesource.com/aom
AV1 test video sequence:http://media.xipha.org/video/derf/
Source code of AV1 stream analysis tool: https://github.com/mbebenita/aomanalyzer
AV1 stream analysis tool: https://people.xiph.org/~mbebenita /analyzer/

4.13. AVS1/AVS2

    The AVS1 standard is an audio and video codec standard independently developed by China. It is mainly used for compressed transmission of domestic satellite TV. Its compression rate is lower than that of H.264, and it is not as good as H. .264.

4.14. SVAC1/SVAC2

    SVAC (Surveillance Video and Audio Coding), security monitoring audio and video coding and decoding technology, this technical standard is an audio and video coding standard proposed by Vimicro and the Public Security Institute for video surveillance applications, the standard was issued in December 2010 Released on the 23rd and implemented on May 1, 2011.

The main technical features of the SVAC standard are as follows:

1) High precision, supporting 8bit~10bit;
2) Supporting intra-frame 4x4 prediction and transform quantization, and supporting technologies such as adaptive frame-field coding (AFF) and CABAC.
3) Support ROI variable quality coding and SVC scalable coding.
4) Support monitoring special information, data security protection and encryption authentication.

4.15、H.266/VVC

    JVET was founded as the Joint Video Exploration Team (on Future Video coding) of ITU-T VCEG and ISO/IEC MPEG in October 2015. After a successful call for proposals, it transitioned into the Joint Video Experts Team (also abbreviated to JVET) in April 2018 with the task to develop a new video coding standard.
The new video coding standard was named Versatile Video Coding (VVC).


Fourth, the basic technology of video compression coding

Redundant information of video signal

Taking the YUV component format for recording digital video as an example, YUV represents brightness and two color difference signals respectively. For example, for the existing PAL TV system, the sampling frequency of the luminance signal is 13.5 MHz; the frequency band of the chrominance signal is usually half or less of the luminance signal, which is 6.75 MHz or 3.375 MHz. Taking the sampling frequency of 4:2:2 as an example, the Y signal is sampled at 13.5MHz, the chrominance signals U and V are sampled at 6.75MHz, and the sampled signal is quantized with 8bit, then the bit rate of the digital video can be calculated as:

13.5*8 + 6.75*8 + 6.75*8= 216Mbit/s

If such a large amount of data is directly stored or transmitted, it will encounter great difficulties, so compression technology must be used to reduce the bit rate. The digitized video signal can be compressed mainly based on two basic conditions:

  • Data redundancy. For example, such as spatial redundancy, temporal redundancy, structural redundancy, information entropy redundancy, etc., that is, there is a strong correlation between the pixels of the image. Eliminating these redundancies does not result in loss of information and is lossless compression.
  • Visual redundancy. Some characteristics of the human eye, such as luminance discrimination threshold, visual threshold, and sensitivity to luminance and chrominance, make it possible to introduce an appropriate amount of error during encoding without being noticed. The visual characteristics of the human eye can be used to exchange data compression with a certain objective distortion. This type of compression is lossy compression.

The compression of digital video signal is based on the above two conditions, so that the amount of video data can be greatly compressed, which is conducive to transmission and storage. The general digital video compression coding method is hybrid coding, that is, transform coding, motion estimation and motion compensation, and entropy coding are combined for compression coding. Transformation coding is usually used to eliminate intra-frame redundancy of images, motion estimation and motion compensation are used to remove inter-frame redundancy of images, and entropy coding is used to further improve compression efficiency. These three compression coding methods are briefly introduced below.


In order to specially deal with various redundancies in video information, video compression coding adopts various techniques to improve the compression ratio of video. The common ones are predictive coding, transform coding and entropy coding.

1. Transform coding  The current mainstream video coding algorithms are all lossy coding, which can achieve relatively higher coding efficiency by causing limited and tolerable loss to the video. The part that causes information loss is the part of transform and quantization. Before quantization, it is first necessary to transform the image information from the spatial domain to the frequency domain through transform coding, and calculate its transform coefficients for subsequent coding. Orthogonal transform is usually used in video coding algorithm for transform coding. Commonly used orthogonal transform methods include discrete cosine transform (DCT), discrete sine transform (DST), and KL transform.

The function of transform coding is to transform the image signal described in the space domain into the frequency domain, and then encode the transformed coefficients. In general, images have a strong correlation in space, and transforming them to the frequency domain can achieve decorrelation and energy concentration. Commonly used orthogonal transforms include discrete Fourier transform, discrete cosine transform and so on. The discrete cosine transform is widely used in the process of digital video compression.

Discrete cosine transform is called DCT transform for short. It can transform L*L image blocks from the spatial domain to the frequency domain. Therefore, in the DCT-based image compression coding process, it is first necessary to divide the image into non-overlapping image blocks. Assuming that the size of a frame image is 1280*720, it is first divided into 160*90 image blocks with a size of 8*8 in the form of a grid and does not overlap with each other, and then DCT transformation can be performed on each image block.

After being divided into blocks, each 8*8 image block is sent to the DCT encoder to transform the 8*8 image block from the spatial domain to the frequency domain. The figure below gives an example of an actual 8*8 image block, and the numbers in the figure represent the brightness value of each pixel. It can be seen from the figure that the luminance values ​​of each pixel in this image block are relatively uniform, especially the luminance values ​​of adjacent pixels do not vary greatly, indicating that the image signals have a strong correlation.

The following figure is the result of DCT transformation of the image block in the above figure. It can be seen from the figure that after DCT transformation, the low-frequency coefficients in the upper left corner concentrate a lot of energy, while the high-frequency coefficients in the lower right corner have very little energy.

After the signal is transformed by DCT, it needs to be quantized. Since the human eye is sensitive to the low-frequency characteristics of the image, such as the overall brightness of the object, but not to the high-frequency details in the image, it is possible to transmit less or no high-frequency information during the transmission process, and only transmit low frequency part.

The example in the figure below shows that after removing the small high-frequency components, the bit rate of the image has been greatly compressed, but the subjective quality of the image has not been affected by the same proportion.

In the quantization process, the coefficients in the low-frequency region are quantized finely, and the coefficients in the high-frequency region are coarsely quantized to remove high-frequency information that is not sensitive to the human eye, thereby reducing the amount of information transmitted. Therefore, quantization is a lossy compression process and is the main cause of quality impairment in video compression coding .

The quantization process can be expressed by the following formula:

Among them, FQ (u, v) represents the DCT coefficient after quantization; F (u, v) represents the DCT coefficient before quantization; Q (u, v) represents the quantization weighting matrix; q represents the quantization step size; round represents rounding, The output value is taken as the nearest integer value.

Reasonably select the quantization coefficient, and the result of quantizing the transformed image block is shown in the figure.


Most of the DCT coefficients become 0 after being quantized, and only a small part of the coefficients have non-zero values. At this time, it is only necessary to compress and encode these non-zero values.

2. Entropy coding  The entropy coding method in video coding is mainly used to eliminate statistical redundancy in video information. Since the occurrence probability of each symbol in the information source is not consistent, it will cause waste to represent all symbols with the same length of codeword. Through entropy coding, symbols of different lengths are assigned to different syntax elements, which can effectively eliminate redundancy caused by symbol probability in video information. The commonly used entropy coding methods in video coding algorithms include variable length coding and arithmetic coding, and specifically, there are mainly context adaptive variable length coding (CAVLC) and context adaptive binary arithmetic coding (CABAC).

3. 预测编码

In video compression "prediction" of a block means finding the most“similar” block to the current one among the surrounding blocks.

In video compression technology, the term "prediction" refers to finding (or using a certain method to construct one) the pixel block that is "closest" to the current block among some pixel blocks around the current pixel block.

Predictive coding can be used to deal with redundancy in the temporal and spatial domains in video. Predictive coding in video processing is mainly divided into two categories: intra prediction and inter prediction.

Intra-frame prediction: The predicted value and the actual value are located in the same frame, which is used to eliminate the spatial redundancy of the image; the characteristic of intra-frame prediction is that the compression rate is relatively low, but it can be decoded independently without relying on the data of other frames; usually in the video All the key frames are intra-frame predicted.
Inter-frame prediction: The actual value of inter-frame prediction is located in the current frame, and the predicted value is located in the reference frame, which is used to eliminate the temporal redundancy of the image; the compression rate of inter-frame prediction is higher than that of intra-frame prediction, but it cannot be decoded independently, and must be obtained after reference Only after the frame data can the current frame be reconstructed.

Usually in the video code stream, all I frames use intra-frame coding, and data in P-frames/B-frames may use intra-frame or inter-frame coding.

Motion Estimation and Motion Compensation are effective means to eliminate temporal correlation of image sequences. The methods of DCT transformation, quantization, and entropy coding introduced above are performed on the basis of one frame of image, and through these methods, the spatial correlation between pixels in the image can be eliminated. In fact, in addition to spatial correlation, image signals also have temporal correlation. For example, for a digital video such as a news broadcast with a static background and a small movement of the main body of the picture, the difference between each picture is very small, and the correlation between the pictures is very large. In this case, we don't need to encode each frame of image separately, but can only encode the changed part of the adjacent video frame, so as to further reduce the amount of data. The work in this area is composed of motion estimation and motion compensation. to achieve.

Motion estimation technology generally divides the current input image into several small image sub-blocks that do not overlap with each other. For example, the size of a frame image is 1280*720. First, it is divided into 40*45 in the form of a grid. *16 image blocks that do not overlap with each other, and then find a most similar image block for each image block within the range of a certain search window in the previous image or the next image. This search process is called motion estimation. By calculating the position information between the most similar image block and the image block, a motion vector can be obtained . In this way, during the encoding process, the block in the current image can be subtracted from the most similar image block pointed to by the motion vector of the reference image to obtain a residual image block. Since the value of each pixel in the residual image block is very small, Therefore, a higher compression ratio can be obtained in compression coding. This subtraction process is called motion compensation .

Since the reference image needs to be used for motion estimation and motion compensation in the encoding process, the selection of the reference image is very important. In general, the encoder divides each input frame image into three different types according to its reference image: I (Intra) frame, B (Bidirection prediction) frame, and P (Prediction) frame. as the picture shows.

As shown in the figure, the I frame only uses the data in this frame for encoding, and it does not need motion estimation and motion compensation during the encoding process. Obviously, since the I frame does not eliminate the correlation in the time direction, the compression ratio is relatively low. In the encoding process of the P frame, a previous I frame or P frame is used as a reference image for motion compensation, which actually encodes the difference between the current image and the reference image. The encoding method of a B frame is similar to that of a P frame, the only difference is that it uses a previous I frame or P frame and a subsequent I frame or P frame for prediction during the encoding process. It can be seen that the encoding of each P frame needs to use one frame of images as a reference image, while the B frame requires two frames of images as a reference. In contrast, B frames have a higher compression ratio than P frames.

4.  Mixed coding

Several important methods in the process of video compression encoding are introduced above. In practical applications, these methods are not separated, and they are usually used in combination to achieve the best compression effect. The figure below shows the model of hybrid coding (ie transform coding + motion estimation and motion compensation + entropy coding). This model is widely used in MPEG1, MPEG2, H.264 and other standards.

mixed coding model

We can see from the figure that the currently input image must first be divided into blocks, and the image block obtained by the block must be subtracted from the predicted image after motion compensation to obtain the difference image X, and then DCT transformation is performed on the difference image block And quantization, the quantized output data has two different destinations: one is sent to the entropy encoder for encoding, and the encoded code stream is output to a buffer for storage, waiting to be transmitted. Another application is the dequantized and inversely changed signal X', which is added to the image block output by motion compensation to obtain a new predicted image signal, and the new predicted image block is sent to the frame memory.

For more information about video codec, please read Zhiyou's topic article

auxten: [Start from scratch] Understanding video codec technology 1263 Agreed · 42 Comments are being uploaded...Re-upload cancel​

Edited at 2020-03-22 23:58

Guess you like

Origin blog.csdn.net/hi_zhengjian/article/details/121922599