Audio and video coding principle of iOS audio and video bottom layer (3) - H.264 coding principle

1. Video-related basic content

1.1 Pixel Pixel

Simply put, pixels, resolution, etc. are used to describe the fineness or clarity of the display screen. We know that points form lines and lines form surfaces, as do images (pictures and videos). An image is composed of many, many small dots, which we call pixels, also called pixels. Under the same physical area, the more pixels, the clearer the displayed image, and the fewer pixels, the blurrier the displayed image. In many cases, they use 点或者方块displays. For example, if a portrait photo has 20 million pixels, even the pores may be clearly visible, but if it only has 1 million pixels, the nose and eyes may be blurred. This is the number of pixels used to describe the clarity of the picture degree. We usually say 10 million pixels, 20 million pixels, and 100 million pixels refer to the total number of pixels in the image. See Figure 1 and Figure 2 below. Figure 2 is the effect of magnifying Figure 1 by 11 times. You may see many small squares, which are pixels. Normally they are too small to see. figure 1

figure 2

1.2 Resolution Resolution

The video resolution refers to the size or dimension of the image locked by the video imaging product, and the expression is: 水平像素数✖️垂直像素数. Common image resolutions are QCIF(176×144) 、CIF(352×288)、D1(704×576)、720P(1280×720)、1080P(1920*1080). The maximum resolution of camera imaging is determined by CCD or CMOS photosensitive device. Now the camera basically supports modification of the resolution, which is generated by cropping the original image through the camera's own software.

1.3 Frame Rate Frame Rate

One frame is a still picture, and continuous frames form animation, such as movies. The number of frames we usually refer to is the number of frames of pictures transmitted in 1 second, usually expressed in fps (Frames Per Second). Each frame is a still image, and displaying frames in quick succession creates the illusion of movement and restores the state of the object at that time. High frame rates result in smoother, more realistic animations. The more frames per second (fps) the smoother the displayed motion will be. Generally speaking, setting the image frame rate to 25fps or 30fps is sufficient.

1.4 Code Stream Data Rate

The code stream refers to the coded and compressed video image 单位时间内的数据流量, also called the code rate, which is the most important part of the picture quality control in video coding. 同样分辨率下,压缩比越小,视频图像的码率越大,画面质量越高. ``.

The relationship between resolution, frame rate, and code stream:

  • The frame rate is directly proportional to the fluency;

  • Resolution is proportional to image size;

  • When the resolution is constant, the bit rate is directly proportional to the definition;

  • When the bit rate is constant, the resolution is inversely proportional to the sharpness.

2. Video-related concepts

2.1 Video file format

We are familiar with the concept of file format. For example, the file format of a common word document is .doc, and the file format of a picture is .jpg or png. For video, the common file formats are and .mov、.avi、.mpg、vob、 .mkv、.rm、.rmvbso on. The file format is usually represented by the suffix name when the file is stored on the operating system, which is usually used by the operating system to associate with the corresponding open program. For example, the .doc file system will call word to open by default, and the .avi or .mkv file system will Call the video player to open etc. The same is video, why are there multiple file formats such as .mov, .avi, .mpg, etc.? That's because they implement "video" in different ways.

2.2 Video Encapsulation Format

Video encapsulation format, referred to as video format, is equivalent to a container for storing video information, which contains video information, audio information and related configuration information (associated information of video and audio, how to decode and other configurations) required for encapsulating video files. . A direct reflection of a video package format is the corresponding video file format. The following are the main video file packaging formats:

  • AVI格式: The corresponding file format is .avi, the English full name is Audio Video Interleave, that is, the audio and video interleaved format, which is a multimedia container format launched by Microsoft in November 1992 as part of its Windows video software. The advantage of this video format is that the image quality is good, and lossless AVI can preserve the alpha channel. The disadvantage is that the volume is too large, and the compression standards are not uniform, and there are many compatibility issues between high and low versions.

  • DV-AVI格式: The corresponding file format is .avi, and the full English name is Digital Video Format, which is a home digital video format jointly proposed by Sony, Panasonic, JVC and other manufacturers. Very popular digital video cameras use this format to record video data. It can transmit video data to the computer through the IEEE 1394 port of the computer, and can also record the edited video data in the computer to the data camera.

  • WMV格式: The corresponding file formats are .wmv, .asf, the English full name is Window Media Video, which is a file compression format launched by Microsoft that adopts an independent encoding method and can directly watch video programs on the Internet in real time. Under the same video quality, files in WMV format can be played while downloading, so it is very suitable for playing and transmitting on the Internet.

  • MPEG格式: The corresponding file formats are .mpg, .mpeg, .mpe, .dat, .vob, .asf, .3gp, .mp4, etc. The full English name is Moving Picture Experts Group, which is a video format specified by the Moving Picture Experts Group. Established in 1988, the expert group is responsible for the formulation of video and audio standards, and its members are technical experts in the fields of video, audio and systems. The MPEG format currently has three compression standards, namely MPEG-1, MPEG-2 and MPEG-4. MPEG-4It is a video package format that is widely used now. It is specially designed to play high-quality video of streaming media, in order to obtain the best image quality with the least amount of data.

  • Matroska格式: The corresponding file format is .mkv, Matroska is a new video encapsulation format, which can encapsulate a variety of different coded videos and more than 16 audios in different formats and subtitle streams in different languages ​​into a Matroska Media file.

  • Real Video格式: The corresponding file formats are .rm and .rmvb, which are the audio and video compression specifications formulated by Real Networks and are called Real Media. Users can use RealPlayer to formulate different compression ratios according to different network transmission rates, so as to realize real-time transmission and playback of image data on low-speed networks.

  • QuickTime File Format格式: The corresponding file format is .mov, which is a video format developed by Apple, and the default player is Apple QuickTime. This encapsulation format has the characteristics of high compression ratio and perfect video definition, and can save the alpha channel.

  • Flash Video格式: The corresponding file format is .flv, which is a network video packaging format extended from Adobe Flash, and is adopted by many video websites.

2.3 Container (video package format file)

Encapsulation format: It is to put the encoded and compressed video data and audio data into a file according to a certain format. This file is called a container. It can be understood as a "shell". Usually, not only audio and video data are stored in the container, but also some video synchronization metadata, such as subtitles. In general, various data needs to be processed by different programs, but when files are transmitted and stored, these data are bound together.

Common video container formats:

  • AVI: It was launched to fight against the Quicktime format (mov) at that time, and it can only support sound files encoded with a fixed CBR constant bit rate;

  • MOV: is the Quicktime package;

  • VMV: Launched by Microsoft, as a market competition;

  • mkv: Universal wrapper, with good compatibility, cross-platform, error correction, and external subtitles;

  • flv: This encapsulation method can protect the original address very well, and it is not easy to be downloaded. At present, some video sharing websites use this encapsulation method;

  • MP4: Mainly used in mpeg4 encapsulation, mainly used on mobile phones.

2.4 Audio and video codec method

The process of video codec refers to the process of compressing or decompressing digital video.

When doing video encoding and decoding, it is necessary to consider the balance of the following factors: the quality of the video, the amount of data required to represent the video (that is, the bit rate), the complexity of the encoding and decoding algorithm, and the robustness against data loss and errors ( Robustness), ease of editing, random access, perfection of coding algorithm design, end-to-end latency, and other factors.

2.4.1 Common Video Coding Methods

  • H.26X series: led by ITU-T, including H.261, H.262, H.263, H.264, H.265.

    • H.261: Mainly used in older video conferencing and video telephony systems. It was the first digital video compression standard to be used. In essence, all subsequent standard video codecs are designed based on it;

    • H.262: Equivalent to the second part of MPEG-2, used in DVD, SVCD and most digital video broadcasting systems and cable distribution systems.

    • H.263: Mainly used in video conferencing, video telephony and network video related products. In contrast to progressive scan (a method of encoding a bitmap image, one of two common methods of "drawing" a video image on an electronic display screen by scanning or displaying each row or row of pixels, the other is Interlaced) video compression, H263 has a greater performance improvement than its previous video coding standards. Especially at the low bit rate end, it can greatly save the bit rate while ensuring a certain quality.

    • H.264: Equivalent to the tenth part of MPEG-4, also known as Advanced Video Coding (AVC for short), is a video compression standard, a widely used high-precision video recording, compression and distribution Format. This standard introduces a series of new technologies that can greatly improve the compression performance, and can greatly surpass the previous standards at both the high bit rate end and the low bit rate end.

    • H.265: Known as High Efficiency Video Coding (HEVC for short), it is a video compression standard and the successor of H.264. HEVC is considered to not only improve the image quality, but also achieve twice the compression rate of H.264 (equivalent to a 50% reduction in the bit rate under the same picture quality), which can support 4K resolution or even ultra-high-quality TV, the best The resolution can reach 8192x4320 (8K resolution), which is the current development trend.

  • MPEG series: Developed by the Moving Picture Experts Group (MPEG) under the International Standards Organization (ISO).

    • MPEG-1 Part II: Mainly used on VCD, some online videos also use this format. The quality of the codec is roughly equivalent to that of original VHS tapes;

    • MPEG-2 Part 2: Equivalent to H.262, used in DVD, SVCD and most digital video broadcasting systems and cable distribution systems;

    • MPEG-4 Part II: Can be used in network transmission, broadcasting and media storage. Compared with MPEG-2 Part 2 and H.263 of the first edition, its compression performance has been improved;

    • MPEG-4 Part 10: Equivalent to H.264, it is the standard born from the cooperation of these two coding organizations.

  • AMV、AVS、Bink、CineForm等。

2.4.2 Common audio coding methods

In addition to the picture, there is usually sound in the video, so it involves audio codec. Commonly used audio encoding methods in video are as follows:

  • AAC: The English full name is Advanced Audio Coding, which is jointly developed by Fraunhofer llS, Dolby Laboratories, AT&T, Sony and other companies, and launched in 1997 based on MPEG-2 audio coding technology. In 2000, after the emergence of the MPEG-4 standard, AAC re-integrated its features and added SBR technology and PS technology. In order to distinguish it from the traditional MPEG-2 AAC, it is also called MPEG-4 AAC.

  • MP3: The English full name is MPEG-1 or MPEG-2 Audio Layer III, which was once a very popular digital audio coding and lossy compression format. It is designed to greatly reduce the amount of audio data. Invented and standardized in 1991 by a group of engineers at the research organization Fraunhofer-Gesellschaft in Erlangen, Germany. The popularity of MP3 has had a great impact and influence on the music industry.

  • WMA: The English full name is Windows Media Audio, a digital audio compression format developed by Microsoft Corporation, which itself includes lossy and lossless compression formats.

2.5 Relationship between video encoding and decoding methods and video encapsulation formats

The video packaging format can be regarded as a container containing information such as video, audio, and video codec methods. A video encapsulation format can support multiple video encoding and decoding methods. For example, QuickTime File Format (.mov) supports almost all video codec methods, and MPEG (.mp4) also supports a wide range of video codec methods. When we see a video file named test.mov, we can know its video file format and video encapsulation format respectively .mov, QuickTime File Formatbut we cannot know its video codec method. Professionally speaking, it can be in the form of A/B, A is the video codec method, and B is the video encapsulation format. For example, for a H.264/MOVvideo file, its encapsulation method is QuickTime File Format, and its encoding method is H.264.

2.6 Encoding format in live broadcast/short video

  • Video encoding format:

    • Advantages of H.264 encoding: low bit rate, high-quality images, strong fault tolerance, strong network adaptability, and high data compression ratio. Under the same image quality, the compression ratio of H.264 is MPEG-2. More than 2 times, 1.5-2 times of MPEG-4;

    • Example: If the size of the original file is 88GB, it will become 3.5GB after compression using the MPEG-2 compression standard, with a compression ratio of 25:1, and it will become 879MB after compression using the H.264 compression standard, with a compression ratio of 102:1.

  • Audio encoding format

    • AAC is currently a relatively popular lossy compression coding technology, and has derived three main coding formats: LC-AAC, HE-AAC, and HE-AAC v2;

    • LC-AAC is a relatively traditional AAC, which is mainly used for scene encoding of medium and high bit rates (>=80kbit/s);

    • HE-AAC is mainly used in the encoding of low bit rate (<=48kbit/s) scenes;

    • Excellent performance at a bit rate of less than 128kbit/s, and is mostly used for audio encoding in video;

3. Basic concepts of H.264

H.264 is a coding method widely used now. Regarding the concepts related to H.264, the order from large to small is: sequence, image, slice group, slice, NALU, macroblock, sub-macroblock, block, pixel.

[Learning address]: FFmpeg/WebRTC/RTMP/NDK/Android audio and video streaming media advanced development

[Article Benefits]: Receive more audio and video learning packages, Dachang interview questions, technical videos and learning roadmaps for free. The materials include (C/C++, Linux, FFmpeg webRTC rtmp hls rtsp ffplay srs, etc.) Click 1079654574 to join the group to receive it~

3.1 Image

In H.264, an image is a collection concept, and , 顶场, and 底场can all be called images. A frame is usually a complete image. When collecting a video signal, if it is used 逐行扫描, the signal obtained by each scan is an image, that is, a frame. If 隔行扫描(odd, even lines) is used, a scanned image is divided into two parts, each part is called a field, and is divided into sum according to the 顶场order 底场. The concept of frame and field brings different coding methods: frame coding and field coding. Progressive scanning is suitable for moving images, and it is better to use frame coding for moving images; interlaced scanning is suitable for non-moving images, and it is better to use field coding for non-moving images.

3.2 Slice, NALU, macroblock

  • Each frame image can be divided into multiple slices

  • NALU: full name Network Abstraction Layer Unit, network abstraction layer unit. It is used to package the encoded data. A slice (Slice) can be encoded into a NALU unit, but a NALU unit can accommodate other data besides the code stream encoded by the slice, such as the sequence parameter set SPS . For the client, its main task is to receive the data packet, parse out the NALU unit from the data packet, and then decode and play it.

  • Macroblock Macroblock, slice is composed of macroblocks.

3.3 Color Model

3.3.1 RGB color model

The most used in the development scene should be the RGB model.

In the RGB model, each color requires 3 numbers, representing red R, green G, and blue B respectively. Usually, a number occupies 1 byte, so 一种颜色需要24bits,

3.3.2 YCbCr color model (YUV one)

Assuming that a concept of Luminance is defined to represent the brightness of a color, it can be expressed as an expression containing R, G, and B:

Y = kr*R + kg*G + kb*B
* Y即亮度,kr、kg、kb为R、G、B的权重值。

At this time, a concept of Chrominance is defined to represent the difference in color.

Cr = R - Y
Cg = G - Y
Cb = B - Y
* Cr、Cg、Cb分别表示在R、G、B上的色度分量。

The above model is the basic principle of the YCbCr color model. YCbCr is a member of the YUV family and is the most widely used color model in computer systems. In YCbCr, Y refers to the luminance component, Cb refers to the blue chroma component, and Cr refers to the red chroma component.

在YUV中Y表示亮度,即灰阶值,U和V则是表示色度。
YUV的关键是在于它的亮度信号Y和色度信号U、V是分离的。那就是说即使只有Y信号分量而没有U、V分量,仍然可以表示出图像,只不过图像是黑白灰度图像。

Get the recommended correlation coefficient from the ITU-R BT.601-7 standard, and you can get the formula for mutual conversion between YCbCr and RGB:

Y = 0.299R + 0.587G + 0.114B
Cb = 0.564(B - Y)
Cr = 0.713(R - Y)
R = Y + 1.402Cr
G = Y - 0.344Cb - 0.714Cr
B = Y + 1.772Cb

At this time, YCbCr still uses 3 numbers to represent the color, will it save bits? Illustrated in conjunction with the image in the video and the pixel representation in the image

  • Suppose the picture consists of the following pixels:

  • An image is an array of pixels. The information of the 3 components of each pixel is complete, YCbCr 4:4:4.

  • For each pixel, the luminance value is retained, but the chrominance value of the even-element pixel in each row is omitted, thereby saving bits. YCbCr 4:2:2

  • More omissions are made below, but the impact on picture quality will not be too great.

3.4 Basic Concepts of H.264

3.4.1 I frame, P frame, B frame

I frame: key frame, using intra-frame compression technology

  • For example, if the camera is looking at you, very little actually changes in 1 second. Cameras generally capture dozens of frames of data per second. For example, animation is 25 frames/s, and general video files are about 30 frames/s. For those with relatively high requirements, there are requirements for the fineness of motion, and if you want to capture complete motion, advanced cameras generally have 60 frames per second. In order to facilitate data compression, the first frame is completely saved. Without this key frame, subsequent decoding of data cannot be completed, so the I frame is particularly critical.

P frame: Refers to the forward frame, only refers to the previous frame when compressing, and belongs to the inter-frame compression technology.

  • The first frame of the video will be completely saved as a key frame, and the subsequent frames will be forward dependent, that is, the second frame depends on the first frame. All subsequent frames only store the difference from the previous frame, so that the data can be greatly reduced to achieve a high compression rate.

B frame: Bidirectional reference frame, which refers to both the previous frame and the next frame during compression, which belongs to the inter-frame compression technology.

  • The B frame refers to both the previous frame and the next frame, so that its compression rate is higher and the amount of stored data is smaller. If the number of B frames is larger, the compression rate is higher, which is the advantage of B frames. The biggest disadvantage of the B frame is that if it is a real-time interactive live broadcast, the B frame must refer to the following frame to decode it, and it must wait for the subsequent frame to be transmitted in the network. If the network status is good, the decoding will be faster, if the network is not good, the decoding will be slower. Retransmission is required when packets are lost, so B frames are generally not used for real-time interactive live broadcasts;

  • If a certain degree of delay can be accepted in the pan-entertainment live broadcast, B frames can be used if a relatively high compression ratio is required;

  • If it is necessary to improve the effectiveness in the live broadcast of real-time interaction, B frames cannot be used.

3.4.2 GOF (Group of Frame) a group of frames, GOP (Group of Pictures) picture group

If there are 30 frames in one second, and if the camera or lens does not change significantly within one minute, all the frames within one minute can be drawn as a group.

  • What is a set of frames? From one I frame to the next I frame, this group of data, including B frame/P frame, is called GOF.

  • GOPPicture group: A group of continuous pictures. Is a set of pictures in the sequence to facilitate random access. GOPThe first picture of must be an I frame, which ensures that the GOP does not need to refer to other pictures and can be decoded independently.

3.4.3 SPS/PPS

SPS/PPS actually store the parameters of GOP.

  • SPS: Sequence Parameter Set, sequence parameter set. Store the number of frames, the number of reference frames, the size of the decoded image, the selection flag of the frame field encoding mode, etc., which is a set of parameters for a group of frames.

  • PPS: Picture Parameter Set, image parameter set. Store information related to the image, such as the entropy coding mode selection flag, the number of slice groups, the initial quantization parameter, and the deblocking filter coefficient adjustment flag.

The SPS/PPS data is first received before a set of frames, without this set of parameters, it cannot be decoded! If an error occurs during decoding, first check whether there is SPS/PPS, if not, is it because the other end did not send it or because the other end lost it during the sending process. SPS/PPS data is classified into I frames, and these two sets of data must not be lost.

3.5 Reasons for video blurring and freezing

When watching a video, you will encounter blurred screen or freeze phenomenon, which is related to GOP.

  • If the P frame in the GOP packet is lost, it will cause an image error at the decoding end;

  • In order to avoid the blurred screen problem, generally if the P frame or I frame is found to be missing, all the frames in this GPP will not be displayed, and the image will be refreshed only after the next I frame comes;

  • Because all the frames of the packet loss are thrown away, the screen cannot be refreshed, and the image will be stuck, that is, freeze.

  • To sum up, the blurred screen is caused by missing P or I frames, resulting in decoding errors. The reason for the freeze is that the entire set of wrong GOP data is thrown away for fear of blurring, and the next set of correct GOP data is reached.

4. H.264 coding principle

4.1 H.264 Compression Technology

  • Intra prediction compression. It solves the problem of airspace data redundancy. What is airspace data? The data in this picture contains a lot of color and light in the width and height space, and the data that is difficult for human eyes to detect can be regarded as redundant and directly compressed.

  • Inter prediction compression. It solves the problem of time-domain data redundancy. The data captured by the camera within a period of time has no major changes, and the compression of the same data within this period is called time-domain data compression.

  • Integer Discrete Cosine Transform (DCT) transforms the correlation in space into irrelevant data in the frequency domain, and then performs quantization. Fourier transform can transform a complex waveform into many sine waves, but the frequencies and amplitudes between them are different. If they are not consistent in frequency, they can be compressed.

  • CABAC Compression: Lossless compression.

4.2 Macroblock division and grouping

H.264 macroblock division, you can refer to the following figure.

Describing the upper left corner of a picture with a macroblock is an element with a macroblock of 88. Take out the color, as described on the right. The macroblock division of the basic picture is completed. Is each macroblock 8 8? Of course not, there is also sub-block division. Sub-block division:

In this large macroblock, it can be further refined. If all the macro blocks in the middle are blue, you can use a color block to describe it more simply.

  • Comparing MPEG2 and H.264, it is found that MPEG2 is relatively complete in storage and occupies relatively more space; while H.264 reduces a lot of space, and repeated colors will be described by simple color blocks.

Frame Grouping For example, if a billiard ball moves from one position to another, its desktop background remains the same, but the position of the ball changes, this group of frames can be divided into one group.

4.3 Intra-group macroblock search

In fact, as shown in the figure, the billiard ball is rolled from one corner to the other, and the two adjacent pictures are used to search for macroblocks in the group. Scan the picture line by line, scan to the third line, find the billiard ball, and then look around it, and find similar blocks.

Motion estimation Put them in the same picture, that is, the position 1 where the billiard ball started and the position 2 it moved to, there is a motion vector between it, including the direction and distance of motion. Comparing all the pictures, the right picture is finally formed. Each red arrow in the right picture is marked with a motion vector, and many frames will form a continuous motion estimation.

Motion Vector and Compensation Compression Converts continuous transport estimates into graphs. It is compressed, the background of all frames is the same, and what is transformed is its motion vector and the data of the billiard ball. After the actual calculation, only the motion vector data and residual value data are left. After such calculation, only these data need to be stored to achieve the effect of compression. This process is the principle of inter-frame compression technology.

4.4 Intra Prediction

帧内压缩是针对于I帧的,解决的是空间的数据冗余. The inter-frame compression solves the time data redundancy. The inter-frame compression technology is to compress a large amount of the same data on the time track, leaving only the calculation estimation and residual value.

The principle of intra-frame compression must first be calculated to select which mode to use, and to use different modes for each macroblock. When the mode is selected for each macroblock, the effect shown in the figure below is formed. Each macroblock selects an intra-frame prediction mode. There are 9 modes for intra-frame prediction. For details, refer to the H.264 series of articles ( three ) - intra prediction . After the mode is selected for each macroblock, the block prediction mode can be used, and a prediction map will be obtained after prediction. The predicted image is on the left, and the original image is on the right.

There is a difference between the calculated forecast map and the original map. The original map is relatively round, and the predicted map is relatively rough. The difference calculation can be performed based on the two pictures.

Calculate the residual value of intra-frame prediction The following figure is the original image, and a result can be obtained through the difference between the prediction and the original image: the gray image is the residual value.

Prediction mode and residual value compression After getting the residual value, compress it. The residual data and mode information data selected by each macroblock are saved during compression. When decoding, first calculate the prediction map through the mode information of the macroblock, and then accumulate the prediction map and the residual value to restore the original image. This process is the principle of "intra-frame compression technology".

Original link: Audio and video coding principle of iOS audio and video bottom layer (3) - H.264 coding principle - Nuggets

Guess you like

Origin blog.csdn.net/irainsa/article/details/130062005