Audio and video hard decoding articles-basic knowledge of audio and video

Today, short video apps are booming and flourishing. With the rise of short video, audio and video development has received more and more attention, but because audio and video development involves a wider range of knowledge and relatively high entry barriers, many developers are daunted.

1. What is the video?

I don’t know if you have played an animated little man book when you were young. When you flip continuously, the picture of the little man book will become an animation, similar to the current gif format picture.
Insert picture description here

It was originally a static little book. After flipping, it will become an interesting little animation. If there are enough pictures and the flipping speed is fast enough, this is actually a little video.

The principle of video is exactly the same. Due to the special structure of human eyes, when the pictures are switched quickly, the pictures will remain, and it feels like a coherent action. Therefore, the video is composed of a series of pictures.

Video frame

Frame, a basic concept of video, represents a picture, such as a page in the flip-flop above, is a frame. A video is composed of many frames.

Frame rate

Frame rate, that is, the number of frames per unit time, the unit is: frames per second or fps (frames per second). For example, in an animation book, how many pictures are contained in one second, the more pictures, the smoother the picture and the more natural the transition.

The frame rate is generally the following typical values:

24/25 fps: 24/25 frames per second, the general movie frame rate.
30/60 fps: 1 second 30/60 frames, the frame rate of the game, 30 frames are acceptable, 60 frames will feel more smooth and realistic.
Above 85 fps, the human eye is basically unable to detect it, so a higher frame rate does not make much sense in the video.

Color space

Here we only talk about two commonly used color spaces.

RGB

The RGB color mode should be the one we are most familiar with, and it is widely used in current electronic devices. Through the three basic colors of RGB, all colors can be mixed.

YUV

Here I will focus on YUV, this kind of color space is not familiar to us. This is a color format that separates brightness and chroma.
Early TVs were all black and white, that is, only the brightness value, that is, Y. With the color TV, two chromaticities of UV were added to form the current YUV, also called YCbCr.
Y: Brightness, which is the gray value. In addition to expressing the luminance signal, it also contains more green channels.
U: The difference between the blue channel and the brightness.
V: The difference between the red channel and the brightness.

What are the advantages of using YUV?

The human eye is sensitive to brightness and insensitive to chromaticity. Therefore, the amount of UV data is reduced, but the human eye cannot perceive it. In this way, the UV resolution can be compressed to reduce the volume of the video without affecting the look and feel.

Conversion of RGB and YUV

Y = 0.299R + 0.587G + 0.114B 
U =0.147R - 0.289G + 0.436B
V = 0.615R - 0.515G - 0.100B
——————————————————
R = Y + 1.14V
G = Y - 0.39U0.58V
B = Y + 2.03U

2. What is audio?

The most commonly used way to carry audio data is pulse code modulation, or PCM.

In nature, sound is continuous and an analog signal. How can the sound be preserved? That is to digitize the sound, that is, convert it into a digital signal.

We know that sound is a wave with its own amplitude and frequency. To save the sound, it is necessary to save the amplitude of the sound at various points in time.

The digital signal cannot continuously save the amplitude at all time points. In fact, it is not necessary to save the continuous signal to restore the sound acceptable to the human ear.

According to the Nyquist sampling theorem: in order to restore the analog signal without distortion, the sampling frequency should not be less than 2 times the highest frequency in the analog signal spectrum.

According to the above analysis, the acquisition steps of PCM are divided into the following steps:

Analog signal->sampling->quantization->encoding->digital signal

Insert picture description here

Sampling rate and number of sampling bits

Sampling rate, that is, the frequency of sampling.

As mentioned above, the sampling rate is greater than twice the frequency of the original sound wave, and the highest frequency that the human ear can hear is 20kHz. Therefore, in order to meet the hearing requirements of the human ear, the sampling rate should be at least 40kHz, usually 44.1kHz, and the higher one is usually It is 48kHz.

The number of sampling bits is related to the amplitude quantization mentioned above. The waveform amplitude is also a continuous sample value on the analog signal, and in the digital signal, the signal is generally discontinuous, so after the analog signal is quantized, only an approximate integer value can be taken. In order to record these amplitude values, the sampler will use a A fixed number of bits is used to record these amplitude values, usually 8 bits, 16 bits, and 32 bits.

Number of digits Minimum Max
8 0 255
16 -32768 32767
32 -2147483648 2147483647

The more digits, the more accurate the recorded value and the higher the degree of restoration.

The last step is coding. Since the digital signal is composed of 0 and 1, therefore, the amplitude value needs to be converted into a series of 0 and 1 for storage, that is, encoding, and the final data obtained is the digital signal: a series of 0 and 1 data.

The whole process is as follows: Insert picture description here
number of channels

The number of channels refers to the number of speakers that support different sounds (note different sounds).

Mono: 1 channel

Dual channel: 2 channels

Stereo channel: 2 channels by default

Stereo channel (4 channels): 4 channels

Bit rate

Bit rate refers to the amount of information that can pass per second in a data stream, in units of bps (bit per second)

Code rate = sampling rate * number of sampling bits * number of channels

Three, why coding

The encoding here is not the same concept as the encoding mentioned in the audio above, but refers to compression encoding.

We know that in the computer world, everything is composed of 0 and 1, and audio and video data are no exception. Due to the huge amount of audio and video data, if it is stored as raw stream data, it will require a very large storage space, which is also not conducive to transmission. The audio and video actually contain a lot of repeated data of 0 and 1, so the data of 0 and 1 can be compressed through a certain algorithm.

Especially in the video, because the pictures are gradually transitioned, the whole video contains a large number of repetitions of pictures/pixels, which just provides a very large compression space.

Therefore, encoding can greatly reduce the size of audio and video data, making it easier to store and transmit audio and video.

Four, video encoding

Video encoding format

There are many video coding formats, such as H26x series and MPEG series coding. These coding formats have emerged to adapt to the development of the times.

Among them, the H26x (1/2/3/4/5) series is led by ITU (International Telecommunication Union). The
MPEG (1/2/3/4) series is led by MPEG (Moving Picture Experts Group, ISO's subsidiary). Organization) dominate.

Of course, they also have a joint coding standard, which is the current mainstream coding format H264, and of course the next generation of more advanced compression coding standard H265.

Introduction to H264 encoding

H264 is currently the most mainstream video coding standard, so we mainly use this coding format as a benchmark in our subsequent articles.

H264 is jointly customized by ITU and MPEG, and belongs to the tenth part of MPEG-4.

Since the H264 encoding algorithm is very complicated, it cannot be explained clearly in a moment, nor is it within the scope of my current ability, so here is only a brief introduction to the concepts that need to be understood in daily development. In fact, the encoding and decoding part of the video is usually completed by the framework (such as Android hardware decoding/FFmpeg), and the average developer will not touch it.

Video frame

We already know that a video is composed of frames one by one, but in the video data, it is not really saved according to the original data one frame by one (if this is the case, the compression coding is meaningless).

H264 will select one frame as the complete encoding according to the changes in the picture within a period of time, and only record the difference between the complete data of the next frame and the previous frame, which is a dynamic compression process.

In H264, the three types of frame data are

I frame : intra-frame coded frame. It is a complete frame.

P frame : forward predictive coding frame. It is an incomplete frame, generated by referring to the previous I frame or P frame.

B frame : Bidirectional predictive interpolation coding frame. Refer to the before and after image frame encoding to generate. A B frame depends on the nearest I frame or P frame before it and the nearest P frame after it.

Picture group: GOP and key frame: IDR

Full name: Group of picture. Refers to a group of video frames with little change.

The first frame of the GOP becomes the key frame: IDR

IDRs are all I frames, which can prevent one frame from being decoded incorrectly, causing the problem of decoding errors on all subsequent frames. When the decoder is decoding the IDR, it will clear the previous reference frame and restart a new sequence, so that even if a major error occurs in the decoding of the previous frame, it will not spread to the subsequent data.

Note: The key frames are all I frames, but I frames are not necessarily key frames

DTS and PTS

The full name of DTS: Decoding Time Stamp. Indicate when the data stream read into the memory will be sent to the decoder for decoding. That is, the timestamp of the decoding order.

The full name of PTS: Presentation Time Stamp. Used to mark when the decoded video frame is displayed.

In the absence of B-frames, the output sequence of DTS and PTS is the same. Once there are B-frames, PTS and DTS will be different.

Color space of the frame

Earlier we introduced two image color spaces, RGB and YUV. H264 uses YUV.
YUV storage methods are divided into two categories: planar and packed.

Planar: Store all Y first, then store all U, and finally V;

packed: The Y, U, and V of each pixel are continuously interleaved and stored.
Planar is as follows:

Insert picture description here
Packed as follows:Insert picture description here

However, the pakced storage method has been very seldom used, and most videos are stored in the planar storage method.

As mentioned above, since human eyes have low sensitivity to chromaticity, some chromaticity information can be omitted, that is, brightness shares some chromaticity information, thereby saving storage space. Therefore, planar distinguishes the following formats: YUV444, YUV422, and YUV420.

YUV 4:4:4 sampling, each Y corresponds to a set of UV components. Insert picture description here
YUV 4:2:2 sampling, every two Y shares a set of UV components. Insert picture description here
YUV 4:2:0 sampling, every four Y shares a set of UV components.

Insert picture description here
Among them, the most commonly used is YUV420.

YUV420 format storage method

YUV420 belongs to the planar storage method, but it is divided into two types:

YUV420P : Three-plane storage. The data composition is YYYYYYYYUUVV (such as I420) or YYYYYYYYVVUU (such as YV12).

YUV420SP : Two-plane storage. Divided into two types: YYYYYYYYUVUV (such as NV12) or YYYYYYYYVUVU (such as NV21)

Regarding H264 encoding algorithm and data structure, there is a lot of knowledge and space involved (such as network abstraction layer NAL, SPS, PPS). This article will not go into details. There are also many tutorials on the Internet. If you are interested, you can learn by yourself.

Five, audio coding

Audio encoding format

The original PCM audio data is also a very large amount of data, so it also needs to be compressed and encoded.

Like video encoding, audio also has many encoding formats, such as WAV, MP3, WMA, APE, FLAC, etc. Music enthusiasts should be very familiar with these formats, especially the latter two lossless compression formats.

However, our protagonist today is not them, but another compression format called AAC.

AAC is a new generation of audio lossy compression technology, a high compression ratio audio compression algorithm. The audio data in MP4 video uses AAC compression format most of the time.

Introduction to AAC coding

There are two main AAC formats: ADIF and ADTS.

ADIF : Audio Data Interchange Format. Audio data exchange format. The feature of this format is that the start of the audio data can be found deterministically, without the need to start decoding in the middle of the audio data stream, that is, its decoding must be carried out at a clearly defined start. This format is commonly used in disk files.

ADTS : Audio Data Transport Stream. Audio data transmission stream. The characteristic of this format is that it is a bit stream with synchronization words, and decoding can start at any position in this stream. Its characteristics are similar to the mp3 data stream format.

ADTS can be decoded in any frame, and each frame has header information. ADIF has only one unified header, so it must be decoded after getting all the data. And the formats of these two headers are also different. Currently, the audio streams in ADTS format are generally encoded.

ADIF data format:

header	raw_data

ADTS one frame data format (the middle part, the left and right ellipsis are the front and back data frames):Insert picture description here

Six, audio and video containers

Careful readers may have discovered that none of the various audio and video encoding formats we have introduced is the video format we usually use, such as: mp4, rmvb, avi, mkv, mov...

That's right, these familiar video formats are actually containers that wrap audio and video encoding data, which are used to mix video streams and audio streams encoded with a specific encoding standard into one file.

For example: mp4 supports video encoding such as H264 and H265 and audio encoding such as AAC and MP3.

mp4 is currently the most popular video format. On the mobile terminal, the video is generally encapsulated in mp4 format.

Seven, hard decoding and soft decoding

The difference between hard solution and soft solution

We will see in some players that there are two playback formats, hard decoding and soft decoding, for us to choose, but most of the time we can't feel the difference between them. For ordinary users, as long as they can play.

So what is the difference between them?

On a mobile phone or a PC, there will be hardware such as CPU, GPU, or decoder. Usually, our calculations are performed on the CPU, which is the execution chip of our software, and the GPU is mainly responsible for the display of the picture (a kind of hardware acceleration).

The so-called soft decoding refers to the use of the computing power of the CPU to decode. Generally, if the power of the CPU is not very strong, the decoding speed will be slow, and the phone may heat up. However, due to the use of a unified algorithm, compatibility will be very good.

Hard decoding refers to the use of a dedicated decoding chip on a mobile phone to accelerate decoding. Generally, the decoding speed of hard decoding is much faster, but because hard decoding is implemented by various manufacturers, the quality is uneven, and compatibility problems are very prone to occur.

Linux, C/C++ technology exchange group: [960994558] I have compiled some good learning books, interview questions from big companies, and popular technology teaching video materials to share in it (including C/C++, Linux, Nginx, ZeroMQ, MySQL) , Redis, fastdfs, MongoDB, ZK, streaming media, CDN, P2P, K8S, Docker, TCP/IP, coroutine, DPDK, etc.), you can add it yourself if you need it! ~
Audio and video learning materials
Insert picture description here
FFmpeg/WebRTC/RTMP audio and video streaming media advanced development

The above deficiencies are welcome to point out the discussion, and friends who feel good, hope to get your likes and support

Guess you like

Origin blog.csdn.net/weixin_52622200/article/details/113394632