Audio and video development II: summary of audio and video knowledge

Introduction

Audio and video technology involves a wide range. Including speech signal processing, digital image processing, information theory, encapsulation format, codec, streaming media protocol, network transmission, rendering, algorithm, etc. In real life, audio and video also play an increasingly important role, such as video conferencing, live broadcast, short video, player, voice chat, etc. Next, it will be introduced from several dimensions: a simple understanding of audio and video principles, audio and video theoretical foundations, audio and video learning routes, media protocols, and audio and video development directions.

Simple understanding, audio and video principles

Many people should have played a small game when they were young, drawing an animal stepping back and forth on several pages in a row in a notebook, and then quickly turning the pages, the effect as shown in Figure 0 is obtained. This process contains the principle of (video) animation . The principle of audio and video is also based on this. For example, when turning the page, add a voice, and you can imitate a horse barking "hiss...".

pic-0

Figure 0

**The simple principle of audio and video is to play a sequence of pictures, and then play the corresponding sequence of audio samples synchronously. ** As revealed in Figures 0 to 3, there are three key elements.

  • 1. YUV image . From the binary data encoded by H264, HEAC, etc., the decoded sequence pictures are called YUV pictures, which are essentially the same as common jpg and png. The main job of the graphics card is to process these graphics data.

  • 2. PCM audio . From the binary data encoded by mp3, aac, etc., the sequence of audio sampling points decoded is called PCM audio data. It can be understood that each piece of PCM audio waveform is a unit of sound.

  • 3. Audio and video synchronization . Continuously render the sequence picture frames to the computer layer, and simultaneously play the audio frames synchronously according to the synchronization parameters DTS and PTS. This is the video playback process.

figure 1

figure 2

img

image 3

Theoretical Basis of Audio and Video

audio

Audio (Audio) refers to the sound that people can hear, including voice, music and other sounds such as environmental sound, sound effect sound, natural sound, etc.

voice introduction

Sound is a physical phenomenon. When objects vibrate, sound waves are transmitted through the air to people's eardrums and are perceived as sound after being reflected by the brain.

Sound is characterized by frequency and amplitude , with frequency corresponding to the time axis and amplitude corresponding to the level axis.

Why does digital audio exist?

It is known from physics that complex sound waves are composed of many sine waves with different amplitudes and frequencies .

The analog information representing the sound is a continuous quantity, which cannot be directly processed by the computer, but must be digitized. After digital processing, digital sound information can be stored, retrieved, edited and processed like text and graphic information.

What is digital audio?

We know that sound can be expressed as a waveform that develops over time:

img

But if you want to directly describe such a curve and store it in the computer, there is no way to describe it.
If the description can only be expressed in this way: the curve goes down, goes up, goes down again, and goes up again, obviously this is very unreasonable.
People thought of a way:

Every small time interval, use a ruler to measure where this point is.

Then as long as the interval is certain, we can describe this curve as: {9,11,12,13,14,14,15,15,15,14,14,13,12,10,9,7… }

Is this description more accurate than the method just now? If we make this time interval smaller and the ruler we use is more accurate, then the measured numbers used to describe this curve can also be made more accurate.

Then we can convert these level signals into binary data and save them. When playing, convert these data into analog level signals and send them to the speakers for broadcasting. That's it.

In general: digital audio refers to the use of digital encoding, that is, the use of 0 and 1 to record audio information , which is relative to analog audio.

The process from "analog signal" to "digital":

Audio in nature is an analog signal. In order to be recognized by a computer, it needs to be converted from an analog signal to a digital signal.

The process of converting an analog signal to digitization requires three steps:

1. Sampling

The so-called sampling, also known as sampling. The process of sampling is to extract the frequency value of a certain point. Obviously, the more points are extracted in one second, the more abundant the frequency information can be obtained.

The basic theorem of sampling: In order to restore the waveform, there must be two sampling points in one vibration. The highest frequency that the human ear can perceive is 20kHz. Therefore, to meet the auditory requirements of the human ear, at least 40k sampling times per second are required. .

2. Quantification

In digital audio technology, the analog voltage representing the strength of the sound is represented by numbers, such as 0.5V voltage is represented by the number 20, and 2V voltage is represented by 80. Even within a certain level range, there can still be infinitely many analog voltage amplitudes, such as 1.2V, 1.21V, 1.215V.... When using numbers to represent audio frequency ranges, only infinite voltage ranges can be represented by finite numbers. That is, the voltage within a certain amplitude range is represented by a number, which is called quantization.

3. Coding

The basic number system in the computer is binary, so we also need to write the sound data into the data format of the computer, which is called encoding.

audio storage

The calculation formula is: (sampling frequency * sampling number * channel number) / 8 * time (seconds). Assume that the sampling rate is 44.1k, the number of channels is 2, and the number of sampling bits is 16. Then, the number of bytes of storage space occupied per second = 44100 * 2 * 16 / 8 = 176.4kb, 1 minute is 10.09Mb.

audio encoding

The function of audio encoding: Compress audio sample data ( PCM , etc.) into an audio bit stream, thereby reducing the amount of audio data.

Audio compression, what is mainly compressed:

​ Audio compression technology is to compress the audio data signal as much as possible on the premise of ensuring that the signal does not cause auditory distortion.

The main method of compression is to remove redundant information from the collected audio. The so-called redundant information includes audio signals outside the hearing range of human ears and audio signals that are masked.

Signal masking can be divided into frequency domain masking and time domain masking .


The frequency domain masking effect is that if the frequencies are very similar, within a certain frequency range, louder sounds will mask smaller sounds. The black line of the masking source in the figure covers the masked sound pointed by the dotted line, because the frequency is similar, but the sound is smaller than the sound of the masking source. So shadowed.

The time-domain shadowing effect, beyond the scope of the shadowing time, is shadowed.

After hearing a strong sound, the auditory system may temporarily block out weaker sounds for a certain period of time, resulting in these sounds being unable to be processed and identified in the auditory system.

The commonly used audio encoding methods are as follows:

audio decoding

Audio decoding is the process of decoding compressed and encoded digital audio files into original audio signal PCM. The decoding process is the reverse process of encoding. Since different audio files use different encoding formats, corresponding decoders are also required for decoding.

video

When continuous image changes exceed 24 frames per second (Frame), according to the principle of persistence of vision, the human eye cannot distinguish a single static image, which appears to be a smooth and continuous visual effect. Such a continuous image is called a video.


The basic unit of video is actually an image .

color model

light and color

Light is the visible spectrum of electromagnetic waves that can be seen (received) by the naked eye.

Visible light, which the human eye can see, is only part of the entire electromagnetic spectrum. The visible spectrum range of electromagnetic waves is about 390~760nm (1nm=10-9m=0.000000001m).

Color is the result of the perception of visible light by the visual system. Studies have shown that the human retina has three types of cone cells that are sensitive to red, green, and blue colors. The red, green and blue cone cells perceive different frequencies of light to different degrees, as well as different levels of brightness.

Any color in nature can be determined by the sum of the three color values ​​of R, G, and B, and an RGB color space is formed with these three colors as the base color.
Color = R (percentage of red) + G (percentage of green) + B (percentage of blue), as long as one of them is not generated by the other two colors, different three primary colors can be selected to construct different color spaces.

YUV (YCbCr) color coding

Related experiments have shown that the human eye is sensitive to brightness but not to chroma. Therefore, the luminance information and the chrominance information can be separated, and the means of "deceiving" human eyes can save space, which is suitable for the field of image processing, thereby improving the compression efficiency.

From the above figure, we can see that it is easier for us to identify images that remove color, but it is difficult to identify images that are only stripped out of color.

Hence the YUV. YUV is another way of representing color. YUV is also known as YCbCr. YUV color encoding uses luminance Y and chroma UV to specify the color of a pixel. "Y" represents the brightness ( Luminance or Luma ), that is, the grayscale value. "U" and "V" represent the chroma ( Chrominance or Chroma ), which is used to describe the hue and saturation of the image.

YUV is a color space representation method that separates brightness and chrominance. Therefore, compared with RGB, YUV is more accurate in responding to luminance information, and at the same time removes the chromaticity-independent characteristics of human perception, making it more suitable for video transmission or storage. Compared with RGB, it has higher compression performance .

YUV sampling format

The sampling format of RGB has three components of R, G, and B for each pixel, and each component occupies 8 bits or 16 bits, so that each pixel occupies 24 bits or 48 bits. To save bandwidth, most YUV formats use less than 24 bits per pixel on average.

主要的采样格式有YUV4:2:0(使用最多)、YUV4:2:2和YUV4:4:4。4:2:0表示每4个像素有4个亮度分量,2个色度分量 (YYYYCbCr)。4:2:2表示每4个像素有4个亮度分量,4个色度分量(YYYYCbCrCbCr)、4:4:4表示全像素点阵(YYYYCbCrCbCrCbCrCbCr)。

4:4:4 means full sampling. Same size as RGB

4:2:2 means 2:1 horizontal sampling and vertical full sampling. One-third smaller than RGB .

4:2:0 means 2:1 horizontal sampling, vertical 2:1 sampling, it can be seen that every time two rows of Y are stored , half row U and half row V will be stored . One-half smaller than RGB

storage

Video bit rate * time.

The formula for calculating the video bit rate is: (frame rate) * (image resolution) * (sampling accuracy) * 1s.

Then the data volume of a 1- hour movie in RGB24 format is: 3600 * 25 * 1920 * 1080 * 24 / (8 * 1024 * 1024 * 1024) = 521.421 GByte( PS : here the frame rate is 25Hz , RGB24 images use 24 bits to represent each pixel, and the resolution is 1920*1080).

video encoding

**The volume of the original audio and video signals collected is very large, and there are a lot of the same content that cannot be seen by the eyes or heard by the ears. ** For example, if the video is not compressed and encoded, the volume is usually very large. A movie may require hundreds of gigabytes of space.

Professionally speaking, video encoding is the compression algorithm used by the video in the file. The main function of video encoding is to compress video pixel data (RGB, YUV, etc.) into a video stream, thereby reducing the amount of video data.

Video compression, what is mainly compressed:

Spatial redundancy: there is a strong correlation between adjacent pixels of an image
Temporal redundancy: the content of adjacent images in a video sequence is similar
Coding redundancy: the probability of occurrence of different pixel values ​​is different
Visual redundancy: the human visual system Some details are insensitive
Knowledge redundancy: regular structure can be obtained from prior knowledge and background knowledge

Common video codec methods include H.26X (H.264, H.265, etc.), MPEG, etc. It should be noted here that different video packaging formats may actually use the same encoding and decoding methods , and the packaging formats are packages of different manufacturers. This is like, multiple ice cream manufacturers produce a flavor of ice cream, but the outer packaging is different.

video decoding

With encoding, of course, decoding is also required.
Because the encoded content cannot be used directly, it must be decoded when used (watched), and restored to the original signal (such as the color of a certain point in the video, etc.), this is decoding.

The process of video decoding is the process of decoding binary data encoded in a certain encoding method (H264) into a YUV picture, that is, "H.264->YUV" . The most widely used one is FFmpeg, an open source codec suite, which widely covers common codec methods and packaging formats (video formats).

Package format

The video formats we often see are mp4, avi, mkv. In the technical concept, it is called "Video Encapsulation Format", referred to as the video format, which contains the video information, audio information and related configuration information required to encapsulate video files (such as: video and audio related information, how to decode, etc.) . It is a shell, equivalent to a container. Common encapsulation formats are: mp4, mkv, webm, avi, 3gp, mov, wmv, flv, mpeg, asf, rmvb, etc.

img

(1) Encapsulation format (also called container) is to put the encoded and compressed video track and audio track into a file according to a certain format, that is to say, it is just a shell, which can be regarded as a video track and audio track. Track folders are also available.
(2) In layman's terms, the video track is equivalent to rice, and the audio track is equivalent to dishes. The packaging format is a bowl or a pot, which is a container for holding meals.
(3) The packaging format is related to the patent, and it is related to the profit of the company that launches the packaging format.
(4) With the package format, subtitles, dubbing, audio and video can be combined .

Example of packaging in MKV format:

Video player principle

After talking about so many concepts, you can understand how the principle of the video player works.

The following is the flow chart when playing a video file.

media protocol

1. Streaming media transmission protocol

In addition to the playback of local videos, what we often use is the online playback of videos (on-demand, live broadcast). If you need to play online, you need to use the support of streaming media protocols. Common streaming media transmission protocols include: RTP, SRTP, RTMP, RTSP, RTCP, etc. Among them, RTP (Real-time Transport Protocol) is a real-time transport protocol, and SRTP is a secure real-time transport protocol, that is, encrypted transmission based on RTP to prevent audio and video data from being stolen. RTMP (Real Time Messaging Protocol) is Adobe's open source real-time message transmission protocol, based on TCP, the basic protocols include: RTMPE, RTMPS, RTMPT. RTSP (Real Time Streaming Protocol) is a real-time streaming protocol, and its fields include: OPTIONS, DESCRIBE, SETUP, PLAY, PAUSE, TEARDOWN, etc. RTCP (RTP Control Protocol) is the RTP transmission control protocol, which is used to count packet loss and transmission delay.

2. Streaming media application protocol

Streaming media application protocols include: HLS, DASH. Among them, HLS is an open source streaming media transmission application protocol of Apple, which involves both m3u8 protocol and ts stream. DASH is a streaming media protocol widely used by Google. It uses fmp4 slices and supports adaptive bit rate and seamless switching of multiple bit rates.

3. WebRTC signaling protocol

WebRTC (Web Real-Time Communications) is a real-time communication technology that allows network applications or sites to establish peer-to-peer (Peer-to-Peer) connections between browsers without intermediaries to achieve video streaming and/or the transmission of audio streams or other arbitrary data. WebRTC signaling protocols include: SDP, ICE, NAT, STUN, TURN. Of course, WebRTC's network transmission protocol is also useful for the streaming media transmission protocol mentioned above.

4. Audio and video coding protocol

Commonly used audio encoding protocols are: MP3, AAC, OPUS, FLAC, AC3, EAC3, AMR_NB, PCM_S16LE. Video encoding protocols include: H264, HEVC, VP9, ​​MPEG4, AV1, etc. For related audio and video codec protocols, please refer to: Entering the world of audio and video - audio and video encoding and entering the world of audio and video - audio and video decoding.

5. Audio and video package formats

Commonly used video packaging formats are: mp4, mov, mkv, webm, flv, avi, ts, mpg, wmv, etc. Commonly used audio package formats are: mp3, m4a, flac, ogg, wav, wma, amr, etc. The packaging format is a multimedia container, including multimedia information, audio and video streams. Among them, the multimedia information includes: duration, resolution, frame rate, bit rate, sampling rate, number of channels, etc., which are related concepts of the audio and video development basis mentioned above. The audio and video code stream is a stream composed of several frames obtained by encoding and compressing the original data, and the subtitle code stream is generally composed of text or bitmap in a specific format. Regarding the packaging format, you can refer to the previously written articles: Entering the world of audio and video - audio packaging format and entering the world of audio and video - video packaging format.

The agreements involved in the above are as follows:

learning path

1. Audio and video basics

audio basics

Audio includes: sampling rate, number of channels and channel layout, sampling format, PCM and waveform, sound quality, audio encoding format, and audio encapsulation format. For details, see the above content, the basic concept of audio and video.

general basis

General includes: coding principles, C/C++ basics, video analysis tools, FFmpeg common commands, and platform-related multimedia APIs.

video basics

Video includes: frame rate, bit rate, resolution, pixel format, color space, I frame, P frame, B frame, DTS and PTS, YUV and RGB, bit depth and color gamut, video encoding format, and video packaging format. For details, see the above content, the basic concept of audio and video.

2. Advanced audio and video

1. Advanced audio
The advanced growth of audio and video is also divided into: audio, general, and video. The audio includes: recording, microphone acquisition, audio codec, audio playback, audio analysis, and sound effects.

2. General Advanced
General includes: familiarity with streaming media protocols, audio and video transmission, audio and video synchronous playback, platform-related multimedia applications, FFmpeg-related API applications, OpenGL rendering, audio and video editing.

3. Video advanced
Video includes: video recording, camera capture, video codec, video playback, filter effects, video transcoding. In-depth study on the basis of familiarity with audio and video, as shown in the following figure:

3. Audio and video related open source libraries

1. Multimedia processing
Multimedia processing includes: FFmpeg, libav, Gstreamer. Among them, FFmpeg is currently the most commonly used audio and video processing library, including modules such as encapsulation format, codec, filter, image scaling, and audio resampling.

2. Streaming media transmission
Streaming media transmission includes WebRTC and live555. Among them, WebRTC is currently the most commonly used RTC library. The more famous modules include JitterBuffer, NetEQ, pacer, and network bandwidth estimation.

3. Player
Players include: ijkplayer, exoplayer, vlc. Among them, ijkplayer is an open-source cross-platform player of Bilibili, exoplayer is an open-source player for Android platform by Google, and vlc is open-sourced by VideoLAN non-profit organization.

4. Codec
Commonly used codecs include: aac, mp3, opus, vp9, x264, av1. Among them, aac is generally used for VOD and short video, and opus is used for RTC live broadcast. vp9 is Google's open source encoder, VideoLAN provides x264 encoder, and av1 is a new generation of video encoder open sourced by AOMedia (Alliance for Open Media).

5. Audio processing
Open source libraries for audio processing include: sox, soundtouch, speex. Among them, sox is called the Swiss Army Knife in the audio processing industry, which can make various sound effects and provide various filters. soundtouch is used for variable speed and pitch change, and variable speed for constant pitch. Speex is strictly speaking an encoder, but it has a wealth of audio processing modules: PLC (packet loss concealment), VAD (silence detection), DTX (discontinuous transmission), AEC (echo cancellation), NS (noise inhibition).

6. Streaming media server
The mainstream of streaming media servers are: SRS, janus. Among them, SRS is a simple and efficient video server that supports RTMP, WebRTC, HLS, HTTP-FLV, and SRT. And janus is an open source WebRTC-based streaming media server of MeetEcho, strictly speaking, it is a gateway.

7. Audio and video analysis Analysis
tools are indispensable for audio and video development, and it is very important to master the use of analysis tools. Commonly used audio and video analysis tools include but are not limited to: Mp4Parser, VideoEye, Audacity. Among them, Mp4Parser is used to analyze the mp4 format and its structure. VideoEye is Raytheon's open source tool for analyzing video streams based on the Windows platform (this is a tribute to Raytheon's open source spirit). Audacity is an open source audio editor that can be used to add various sound effects and analyze audio waveforms.

8. Video rendering
Open source libraries related to video rendering include: GPUImage, Grafika, LearnOpenGL. Among them, GPUImage can be used to add various filter effects. Grafika is an open source rendering sample library based on the Android platform by an engineer at Google. LearnOpenGL is mainly a learning OpenGL tutorial for its website.

The relevant open source websites and addresses are as follows:

project Value
FFmpeg https://ffmpeg.org/
WebRTC https://webrtc.org.cn/
RTCCommunity https://rtcdeveloper.agora.io/
RFC protocol https://www.rfc-editor.org/rfc/
OpenGL https://learnopengl-cn.github.io/
GPUIimage https://github.com/BradLarson/GPUImage
computer https://www.videolan.org/projects/
AOMedia https://aomedia.org/
xiph.org https://gitlab.xiph.org/
VP9 https://www.encoding.com/vp9/
soundtouch http://soundtouch.surina.net/
sox http://sox.sourceforge.net/

Direction of development

Guess you like

Origin blog.csdn.net/qq_38056514/article/details/129848687