Audio and video development 1: Basic concepts of audio and video

basic concept

The common basic concepts of audio and video include bit rate and duration, and different audio and video have different encapsulation formats and encoding protocols. Among them, the key video parameters include resolution, frame rate, picture quality, rotation angle, and pixel format, while the key audio parameters include sampling rate, number of channels, channel layout, sound quality, number of samples, number of bits of samples, and frame duration. Next, we will discuss with you in detail

audio

voice introduction

Sound is a physical phenomenon. When objects vibrate, sound waves are transmitted through the air to people's eardrums and are perceived as sound after being reflected by the brain.

Sound is characterized by frequency and amplitude , with frequency corresponding to the time axis and amplitude corresponding to the level axis.

Sound propagates in the form of vibration (vibration) in the form of waves. As a wave, sound with a frequency between 20 Hz and 20 kHz can be recognized by the human ear.

type

According to frequency classification: sound waves with a frequency lower than 20Hz are called infrasonic waves ; sound waves with a frequency of 20Hz~20kHz are called audible sounds ; sound waves with a frequency of 20kHz~1GHz are called ultrasonic waves ; sound waves with a frequency greater than 1GHz are called ultra-sonic waves or microwave ultrasonic waves .

sound quality

Sound quality, the fidelity of an encoded and compressed audio signal, consists of volume, pitch (audio) and timbre. **Volume: **The intensity of the audio refers to the amplitude of the vibration of the emitted object. For the same frequency, the greater the amplitude, the louder the sound.

Pitch (Audio): The pitch of a sound, which is the audio frequency or number of changes per second.

Timbre: Audio overtones, also known as timbres, have different characteristics of different sounds in terms of waveforms.

The development of sound storage

When it comes to recording, we have to talk about the ancestor of modern recording equipment invented by Edison: the phonograph.

The phonograph was originally invented by Edison, the great inventor of the world in 1877. When debugging the microphone, because of poor hearing, Edison used a needle to test the vibration of the speaking membrane. Weak changes produce a regular vibration, and this phenomenon became the inspiration for his invention.

Because we all know that sending and receiving are two corresponding processes. The speed of speaking can make the short needle vibrate correspondingly, and in turn, this vibration can also produce the original speaking sound, and the sound wave can be transformed into the vibration of the metal needle, and then the waveform is recorded on the cylindrical wax tube. on foil. When the needle travels along the recorded track again, the sound left behind can be reproduced. So he used this principle to make his first gramophone.

With the development of history, it has slowly passed: mechanical recording (represented by gramophones and mechanical records) ----- optical recording (represented by film film) ----- magnetic recording (represented by magnetic tape recording), etc. The analog recording method gradually entered the era of digital recording (digital audio) until the 1970s and 1980s.

Sampling frequency

Sampling frequency , also known as sampling speed or sampling rate, defines the number of samples extracted from continuous signals per unit time to form discrete signals. It is expressed in Hertz (Hz). The higher the sampling rate, the more realistic the sound reproduction . The reciprocal of the sampling frequency is the sampling period or sampling time , which is the time interval between samples. In layman's terms, the sampling frequency refers to how many signal samples the computer can collect per unit time. According to the Nyquist theorem : when the sampling frequency is greater than twice the highest frequency in the signal, the digital signal after sampling can fully retain the information of the original signal. Since the maximum frequency that people can hear is 20khz, the sampling frequency should be greater than 40khz.

signal frequency

This is the frequency at which the signal repeats: y=sin(2*f0*pi*t); this f0 is the physical frequency of the signal.

sound track

Sound channel: Refers to mutually independent audio signals that are collected or played back at different spatial locations during sound recording or playback . The number of channels refers to the number of audio sources during recording or the number of speakers during playback. Different channel numbers correspond to different channel layouts. Common channel layouts include mono, stereo, surround, and 5.1.

The different spatial locations are collected as shown in the figure below:

Front Left - Front Left, Front Right - Front Right, Center - Center, LFE - Low Frequency Effect, Left - Side Left, Right - Side Right, Left Rear - Back Left, Right Rear - Back Right.

channel layout

  • Mono: only one channel, the advantage of small amount of data, amr_nb and amr_wb default to mono, the disadvantage is the lack of positioning of the sound position.

  • Stereo channel: generally two channels, composed of left channel and right channel, to improve the positioning of the sound position.

  • Four-sound surround: It is composed of front left, front right, rear left, and rear right to form a three-dimensional surround. The 4.1 channel is based on the four-sound surround, adding a bass.

  • 5.1 channel: On the basis of 4.1, add a midfield channel, Dolby AC3 uses 5.1 channel, which is the Dolby sound effect promoted by the theater.

sound frame

The concept of audio frames is not as clear as that of video frames. Almost all video encoding formats can simply think that one frame is an encoded image, and audio frames will be different due to different encoding formats. Different encoding formats have different frames per second.

frame duration

The inverse of the number of frames.

Number of samples

That is, the frame size, the number of samples per frame. Its value is fixed. In FFmpeg's AVFrame, it is defined as nb_samples.

number of samples

That is, how many bits each sample takes. In the RIFF (Resource Interchange File Format) resource exchange file format, there is a field bits_per_sample indicating the number of sampling bits, which is also used in FFmpeg to indicate the number of sampling bits. The larger the number of bits occupied by the sample, the larger the life quantization value, and the larger the volume range of the sound.

storage

The calculation formula is: (sampling frequency, sampling digits, channel number)/8*time (seconds). Assume that the sampling rate is 44.1k, the number of channels is 2, and the number of sampling bits is 16. Then, the number of bytes of storage space per second = 44100 * 2 * 16 / 8 = 176.4k.

Code stream (bit rate)

Code rate, also known as bit rate, the amount of data transmitted per unit time , the amount of data in one second of original audio data is (sampling rate) * (bit depth) * (number of channels), the unit is generally kbps (thousand bits per second) Second). It should be noted that here b stands for bit, not byte.

Calculation formula: average bit rate (kbps) = file size (kb) * 8/time (s). Dynamic bit rate (kbps) = amount of data transmitted per second (kb) * 8.

The larger the code stream, the more data the image or sound transmits, and the clearer and more realistic the picture or sound is, but it also takes up more bandwidth and storage space. Therefore, the size of the code stream should be controlled and adjusted according to actual needs and resource constraints, and the larger or smaller code stream cannot be blindly pursued.

sampling format

Audio sampling formats are divided into big-endian storage and little-endian storage. According to the symbol, there are: signed and unsigned. Divided by type: integer and floating point. According to the number of storage bits, there are: 8 bits, 16 bits, 32 bits, and 64 bits, all of which are multiples of 8. The AVSampleFormat enumeration in FFmpeg is as follows:

enum AVSampleFormat {
    
    
    AV_SAMPLE_FMT_NONE = -1,
    AV_SAMPLE_FMT_U8,          // unsigned 8 bits
    AV_SAMPLE_FMT_S16,         // signed 16 bits
    AV_SAMPLE_FMT_S32,         // signed 32 bits
    AV_SAMPLE_FMT_FLT,         // float
    AV_SAMPLE_FMT_DBL,         // double
 
    AV_SAMPLE_FMT_U8P,         // unsigned 8 bits, planar
    AV_SAMPLE_FMT_S16P,        // signed 16 bits, planar
    AV_SAMPLE_FMT_S32P,        // signed 32 bits, planar
    AV_SAMPLE_FMT_FLTP,        // float, planar
    AV_SAMPLE_FMT_DBLP,        // double, planar
    AV_SAMPLE_FMT_S64,         // signed 64 bits
    AV_SAMPLE_FMT_S64P,        // signed 64 bits, planar
 
    AV_SAMPLE_FMT_NB           // Number of sample formats
};

Audio encoding (audio compression)

PCM

The digital signal generated by sampling, quantizing and encoding the continuously changing analog signal is PCM (Pulse-code modulation), that is, pulse code modulation . This electrical digital signal is called a digital baseband signal. The data collected by AudioRecord and MediaRecorder is PCM data. Pure audio data, without any formatting. It is called lossless encoding, that is, the analog signal is converted into a digital signal without compression, only conversion, which is the uncompressed data stream obtained directly after microphone recording. For audio, CD uses PCM encoding.

PCM (Pulse Code Modulation) is a method of digitizing analog signals , and PCM encoding is the digital audio encoding method in this method. PCM encoding is the most original audio encoding, and other encodings are re-encoded and compressed on the basis of it.

Compression and encoding are not the same concept.

Compression is a process of reducing the storage capacity or transmission bandwidth occupied by data by removing or optimizing redundant and unnecessary information in the data.

Encoding is the process of converting data into a simpler representation for storage, transmission, or processing. For example, PCM coding is a coding method for representing analog audio signals in digital form.

A PCM file is a file that stores audio in PCM encoding. It is an uncompressed original digital audio file, usually called PCM bare stream/audio bare data/raw data. Commonly used file extensions are .pcm and .raw, and usually they cannot be played directly. PCM bare stream can be played normally after re-encoding and encapsulation (see next section), for example, into .wav format.

The function of audio encoding: Compress audio sample data ( PCM , etc.) into an audio bit stream, thereby reducing the amount of audio data. The commonly used audio encoding methods are as follows:

(1)MP3

MP3 , the full English name of MPEG-1 or MPEG-2 Audio Layer III , is a once very popular digital audio coding and lossy compression format , which is designed to greatly reduce the amount of audio data. It was invented and standardized in 1991 by a group of engineers at the research organization Fraunhofer-Gesellschaft in Erlangen, Germany . The popularity of MP3 has had a great impact and influence on the music industry.

(2)AAC

AAC , the English full name is Advanced Audio Coding , which is jointly developed by Fraunhofer IIS , Dolby Laboratories, AT&T , Sony and other companies, and launched in 1997 based on MPEG-2 audio coding technology. In 2000 , after the emergence of the MPEG-4 standard, AAC re-integrated its features and added SBR technology and PS technology. In order to distinguish it from the traditional MPEG-2 AAC , it is also called MPEG-4 AAC . AAC has a higher compression ratio than MP3 , and the sound quality of AAC is higher for the same size audio file.

(3)WMA

WMA , the full name of Windows Media Audio in English , is a digital audio compression format developed by Microsoft Corporation, which itself includes lossy and lossless compression formats.

​The role of audio Compress audio sample data ( PCM , etc.) into an audio bit stream, thereby reducing the amount of audio data.

Audio compression, what is mainly compressed:

​ Audio compression technology is to compress the audio data signal as much as possible on the premise of ensuring that the signal does not cause auditory distortion.

The main method of compression is to remove redundant information from the collected audio. The so-called redundant information includes audio signals outside the hearing range of human ears and audio signals that are masked.

Signal masking can be divided into frequency domain masking and time domain masking .

The frequency domain masking effect is that if the frequencies are very similar, within a certain frequency range, louder sounds will mask smaller sounds. The black line of the masking source in the figure covers the masked sound pointed by the dotted line, because the frequency is similar, but the sound is smaller than the sound of the masking source. So shadowed.

The time-domain shadowing effect, beyond the scope of the shadowing time, is shadowed.

After hearing a strong sound, the auditory system may temporarily block out weaker sounds for a certain period of time, resulting in these sounds being unable to be processed and identified in the auditory system.

The time-domain shadowing effect, beyond the scope of the shadowing time, is shadowed.

audio decoding

The audio frame and video frame processing processes are similar, the difference is that the audio encoding method is AAC, MP3, etc., and the basic unit processed is PCM (the basic unit of video, YUV) .

Package format

The audio encapsulation format, similar to the video encapsulation format, consists of a specific format header + media information + audio track data. Common encapsulation formats are: mp3, m4a, ogg, amr, wma, wav, flac, aac, ape, etc.

video

When continuous image changes exceed 24 frames per second (Frame), according to the principle of persistence of vision, the human eye cannot distinguish a single static image, which appears to be a smooth and continuous visual effect. Such a continuous image is called a video.

The basic unit of video is an image . Since it is an image, the colors in the image are all RGB, how does the monitor display it? On the monitor, the red, green, and blue three-color light-emitting electrodes of the screen are shot through an electron gun to produce colors. Behind each pixel of the monitor are Consisting of three light emitting diodes,

If you look at it with a magnifying glass, the screen is as shown below. There are three light-emitting diodes behind each pixel.

All the colors on the computer screen are formed by mixing the three colors of red, green and blue in different proportions. A group of red, green and blue is the smallest display unit. Any color on the screen can be recorded and represented by a set of RGB values.

The three electron beams emitted by the electron gun in the picture tube are respectively shot at the red, green and blue fluorescent spots on the screen. By controlling the intensity of the three electron beams respectively, the brightness of the three-color fluorescent spots can be changed. Because these color points are very small and close together, human eyes cannot distinguish them, and what we see is the composite of three color points. That is, the composite color.

That is to say, not only pictures have pixel data, but also displays on the monitor screen are displayed through pixel data. The relationship between the image and the screen so far is that the image is data, the screen is a display device, and the image data passes through the driver to make the screen display the image.

video pixels

Video pixel data function: save the pixel value of each pixel on the screen.

Format: Common pixel formats include RGB24, RGB32, YUV420P, YUV422P, YUV444P, etc. The pixel data in YUV format is generally used in compression coding , and the most common format is YUV420P .

Features: The volume of video pixel data is very large. The data volume of a **1** hour movie in RGB24 format is: 3600 * 25 * 1920 * 1080 * 3 = 559.872GByte( PS : here the frame rate is 25Hz , the sampling accuracy is 8bit , the number of channels is 3, and the resolution is 1920*1080).

Code stream (code rate, bit rate)

Code rate, also known as bit rate, is the amount of data transmitted per unit time . The calculation formula is: (frame rate) * (image resolution) * (channel number) * (sampling precision) * 1s.

storage

Video bit rate * time.

resolution

Resolution is also called resolution, the higher the resolution, the more pixels and the clearer the image. Divided into image resolution and display resolution.

  • Image resolution: It consists of the width and height of the video, expressed in the form of width x height, common video resolutions are 480P, 720P, 1080P, 2K (2048x1080/2160x1440), 4K (4096x2160/3840x2160) For example, the resolution of a video is 1280_720 , which means that the video has 1280 pixels in the horizontal direction and 720 pixels in the vertical direction. 720P refers to 1280_720 resolution, generally such video is called "high-definition" by the industry. 1080P refers to 1920*1080 resolution, which is what we often call "full HD". 2K resolution is special, and its horizontal pixels reach More than 2000 can be said to be 2K resolution

  • Display resolution: The unit to describe the screen resolution is ppi (pixel per inch, the number of pixels per inch). In fact, the calculation method is very simple. Use the number of pixels in the length and height to calculate the number of pixels in the diagonal direction (right-angled triangle, the calculation method needless to say), and then divide the number of pixels in the diagonal direction by the screen size (object line) It is ppi.

Common resolutions and display modes are as follows:

bit depth

Indicates the accuracy of the computer to measure the amplitude (volume) of the sound waveform, which is commonly referred to as the number of digits of the sound card.

Also known as bit depth (BitDepth), the number of bits of information stored in each pixel. The more numbers, the finer the representation, the better the sound quality, and the doubled amount of data. The commonly used bit width in the audio sampling process is 8bit or 16bit. Common Android Bitmaps are ALPHA_8, RGB_565, ARGB_4444, and ARGB_8888.

frame

It is the smallest unit commonly used in images, which is equivalent to each frame of film in a movie. One frame is a still picture, and consecutive frames form a video.

frame rate

  • Video frame rate: A measure of the number of display frames, and the unit is the number of display frames per second (FPS, the full name is Frame Per Second). The general video frame rate is 24fps, the P system (PAL, proposed by Germany, used by China, India, Pakistan and other countries) is 25fps, that is, each frame displays 40ms, and the N system (NTSC, proposed by the American Standards Committee, the United States, Japan, South Korea, etc. country use) at 30fps. Some super high frame rate videos reach 60fps.
  • Display frame rate: **The frequency at which bitmap images in units of frames appear continuously on the display, also known as the refresh rate. **The refresh rate of Android devices is generally 60Hz, that is, the frame rate is 60fps, and each frame is 16ms. Exceeding 16ms can cause delays and freezes to the naked eye. In terms of performance optimization, that is to ensure that a series of actions from measurement, layout, drawing, uploading instructions, and exchanging buffers with the GPU are completed within 16ms. Android 11 supports a higher frame rate of 120Hz, which is generally used in application scenarios that require extremely high frame rates, such as interactive games.

Experience at different frame rates: https://frames-per-second.appspot.com

Field frequency

Field frequency is also called _refresh frequency_, that is, the vertical scanning frequency of the display , which refers to the number of images that the display can display per second, and the unit is Hertz (Hz).

Generally around 60-100Hz, the field frequency is also called the screen refresh rate, which refers to the number of times the screen is updated per second.

The persistence of vision of the human eye is about 16-24 times per second, so as long as the screen image is updated at an interval of 30 times per second or less, it can fool the human eye into thinking that the image has not changed.

In fact, our eyes can still perceive the flickering phenomenon produced by the screen refresh rate of 30 times per second, resulting in a feeling of fatigue. Therefore, the higher the field frequency of the screen, the more stable the picture, and the more comfortable the user feels.

In addition: the fluorescent screen is coated with medium and short afterglow fluorescent materials. If the electron gun does not repeatedly "light up" and "extinguish" the fluorescent points, it will cause the residual image of the previous image to stay on the screen when the image changes.

Generally, the refresh rate of the screen is more than 75 times per second, and the human eyes cannot perceive it at all, so it is recommended to set the field frequency between 75Hz-85Hz, which is enough to meet the needs of ordinary users.

The larger the field frequency, the more times the image is refreshed, the smaller the flickering of the image display, and the higher the picture quality. Note that the so-called "refresh times" here and the "picture frame number" we usually say when describing the game speed are two completely different concepts. The latter refers to the scanning frequency of the electronic gun of the picture tube displayed on the dynamic image processed by the computer every second.

color model

light and color

Light is the visible spectrum of electromagnetic waves that can be seen (received) by the naked eye. In the scientific definition, light sometimes refers to all electromagnetic waves.

Visible light, which the human eye can see, is only part of the entire electromagnetic spectrum. The visible spectrum range of electromagnetic waves is about 390~760nm (1nm=10-9m=0.000000001m). We cannot survive in this world without light.

Color is the result of the perception of visible light by the visual system. Studies have shown that the human retina has three types of cone cells that are sensitive to red, green, and blue colors. The red, green and blue cone cells perceive different frequencies of light to different degrees, as well as different levels of brightness.

Any color in nature can be determined by the sum of the three color values ​​of R, G, and B, and an RGB color space is formed with these three colors as the base color. Color = R (percentage of red) + G (percentage of green) + B (percentage of blue), as long as one of them is not generated by the other two colors, different three primary colors can be selected to construct different color spaces.

RGB color coding

4624551-cd5ce515e4596c75.png

The most used in our development scene should be the RGB model. R, G, and B represent red, green, and blue respectively. These three colors are called three primary colors. Adding them in different proportions can produce any color.

RGB format:

What is the difference between RGB16 RGB24 RGB32 and so on?

Generally speaking, the difference is that the number of bits used by a pixel is different, and the displayed color richness is different. The larger the number of bits, the richer the color.

Computers use binary, so all orders of magnitude are based on binary, whether it is storage space, computing speed, file size, etc. If you want to represent a color, each corresponding color needs a binary code to represent,

Using 8-bit binary , it can represent 2^8 (2 to the 8th power), that is, 256 colors. Using 16-bit binary , it can represent 2^16 (2 to the 16th power), that is, 65536 colors. Using 24-bit binary , it can represent 2^24 (2 to the 24th power), that is, 16,777,216 colors.

Generally, the color above 24bit is called true color, and of course there are 30bit, 36bit, 42bit. The longer the color code used, the file size of the same pixel file will also increase exponentially. Using more than 16-bit color files will not show any difference on ordinary monitors, especially LCD monitors, because the LCD monitor itself cannot display so many colors. But it is very useful for color printing, because the ink dots are very fine, and due to the enlargement of the printing size, larger files can show more delicate levels and details when printing.

YUV (YCbCr) color coding

Related experiments have shown that the human eye is sensitive to brightness but not to chroma. Therefore, the luminance information and the chrominance information can be separated, and the means of "deceiving" human eyes can save space, which is suitable for the field of image processing, thereby improving the compression efficiency.

Hence the YUV. YUV is another way of representing color. YUV is also known as YCbCr. YUV color encoding uses luminance Y and chroma UV to specify the color of a pixel. "Y" represents the brightness ( Luminance or Luma ), that is, the grayscale value. "U" and "V" represent the chroma ( Chrominance or Chroma ), which is used to describe the hue and saturation of the image.

YUV is a color space representation method that separates brightness and chrominance. Therefore, compared with RGB, YUV is more accurate in responding to luminance information, and at the same time removes the chromaticity-independent characteristics of human perception, making it more suitable for video transmission or storage. Compared with RGB, it has higher compression performance .

The YCbCr color space is an internationally standardized variant of YUV , which is used in digital television and image compression (such as JPEG). YCbCr is actually a scaled and offset replica of YUV . Among them, Y has the same meaning as Y in YUV , and Cb and Cr also refer to color, but the representation method is different. In the YUV family, YCbCr is the most widely used member in computer systems, and its application fields are very wide. Both JPEG and MPEG use this format. Generally speaking, YUV mostly refers to YCbCr . **Cb (equivalent to U, blue): ** reflects the difference between the blue part of the RGB input signal and the brightness value of the RGB signal. **Cr (equivalent to V, red): **reflects the difference between the red part of the RGB input signal and the brightness value of the RGB signal.

\\ 公式 并不是固定的, 不同的RGB的格式,转换的的公式也是不一样的
\\ RGB 转换为 Ycbcr 公式

Y = 0.257*R+0.564*G+0.098*B+16
Cb = -0.148*R-0.291*G+0.439*B+128
Cr = 0.439*R-0.368*G-0.071*B+128

\\ Ycbcr 转换为 RGB 公式

R = 1.164*(Y-16)+1.596*(Cr-128)
G = 1.164*(Y-16)-0.392*(Cb-128)-0.813*(Cr-128)
B = 1.164*(Y-16)+2.017*(Cb-128)



There are two advantages of using YUV: 1. Convert color YUV image to black and white YUV image. If there is only the Y signal component but no U and V components, then the image represented in this way is a black and white grayscale image. So compatible with older black and white TVs.

Second, YUV is the total size of the data is smaller than the RGB format. Because of YUV, the brightness signal can be increased and the color signal can be reduced for volume reduction.

YUV sampling format
为节省带宽,大多数 YUV 格式平均使用的每像素位数都少于24位。
主要的采样格式有YUV4:2:0(使用最多)、YUV4:2:1、YUV4:2:2和YUV4:4:4。4:2:0表示每4个像素有4个亮度分量,2个色度分量 (YYYYCbCr)。4:2:2表示每4个像素有4个亮度分量,4个色度分量(YYYYCbCrCbCr)、4:4:4表示全像素点阵(YYYYCbCrCbCrCbCrCbCr)。

4:4:4 means full sampling. Same size as RGB

4:2:2 means 2:1 horizontal sampling and vertical full sampling. One-third smaller than RGB .

4:2:0 means 2:1 horizontal sampling, vertical 2:1 sampling, it can be seen that every time two rows of Y are stored , half row U and half row V will be stored . One-half smaller than RGB

YUV storage format

先存储个整张图像的 Y 信息,然后存储 U 信息,最后存储 V 信息。但存储的比例是不同的,可以看出是每存储两行 Y,才会存储半行 U 和半行 V,内存相对RGB少了二分之一,而且这种先存Y再存UV,就比较兼容,如果是黑白显示,只读取Y的信息就可以。

Also note the different packaging formats:

picture quality

Image quality: Picture quality, consisting of indicators such as clarity, sharpness, resolution, color purity, and color balance.

Clarity: Refers to the clarity of image detail texture and its boundary.

Sharpness: It reflects the sharpness of the image plane and the sharpness of the image edge.

Resolution: refers to the number of pixels, corresponding to the resolution, the higher the resolution, the higher the resolution.

Color purity: Refers to the vividness of the color. All colors are composed of three primary colors, and other colors are made by mixing three primary colors. In theory, 256 colors can be mixed. Primary colors have the highest purity. Color purity refers to the percentage of primary colors in a color.

Color Balance: It is used to control the color distribution of the image, so that the overall image can achieve color balance.

Color gamut and HDR

Color gamut: refers to the range of colors that can be expressed by a certain color display mode. The larger the color gamut, the more colors it can express.

HDR: High Danamic Range, high dynamic range, provides more dynamic range and image details than ordinary images, and can better reflect the visual effects of the real environment. After the color value is normalized, the range is generally [0,1]. HDR, on the other hand, can express color values ​​beyond 1 and has a larger color range.

Rotation angle

Rotation angle: The YUV storage direction of the video. The general video rotation angle is 0°, corresponding to the horizontal screen display. The video shot by the rear camera on the vertical screen has a rotation angle of 90°, which corresponds to the vertical screen display. In Android, the rotation angle can be obtained through MediaMetaDataRetriever.

duration

The time required to play all the images of a video is called the video duration. Calculation formula: Duration (s) = number of frames x duration of each frame = number of frames x (1/frame rate). Suppose a video has 1000 frames and a frame rate of 25fps, then the duration is 40s.

video composition

A complete video file is composed of two parts: audio and video , and video and audio are composed of encapsulation format and encoding format , such as AVI, RMVB, MKV, WMV, MP4, 3GP, FLV and other files we see on the surface In fact, it can only be regarded as a packaging standard, a shell .

There is another core layer in the shell that is the encoding file. After the encoding file is encapsulated, it becomes the .mp4 .avi and other videos we see now. Such as H.264, mpeg-4, etc. are video encoding formats, and MP3, AAC, etc. are audio encoding formats.

**For example:** After encapsulating an H.264 video encoding file and an MP3 video encoding file according to the AVI encapsulation standard, a video file with an AVI suffix will be obtained, which is our common AVI video file.

Some technologically advanced containers can also encapsulate multiple video and audio encoding files, and even encapsulate subtitles at the same time, such as the MKV packaging format. An MKV file can include multilingual pronunciation and multilingual subtitles in one file, which is suitable for the needs of different people.

Package format

The video formats we often see are mp4, avi, mkv. In the technical concept, it is called "Video Encapsulation Format", referred to as the video format, which contains the video information, audio information and related configuration information required to encapsulate video files (such as: video and audio related information, how to decode, etc.) . It is a shell, equivalent to a container. Common encapsulation formats are: mp4, mkv, webm, avi, 3gp, mov, wmv, flv, mpeg, asf, rmvb, etc.

(1) Encapsulation format (also called container) is to put the encoded and compressed video track and audio track into a file according to a certain format, that is to say, it is just a shell, which can be regarded as a video track and audio track. Track folders are also available. (2) In layman's terms, the video track is equivalent to rice, and the audio track is equivalent to dishes. The packaging format is a bowl or a pot, which is a container for holding meals. (3) The packaging format is related to the patent, and it is related to the profit of the company that launches the packaging format. (4) With the package format, subtitles, dubbing, audio and video can be combined.

Example of packaging in MKV format:

The following details the packaging formats of several videos :

  • 1. AVI format, the corresponding file format is .avi , the full name is Audio Video Interleaved , which was launched by Microsoft in 1992 . The advantage of this video format is that it has good image quality and lossless AVI preserves the alpha channel.

    Features: Good compatibility, cross-platform support, constant frame rate, large size, poor fault tolerance, not streaming media, outdated .

  • 2. DV-AVI format, the corresponding file format is .avi , the English full name is Digital Video Format , which is a digital video format for home use jointly proposed by Sony, Panasonic, JVC and other manufacturers. Common digital camcorders use this format to record video data. It can transmit video data to the computer through the IEEE 1394 port of the computer, and can also record the edited video data in the computer to the digital video camera.

  • 3. WMV format, the corresponding file formats are .wmv, .asf , the full English name is Windows Media Video , which is a file compression format launched by Microsoft that adopts an independent encoding method and can directly watch video programs on the Internet in real time. Under the same video quality, files in WMV format can be played while downloading, so it is very suitable for playing and transmitting on the Internet.

  • 4. MPEG format, the corresponding file formats are .mpg, .mpeg, .mpe, .dat, .vob, .asf, .3gp, .mp4, etc. The full English name is Moving Picture Experts Group , which is formulated by the Moving Picture Experts Group Formed in 1988 , the expert group is responsible for the formulation of video and audio standards, and its members are technical experts in the fields of video, audio and systems. The MPEG format currently has three compression standards, namely **MPEG-1, MPEG-2, ** and MPEG-4 . MPEG-4 is a video packaging format that is widely used now. It is specially designed to play high-quality video of streaming media, in order to obtain the best image quality with the least amount of data.

  • 5. Matroska format, the corresponding file format is .mkv, Matroska is a new video encapsulation format, which can encapsulate a variety of different encoded videos and more than 16 audios in different formats and subtitle streams in different languages ​​into one Matroska Media files.

    Features: Support multi-audio tracks, soft subtitles, streaming, strong compatibility, can hold an unlimited number of video, audio, picture or subtitle tracks in one file, any video coded file can be put into MKV.

  • 6. Real Video format, the corresponding file format is .rm, .rmvb , which is the audio and video compression specification formulated by Real Networks , which is called Real Media . Users can use RealPlayer to formulate different compression ratios according to different network transmission rates, so as to realize real-time transmission and playback of image data on low-speed networks.

  • 7. QuickTime File Format , the corresponding file format is .mov , which is a video format developed by Apple , and the default player is Apple's QuickTime . This encapsulation format has the characteristics of high compression ratio and perfect video definition, and can save the alpha channel.

  • 8. Flash Video format, the corresponding file format is .flv , which is a network video packaging format extended from Adobe Flash . This format is adopted by many video sites.

video encoding

**The volume of the original audio and video signals collected is very large, and there are a lot of the same content that cannot be seen by the eyes or heard by the ears. ** For example, if the video is not compressed and encoded, the volume is usually very large. A movie may require hundreds of gigabytes of space.

Professionally speaking, video encoding is the compression algorithm used by the video in the file. The main function of video encoding is to compress video pixel data (RGB, YUV, etc.) into a video stream, thereby reducing the amount of video data.

Encoding format

Compress video pixel data ( RGB, YUV, etc.) into a video stream to reduce the amount of video data. The encoding format refers to a description of the compression encoding method of the video stream data in the encapsulation format. If the video is not compressed, the volume will be very large.

Video compression, what is mainly compressed:

Spatial redundancy: there is a strong correlation between adjacent pixels in the image Temporal redundancy: the content of adjacent images in the video sequence is similar Coding redundancy: the probability of occurrence of different pixel values ​​​​is different Visual redundancy: the human visual system Some details are not sensitive to redundant knowledge: the regular structure can be obtained from prior knowledge and background knowledge

Common video codec methods include H.26X (H.264, H.265, etc.), MPEG, etc. It should be noted here that different video packaging formats may actually use the same encoding and decoding methods , and the packaging formats are packages of different manufacturers. This is like, multiple ice cream manufacturers produce a flavor of ice cream, but the outer packaging is different.

(1), H.26X series

H.26X is led by the International Telecommunications Union Telecommunications Standardization Organization (ITU-T)**, including H.261, H.262, H.263, H.264, and H.265 .

  • H.261 , mainly used in older video conferencing and video telephony systems. It was the first digital video compression standard to be used. In essence, all subsequent standard video codecs are designed based on it.
  • H.262 , equivalent to MPEG-2 Part 2, is used in DVD, SVCD and most digital video broadcasting systems and cable distribution systems.
  • H.263 , mainly used in video conferencing, video telephony and network video related products. In terms of compressing progressive video sources, H.263 has a large performance improvement over its predecessor video coding standards. Especially at the low bit rate end, it can greatly save the bit rate while ensuring a certain quality.
  • H.264 , equivalent to the tenth part of MPEG-4 , also known as Advanced Video Coding ( AVC for short ), is a video compression standard, a widely used high-precision video recording, compression and Publish format. This standard introduces a series of new technologies that can greatly improve the compression performance, and can greatly surpass the previous standards at both the high bit rate end and the low bit rate end.
  • H.265 , known as High Efficiency Video Coding ( HEVC for short ) is a video compression standard and the successor of H.264 . HEVC is considered to not only improve the image quality, but also achieve twice the compression rate of H.264 (equivalent to a 50% reduction in the bit rate under the same picture quality ), which can support 4K resolution and even ultra-high-definition TV. The resolution can reach 8192×4320 ( 8K resolution), which is the current development trend .

(2), MPEG series

The MPEG series was developed by the Moving Picture Experts Group ( MPEG ) under the International Standards Organization (ISO)** .

  • The second part of MPEG-1 is mainly used on VCD , and some online videos also use this format. The quality of the codec is roughly equivalent to that of original VHS tapes.
  • MPEG-2 Part 2, equivalent to H.262 , is used in DVD , SVCD and most digital video broadcasting and cable distribution systems.
  • The second part of MPEG-4 can be used in network transmission, broadcasting and media storage. Compared with MPEG-2 Part 2 and H.263 version 1 , its compression performance has been improved.
  • MPEG-4 Part 10, equivalent to H.264 , is a standard born out of the cooperation of these two coding organizations.
video decoding

With encoding, of course, decoding is also required. Because compressed (encoded) content cannot be used directly, it must be decompressed when used (watched), and restored to the original signal (such as the color of a certain point in the video, etc.), this is "decoding" or "decompression".

The process of video decoding is the process of decoding binary data encoded in a certain encoding method (H264) into a YUV picture, that is, "H.264->YUV" . The most widely used one is FFmpeg, an open source codec suite, which widely covers common codec methods and packaging formats (video formats).

Guess you like

Origin blog.csdn.net/qq_38056514/article/details/129848139