Multimedia knowledge

Standard Multimedia System Introduction

Multimedia technology covers a wide range, involves many platforms, and has many commercial products. But its core technology is roughly the same, the basic block diagram is as follows:

Data input system : The function of this part is relatively simple, mainly to collect data from devices (U disk, Sdcard, flash, network and other devices), and send the data to the subsequent Demux system through the system file processing part. If it is a platform with an operating system, such as linux, WinCE, android, etc., it is generally the common fopen, fread and other file operation functions; if it is a Non-OS system or network playback, you need to implement the corresponding data operation functions yourself.

Demux system : This part mainly parses the input data into audio/video/subtitle ES (Elementary Stream) streams, and then sends video ES and audio ES to corresponding video decoder and audio decoder for decoding. The streams sent by the front-end data input system are generally with containers, such as TS, FLV, MP4, etc. These streams are not only compressed data, but also some pts, subtitle and other information. General decoders only accept purely compressed data, so Demux (demultiplexing) is required. The Demux system is one of the core technologies of the multimedia system, and it is also the core source code of many player manufacturers. Because it needs to support multiple file formats, such as TS, MPEG, FLV, ASF, WMV, etc. This part of the code is relatively large and needs to be read in accordance with the spec of each format.

Elementary Stream (Elementary Stream), referred to as ES. It is the original basic bit stream output by the encoder, which only contains the information necessary for the decoder and close to the original image or original audio. MPEG has strictly defined the syntax of the compressed signal to ensure that the decoder can decode normally. MPEG does not define the encoder, but it must be able to provide a syntactically correct code stream.

Elementary Stream (basic code stream)
is the original code stream output by the compressor for transmitting single-channel video and audio signals.

Decode system : mainly divided into video decoder and audio decoder. The video decoder decodes the video ES stream sent by the front-end Demux module, and outputs the YUV or RGB data supported by the video output system; the audio decoder decodes the audio ES data stream sent by the front-end demux terminal, and outputs the PCM data supported by the audio output system. Although there are many kinds of APs on the upper layer, and there are many stream formats, the ES stream and DTS in the standard compression format (such as H264, H263, MPEG4, WMV1, etc.) are finally sent to this layer. DTS mainly determines the timing of decoding. Since the decoding process is very cumbersome and involves complex mathematical transformations, this part is generally completed by the underlying dedicated hardware DSP. Due to the powerful PC CPU performance, most of the PC playback is soft solution.

AV Synchronization System : After decoding, the audio and video data is synchronized and then output to the screen or speakers. The reference standard is the pts of the two. Because the human ear is much more sensitive to sound than images, the ears can recognize slight loss and damage of audio data; on the contrary, the loss of image frame numbers may not be visible to the human eye. For example, it is difficult for non-professionals to distinguish between 20FPS and 25FPS videos. The most basic principle is to decide the processing of video data according to the pts of the audio. If the video pts is behind the sound, the video frame may be still or slow-play and wait for the audio to catch up; on the contrary, if the video pts is in front of the audio, the video frame is directly dropped or fast-play to catch up with the audio to achieve audio and video synchronization. If there is no audio data, the video will be decoded according to the frame rate and sent directly to the display system.

This part is also one of the core technologies of the multimedia system, and it is also the most error-prone place. Different players may have slightly different detailed mechanisms, but the basic strategy is the same.

Output system : mainly divided into video output system and audio output system. For video output system, YUV, RGB or other types of raw data are sent to the chip display system, and some chips have hardware acceleration function at the bottom layer. This part of the function is relatively simple, that is, to send data to the screen. The realization principle has a lot to do with the specific chip solution. For the audio output system, the PCM data is sent to the audio HAL layer for processing and finally output to the speaker.

2 Common multimedia concepts and terminology description

ES stream (Elementary Stream): also called elementary stream, which contains continuous stream of video, audio or data.
PES stream (Packet Elementary Stream): also called packaged elementary stream, which divides the basic stream ES stream into data packets of different lengths according to the needs, and adds the packet header to form the packaged elementary stream PES stream.
TS stream (Transport Stream): Also called transport stream, it is composed of packets with a fixed length of 188 bytes, containing one or more pr ograms with an independent time base, and a program can contain multiple ES streams of video, audio, and text information; each ES stream will be marked with a different PID. In order to analyze these ES streams, TS has some fixed PIDs for sending program and ES stream information tables at intervals: PAT and PMT Table. (In the MPEG-2 system, the standard information stream for actual transmission generated by video, audio ES stream and auxiliary data multiplexing is called MPEG-2 transport stream).
Packaging (container): It is bundled and packaged, which packs video files, audio files, and subtitle files together, and establishes sorting and indexing according to certain rules, which is convenient for players or playback software to index and play. Including AVI, TS, MKV, MP4, etc.
DTS (Decoding Timestamp) and PTS (Presentation Timestamp): The timestamps relative to the SCR (System Reference) when the decoder decodes and displays the frame, respectively. SCR can be understood as the time when the decoder should start reading data from the disk.
BitRate: Refers to the data traffic used by a video or audio file per unit time. The unit of this parameter is usually Kbps, that is, kilobits per second. Usually 2000kbps~3000kbps is enough to express the picture quality effect to the extreme. The bit rate parameter is directly related to the final size of the video file.
Constant BitRate: It means that the output bit rate of the encoder (or the input bit rate of the decoder) should be fixed (constant). CBR is not suitable for high-definition video encoding, because CBR will result in insufficient bit rate to encode complex and variable content parts (resulting in quality degradation), and some bit rate will be wasted in simple content parts.
Variable BitRate: The output bit rate of the encoder (or the input bit rate of the decoder) can be adaptively adjusted according to the degree of responsibility of the input source signal of the encoder. The purpose is to keep the output quality unchanged instead of keeping the output bit rate constant. VBR encoding consumes more computing time, but can make better use of limited storage space: use more code rates to encode high-complexity segments, and use less code rates to encode low-complexity segments. In short, high-definition and small-sized video is required, so it is a wise choice to choose VBR.
Average BitRate: refers to the average bit rate of audio or video, which can be simply considered equal to the file size divided by the playback time. It is basically the same as CBR in terms of audio encoding, and will be encoded according to the set target bit rate. But when the encoder thinks it is "appropriate", it will use a value higher than the target bit rate for encoding to ensure better quality.
Frame Rate (Frame Rate): It is a measure used to measure the number of frames displayed on the screen. The so-called unit of measurement is Frames per Second (abbreviation: FPS). For example, the frame rate of movies is generally 25fps and 29.97fps, but for special occasions such as first-person shooter games that require extremely smooth images, the effect of more than 30fps is required, and higher than 60fps is unnecessary.
Resolution: Refers to the pixel value of the width and height of the video, in Px. Usually the numerical aspect ratio of the video resolution should be equal to the aspect ratio, otherwise the video file will have black borders. The standard 1080P has a resolution of 1920×1080 and a frame rate of 60fps, which is true high definition. The frame rate of the most common 1080P high-definition video broadcasted on the Internet is usually 23.976 fps
Sampling rate: The number of samples extracted from a continuous signal and composed of a discrete signal per second, expressed in Hertz (Hz). The sampling rate of a general music CD is 44100Hz, so it is sufficient to keep the audio sampling rate at this level in video encoding, and usually the video converter also uses this sampling rate as the default setting.