Android audio and video development (1) basic knowledge of audio and video

foreword

Recently, I have some free time and want to learn something new. In view of the current popularity of audio and video apps, I decided to learn about audio and video development on the Android platform. However, I am a novice in this area, so I hope to record it in the form of a blog. The process of learning step by step by myself also hopes to give some reference to novices in this area.

study plan

1. Understand the basics of audio and video

2. Understand the implementation and use of SurfaceView and TextureView

3. Android platform audio collection and playback (AudioRecord related API)

4. Android platform video capture and playback (Camera related API)

5. Learn MediaExtractor and MediaMuxer API of Android platform

6. Learn MediaCodec API

7. Understand OpenGL ES, learn to use OpenGL to draw graphics

8. Further study OpenGL, understand how to realize video clipping, rotation, watermark, filter, etc.

9. Learn the use of GLSurfaceView

10. Learn to use the third-party library ffmpeg

11. Understand RTMP, RTSP, and learn to use the third-party library librtmp

The whole study plan is about so much. If I have a deeper understanding of other things, I will add them. Finally, I will use the above-mentioned technologies to develop a simple audio and video app. This is a long process. I hope I can stick to it and encourage each other!

Then it's officially on track:

Video Basics

1. What is video?

To put it simply, a video can be regarded as a rapid switching of pictures, which produces a coherent movement in the human eye. Early film films are an obvious example, through the images recorded on each frame of film, fast switching , resulting in a video effect.

2. frame

Frame——is the single image frame of the smallest unit in video animation, which is equivalent to each frame on the film film, each image is a frame, and a video is composed of many frames.

3. Frame rate

Frame rate refers to the frequency (rate) at which frames appear continuously on the display. Children’s shoes who play games must be familiar with this. Usually we use FPS as the unit, which is the number of frames updated per second (frames/second). A high frame rate results in smoother, more realistic animations. Generally speaking, 30fps is acceptable, but increasing the performance to 60fps can significantly improve the sense of interaction and realism, but generally speaking, if it exceeds 75fps, it is not easy for the human eye to notice the obvious improvement in fluency.

4. Color space

RGB: A color standard that obtains a variety of colors by changing the three color channels of red (R), green (G), and blue (B) and superimposing them with each other. All the colors on the screen are formed by mixing red, green and blue in different proportions, and these three colors are called the three primary colors.

YUV: YUV is a color coding method adopted by European TV systems. In modern color TV systems, cameras are usually used to capture images, and then the obtained color image signals are separated, enlarged and corrected to obtain RGB, and then The brightness signal Y and two color difference signals BY (ie U) and RY (ie V) are obtained through the matrix conversion circuit, and finally the sending end encodes the three signals of brightness and color difference respectively, and sends them out on the same channel, which is the TV signal. transfer process. This color representation method is the so-called YUV color space representation. The importance of using the YUV color space is that its brightness signal Y and chrominance signals U and V are separated. Among them, "Y" represents the brightness, that is, the grayscale value; while "U" and "V" represent the chroma, which are used to describe the color and saturation of the image, and are used to specify the color of the pixel.

Advantages of using YUV:

1. The conversion of color YUV image to black and white YUV image is very simple, and this feature is used in TV signals.

2. The total data size of YUV is smaller than that of RGB format, which facilitates the reduction of video size.

Conversion method between RGB and YUV:

Y = 0.299R + 0.587G + 0.114B
U = -0.147R - 0.289G + 0.436B
V = 0.615R - 0.515G - 0.100B

R = Y + 1.14V
G = Y - 0.39U - 0.58V
B = Y + 2.03U

Audio Basics

1. What is audio?

The audio here refers to the medium that stores sound content. Any sound that we can hear through audio cables or microphones will become a series of analog signals. In the CD era, the sound is collected and recorded in the tape medium by physical means. This process is all analog, and the sound is distorted; in the digital age, the sound is processed into a digital signal and stored in the storage medium. The analog signal is We can hear, and the digital signal uses a bunch of digital marks (binary 1 and 0) to record the sound, and the digital signal can realize the lossless preservation of the sound.

The most critical step in digital recording is to convert analog signals into digital signals. Here I have to mention a term: Pulse Code Modulation (PCM), which is a digital data processing mechanism. For details, please refer to Wikipedia.

The working process of PCM is as follows:

Analog Signal -> Sampling -> Quantization -> Encoding - > Digital Signal

2. Sampling rate and sampling bits

Sampling is the process of converting an analog audio signal into a digital signal by periodically intercepting an audio signal at a certain interval. Each sample is assigned a number representing the amplitude of the audio signal at the instant of sampling.

The sampling frequency refers to the number of times the recording device samples the sound signal within one second. According to the Nyquist sampling theorem: in order to restore the analog signal without distortion, the sampling frequency should not be less than twice the highest frequency in the analog signal spectrum. That is to say, when we collect and process the sound, we need to select the sound of each specific frequency.

The highest frequency that the human ear can hear is 20kHz, so in order to meet the hearing requirements of the human ear, the sampling rate should be at least 40kHz, usually 44.1kHz, and higher usually 48kHz.

The number of sampling bits is the sampling value or sampling value, a parameter used to measure sound fluctuations, and refers to the binary number of digital sound signals used by the sound card when collecting and playing sound files. The number of bits sampled and the frequency of sampling determine the quality of sound collection.

In digital signals, the signal is generally discontinuous, so after the analog signal is quantized, it can only take an approximate integer value. In order to record these amplitude values, the sampler will use a fixed number of bits to record these amplitude values, usually 8 bits , 16-bit, 32-bit. 8 bits represent the 8th power of 2 - 256, 16 bits represent the 16th power of 2 - 64K, 32 bits represent the 32nd power of 2 - 2147483648, the higher the number, the better the sound quality.

3. Channel

Channels refer to mutually independent audio signals that are collected or played back at different spatial locations during sound recording or playback, so the number of channels is the number of sound sources during sound recording or the corresponding number of speakers during playback. Usually what we call stereo generally has 2 channels, and some more advanced ones have 4 channels.

4. Bit rate

Code rate refers to the number of bits (bits) transmitted per second, in units of bps (bit per second), usually kbps (1000 bits per second). In audio, it refers to the amount of binary data per unit time after the analog sound signal is converted into a digital sound signal, which is an indicator to indirectly measure the audio quality. When the bit rate is high, the file size becomes larger and will occupy a lot of memory capacity. The most commonly used bit rate for music files is 128kbps, and the available bit rate for MP3 files is generally 8-320kbps.

Bit rate (kbps) = sampling rate (kHz) × number of sampling bits (bit/sample point) × number of channels (usually 2)

video encoding

1. What is video encoding?

It refers to the way to convert the original video format file into another video format file through compression technology. From the point of view of information theory, data = information + data redundancy. Video signals also have data redundancy, and the essence of video coding is to reduce redundant data in video. We know that a video is composed of frames, but in actual use, the data of the video is not really saved according to the original data frame by frame, but is stored after compression and encoding. Video coding can effectively reduce the video size and facilitate transmission and storage. After the video and audio are combined through compression coding, they become our common formats, such as: avi, mp4, rmvb, mov, etc. These are called video packaging formats.

2. Video encoding format

The most important codec standards in video stream transmission include ITU's H.261, H.263, H.264, M-JPEG of the Moving Picture Experts Group and MPEG series standards of the International Organization for Standardization Moving Picture Experts Group, etc. . The most mainstream of them is H.264. Of course, H.265 has been launched now, which is a more efficient encoding method and has higher compression efficiency than the previous generation.

3. H.264 encoding

Because the H.264 encoding is too large and complex, in actual development, the encoding part is generally completed by a third-party framework, and developers don’t really need to get involved, so I won’t introduce it in detail here. For details, please refer to Baidu Encyclopedia .

You can also refer to Getting started to understand H264 encoding

audio encoding

1. What is audio coding?

Referring to video coding, audio coding is naturally a process of compressing audio data.

2. Audio encoding format

PCM encoding: As we have already introduced, the biggest advantage of this encoding is the good sound quality, but the biggest disadvantage is the large size.

Like video encoding, audio also has many encoding formats, such as: WAV, MP3, WMA, APE, FLAC and so on.

I will focus on introducing AAC here. AAC is a new generation of audio lossy compression technology, an audio compression algorithm with a high compression ratio. The audio data in MP4 video is mostly in AAC compression format.

3. AAC encoding

AAC is a file compression format designed specifically for sound data. Different from MP3, it uses a new algorithm for encoding, which is more efficient and has a higher "cost-effectiveness". Using the AAC format can make people feel that the sound quality is not significantly reduced, and the storage space is smaller.

The AAC format is mainly divided into two types: ADIF and ADTS.

ADIF : Audio Data Interchange Format. Audio data interchange format. The characteristic of this format is that the beginning of the audio data can be found definitely, and there is no need to start decoding in the middle of the audio data stream, that is, its decoding must be performed at a clearly defined beginning. This format is commonly used in disk files.

ADTS : Audio Data Transport Stream. Audio data transport stream. The characteristic of this format is that it is a bit stream with a sync word, and decoding can start anywhere in this stream. Its characteristics are similar to the mp3 data stream format.

Refer to AAC file parsing and decoding process

Hardware Acceleration

Hardware acceleration (Hardware acceleration) is to use hardware modules to replace software algorithms to make full use of the inherent fast characteristics of hardware. Hardware acceleration is usually more efficient than software algorithms.

Handing over 2D and 3D graphics calculation-related work to the GPU to release the pressure on the CPU is also a type of hardware acceleration.

Hard decoding and soft decoding

1. Difference

Hard decoding corresponds to the above hardware acceleration, that is, using hardware modules to analyze video and audio files, etc., while soft decoding uses CPU to calculate and analyze.

Hard decoding is to hand over part of the video data that was originally processed by the CPU to the GPU, and the parallel computing capability of the GPU is much higher than that of the CPU, which can greatly reduce the load on the CPU, and the CPU usage rate is lower. After the low, you can run some other programs at the same time.

Soft decoding has better adaptability. Software decoding mainly takes up CPU operation. Soft decoding does not consider the hardware decoding support of the device. It can be used with CPU, but it takes up more CPU, which means it is very expensive. Performance, consumes a lot of power. It is better to use software decoding when the device has sufficient power, or when the device hardware decoding support is insufficient.

General video players will support hardware and software decoding, and the combination of the two will give the best performance.

2. Decoding of Android platform

Android hardware decoding can directly use MediaCodec (introduced after API 16). Although MediaPlayer is also hardware decoding, it is too tightly encapsulated and supports few protocols.

For soft decoding, there are many third-party frameworks to support it, the most famous being ffmpeg , which will be further understood in future studies.

Postscript: It is strongly recommended to read the official documentation: Google Audio and Video Introduction

Guess you like

Origin blog.csdn.net/gs12software/article/details/104754429