Best practice of K song and short video technology——Exploration of "Sing Bar" audio and video technology

Original content, please indicate the source for reprinting

I. Introduction

  With the development of mobile apps today, almost all applications have audio and video related functions. In summary, there are probably audio and video recording, audio and video playback, audio and video special effects processing, and audio and video transmission.
  The more popular short video applications are Douyin and Kuaishou, the K song application Sing Bar and National K song, and the new Music Street of NetEase Cloud, as well as NetEase Cloud Music and QQ Music for listening to songs. Although the operating mode of each application is different, the technical implementation is similar. Short video software has richer video processing, with very cool filters, stickers, and K song software that handles more details of music, such as volume, sound effects, and tone.
  The author is an earlier user of Sing Bar. Since its development, the audio and video technology covered by Sing Bar is relatively comprehensive. The original director of Sing Bar’s audio and video technology, Mr. Zhan Xiaokai, discussed the mobile audio in the book "Advanced Audio and Video Guide". Video best practices give specific implementations, so take Sing Bar as a reference to explore audio and video related technologies on the mobile terminal. (The code implementation is based on Android, but for the low-level operation of audio and video, the implementation ideas of each platform are roughly the same. In addition, the author has never worked in the singing bar. If there is infringement or other behavior, please leave a message. If you want to communicate with the author To discuss mobile audio and video technologies together, please leave a message or discuss on WeChat smzh_james).


A technique directory list
a, audio recording (recording, an audio decoder, audio resampling, audio mixing)
Second, the audio playback and effects processing (volume, tone, sound)
three video recording (Camera related, Video Coding)
IV Video Effects Processing (beauty, filters, stickers)
5. Video playback (video decoding, video rendering, audio and video synchronization)
6. Audio and video mixing
(content links will be updated successively)

2. Comparison of effects and analysis of technical realization

Refer to the interface of Sing Bar App and made a simple K song software, black and white style, but covers most of the functions, hereafter called SuperKtv. The effect comparison is shown in the figure below.

(1) K song list page

  picture 1 is SuperKtv, picture 2 is singing it.

  The implementation of SuperKtv is relatively simple. It directly obtains the existing audio files on the mobile phone and displays them in a list, providing an entrance for subsequent functions. There is no difficulty in implementation, and everyone who does development understands it, so I won't repeat it.

(2) The

  first two pictures on the audio recording page are the audio recording pages of SuperKtv, and the third picture is the audio recording page of Sing Bar.

  • Lyrics
      are very important to karaoke software. Professional karaoke software like Sing Bar has a huge music library. In order to synchronize with the singing, the lyrics are usually time stamped so that the lyrics can be scrolled dynamically during the singing. , And scoring function. From a technical analysis, the lyrics and scoring function should be different for each song. This aspect does not involve audio and video technology, and the author does not have much energy to customize the lyrics and scoring function for the song.
      In order to make SuperKtv and Sing Bar have a higher degree of intimacy, I still think of ways to do this work. To quickly solve the problem SuperKtv lyrics, can only start from the network, using the lyrics net https://www.90lrc.cn powerful search capabilities, using its search interface to search for the song, Html returned with the search results list, and then parse Html obtains the appropriate list item. Generally, the first item is the most appropriate, and the lyrics link is taken out, and then the Html with lyrics is obtained through the lyrics link, and the lyrics text can be obtained after filtering. The display effect is as shown in Figure 1. One way to realize the scoring function is to match the dynamically recognized pitch with the content in the predefined file during the singing process.

  • Recording
       one of the core functions of K song software for high performance low latency audio requirements, this point than Android IOS to do much better, and the function is perfect, with AudioUnit can solve most problems; Android8.0 less It is recommended to use OpenSL ES, which is a streamlined implementation of OpengSL on embedded devices; Android 8.0 and above recommend AAudio, a lightweight recording interface, born for low-latency audio processing, according to the official website document describes low-latency performance Good, and given the test data of several Pixel models. However, because there are too many Android models and different hardware implementation efforts, both of them are not as good as AudioRecorder in compatibility. The Google Android team encapsulated OpenSL ES and AAudio and named it Oboe. It seems to have the ideal of catching up with AudioUnit. It provides a unified calling interface and is extremely convenient to use. Oboe can be found directly on GitHub https://GitHub/ Oboe , the demo provided is of great reference value, but according to the feedback on GitHub, there are many problems in use, mainly in noise and delay. I look forward to the follow-up update to completely solve the problem that Android has been criticized since its inception.
       Another solution for low-latency recording is to calculate the recording delay. It is assumed that if the recording delay can be obtained, the recorded sound can be artificially advanced for a period of time when mixing on the editing page, and the result is almost close to no delay. But if you can be so easy to solve if it is not Android, the delay calculation record is not easy, rough calculations recording hardware delay = delay + buffering delay, If there are more calculations in the recording callback, the calculation delay needs to be added. Up to now, only AAudio provides a method to calculate the delay. The embarrassing thing is that the measured delay is 0. Even if devices above 8.0 have already occupied the mainstream market, it is certainly not enough for commercial software to support 8.0. The delay of hardware may only be obtained by manufacturers with more accurate data, so cooperation with manufacturers is one of the solutions. Method. For Huawei mobile phones, the corresponding APIs are provided in Huawei's SDK. AudioKit can be used to directly obtain the recording delay, and it has the ear-back function and seven sound effects. This should be the best news I have heard so far.

  • Audio decoding and resampling
       will generally play accompaniment or original vocals when recording in K song software to provide a reference for the singer. The underlying Api generally does not provide decoding functions, so accompaniment is required before "filling" the data into the device buffer Or the original song is decoded into the original data, which is the famous Pcm. Song files are generally in Mp3 or AAC format. Lame is recommended for encoding and decoding of Mp3, and Fdk-aac is recommended for encoding and decoding of AAC. These two have the best speed so far, and are most suitable for those with real-time requirements. , You can directly use or compile these two libraries into FFmpeg, call FFmpeg's Api operation, take a step back, directly use FFmpeg's own decoding algorithm to complete the corresponding functions, of course, there are differences in performance. The common practice in decoding is to compile FFmpeg with Lame or Fdk-aac, and call FFmpeg's Api to do the encoding and decoding, because the uniformity of its Api is very advantageous.
       Under the Android platform, MediaCodec can also be used for decoding. The system comes with Api, which is easy to use, has more documents, and uses hardware decoding, which is much faster than software decoding. Unfortunately, MediaCodec is not provided until after Android 6.0 C++ interface, considering the unity with IOS, the former is recommended.
       The key points of resampling are the sampling rate, number of channels, and sampling depth (also known as quantization accuracy) of the music file. The sampling rate, number of channels, and quantization accuracy of a certain music file can be considered uncertain, but the sampling rate, channel, and sampling depth supported by the playback platform hardware are limited and certain, so the "filling" data for the audio The other work before the buffer of the device is resampling, just use FFmpeg directly. The resampling function provided by FFmpeg is more powerful, including the sampling rate, number of channels and sampling depth can be changed. You can refer to the document or search on the Internet. It should be noted that most of the FFmpeg resampling tutorials on the Internet ignore a problem. When resampling from mono to dual channels, in order to ensure that the audio volume remains unchanged, the actual decibel (loudness) will slightly change. Points also need special attention.

  • Audio content storage
       Considering that the audio content recorded by the K song software will be edited later, it is recommended to save the original data directly when the disk space is not limited to facilitate subsequent operations. The space of a song is generally within 100Mb. If you consider disk space, you can encode and save the recorded data, and then decode it when editing. Regarding the audio coding part, we will analyze it in detail when saving it. It should be noted that the file should be read and written frequently when saving data. Considering performance issues, asynchronous operations can be used to complete. The thread synchronization problem of asynchronous operations can be used in the Boost library. Provide the lock-free queue to complete.

(3) Audio editing page


  The first two pictures are SuperKtv, the last two pictures are singing. When writing this article, the editor page of Sing Bar has undergone a revision. The page layout has undergone tremendous changes, and the style is more similar to that of Yinjie. But when the author imitated the interface of Sing Bar, the old users should be clearer that the appearance is very similar to SuperKtv. It is a pity that I could not restore the Sing Bar App with the highest similarity.

  • Audio playback
       K song software requires audio playback on the edit page, and requires high low-latency performance. The purpose is to play the previously recorded content and accompaniment at the same time, which sounds like listening to a singer's album. Using OpenSL ES and AAudio on the Android platform is a better choice. In order to achieve a better synchronization effect, it is necessary to choose a low-latency Api. In addition, it is also necessary to accurately control the playback time of the two tracks. The simpler way is to open two players, one to play the vocals and the other to play the accompaniment to achieve the effect of synchronized playback. More useful is to select a player and mix the Pcm data of the vocals and accompaniment to play. , The synchronization of the latter is better than the former, and the efficiency is higher. The second scheme is used in SuperKtv, and the effect is not bad. Before playing, pay attention to the channel, sampling rate, and sampling depth of the Pcm data, and re-sampling if necessary.

  • Volume control
       In audio processing, it is difficult to process the encoded audio, generally processing the original data. The subsequent volume, sound effects, and tone processing are all processing Pcm.
       As we all know, sound is a wave, and volume can be understood as the amplitude of the wave, also called loudness. Changing the volume means changing the amplitude of the wave. From a mathematical point of view, to increase the amplitude of a sine wave, just multiply it by the gain. Similarly, the volume is right. The Pcm data can be multiplied by the gain, of course, you can use FFmpeg to solve it directly. It should be noted that the maximum and minimum values ​​of each frame of data cannot be exceeded after increasing the amplitude. Taking 16-bit shaping accuracy as an example, if the final value is greater than 32767 or less than -32768, then the sound quality will be impaired and it sounds like The same noise.

  • Tone control The
       tone is mainly determined by the frequency of the sound, but also related to the sound intensity. For a certain intensity of pure tones, the pitch rises and falls with the rise and fall of the frequency; for certain frequencies of pure tones and low-frequency pure tones, the pitch decreases with the increase in sound intensity, while the pitch of high-frequency pure tones rises with the increase in intensity. It is recommended to use SoundTouch (https://sourceforge.net/projects/soundtouch/) for pitch adjustment . The usage is relatively simple, and it also has functions such as variable speed.
       The pitch is mainly determined by the frequency of the sound wave. It is associated with knowledge points related to waves in mathematics or physics, which can be manually realized by yourself. It is simply understood as resampling the audio data to another frequency, but the data content is not reduced. According to the definition of Fourier transform, any periodic wave can be decomposed into a superposition of several sine waves. The common practice here is to transform the Pcm data into the frequency domain by Fourier transform, and change the frequency of the fundamental wave. Because the change of frequency may cause the data to decrease or increase, it needs to perform interpolation calculation, and finally do the inverse Fourier Transformation transforms the data into the time domain, and the Pcm data with the changed frequency can be obtained, that is, the pitch is changed.

  • Sound effect control
       Because the sound is a wave, the sound effect is also processed according to some characteristics of the wave. For example, the echo effect is realized according to the reflectivity of the wave. Teacher Xiaokai Zhan gave the specific realization of audio effects with Sox in the book "Advanced Audio and Video Guide". FFmpeg can also perform audio effects processing, and SuperKtv uses both to achieve audio effects processing. Sox has a slightly more comprehensive function in audio effects processing. Sox http://sox.sourceforge.net/ is recommended . It is known as the Swiss army knife of audio processing. Unfortunately, this library has stopped updating earlier. Its function is gradually replaced by the desktop client AudioCity.
       Audio effect processing is divided into reverberation, equalization, and compression. Reverberation can be understood as the echo of singing in a house. The influencing factors generally include the size of the room, the reflectivity of the wall (decoration materials, etc.), the location of the singer, etc., so by changing the room size and other parameters, you can Imitating the effect of a professional studio hall, like the reverberation parameters of the Vienna studio hall are known, referring to these parameters can simulate the effect of a concert in the Vienna studio hall. The equalizer can adjust the amplification of various frequency component signals separately. Generally, the equalizer on the mixer can only adjust the high-frequency, intermediate-frequency, and low-frequency electrical signals separately. The equalizer also has high-pass filtering and low-pass filtering. For example, the phonograph effect in Sing Bar uses an equalizer. On the one hand, it filters out the harmonics of certain frequencies, and on the other, changes the gain of certain frequency signal components to get the effect of the phonograph. The definition of compressor on Baidu Encyclopedia is as follows: "Compressor is an amplifier whose gain decreases as the input signal level increases. What essentially changes is the ratio of input to output signal. Compressors are the two most common One of the devices used to process the dynamic range of audio signals.” This definition should be very clear. The realization of Sox is mainly based on adjusting parameters. Through the superposition of these three effects, you can adjust the effects of different styles such as Ktv, recorder, pop, and rap. The custom sound effects in the above picture provide adjustable access to some important parameters of the three effects of reverberation, equalization, and compression. Other sound effects can be understood as a combination of fixed parameters based on experience. The principles of the above three algorithms can refer to the three files reverb.c, bequed.c, and compand.c in the Sox source code.

(4) Video recording page

  The first three pictures are the video recording interface of SuperKtv, and the last three pictures are the video recording interface of Sing Bar.
  From a functional comparison, the SuperKtv beauty panel lacks the sharpening, face-lifting, and big-eye functions. In addition, the prop function is not implemented, and we will analyze it later.
   The audio part is the same as before, focusing on the analysis of the video recording. Video recording needs to save the video with beauty effects. Of course, you can also save the original video screen for processing on the editing page. The former is used in Sing Bar, so the first solution is also used in SuperKtv.

  • Camera operation, beauty, and filters
       need to have a certain OpenGL ES basis, which can be understood as processing each pixel in each frame, and then rendering it to the screen. For detailed principles, please refer to the blogger OpenGL ES series articles
    Android OpenGL ES from entry to advanced (1)-five minutes to develop a beauty camera
    Android OpenGL ES from entry to advanced (6)-OpenGL ES portrait whitening and polishing A preliminary exploration of
    Android OpenGL ES from entry to advanced (8)-the universal Lookup filter. The
    above content belongs to the basic part of OpenGL ES. This article will not describe it. If you have any questions, you can leave a message for communication.

  • Stickers (props)
       In beauty camera products, the sticker function should be regarded as the highlight, and it is called props in Sing Bar. At present, the sticker function in Douyin is relatively powerful. The realization of this type of function relies on facial feature point recognition, and it has relatively high real-time requirements. SenseTime and Megvii Technology, which do well in domestic commercial facial recognition Sdk, are beauty products for small and medium-sized companies. I bought Sdk from these two companies. Douyin and FaceU used ByteDance's self-developed face recognition Sdk as far as I know. Due to the high requirements for real-time performance, deep learning algorithms are generally used, and the model is identified after training. If real-time requirements are not high, you can use Dlib http://dlib.net/ , based on OpenCV is relatively simple to use, this is also based on deep learning, the model given in the official document is about 100Mb, after all, it is open source and free. There is a big gap in performance compared with commercial Sdk. According to actual measurement, it takes about 400-500ms for Dlib to recognize a 1080 x 720 picture, and even 1000ms on some low-performance mobile phones, while commercial SDKs can be completed within 20ms. This is why there are open source and free, and commercial ones. The reason why it can sell so expensive also enlightens us to respect intellectual property rights. Due to these limitations, there is no dynamic sticker function in SuperKtv for the time being.

   Regarding the use of stickers, you can refer to another blog Android OpenGL ES from entry to advanced (7)-OpenGL ES 2D stickers and Blend mixing. This is an article about static stickers. Dynamic stickers are a little more troublesome and need to be based on the face. Change the position and angle of the sticker from time to time, and some need to pan and zoom. The change rule of the sticker needs to be defined in advance. Take Shangtang Sdk as an example. The sticker's motion rule is defined in a json file. Use the sticker to parse it first. json content, and then do periodic movement over time, this is the general principle of dynamic stickers.

  • Video encoding is
       different from audio. The original video occupies a very large disk space. If the recorded video is saved as original data, it is unimaginable. Therefore, after recording the video, it must be saved after encoding. There are many video encoding formats. H264 and H265 are popular on mobile terminals. , SuperKtv uses H264 encoding, so the article also takes H264 as an example for analysis.
       If you don’t need to edit the picture twice after saving the video with the beauty effect (for example, sing it), MediaCodec and Surface are preferred on the Android platform. On the one hand, the hardware encoding speed is better than the software encoding. On the other hand, the system comes with Api is more convenient to use. Mediacodec can directly encode the data on the Surface. You only need to write the encoded data to a file, which fits the needs of SuperKtv very well. However, the compatibility of hardware encoding is not very good, especially for thousands of Android models. There are always several mobile phones that have problems of this and other. Considering this situation, software encoding is sometimes used. X264 is recommended for H264. The algorithm performance is better than the H264 encoding algorithm that comes with FFmpeg, but the common usage is to compile X264 into FFmpeg and call it with a unified interface. This method is a bit more troublesome. First, you need to call back the camera. The data is processed for beauty, filters, etc., and then according to the needs to see whether the data format conversion is required, because there are two sets of Android Camera API, Camera1 recommends using NV21, Camera2 recommends using Yuv420p, when encoding with FFmpeg, due to the color space For the convenience of use, it is recommended that the data be uniformly converted to Yuv420p format and then encoded. The conversion algorithm is available on the Internet. LibYuv is recommended here, which is very powerful. This part of the content requires an understanding of several color spaces of the image. Especially RGBA series and YUV series.
       If you want to edit the original video twice on the editing page, you can only save the original picture. At this time, you can use MediaCodec or FFmpeg. Under this demand, the two are not very big, and the hardware encoding speed is fast. , But the compatibility is a little bit worse, the software coding compatibility is good, but the speed is a little slower, you can choose it appropriately. Of course FFmpeg also supports hardware encoding, but some additional processing is needed when compiling, but if it is hardware encoding, why not use the system API directly.

(5) Audio and video editing page

  The first two pictures are the realization of SuperKtv, the last two pictures are the realization of Sing Bar.
  Similarly, Sing Bar has also undergone a major revision on the audio and video editing page. Sing Bar does not implement video editing under the singing function, so it follows the style of Sing Bar, mainly for audio editing. The function implementation is the same as the audio editing mentioned above. The more important point is video playback. To video decoding and audio and video synchronization. If you want to realize the video editing function, there are two ideas, one is to use OpenGL ES to process and save in the background, the other is to use FFmpeg or OpenCV to save the data after processing, it is recommended to use OpenGL ES for video editing.

  • Video decoding
       Android platform recommends using MediaCodec decoding, which can be directly decoded to the Surface for display, which saves a lot of extra operations (data format conversion and rendering).
    Of course, you can also use X264 or FFmpeg for decoding, and copy the decoded data to the buffer provided by ANative_Window for display. This is much more convenient than FFmpeg encoding. If you need to use OpenGL ES to edit the video, you need to convert the data. The texture is uploaded to the buffer provided by OpenGL ES for display. Sing Bar can import external videos for editing, which can be achieved with the above ideas.

  • Audio and video synchronization
       Audio and video synchronization is a very important part of video playback, which directly affects the results of all previous work. Audio and video synchronization generally uses time stamp synchronization, which requires the correct time stamp to be added when encoding. The camera's callback has the timestamp of the current frame in microseconds, which is consistent with the required unit of MediaCodec, which is more convenient to use when encoding. The timestamp of FFmpeg is different from that of MediaCodec. It is calculated based on the time base. You only need to add the index of the current frame when encoding. The reason for synchronization is because the amount of audio playback data per unit time is fixed. You only need to fill in the data for the playback Api callback, but the video does not have a corresponding mechanism. The content displayed on the screen and when it is displayed are all from the outside world. Control, if the decoding speed is used, if the decoding speed is too fast, a longer video can be played in a short time, and if the decoding speed is too slow, it will cause the picture to freeze. The synchronization process can be roughly described as follows: If the current frame is played fast, the next frame will continue to play the current frame, which is equivalent to waiting. If the current frame is played too slowly, then discard it and play the next frame or the next frame, so that Can control the speed of video playback. Synchronization methods generally include audio to video synchronization, video to audio synchronization, and the two are synchronized to the reference time, and the video is synchronized to audio for general movies, because the audio data played in a unit of time is fixed.

(6) Save and work list page

  The first picture is the saved interface, and the second picture is the saved work list page. The interface is simple, no comparison with Sing Bar.

  The purpose of saving is to make the saved result the same as the adjusted effect of the editing page, which is equivalent to repeating the operation of the editing page, but to become a background operation. The only thing that the foreground needs to display is the saving progress, and the data processed in the background is directly Just encode it.

  • Audio encoding
       mentioned earlier that for the convenience of editing, the recorded sound is saved as the original data, but the audio result must be the finished product (encoded), so it must be encoded before saving. Mobile audio encoding generally uses Mp3 or AAC format. Lame is recommended for Mp3 encoding, Fdk-aac is recommended for AAC encoding, and MediaCodec or FFmpeg's own encoding algorithm can also be used. For the storage requirements of SuperKtv, the encoding speed is not very important, so you can choose any one. After understanding the difference between the MP3 format and the AAC format, you can choose either method to achieve good results. One difference is that the AAC encoded by MediaCodec is a bare stream without ADTS. If this is the final audio file, you need to manually add ADTS, and after using Fdk-aac encoding, ADTS will be automatically added.

  • Audio and video merging
       ends this step. The previously generated audio and video files are all separate. For audio recording, this step has been completed. For video, it is necessary to merge audio and video into one file. It can be understood as merging two files into one file. The two types of data are distinguished by a certain rule, and the audio and video data can be obtained separately when used, which is a bit similar to the feeling of synchronization of audio and video on the editing page. You can use MediaMuxer API provided by Android or FFmpeg. When using MediaMuxer to merge audio and video, if the audio is in AAC format, you need to provide AAC raw stream. If you use FFmpeg, you don’t need it. You can even use commands to complete it. After merging, it will be a video in the general sense. .

(7) Local works play page

  The first picture is the audio playback interface, and the second picture is the video playback interface. The interface is simple, no comparison with Sing Bar.

   A lot of work has been done before, and the final result is just one file, one is an audio file, similar to .mp3, and the commonly used is .m4a, etc.; the other is a video file, similar to .mp4, and the commonly used format is also .flv Wait. Everyone should be familiar with such files. The mobile phone can play it with its own functions, so there is no need to write another set of players when enjoying or sharing personal works in SuperKtv. You can use the ready-made one, or even call the system directly. API MediaPlayer is fine. Regarding the use of the player, it is recommended to use EXOPlayer under the Android platform. This is an open source Google player based on MediaCodec. It supports audio, video and their common formats, uses hardware decoding, and comes with some basic controls, which is highly customizable. SuperKtv is a local work playback function based on ExoPlayer. The effect is shown in the above picture, and the experience is also very good.

Three, summary

   This article is based on the "Sing Bar App" as a reference, based on the Android platform, summarizes the implementation strategy of mobile audio and video technology, suitable for developers with a certain audio and video foundation, can be used as a reference for technical solutions, due to limited time and energy, the implementation details will follow It is published in the form of an article. If you have any questions, please leave a message for exchange.

Please indicate the source.

Friendly Links:
1. FFmpeg http://ffmpeg.org/ (A must-have artifact for audio and video)
2. Sox http://sox.sourceforge.net/ (Swiss Army Knife in the audio industry)
3. SoundTouch https://sourceforge.net /projects/soundtouch/ (pitch, variable speed)
4. Dlib http://dlib.net/ (face feature point recognition open source library)
5. Oboe https://GitHub/Oboe (low latency audio interface for Android platform)

Guess you like

Origin blog.csdn.net/liuderong0/article/details/109172929