Principle and Algorithm Analysis of Android Audio Double Speed

Audio and video double speed is a very important function of content APP. It contains the double speed of video stream and audio stream. The principle of video double speed is relatively simple, that is, the frame rate can be increased when decoding video frames.

Audio double-speed is relatively complex. As we all know, the essence of sound is actually the sound waves generated when objects vibrate. Therefore, the double-speed of audio is to lengthen or shorten the voice signal in the time domain. Considering the user's experience, while ensuring the speed of sound, the voice The sampling rate, fundamental frequency and formant cannot be changed, so as to achieve the purpose of variable speed and no modulation.

For applications on the Android platform, there are usually three ways to implement audio double-speed:

For the native AudioTrack, it itself provides the function of processing PCM audio streams, but it is rarely used because it is bound by the system's MediaPlayer by default, and the latter itself has poor compatibility on different platforms.

The Sonic library is used by Google's famous open source player ExoPlayer, which is internally based on the time-scale modificatio (TSM algorithm), and continuously divides the input speech signal by locating the pitch period. Frame and frame processing, and finally synthesize a new signal to achieve the effect of double speed.

SoundTouch is built in by Bilibili's open source ijkPlayer, which is still based on the TSM algorithm internally. Unlike Sonic, it uses correlation peaks to synthesize speech signals.

It is undeniable that the synthesis of the signal will inevitably cause the distortion of the original audio. The difference is that Sonic is based on the pitch period, so the voice signal after variable speed has less impact on the human voice; while the SoundTouch double-speed effect is more suitable for comprehensive scenes.

However, in practical applications, problems continue to emerge:

  1. From the above conclusion, since the impact on human voice is small, the effect of Sonic should be better for double-speed playback of songs, but in reality, Sonic has obvious listening distortion at high speed, and there is a significant gap with the effect of SoundTouch, resulting in What is the cause of this phenomenon?

  2. How to solve this problem, and what are the applicable scenarios of the two double-speed solutions?

To understand these doubts, it is necessary to analyze the principle and specific algorithm implementation.

Audio Double Speed ​​Principle

1. The basic principle of TSM

The realization idea of ​​audio double speed is divided into analysis of time domain signal or frequency domain signal, but because the complexity of frequency domain is too high, it usually starts from time domain signal in practice.

Time-domain companding (TSM) is also a typical algorithm in time-domain signal-based processing, which provides a variable-speed, invariant audio processing implementation.

In the process of audio signal processing, it is unavoidable to carry out framing (analysis fames) operation, and the length of the frame is mostly selected between 20ms and 50ms, and the windowing operation is performed. Both ends are suppressed, so the sub-frames cannot be intercepted by length, but overlap each other, and then perform synthesis frames. If the overlap is 50% for split frames, and 75% for synthesis frames, then slow playback is achieved, and vice versa, fast playback.

In a word, perform a series of processing on each frame, such as stretching or compressing, and finally re-superimpose these frames into a composite signal to achieve double speed:

Private message me to receive the latest and most complete C++ audio and video learning and improvement materials, including ( C/C++ , Linux , FFmpeg , webRTC , rtmp , hls , rtsp , ffplay , srs )

 

 

2. Violent OLA

The OLA (Overlap-and-Add, OLA) algorithm is the simplest time-domain algorithm in the audio variable speed algorithm, and it is the basis of the subsequent time-domain algorithms (SOLA, SOLA-FS, TD-PSOLA, WSOLA).

First of all, after the audio signal is processed in frames, the first part of the processed signal is violently spliced ​​together. The idea is very simple, but the disadvantage is obvious. It will cause discontinuity of the spliced ​​signal, and the overlapping area of ​​adjacent frames will produce fundamental frequency distortion:

In order to alleviate the influence of this waveform discontinuity, we perform frame-by-frame windowing processing on the signal. In OLA, the Hanning window is usually used to add a window to the frame (as shown in the following figure  b ):

The windowing process ensures that both ends of the signal are suppressed, ensures the subsequent Fourier transform, and reduces spectrum leakage; after this, the next frame (as shown in Figure c above) is taken through a fixed interval Ha, and the window is superimposed with the previous frame. (Fig. d above) to alleviate the waveform discontinuity (pitch break) problem. Even so, in the process of frame cropping, there is still no guarantee that each frame can cover a complete cycle and ensure its phase alignment. This distortion is also called phase jump artifacts, and the audio is still not good for listening:

 

As shown in the figure, the two periodic signal frames  OLA become "non-periodic" after synthesis.

 

3. Waveform Similarity Superposition (WSOLA)

How to solve such a problem, the WSOLA (Waveform similarity Overlap-Add  ) algorithm proposes such an idea, by finding the next most similar signal frame of the current frame, and superimposing the two frames, so that the synthesized speech will be Very natural:

The above figure clearly expresses the core idea of ​​the algorithm:

1. Cut out a frame in the original audio and add a window;
2. Select the second frame in a range (blue dotted box), the phase parameter of this frame should be aligned with the phase of the first frame;
3. In another range Find the third frame inside (solid blue frame), which should be the most similar to the second frame;
4. Finally, superimpose them together.

The problem naturally turns into "how to find the most similar frame". In Android, for the problem of audio variable speed and invariant processing, SoundTouch uses an algorithm to find correlation peaks, while Sonic uses another AMDF pitch. extraction algorithm.

For finding the correlation peak, as the name implies, when the first frame of data arrives, the data will be transferred to the Buffer in turn, and start at the position after the fixed length, find the position with the greatest correlation with the first frame signal, and perform the two-frame signal analysis. synthesis;

For Sonic, the AMDF (Average Amplitude Difference Function Method) method is used, which is extremely simple. Within a certain range, the AMDF values ​​of each frame and the starting frame are calculated separately, and the distance between the frame with the smallest amplitude difference and the first frame is calculated. It is the pitch period. After finding the gene period, the pitch is changed according to the pitch period.

Periodic summary

The article first mentioned that when using ExoPlayer to play music, the double-speed effect of Sonic is obviously distorted, and there is a significant gap between the double-speed effect of SoundTouch corresponding to ijkPlayer. This seems counterintuitive. Since Sonic is implemented based on an algorithm that locates the pitch period, the double-speed effect should be better for audio signals with strong periodicity such as human voice.

After comparison and thinking, we make the following speculations. It is true that Sonic's double-speed effect is better for pure vocals, but for most music, the audience's sense of sound rhythm is more provided by the background music. While background music is usually played by a combination of multiple instruments, Sonic is more difficult to process audio signals that contain more harmonic impact and transient components.

Therefore, in the implementation of specific audio double-speed, it is advisable to make different decisions on specific business scenarios. For conventional music, especially background music and music with a strong sense of percussion, we can choose SoundTouch. Sonic is also a good choice for audio types (such as cross talk, storytelling, singer acapella).

Guess you like

Origin blog.csdn.net/m0_60259116/article/details/124354355