FFmpeg realizes precise splicing of audio and video synchronization

In the process of audio and video development, we often encounter the problem of splicing multiple clips (or called "shots"). A number of examples will be listed below to illustrate respectively.

The examples listed in this article must meet the prerequisites:

1. Each clip itself is synchronized with audio and video, but there is a problem of out-of-sync after splicing;

2. The video stream of each segment has the same: picture width, picture height, pixel aspect ratio, frame rate and time base; / 3. The audio stream of each segment has the same: sampling rate, time base; /4. Each clip has a video stream and an audio stream.

directly concat

FFmpeg provides us with the concat filter, which can not only stitch video but also audio. For example:

ffmpeg -i demo_1.mp4 -i demo_2.mp4 -filter_complex \
"[0:v][0:a][1:v][1:a]concat=n=2:v=1:a=1[v][a]" \
-map "[v]" -map "[a]" mix.mp4

If you want the audio and video to be synchronized after this kind of splicing, the prerequisite is: the duration of the audio stream in each segment must be equal to the duration of the video stream ,

like this: 

You can use the ffprobe tool that comes with FFmpeg to view the duration information corresponding to each stream in the file:

ffprobe -v quiet -show_entries \
stream=index,codec_name,time_base,start_pts,start_time,duration_ts,duration \
-of json demo_1.mp4

The output is as follows:

{
   ...
    "streams": [
        {
            "index": 0,
            "codec_name": "h264",
            "time_base": "1/12800",
            "start_pts": 0,
            "start_time": "0.000000",
            "duration_ts": 128512,
            "duration": "10.040000"
        },
        {
            "index": 1,
            "codec_name": "aac",
            "time_base": "1/44100",
            "start_pts": 0,
            "start_time": "0.000000",
            "duration_ts": 442764,
            "duration": "10.040000"
        }
    ]
}

The main focus here is on the duration field. It can be seen that the audio stream and the video stream in the sample file have the same duration: 10.04 seconds, so there will be no audio-video synchronization problem with how many such video splicing. As shown in the figure below:

 

Contrary to expectations, in actual processing, the audio stream and video stream of the segment are often of non-equal length. Of course, the difference in duration may be very small. On the order of tens of milliseconds, simply playing a clip will hardly make people feel the difference in duration. Only through the above ffprobe command can you see that the duration of the audio and video streams may not be equal.

For example:

 

If you directly concat, the output will become like this:

 

In addition, due to the use of multiple file inputs, ffmpeg decodes multiple audio and video at the same time, which consumes a lot of CPU and memory. The author once tried to concat 60+ video streams in a 16G memory environment, and just started processing a few frames. Error: thread_get_buffer() failed, get_buffer() failed, and finally exited abnormally. Therefore, it is strongly not recommended to use this filter when the quantity is unpredictable.

xfade and afade filter splicing

The xfade filter can make the transition between the two scenes softer, with certain special effects, but this filter will consume a period of time in the video for the screen transition.

Similarly, afade also realizes fade-out in the previous audio stream, and fade-in in the latter audio stream, and the intersection of fade-out and fade-in is merged, so a certain audio duration will also be lost.

Essentially, both the xfade filter and the afade filter perform concat operations directly on the video stream and audio stream, and do not involve audio and video synchronization.

Improved audio and video synchronization

Analyzing these clips, it is found that the audio and video of each clip are synchronized, and the problem arises from the slight difference in the duration of the audio and video of the clips.

When we stitch these clips, can we use the video stream duration as a frame of reference?

the whole idea

The audio stream no longer uses splicing directly, but first delays the audio stream of the sub-clip for a period of time, which is equal to the sum of the duration of all previous sub-clip video streams.

After delay processing, the audio streams of all clips are mixed into one main audio stream, so that the timing of the audio stream starting to play is no longer related to the length of the previous audio.

As shown below:

 

Beforehand prepare an audio stream long enough for the final mix. Generally, background music (BGM) is used. If you do not need background music in your needs, you can also use an empty audio stream of unlimited length instead:

-f lavfi -i anullsrc=r=44100:cl=stereo

Implementation case

There are currently three fragments, namely:

The audio stream channel layout of the above video clip file is: stereo, and the sampling rate is 44100Hz ( here, note that the above audio time base 1/44100 is just the reciprocal of the sampling rate, but it is just a coincidence ).

There is a piece of background music:

The above background music channel layout is: stereo, and the sampling rate is also: 44100Hz.

FFmpeg stitching command:

ffmpeg -i v_1.mp4 -i v_2.mp4 -i v_3.mp4 -stream_loop -1 -i bgm.mp3 \
-filter_complex "[0:v][1:v][2:v]concat=n=3:v=1:a=0;\
[0:a]anull[a_delay_0]; \
[1:a]adelay=delays=10000:all=1[a_delay_1]; \
[2:a]adelay=delays=16000:all=1[a_delay_2]; \
[3:a]volume=volume=0.2,atrim=end_sample=943740[bgm]; \
[bgm][a_delay_0][a_delay_1][a_delay_2]amix=inputs=4:duration=first" \
-b:v 2M \
-b:a 128k \
-movflags faststart \
concat.mp4

Although the length of the BGM in the example is sufficient, in real application scenarios, the background music may be user-defined, so the length may not be sufficient every time. To avoid problems, the BGM stream is played in an infinite loop.

Since there is no delay in the first input audio, use the anull filter to pass through into a unified stream label;

The audio stream [1:a] from v_2.mp4, whose delay time comes from the video stream duration of v_1.mp4 (seconds converted to milliseconds);

The audio stream [2:a] from v_3.mp4, the length of its delay comes from the video stream of v_1.mp4 + the video stream duration of v_2.mp4;

The audio stream [3:a] from bgm.mp3 is trimmed to fit the total length of the video after its volume is lowered. Here a calculation based on the sampling rate is used.

Why clip based on sample rate? As can be seen above, the audio sampling rate of all clips must be the same. So there is a constant: 44100, which means that there will be 44100 samples in 1 second. After the total video duration is known, it can be converted into how many samples the audio should contain. The timebase is not used because the timebase of the source background music may not match the timebase of the audio stream in the clip, just to avoid trouble.

So far, all the basic preparations have been completed, and the rest is mixing, mixing all the delayed audio streams to the bgm audio stream. Note that the bgm stream tag must be placed first, because the duration algorithm of the amix filter uses first, that is, the duration of the first stream is kept.

So far, the output is a video with strict audio and video synchronization.

How to Calculate Latency More Accurately

In order to simplify the logic, the example command above simply writes the number of milliseconds of the delay. But how should this number of milliseconds be calculated?

First look at how the duration of the video is calculated:

duration = duration_ts * time_base

For example, the video stream information is:

"time_base": "1/12800"
"start_pts": 0
"start_time": "0.000000"
"duration_ts": 128000
"duration": "10.000000"

Right now:

duration = 128000 * 1 / 12800 = 10(秒)

duration_ts is an integer, because of the introduction of division, the duration may be indivisible, and the precision will be lost in the duration field.

If such an inaccurate value is continuously accumulated, it is tantamount to introducing another error.

The correct way is to take the duration_ts of each fragment, which is of type int64. Since concat requires that the video streams of each sub-clip must have the same time base, duration_ts can be accumulated.

When the delay is actually added, the accumulated duration_ts is converted into the corresponding duration, so that the error is always controlled within a reasonable range.

why add delay

When discussing this solution with my colleagues, someone raised doubts: "The audio and video are out of sync when splicing. Wouldn't it be even more out of sync with the delay?". Later, after understanding his thinking, I realized that it was a very basic question. Take a look at this picture:

 

No matter how many sources are input to FFmpeg and how many streams are included, after demux, all streams start from time 0, and the input streams will not necessarily cause some lag due to the order of input sources .

Another concat implementation

ffmpeg provides us with the function of batch merging. First, list the files to be spliced ​​in the following form:

file 'media_1.mp4'
file 'media_2.mp4'
...

Start with file, include the file name with single quotes, support full file name and relative file name, use UTF-8 without BOM to save as a manifest file, such as manifest.txt. Then use the command:

ffmpeg -f concat -i manifest.txt -c copy concat.mp4

This method will automatically align audio and video. The alignment algorithm is very similar to the logic mentioned above, and it is also through misalignment. But the difference is that the audio part directly modifies the playback time (pts) of the packet data (for the specific code, please refer to: the implementation of the concat_read_packet function in libavformat/concatdec.c). But the problem has also been exposed, as in the example mentioned above, assuming that the audio streams of the two video files are shorter than their respective video streams, it may be possible during playback (possible here, depending on the support of the player) Can't show any problems, viewing seems normal. But if you import such a file directly into an editing software like Premiere Pro, you will find that the two pieces of audio are played next to each other, just like direct concat. Even if it is not used for video editing software, when the FFmpeg command is used to extract the audio stream from such a spliced ​​video, the same result is obtained. No silent audio in between.

Solution: Silent audio needs to be supplemented between the two audio segments so that the audio stream can be played naturally from beginning to end. There are two specific methods:

ffmpeg -f concat -i manifest.txt -filter_complex "[0:a]aresample=async=1000" concat.mp4

or

ffmpeg -f concat -i manifest.txt -async 1000 concat.mp4

Both are the same, automatically filling the empty space in the middle with silent audio. Among them, the async parameter is adjustable, and generally 1000 can roughly meet the requirements.

It should be noted that only the blank space in the middle is filled. If the audio stream of the last segment is also shorter than the video stream, it will not be filled to the end of the video.

Some students may think that there are two video files in the example, and the audio stream is shorter than the video stream. What if it is longer than the video stream? Taking this method as an example, the splicing result is:

The last frame of the previous video stream is continuously copied until the audio stream finishes playing (the video stream is not ended);

The latter video stream will stop after playing, and the conventional player processing will generally display still (but the video stream has ended at this time), and the audio stream will continue to play until it ends.

Original text FFmpeg realizes accurate fragment splicing of audio and video synchronization_ffmpeg splices shots according to music rhythm points_Jack_Chai's Blog-CSDN Blog 

★The business card at the end of the article can receive audio and video development learning materials for free, including (FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, srs) and audio and video learning roadmaps, etc.

see below!

 

Guess you like

Origin blog.csdn.net/yinshipin007/article/details/130645720
Recommended