Audio and video development - audio and video synchronization

1. Flow chart of FFmpeg simple player

Flow chart of FFmpeg simple player

The purpose of audio and video synchronization is to make the played sound consistent with the displayed picture.

The video is played by frames, and the image display device displays one frame at a time. The video playback speed is determined by the frame rate, which indicates how many frames are displayed per second;

Audio is played by sample points, and the sound playback device plays one sample point at a time. The sound playback speed is determined by the sampling rate, which indicates how many sample points are played per second.

If only the video is played at the frame rate and the audio is played at the sampling rate, there is no synchronization mechanism between the two. Even if the audio and video are basically synchronized at the beginning, as time goes by, the audio and video will gradually lose synchronization, and the out-of-sync phenomenon will become more and more serious. It's getting serious.

This is because: first, the playback time is difficult to control precisely, and second, abnormalities and errors will accumulate over time. Therefore, a certain synchronization strategy must be adopted to continuously correct the time difference of audio and video, so that the image display and sound playback are generally consistent.

The way of audio and video synchronization is basically to determine a clock (audio clock, video clock, external clock) as the master clock, and the audio or video clock that is not the master clock is the slave clock. During playback, the master clock is used as a synchronization reference to constantly judge the difference between the slave clock and the master clock, and adjust the slave clock so that the slave clock catches up (when lagging behind) or waits (when leading) the master clock.

According to different types of master clocks, audio and video synchronization modes can be divided into the following three types:

  • Audio is synchronized to video, and the video clock is used as the master clock;

  • Video is synchronized to audio, and the audio clock is used as the master clock;

  • The audio and video are synchronized to the external clock, and the external clock is used as the main clock;

The default synchronization method of ffplay: video synchronization to audio.

2. I frame/IDR frame/P frame/B frame

I frame: I frame (Intra-coded picture, intra-frame coded frame, often called key frame) contains a complete image information, which belongs to intra-frame coded image, does not contain motion vector, and does not need to refer to other frame images when decoding. So channel can be switched at I-frame image without loss of image or undecodable. The I-frame image is used to prevent the accumulation and diffusion of errors. In a closed GOP, the first frame of each GOP must be an I frame, and the data of the current GOP will not refer to the data of the previous and subsequent GOPs.

IDR frame: IDR frame (InstantaneousDecodingRefreshpicture, instant decoding refresh frame) is a special I frame. When the decoder decodes the IDR frame, it will clear the DPB (DecodedPictureBuffer, pointing to the forward and backward reference frame list), output or discard all the decoded data, and then start a new decoding sequence. The picture after the IDR frame will not refer to the picture before the IDR frame, so the IDR frame can prevent the error propagation in the video stream, and the IDR frame is also a safe access point for the decoder and the player.

P frame: A P frame (Predictive-coded picture, predictively coded image frame) is an inter-frame coded frame, and uses the previous I frame or P frame for predictive coding.

B frame: B frame (Bi-directionally predicted picture, bi-directionally predictive coded image frame) is an inter-frame coded frame, which uses the previous and (or) subsequent I frame or P frame for bidirectional predictive coding. B frames cannot be used as reference frames. B-frames have a higher compression rate, but require more buffering time and higher CPU usage, so B-frames are suitable for local storage and video on demand, but not for live broadcast systems that require high real-time performance.

3、GOP

GOP (Group Of Pictures, Group of Pictures) is a group of continuous pictures, consisting of one I frame and multiple B/P frames, and is the basic unit of codec access. Two parameters M and N are commonly used in the GOP structure. M specifies the distance between two anchor frames in the GOP (anchor frame refers to frames that can be referenced by other frames, that is, I frames or P frames), and N specifies the size of a GOP. For example, M=3, N=15, the GOP structure is: IBBPBBPBBPBBPBB

There are two types of GOP: closed GOP and open GOP.

  • Closed GOP: A closed GOP only needs to refer to the images in this GOP, and does not need to refer to the data of previous and subsequent GOPs. This mode determines that the display sequence of a closed GOP always starts with an I frame and ends with a P frame;

  • Open GOP: Some frames of the previous GOP or the following GOP may be used when decoding the B frame in the open GOP. An open GOP will only appear when the code stream contains B frames.

In an open GOP, the functions of ordinary I frames and IDR frames are different, and the two frame types need to be clearly distinguished. In a closed GOP, there is no difference in function between a normal I frame and an IDR frame, and no distinction can be made.

The dependencies of I frames, P frames, and B frames in open GOPs and closed GOPs are shown in the figure below:Figure 3 GOP mode

4. DTS and PTS

DTS (Decoding Time Stamp, decoding time stamp), indicating the decoding time of the compressed frame. PTS (Presentation Time Stamp, display time stamp), indicating the display time of the original frame obtained after decoding the compressed frame. DTS and PTS are the same in audio. In the video, because the B frame requires bidirectional prediction, and the B frame depends on the previous and subsequent frames, the video decoding order and the display order of the B frame are different, that is, the DTS is different from the PTS. Of course, the DTS and PTS of videos without B frames are the same. The following figure takes an open GOP diagram as an example to illustrate the decoding sequence and display sequence of video streams:

Figure 4 Decoding and display sequence

Take the "B[1]" frame in the figure as an example. When decoding the "B[1]" frame, you need to refer to the "I[0]" frame and the "P[3]" frame. Therefore, the "P[3]" frame Must be decoded before the 'B[1]' frame. This leads to an inconsistency between the decoding order and the display order, and the frames displayed later need to be decoded first.

video_decode_frame() function:

This function implements the following functions:

  1. Take a packet from the video packet queue.

  2. Send the obtained packet to the decoder.

  3. Receive the decoded frame from the decoder, this frame is used as the output parameter of the function for the upper function to process.

Note the following points:

  1. For video files containing B frames, the order in which video frames are stored is different from the order in which they are displayed.

  2. The input of the decoder is a packet queue, and the video frame decoding sequence is the same as the storage sequence, which is the increasing order of dts. dts is the decoding timestamp, so the storage order and decoding order are the increasing order of dts. avcodec_send_packet() is to send the packet sequence in the video file to the decoder in sequence. The order of sending packets is IPBBPBB.

  3. The output of the decoder is a frame queue, and the frame output order is the order of increasing pts. pts is the decoded timestamp. The inconsistency between pts and dts is handled by the decoder, and the user program does not need to care about it. The order in which frames are received from the decoder is IBBPBBP.

  4. A certain number of frames will be buffered in the decoder. After a new decoding action is started, the decoder will output the first packet after several packets are sent to the decoder. This is easier to understand, because there is trust between frames during decoding For example, after the three frames of IPB are sent to the decoder, the decoding of the B frame needs to rely on the I frame and the P frame, so before the output of the B frame, the I frame and the P frame must exist in the decoder and cannot be deleted. After understanding this, it is easy to understand the display and deletion mechanism of video frames in the video frame queue.

  5. Frames buffered in the decoder can be retrieved by flushing the decoder. The way to flush (flush) the decoder is to call avcodec_send_packet(..., NULL), and then call avcodec_receive_frame() multiple times to get all the buffered frames. After the cache frame is fetched, avcodec_receive_frame() returns AVERROR_EOF.

How to determine the corresponding relationship between the output frame of the decoder and the input packet? You can compare the values ​​of frame->pkt_pos and pkt.pos. These two values ​​indicate the offset address of the packet in the video file. If the two variables are equal, it means that the frame comes from this packet. By debugging and tracking the values ​​of these two variables, the relationship between the input frame and the output frame of the decoder can be found.

5. Video synchronization audio

Video synchronization to audio is the default synchronization method of ffplay, which is implemented in the video playback thread. Among them, the video_refresh() function implements the core steps of video playback (including synchronization control).

The related functions are as follows:

main() -->
player_running() -->
open_video() -->
open_video_playing() -->
SDL_CreateThread(video_playing_thread, ...) 创建视频播放线程
​
video_playing_thread() -->
video_refresh()

The source code of the video playback thread is as follows:

static int video_playing_thread(void *arg)
{
    player_stat_t *is = (player_stat_t *)arg;
    double remaining_time = 0.0;
​
    while (1)
    {
        if (remaining_time > 0.0)
        {
            av_usleep((unsigned)(remaining_time * 1000000.0));
        }
        remaining_time = REFRESH_RATE;
        // 立即显示当前帧,或延时remaining_time后再显示
        video_refresh(is, &remaining_time);
    }
    return 0;
}
video_refresh()函数源码如下:

/* called to display each frame */
static void video_refresh(void *opaque, double *remaining_time)
{
    player_stat_t *is = (player_stat_t *)opaque;
    double time;
    static bool first_frame = true;
​
retry:
    if (frame_queue_nb_remaining(&is->video_frm_queue) == 0)  // 所有帧已显示
    {    
        // nothing to do, no picture to display in the queue
        return;
    }
​
    double last_duration, duration, delay;
    frame_t *vp, *lastvp;
​
    /* dequeue the picture */
    lastvp = frame_queue_peek_last(&is->video_frm_queue);     // 上一帧:上次已显示的帧
    vp = frame_queue_peek(&is->video_frm_queue);              // 当前帧:当前待显示的帧
​
    // lastvp和vp不是同一播放序列(一个seek会开始一个新播放序列),将frame_timer更新为当前时间
    if (first_frame)
    {
        is->frame_timer = av_gettime_relative() / 1000000.0;
        first_frame = false;
    }
​
    // 暂停处理:不停播放上一帧图像
    if (is->paused)
        goto display;
​
    /* compute nominal last_duration */
    last_duration = vp_duration(is, lastvp, vp);        // 上一帧播放时长:vp->pts - lastvp->pts
    delay = compute_target_delay(last_duration, is);    // 根据视频时钟和同步时钟的差值,计算delay值
​
    time = av_gettime_relative()/1000000.0;
    // 当前帧播放时刻(is->frame_timer+delay)大于当前时刻(time),表示播放时刻未到
    if (time < is->frame_timer + delay) {
        // 播放时刻未到,则更新刷新时间remaining_time为当前时刻到下一播放时刻的时间差
        *remaining_time = FFMIN(is->frame_timer + delay - time, *remaining_time);
        // 播放时刻未到,则不播放,直接返回
        return;
    }
​
    // 更新frame_timer值
    is->frame_timer += delay;
    // 校正frame_timer值:若frame_timer落后于当前系统时间太久(超过最大同步域值),则更新为当前系统时间
    if (delay > 0 && time - is->frame_timer > AV_SYNC_THRESHOLD_MAX)
    {
        is->frame_timer = time;
    }
​
    SDL_LockMutex(is->video_frm_queue.mutex);
    if (!isnan(vp->pts))
    {
        update_video_pts(is, vp->pts, vp->pos, vp->serial); // 更新视频时钟:时间戳、时钟时间
    }
    SDL_UnlockMutex(is->video_frm_queue.mutex);
​
    // 是否要丢弃未能及时播放的视频帧
    if (frame_queue_nb_remaining(&is->video_frm_queue) > 1)  // 队列中未显示帧数>1(只有一帧则不考虑丢帧)
    {         
        frame_t *nextvp = frame_queue_peek_next(&is->video_frm_queue);  // 下一帧:下一待显示的帧
        duration = vp_duration(is, vp, nextvp);             // 当前帧vp播放时长 = nextvp->pts - vp->pts
        // 当前帧vp未能及时播放,即下一帧播放时刻(is->frame_timer+duration)小于当前系统时刻(time)
        if (time > is->frame_timer + duration)
        {
            frame_queue_next(&is->video_frm_queue);         // 删除上一帧已显示帧,即删除lastvp,读指针加1(从lastvp更新到vp)
            goto retry;
        }
    }
​
    // 删除当前读指针元素,读指针+1。若未丢帧,读指针从lastvp更新到vp;若有丢帧,读指针从vp更新到nextvp
    frame_queue_next(&is->video_frm_queue);
​
display:
    video_display(is);                      // 取出当前帧vp(若有丢帧是nextvp)进行播放
}

The basic method of synchronizing video to audio is: if the video is ahead of the audio, it will not play and wait for the audio; if the video is behind the audio, discard the current frame and play the next frame directly to catch up with the audio. The execution flow of this function refers to the following flow chart:

video_refresh() flow chart

Proceed as follows:

  1. According to the playback duration of the previous frame lastvp, the correction waits until the delay value. Duration is the ideal playback duration of the previous frame, and delay is the actual playback duration of the previous frame. According to the delay value, the playback moment of the current frame can be calculated;

  2. If the playback time of the current frame vp has not arrived, the previous frame lastvp will continue to be displayed, and the delay value remaining_time will be used as an output parameter for the superior calling function to process;

  3. If the playing time of the current frame vp has arrived, the current frame will be displayed immediately, and the read pointer will be updated;

In the video_refresh() function, compute_target_delay() is called to adjust the delay value according to the difference between the video clock and the main clock, thereby adjusting the moment when the video frame is played:

// 根据视频时钟与同步时钟(如音频时钟)的差值,校正delay值,使视频时钟追赶或等待同步时钟
// 输入参数delay是上一帧播放时长,即上一帧播放后应延时多长时间后再播放当前帧,通过调节此值来调节当前帧播放快慢
// 返回值delay是将输入参数delay经校正后得到的值
static double compute_target_delay(double delay, VideoState *is)
{
    double sync_threshold, diff = 0;
​
    /* update delay to follow master synchronisation source */
    if (get_master_sync_type(is) != AV_SYNC_VIDEO_MASTER) {
        // 视频时钟与同步时钟(如音频时钟)的差异,时钟值是上一帧pts值(实为:上一帧pts + 上一帧至今流逝的时间差)
        diff = get_clock(&is->vidclk) - get_master_clock(is);
        // delay是上一帧播放时长:当前帧(待播放的帧)播放时间与上一帧播放时间差理论值
        // diff是视频时钟与同步时钟的差值
        // 若delay < AV_SYNC_THRESHOLD_MIN,则同步域值为AV_SYNC_THRESHOLD_MIN
        // 若delay > AV_SYNC_THRESHOLD_MAX,则同步域值为AV_SYNC_THRESHOLD_MAX
        // 若AV_SYNC_THRESHOLD_MIN < delay < AV_SYNC_THRESHOLD_MAX,则同步域值为delay
        sync_threshold = FFMAX(AV_SYNC_THRESHOLD_MIN, FFMIN(AV_SYNC_THRESHOLD_MAX, delay));
        if (!isnan(diff) && fabs(diff) < is->max_frame_duration) {
            if (diff <= -sync_threshold)        // 视频时钟落后于同步时钟,且超过同步域值
                delay = FFMAX(0, delay + diff); // 当前帧播放时刻落后于同步时钟(delay+diff<0)则delay=0(视频追赶,立即播放),否则delay=delay+diff
            else if (diff >= sync_threshold && delay > AV_SYNC_FRAMEDUP_THRESHOLD)  // 视频时钟超前于同步时钟,且超过同步域值,但上一帧播放时长超长
                delay = delay + diff;           // 仅仅校正为delay=delay+diff,主要是AV_SYNC_FRAMEDUP_THRESHOLD参数的作用,不作同步补偿
            else if (diff >= sync_threshold)    // 视频时钟超前于同步时钟,且超过同步域值
                delay = 2 * delay;              // 视频播放要放慢脚步,delay扩大至2倍
        }
    }
​
    av_log(NULL, AV_LOG_TRACE, "video: delay=%0.3f A-V=%f\n", delay, -diff);
    return delay;
}

This function implements the following functions:

  • Calculate the deviation diff between the video clock and the audio clock (master clock), which is actually the pts of the previous frame of video minus the pts of the previous frame of audio. The so-called previous frame is the last frame that has been played, and the pts of the previous frame can identify the playback moment (progress) of the video stream/audio stream.

  • Calculate the synchronization threshold value sync_threshold. The function of the synchronization threshold value is: if the difference between the video clock and the audio clock is less than the synchronization threshold value, the audio and video are considered to be synchronous, and the delay is not corrected; if the difference is greater than the synchronization threshold value, the audio and video are considered Out of synchronization, the delay value needs to be corrected. The calculation method of the synchronization domain value is as follows:

    • If duration < AV_SYNC_THRESHOLD_MIN, the synchronization domain value is AV_SYNC_THRESHOLD_MIN

    • If duration > AV_SYNC_THRESHOLD_MAX, the synchronization domain value is AV_SYNC_THRESHOLD_MAX

    • If AV_SYNC_THRESHOLD_MIN < duration < AV_SYNC_THRESHOLD_MAX, the synchronization domain value is duration

  • The delay correction strategy is as follows:

    • The video clock lags behind the sync clock and the lag value exceeds the sync threshold value: if the current frame playback time lags behind the sync clock (delay+diff<0), then delay=0 (video catches up and plays immediately); otherwise delay=duration+diff;

    • The video clock is ahead of the synchronization clock and exceeds the synchronization threshold: the playback time of the last frame is too long (exceeding the maximum value), and only corrected to delay=duration+diff; otherwise, delay=duration×2, the video playback slows down and waits for the audio;

    • The difference between the video clock and the audio clock is within the synchronization domain, indicating that the audio and video are in sync. If the delay is not corrected, then delay=duration;

To make a summary of the above process of synchronizing video to audio, refer to the following figure:

Schematic diagram of ffplay audio and video synchronization

In the figure, the small black circle represents the actual playback time of the frame, the small red circle represents the theoretical playback time of the frame, the small green square represents the current system time (current time), and the small red square represents the time point in a different interval, then the current time When in different intervals, the video synchronization strategy is:

  • If the current moment is at T0, the previous frame will be played repeatedly, and the current frame will be played after a delay of remaining_time;

  • If the current moment is at the T1 position, the current frame will be played immediately;

  • If the current moment is at the T2 position, ignore the current frame and immediately display the next frame to speed up video catching up;

The above content is a simple and vivid description for the convenience of understanding. The actual process needs to calculate relevant values, and control the playback process according to the strategies in compute_target_delay() and video_refresh().

[Learning address]: FFmpeg/WebRTC/RTMP/NDK/Android audio and video streaming media advanced development

[Article Benefits]: Receive more audio and video learning packages, Dachang interview questions, technical videos and learning roadmaps for free. The materials include (C/C++, Linux, FFmpeg webRTC rtmp hls rtsp ffplay srs, etc.) Click 1079654574 to join the group to receive it~

6. Audio playback

The audio clock is a synchronous master clock, the audio can be played according to its own rhythm, and the audio clock should be referred to when the video is played. The audio playback function is called back by the SDL audio playback thread, and the callback function is implemented as follows:

// 音频处理回调函数。读队列获取音频包,解码,播放
// 此函数被SDL按需调用,此函数不在用户主线程中,因此数据需要保护
// \param[in]  opaque 用户在注册回调函数时指定的参数
// \param[out] stream 音频数据缓冲区地址,将解码后的音频数据填入此缓冲区
// \param[out] len    音频数据缓冲区大小,单位字节
// 回调函数返回后,stream指向的音频缓冲区将变为无效
// 双声道采样点的顺序为LRLRLR
static void sdl_audio_callback(void *opaque, Uint8 *stream, int len)
{
    player_stat_t *is = (player_stat_t *)opaque;
    int audio_size, len1;
​
    int64_t audio_callback_time = av_gettime_relative();
​
    while (len > 0) // 输入参数len等于is->audio_hw_buf_size,是audio_open()中申请到的SDL音频缓冲区大小
    {
        if (is->audio_cp_index >= (int)is->audio_frm_size)
        {
           // 1. 从音频frame队列中取出一个frame,转换为音频设备支持的格式,返回值是重采样音频帧的大小
           audio_size = audio_resample(is, audio_callback_time);
           if (audio_size < 0)
           {
                /* if error, just output silence */
               is->p_audio_frm = NULL;
               is->audio_frm_size = SDL_AUDIO_MIN_BUFFER_SIZE / is->audio_param_tgt.frame_size * is->audio_param_tgt.frame_size;
           }
           else
           {
               is->audio_frm_size = audio_size;
           }
           is->audio_cp_index = 0;
        }
        // 引入is->audio_cp_index的作用:防止一帧音频数据大小超过SDL音频缓冲区大小,这样一帧数据需要经过多次拷贝
        // 用is->audio_cp_index标识重采样帧中已拷入SDL音频缓冲区的数据位置索引,len1表示本次拷贝的数据量
        len1 = is->audio_frm_size - is->audio_cp_index;
        if (len1 > len)
        {
            len1 = len;
        }
        // 2. 将转换后的音频数据拷贝到音频缓冲区stream中,之后的播放就是音频设备驱动程序的工作了
        if (is->p_audio_frm != NULL)
        {
            memcpy(stream, (uint8_t *)is->p_audio_frm + is->audio_cp_index, len1);
        }
        else
        {
            memset(stream, 0, len1);
        }
​
        len -= len1;
        stream += len1;
        is->audio_cp_index += len1;
    }
    // is->audio_write_buf_size是本帧中尚未拷入SDL音频缓冲区的数据量
    is->audio_write_buf_size = is->audio_frm_size - is->audio_cp_index;
    /* Let's assume the audio driver that is used by SDL has two periods. */
    // 3. 更新时钟
    if (!isnan(is->audio_clock))
    {
        // 更新音频时钟,更新时刻:每次往声卡缓冲区拷入数据后
        // 前面audio_decode_frame中更新的is->audio_clock是以音频帧为单位,所以此处第二个参数要减去未拷贝数据量占用的时间
        set_clock_at(&is->audio_clk, 
                     is->audio_clock - (double)(2 * is->audio_hw_buf_size + is->audio_write_buf_size) / is->audio_param_tgt.bytes_per_sec, 
                     is->audio_clock_serial, 
                     audio_callback_time / 1000000.0);
    }
}

7. Update the clock

7.1 Structures and functions

Clock structure:

// 时钟/同步时钟
typedef struct Clock {
    double pts;             // 当前正在播放的帧的pts    /* clock base */
    double pts_drift;       // 当前的pts与系统时间的差值  保持设置pts时候的差值,后面就可以利用这个差值推算下一个pts播放的时间点
    double last_updated;    // 最后一次更新时钟的时间
    double speed;               // 播放速度控制
    int serial;                 // 播放序列
    int paused;                     // 是否暂停
    int *queue_serial;      // 队列的播放序列 PacketQueue中的serial
} Clock;

Take a look at several functions about the clock:

// 主要由set_clock调用
static void set_clock_at(Clock *c, double pts, int serial, double time)
{
    c->pts = pts;
    c->last_updated = time;
    c->pts_drift = c->pts - time;
    c->serial = serial;
}
​
static void set_clock(Clock *c, double pts, int serial)
{
    double time = av_gettime_relative() / 1000000.0;
    set_clock_at(c, pts, serial, time);
}
​
static double get_clock(Clock *c)
{
    // 如果时钟的播放序列与待解码包队列的序列不一致,返回NAN,肯定就是不同步或者需要丢帧了
    if (*c->queue_serial != c->serial)
        return NAN;
    if (c->paused) {
        // 暂停状态则返回原来的pts
        return c->pts;
    } else {
        double time = av_gettime_relative() / 1000000.0;
        // speed可以先忽略播放速度控制
        // 如果是1倍播放速度,c->pts_drift + time
        return c->pts_drift + time - (time - c->last_updated) * (1.0 - c->speed);
    }
}

Audio and video will call the function set_clock to update the audio clock or video clock every time a new frame of data is played. Through the function set_clock_at we found that the four variables of the Clock structure are updated. Among them, pts_drift is the difference between the pts of the current frame and the system time. With this difference, the clock point of the current frame relative to the system time can be easily calculated at a certain moment in the future.

7.2 Update audio clock

Although the sound card uses the audio sampling point as the playback unit, each time an audio frame is sent to the sound card buffer, the audio playback time is updated every time an audio frame is sent, that is, the audio clock is updated every other audio frame duration.

In the audio_decode_frame function, update the audio clock audio_clock:

/* update the audio clock with the pts */
if (!isnan(af->pts))
  is->audio_clock = af->pts + (double) af->frame->nb_samples / af->frame->sample_rate;
else
  is->audio_clock = NAN;
is->audio_clock_serial = af->serial;

In sdl_audio_callbackthe function, set the audio clock:

if (!isnan(is->audio_clock))
{
  // 更新音频时钟,更新时刻:每次往声卡缓冲区拷入数据后
  // 前面audio_decode_frame中更新的is->audio_clock是以音频帧为单位,所以此处第二个参数要减去未拷贝数据量占用的时间
  set_clock_at(&is->audio_clk, 
               is->audio_clock - (double)(2 * is->audio_hw_buf_size + is->audio_write_buf_size) / is->audio_param_tgt.bytes_per_sec, 
               is->audio_clock_serial, 
               audio_callback_time / 1000000.0);
}

7.3 Update video clock

video_refreshUpdate video clock in function:

SDL_LockMutex(is->video_frm_queue.mutex);
if (!isnan(vp->pts))
{
    update_video_pts(is, vp->pts, vp->pos, vp->serial); // 更新视频时钟:时间戳、时钟时间
}
SDL_UnlockMutex(is->video_frm_queue.mutex);

8. Playback control

8.1 Pause/Continue

The switch between the pause/continue state is realized by the user pressing the space bar, and each time the space bar is pressed, the pause/continue state is reversed once.

The function call relationship is as follows:

main() -->
event_loop() -->
toggle_pause() -->
stream_toggle_pause()

stream_toggle_pause() implements state flipping:

/* pause or resume the video */
static void stream_toggle_pause(VideoState *is)
{
    if (is->paused) {
        // 这里表示当前是暂停状态,将切换到继续播放状态。在继续播放之前,先将暂停期间流逝的时间加到frame_timer中
        is->frame_timer += av_gettime_relative() / 1000000.0 - is->vidclk.last_updated;
        if (is->read_pause_return != AVERROR(ENOSYS)) {
            is->vidclk.paused = 0;
        }
        set_clock(&is->vidclk, get_clock(&is->vidclk), is->vidclk.serial);
    }
    set_clock(&is->extclk, get_clock(&is->extclk), is->extclk.serial);
    is->paused = is->audclk.paused = is->vidclk.paused = is->extclk.paused = !is->paused;
}

There is the following code in the video_refresh() function:

/* called to display each frame */
static void video_refresh(void *opaque, double *remaining_time)
{
    ......
    
    // 视频播放
    if (is->video_st) {
        ......
        // 暂停处理:不停播放上一帧图像
        if (is->paused)
            goto display;
        
        ......
    }
    
    ......
}

In the paused state, the previous frame (the last frame) is actually played continuously, and the image is not updated.

8.2 Play frame by frame

Frame-by-frame playback means that every time the user presses the s key, the player plays a frame. The method of realizing frame-by-frame playback is: every time the s key is pressed, the state is switched to play, and after one frame is played, the state is switched to pause.

The function call relationship is as follows:

main() -->
event_loop() -->
step_to_next_frame() -->
stream_toggle_pause()

The implementation code is relatively simple, as follows:

static void step_to_next_frame(VideoState *is)
{
    /* if the stream is paused unpause it, then step */
    if (is->paused)
        stream_toggle_pause(is);        // 确保切换到播放状态,播放一帧画面
    is->step = 1;
}
/* called to display each frame */
static void video_refresh(void *opaque, double *remaining_time)
{
    ......
    
    // 视频播放
    if (is->video_st) {
        ......
        if (is->step && !is->paused)
            stream_toggle_pause(is);    // 逐帧播放模式下,播放一帧画面后暂停
        ......
    }
    
    ......
}

8.3 Variable speed and pitch

  • Two schemes for ffmpeg to achieve variable speed playback

  • Analysis and Implementation of Audio Variable Speed ​​Playing Principle - Tocy - Blog Garden (cnblogs.com)

  • SoundTouch implements audio variable speed and pitch wkw1125's blog-CSDN Blog

  • Summary of the principle and method of variable speed and tuning - WELEN - 博客园 (cnblogs.com)

  • ffplay uses ffmpeg filter to achieve double-speed playback of CodeOfCC's blog-CSDN blog ffmpeg double speed

  • ffplay uses sonic to achieve double-speed playback of CodeOfCC's blog-CSDN blog ffplay double speed

  • Open source player ijkplayer (2): solution to ijkplayer double speed transposition problem - gray drifting - blog garden (cnblogs.com)

Variable speed and modulation can be divided into: variable speed without change and change without change.

Speech speed change and pitch change means that the pitch and semantics remain unchanged, and the speech speed becomes faster or slower. This process is manifested as the spectrogram compressing or expanding like an accordion on the time axis. That is to say, the fundamental frequency value is almost unchanged, corresponding to the same pitch; the entire time course is compressed or expanded, and the number of glottal cycles is reduced or increased, that is, the vocal tract movement rate changes, and the speech rate also changes accordingly. . Corresponding to the speech production model, the stimulus and the system go through almost the same states as the original utterance case, but for longer or shorter durations than before.

Strictly speaking, fundamental frequency and pitch are two different concepts. Fundamental frequency refers to the frequency of vocal cord vibration, and pitch refers to human beings' subjective perception of fundamental frequency, but the changes of the two are basically the same, that is, the higher the fundamental frequency, the higher the pitch. The higher, the lower the fundamental frequency, the lower the pitch, and the pitch is determined by the fundamental frequency. Therefore, voice change without changing the speed refers to changing the fundamental frequency of the speaker while keeping the speech rate and semantics unchanged, that is, keeping the short-term spectral envelope (position and bandwidth of the formant) and the time process basically unchanged. Corresponding to the speech production model, the tone transposition changes the excitation source; the formant parameters of the vocal tract model are almost unchanged, which ensures that the semantics and speech rate remain unchanged.

To sum up, the variable speed changes the movement speed of the vocal tract, and strives to keep the excitation source unchanged; the pitch shift changes the excitation source, and strives to keep the formant information of the vocal tract unchanged. However, the sound source and the sound channel are not independent of each other. When changing the sound source, it will inevitably affect the sound channel nonlinearly. Similarly, when changing the sound channel, it will also affect the sound source more or less. influence, interact.

Voice modulation is more commonly used in voice changing software. Voice speed change is commonly used in players, such as double-speed playback (fast broadcast, slow broadcast). Compared with the frame-based speed change principle of video, frame skipping or frame insertion, the speed change principle of audio is not so simple, because simple sampling points will cause sound discontinuity, noise or popping sound, and the subjective experience is poor.

There are currently two commonly used audio speed change solutions: soundtouch and Sonic. ijkplayer uses soundtouch, and EXOPlayer uses Sonic. There is another way to implement it on Android, based on AudioTrack's variable speed playback.

Sonic and Soundtouch are used in a similar way. They both provide packaged libraries, and process the PCM data of the original audio into the target format through interface functions. For example, at double speed, the PCM sampling points may be halved. The interface provided by Soundtouch is as follows:

Parameter setting class interface:

  • setChannels(int) set channels, 1 = mono, 2 = stereo

  • setSampleRate(uint) set the sampling rate

  • setRate(double) specifies the playback rate, the original value is 1.0, big fast small slow

  • setTempo(double) specifies the tempo, the original value is 1.0, big fast small slow

  • setRateChange(double), setTempoChange(double) Based on the original speed 1.0, increment by percentage, value (-50 ... +100 %)

  • setPitch(double) specifies the pitch value, the original value is 1.0

  • setPitchOctaves(double) Adjust the octave based on the original pitch, and the value is [-1.00,+1.00]

  • setPitchSemiTones(int) Adjust the pitch in semitones based on the original pitch, and the value is [-12,+12]

PCM processing interface:

  • putSamples(const SAMPLETYPE *samples, uint nSamples) input sample data

  • receiveSamples(SAMPLETYPE *output, uint maxSamples) Output processed data, need to execute flush() cyclically flush out the last set of "residual" data in the processing pipeline, should be executed at the end

  • From the above interface, it is similar to the calling logic of a conventional decoder or demultiplexer.

8.3.1 SEEK operation (fast forward and fast rewind)

ffplay analysis (seek operation processing) "Good memory is not as good as" rotten blog blog-CSDN blog ffplay seek ffplay source code analysis 7-Playback control-Ye Yu- Blog Garden (cnblogs.com)

8.3.2 Data structure and SEEK flag

The SEEK operation is the realization of changing the playback progress by user intervention, such as dragging the playback progress bar with the mouse.

The relevant data variables are defined as follows:

typedef struct VideoState {
    ......
    int seek_req;                   // 标识一次SEEK请求
    int seek_flags;                 // SEEK标志,诸如AVSEEK_FLAG_BYTE等
    int64_t seek_pos;               // SEEK的目标位置(当前位置+增量)
    int64_t seek_rel;               // 本次SEEK的位置增量
    ......
} VideoState;

VideoState.seek_flags indicates the SEEK flag. The type definition of the SEEK flag is as follows:

#define AVSEEK_FLAG_BACKWARD 1 ///< seek backward
#define AVSEEK_FLAG_BYTE     2 ///< seeking based on position in bytes
#define AVSEEK_FLAG_ANY      4 ///< seek to any frame, even non-keyframes
#define AVSEEK_FLAG_FRAME    8 ///< seeking based on frame number

The determination of the SEEK target playback point (hereinafter referred to as the SEEK point) is divided into the following situations according to the difference of the SEEK signs:

  1. AVSEEK_FLAG_BYTE: The SEEK point corresponds to the position in the file (byte representation). Some demuxers may not support this case.

  2. AVSEEK_FLAG_FRAME: The SEEK point corresponds to the frame number in the stream, and the stream is specified by stream_index. Some demuxers may not support this case.

    • If the above two flags are not included and the stream_index is valid: the SEEK point corresponds to the timestamp, the unit is the timebase in the stream, and the stream is specified by the stream_index. The value of the SEEK point is obtained by "pts (seconds) in the target frame × timebase in the stream".

    • If the above two flags are not included and stream_index is -1: the SEEK point corresponds to the timestamp, and the unit is AV_TIME_BASE. The value of the SEEK point is obtained by "pts (seconds) in the target frame × AV_TIME_BASE".

  3. AVSEEK_FLAG_ANY: The SEEK point corresponds to the frame number, and the playback point can stay at any frame (including non-key frames). Some demuxers may not support this case.

  4. AVSEEK_FLAG_BACKWARD: Ignored.

Among them, AV_TIME_BASE is the time base used internally by FFmpeg, defined as follows:

/**
 * Internal time base represented as integer
 */
#define AV_TIME_BASE            1000000

AV_TIME_BASE means 1000000us.

8.3.3 SEEK trigger mode

[Fast Forward]/[Rewind] control is realized through the asynchronous event mechanism. Firstly, the [Fast Forward]/[Rewind] event is triggered by the 4 direction keys up, down, left, and right, among which, the left and right keys set [Fast Forward]/[Fast Rewind] for 10s, and the up and down keys set [Fast Forward]/[Fast Rewind] 60s. Then in the event loop processing logic of the main function, sdl is used to monitor and capture the message corresponding to each key, and then jump to do_seek through goto to execute specific event processing logic.

There are the following code fragments in the SDL message processing performed by the event_loop() function:

//事件到来后唤醒主线程后,检查事件类型,执行相应操作
case SDLK_LEFT: //左键
    incr = seek_interval ? -seek_interval : -10.0; //后退10s
    goto do_seek;
case SDLK_RIGHT: //右键
    incr = seek_interval ? seek_interval : 10.0; //前进10s
    goto do_seek;
case SDLK_UP: //上键
    incr = 60.0; //前进60s
    goto do_seek;
case SDLK_DOWN: //下键
    incr = -60.0; //后退60s
do_seek://处理请求
        if (seek_by_bytes) {//通过字节方式seek
            pos = -1;
            //从frame(解码后)队列中获取当前播放到什么位置
            if (pos < 0 && cur_stream->video_stream >= 0)
                pos = frame_queue_last_pos(&cur_stream->pictq);
            if (pos < 0 && cur_stream->audio_stream >= 0)
                pos = frame_queue_last_pos(&cur_stream->sampq);
            if (pos < 0)
                pos = avio_tell(cur_stream->ic->pb);
            //根据封装格式的比特率计算一秒有多少个字节
            //计算seek的秒数相应有多少个字节
            if (cur_stream->ic->bit_rate)
                incr *= cur_stream->ic->bit_rate / 8.0; //比特除以8得到字节
            else
                incr *= 180000.0;
            //当前位置加上偏移量得到seek的位置
            pos += incr;
            //进行seek参数设置,实际seek调用是在数据读取线程read_thread()里进行
            stream_seek(cur_stream, pos, incr, 1);
        } else {//通过时间方式seek
            //获取播放当前时间(s)
            pos = get_master_clock(cur_stream);
            //get_master_clock(cur_stream)返回失败值,再用这个方法获取
            if (isnan(pos))
                pos = (double)cur_stream->seek_pos / AV_TIME_BASE;
            //当前位置加上偏移量得到seek的位置
            pos += incr; 
            //cur_stream->ic->start_time为封装文件第一帧的时间
            //如果seek值比这个小就充值为cur_stream->ic->start_time
            if (cur_stream->ic->start_time != AV_NOPTS_VALUE && pos < cur_stream->ic->start_time / (double)AV_TIME_BASE)
                pos = cur_stream->ic->start_time / (double)AV_TIME_BASE;
            //进行seek参数设置,实际seek调用是在数据读取线程read_thread()里进行
            //pos和incr均转为微秒
            stream_seek(cur_stream, (int64_t)(pos * AV_TIME_BASE), (int64_t)(incr * AV_TIME_BASE), 0);
        }
    break;

When seek_by_bytes takes effect (corresponding to the AVSEEK_FLAG_BYTE flag), the SEEK point corresponds to the position in the file, and the playback increment corresponding to 1 second of data is set in the above code; when it is not in effect, the SEEK point corresponds to the playback time.

This function implements the following functions:

  1. First determine the playback progress increment (SEEK increment) and target playback point (SEEK point) of the SEEK operation. When seek_by_bytes is not in effect, set the increment to the selected value, such as 10.0 seconds (when the user presses the "RIGHT" key) .

  2. SEEK points can be obtained by adding the progress increment to the synchronous master clock. First record the relevant values ​​for use in subsequent SEEK operations. stream_seek(cur_stream, (int64_t)(pos * AV_TIME_BASE), (int64_t)(incr * AV_TIME_BASE), 0); is to record the two parameters of target playback point and playback progress increment, accurate to microseconds.

Look again at the implementation of the stream_seak() function, which is just variable assignment:

/* seek in the stream */
//进行seek参数设置,实际seek调用是在数据读取线程read_thread()里进行
static void stream_seek(VideoState *is, int64_t pos, int64_t rel, int seek_by_bytes)
{
    //当前没有seek请求才进行参数设置
    if (!is->seek_req) {
        is->seek_pos = pos; //seek到的位置(字节/微秒)
        is->seek_rel = rel; //seek的偏移(字节/微秒)
        is->seek_flags &= ~AVSEEK_FLAG_BYTE;
        if (seek_by_bytes)
            is->seek_flags |= AVSEEK_FLAG_BYTE;
        is->seek_req = 1;
        //如果数据读取线程read_thread()睡眠则唤醒
        SDL_CondSignal(is->continue_read_thread);
    }
}

8.3.4 Implementation of SEEK operation

The SEEK operation is handled in the demultiplexing thread main loop.

static int read_thread(void *arg)
{
    ......
    for (;;) {
        //判断是否有seek请求
        if (is->seek_req) {
            //seek_target的位置不一定对应能够播放的位置,如不是I帧,则会偏移到合适的位置
            int64_t seek_target = is->seek_pos;
            //可以接受seek_target的最小值
            int64_t seek_min    = is->seek_rel > 0 ? seek_target - is->seek_rel + 2: INT64_MIN;
            //可以接受seek_target的最大值
            int64_t seek_max    = is->seek_rel < 0 ? seek_target - is->seek_rel - 2: INT64_MAX;
            //调用avformat_seek_file()进行真正的seek操作
            //阻塞函数,等待seek完成才返回
            ret = avformat_seek_file(is->ic, -1, seek_min, seek_target, seek_max, is->seek_flags);
            if (ret < 0) {
                av_log(NULL, AV_LOG_ERROR,
                       "%s: error while seeking\n", is->ic->url);
            } else {
                //seek时要把原来的数据清空,重置解码器
                if (is->audio_stream >= 0) {
                    //清空Packet(解码前)队列的数据
                    packet_queue_flush(&is->audioq);
                    //放入flush_pkt,重新开始一个播放序列(serial)
                    //解码器读到flush_pkt会清空解码器内缓存的Packet数据,serial++
                    packet_queue_put(&is->audioq, &flush_pkt);
                }
                if (is->subtitle_stream >= 0) {
                    packet_queue_flush(&is->subtitleq);
                    packet_queue_put(&is->subtitleq, &flush_pkt);
                }
                if (is->video_stream >= 0) {
                    packet_queue_flush(&is->videoq);
                    packet_queue_put(&is->videoq, &flush_pkt);
                }
                if (is->seek_flags & AVSEEK_FLAG_BYTE) {
                   set_clock(&is->extclk, NAN, 0);
                } else {
                   set_clock(&is->extclk, seek_target / (double)AV_TIME_BASE, 0);
                }
            }
            is->seek_req = 0;
            is->queue_attachments_req = 1;
            is->eof = 0;
            //如果是暂停状态,显示下一帧就暂停
            if (is->paused)
                step_to_next_frame(is);
        }
    }
    ......
}
  1. Call to avformat_seek_file()complete the SEEK point switching operation in the demultiplexer;

    // 函数原型
    int avformat_seek_file(AVFormatContext *s, int stream_index, int64_t min_ts, int64_t ts, int64_t max_ts, int flags);
    // 调用代码
    ret = avformat_seek_file(is->ic, -1, seek_min, seek_target, seek_max, is->seek_flags);

    This function will wait for the SEEK operation to complete before returning. The actual playback point strives to be closest to the parameter ts, and ensures that it is within the interval [min_ts, max_ts]. The reason why the playback point is not necessarily at the ts position is because the ts position may not be able to play normally. The value of the three parameters related to the SEEK point of the function (actual parameters "seek_min", "seek_target", "seek_max") is related to the SEEK flag (actual parameter "is->seek_flags"), here "is->seek_flags" The value is 0.

  2. Rinse each decoder buffer frame, make the frames in the current playback sequence play completely, and then start a new playback sequence (the playback sequence is marked by the "serial" variable in each data structure). code show as below:

    if (is->video_stream >= 0) {
        packet_queue_flush(&is->videoq);
        packet_queue_put(&is->videoq, &flush_pkt);
    }
  3. Clear this SEEK request flagis->seek_req = 0;

Original Link: Audio and Video Synchronization_Coastal Star's Breeze Blog-CSDN Blog

Guess you like

Origin blog.csdn.net/irainsa/article/details/130372979