Deep optimization of RTSP live broadcast delay

Now ijkPlayer is the first choice for many players and live broadcast platforms. I believe that many developers have come into contact with ijkPlayer, whether it is an Android engineer or an iOS engineer. I once asked a question on the ijkPlayer open source project on Github: The video stream is 1080P, 30fps, how to optimize the delay of RTSP live broadcast to about 100ms? I found that everyone was very interested in RTSP live broadcast delay optimization, and they asked questions or gave their own opinions. This article is mainly a summary, but also to discuss with you the delay optimization of RTSP live broadcast.

1. Modify the compilation script to support RTSP

By default, ijkPlayer does not compile the RTSP protocol, so we have to modify the compilation script. The original disable is changed to enable:

export COMMON_FF_CFG_FLAGS="$COMMON_FF_CFG_FLAGS --enable-protocol=rtp"
export COMMON_FF_CFG_FLAGS="$COMMON_FF_CFG_FLAGS --enable-protocol=tcp"
export COMMON_FF_CFG_FLAGS="$COMMON_FF_CFG_FLAGS --enable-demuxer=rtsp"
export COMMON_FF_CFG_FLAGS="$COMMON_FF_CFG_FLAGS --enable-demuxer=sdp"
export COMMON_FF_CFG_FLAGS="$COMMON_FF_CFG_FLAGS --enable-demuxer=rtp"

2. Modify the option parameters of the player

//丢帧阈值
mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_PLAYER, "framedrop", 30);
//视频帧率
mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_PLAYER, "fps", 30);
//环路滤波
mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_CODEC, "skip_loop_filter", 48);
//设置无packet缓存
mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_PLAYER, "packet-buffering", 0);
mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_FORMAT, "fflags", "nobuffer");
//不限制拉流缓存大小
mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_PLAYER, "infbuf", 1);
//设置最大缓存数量
mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_FORMAT, "max-buffer-size", 1024);
//设置最小解码帧数
mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_PLAYER, "min-frames", 3);
//启动预加载
mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_PLAYER, "start-on-prepared", 1);
//设置探测包数量
mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_FORMAT, "probsize", "4096");
//设置分析流时长
mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_FORMAT, "analyzeduration", "2000000");

It is worth noting that ijkPlayer uses udp to pull streams by default because it is faster. If you need reliability and reduce packet loss, you can change to the tcp protocol:

mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_FORMAT, "rtsp_transport", "tcp");

In addition, you can turn on hard decoding like this. If the hard decoding fails to be turned on, it will automatically switch to soft decoding:

mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_PLAYER, "mediacodec", 0);
mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_PLAYER, "mediacodec-auto-rotate", 0);
mediaPlayer.setOption(IjkMediaPlayer.OPT_CATEGORY_PLAYER, "mediacodec-handle-resolution-change", 0);

3. Packet loss due to network jitter

When pulling a stream, the audio stream and video stream are stored in the buffer queue separately. If network jitter occurs, it will cause buffer jitter (JitBuffer) , which can be summed up as network jamming, which leads to an increase in the audio and video buffer queue, resulting in a delay in decoding and playback. At this point, we need to actively drop packets to follow up with the current timestamp. Because audio and video synchronization is generally based on the audio clock, people are more sensitive to audio, so we give priority to discarding packets in the video queue. However, when the video data packet is lost, the entire GOP data packet needs to be lost, because the B frame and P frame depend on the I frame to be decoded, otherwise it will cause a blurry screen. There is a developer called Big Tooth, and his article about ijkPlayer's live broadcast delay is very well written: ijkplay broadcast live stream delay control summary

Fourth, the decoder is set to zero delay

You should have heard of encoder zero latency (zerolatency) , but probably haven't heard of decoder zero latency. In fact, the decoder will cache several frames of data by default, which are used for decoding subsequent associated frames, about 3-5 frames. After repeated tests, it was found that the buffer frame of the decoder will bring more than 100 ms of delay. That is to say, if the cached frame can be removed, the delay of more than 100 ms can be reduced . In the AVCodecContext structure of the avcodec.h file, there is a parameter (flags) used to set the decoder delay:

typedef struct AVCodecContext {
......
int flags;
......
}

To remove the decoder buffer frame, we can set flags to CODEC_FLAG_LOW_DELAY. Set when initializing the decoder:

//set decoder as low deday
codec_ctx->flags |= CODEC_FLAG_LOW_DELAY;

Fifth, reduce the delay of FFmpeg frame splitting waiting

FFmpeg splitting is based on the start code of the next frame as the current frame terminator. The start code is generally: 0x00 0x00 0x00 0x01 or 0x00 0x00 0x01. This will bring a delay of one frame. Can this frame delay be removed? If there is a frame terminator, we split the frame with the frame terminator, which can solve the one frame delay. Now, the problem is to find the end of frame and replace it with the start of the next frame to split the frame. The whole calling process is: read_frame—>read_frame_internal—>parse_packet—>av_parser_parse2—>parser_parse—>ff_combine_frame. The flow chart is as follows:

1. Find the current frame terminator

In the rtp_parse_packet_internal method of the rtpdec.c file, there is a frame terminator, that is, the mark flag bit , we set a global variable here:

static int rtp_parse_packet_internal(RTPDemuxContext *s, AVPacket *pkt,
                                     const uint8_t *buf, int len)
{
    ......

    if (buf[1] & 0x80)
        flags |= RTP_FLAG_MARKER;
    //the end of a frame
    mark_flag = flags;

    ......
}

2. Remove the while loop of parse_packet

We call the read_frame of the utils.c file of the libavformat module externally to read a frame of data, and read_frame calls the internal method read_frame_internal, read_frame_internal then calls the parse_packet method, in which there is a while loop body. Now remove the loop body and free the allocated memory:

static int parse_packet(AVFormatContext *s, AVPacket *pkt, int stream_index)
{
    ......

//    while (size > 0 || (pkt == &flush_pkt && got_output)) {
        int len;
        int64_t next_pts = pkt->pts;
        int64_t next_dts = pkt->dts;

        av_init_packet(&out_pkt);
        len = av_parser_parse2(st->parser, st->internal->avctx,
                               &out_pkt.data, &out_pkt.size, data, size,
                               pkt->pts, pkt->dts, pkt->pos);
        pkt->pts = pkt->dts = AV_NOPTS_VALUE;
        pkt->pos = -1;
        /* increment read pointer */
        data += len;
        size -= len;

        got_output = !!out_pkt.size;

        if (!out_pkt.size){
            av_packet_unref(&out_pkt);//release current packet
            av_packet_unref(pkt);//release current packet
            return 0;
//            continue;
        }
    ......        
   
        ret = add_to_pktbuf(&s->internal->parse_queue, &out_pkt,
                            &s->internal->parse_queue_end, 1);
        av_packet_unref(&out_pkt);
        if (ret < 0)
            goto fail;
//    }

    /* end of the stream => close and free the parser */
    if (pkt == &flush_pkt) {
        av_parser_close(st->parser);
        st->parser = NULL;
    }

fail:
    av_packet_unref(pkt);
    return ret;
}

3. Modify the frame offset of av_parser_parse2

In the parser.c file of the libavcodec module, parse_packet calls to av_parser_parse2 to interpret the data packet, and this method internally records the frame offset. Originally it was waiting for the start code of the next frame, but now it is changed to the end of the current frame, so the offset length of the start code of the next frame should be removed:

int av_parser_parse2(AVCodecParserContext *s, AVCodecContext *avctx,
                     uint8_t **poutbuf, int *poutbuf_size,
                     const uint8_t *buf, int buf_size,
                     int64_t pts, int64_t dts, int64_t pos)
{
    ......

    /* WARNING: the returned index can be negative */
    index = s->parser->parser_parse(s, avctx, (const uint8_t **) poutbuf,
                                    poutbuf_size, buf, buf_size);
    av_assert0(index > -0x20000000); // The API does not allow returning AVERROR codes
#define FILL(name) if(s->name > 0 && avctx->name <= 0) avctx->name = s->name
    if (avctx->codec_type == AVMEDIA_TYPE_VIDEO) {
        FILL(field_order);
    }

    /* update the file pointer */
    if (*poutbuf_size) {
        /* fill the data for the current frame */
        s->frame_offset = s->next_frame_offset;

        /* offset of the next frame */
//        s->next_frame_offset = s->cur_offset + index;
        //video frame don't plus index
        if (avctx->codec_type == AVMEDIA_TYPE_VIDEO) {
            s->next_frame_offset = s->cur_offset;
        }else{
            s->next_frame_offset = s->cur_offset + index;
        }
        s->fetch_timestamp   = 1;
    }
    if (index < 0)
        index = 0;
    s->cur_offset += index;
    return index;
}

4. Remove the search frame start code of parser_parse

av_parser_parse2 calls the parser_parse method, and we are using h264 decoding here, so there is a structure ff_h264_parser in h264_parser.c of the libavcodec module, assign h264_parse to parser_parse:

AVCodecParser ff_h264_parser = {
    .codec_ids      = { AV_CODEC_ID_H264 },
    .priv_data_size = sizeof(H264ParseContext),
    .parser_init    = init,
    .parser_parse   = h264_parse,
    .parser_close   = h264_close,
    .split          = h264_split,
};

Now we need the h264_parse method of the h264_parser.c file, removing the process of finding the start code of the next frame as the end character of the current frame:

static int h264_parse(AVCodecParserContext *s,
                      AVCodecContext *avctx,
                      const uint8_t **poutbuf, int *poutbuf_size,
                      const uint8_t *buf, int buf_size)
{
    ......

    if (s->flags & PARSER_FLAG_COMPLETE_FRAMES) {
        next = buf_size;
    } else {
//TODO:don't use next frame start code, modify by xufulong
//        next = h264_find_frame_end(p, buf, buf_size, avctx);

        if (ff_combine_frame(pc, next, &buf, &buf_size) < 0) {
            *poutbuf      = NULL;
            *poutbuf_size = 0;
            return buf_size;
        }

/*        if (next < 0 && next != END_NOT_FOUND) {
            av_assert1(pc->last_index + next >= 0);
            h264_find_frame_end(p, &pc->buffer[pc->last_index + next], -next, avctx); // update state
        }*/
    }

    ......
}

5. Modify the framing method of parser.c

h264_parse calls the ff_combine_frame framing method of parser.c, where we replace the start code with the mark as the end of the frame:

external int mark_flag;//引用全局变量

int ff_combine_frame(ParseContext *pc, int next,const uint8_t **buf, int *buf_size)
{
    ......

    /* copy into buffer end return */
//    if (next == END_NOT_FOUND) {
        void *new_buffer = av_fast_realloc(pc->buffer, &pc->buffer_size,
                                           *buf_size + pc->index +
                                           AV_INPUT_BUFFER_PADDING_SIZE);

        if (!new_buffer) {
          
            pc->index = 0;
            return AVERROR(ENOMEM);
        }
        pc->buffer = new_buffer;
        memcpy(&pc->buffer[pc->index], *buf, *buf_size);
        pc->index += *buf_size;
//        return -1;
          if(!mark_flag)
            return -1;
        next = 0;
//    }

    ......

}

After the above modifications, the local area network uses the computer to push the 1080P, 30fps video stream, and the Android device pulls the stream to decode and play, and the overall delay can be optimized to about 130ms. The mobile phone push stream, the delay can reach 86ms.

Deep optimization of RTSP live broadcast delay

Guess you like