Detailed explanation of low-latency streaming protocols SRT, WebRTC, LL-HLS, UDP, TCP, RTMP

Low broadcast latency has become a necessary feature in any tender and competition for building origins and CDNs. Previously this standard was only applicable to sports broadcasts, but now operators require broadcast equipment suppliers to provide low latency in every field, such as: broadcast news, concerts, shows, interviews, talk shows, debates, eSports and more.

What is low latency?

Generally speaking, latency is the time difference between when a particular video frame is captured by a device (camera, player, encoder, etc.) and when that frame is played on the end user's display.

What is low latency video streaming?

Low latency should not degrade the quality of the signal transmission, which means using minimal buffering when encoding and multiplexing, while maintaining a smooth and crisp picture on any device's screen. Another prerequisite is guaranteed transmission: all lost packets should be recovered, and transmission over an open network should not cause any problems.

More and more services are moving to the cloud to save on rented space, power and hardware costs. This increases the requirement for low latency at high RTT (Round Trip Time). This is especially the case when streaming high bitrates when playing HD and Ultra HD video. For example, if the cloud server is located in the United States, and the content consumer is in Europe.

In this article, we will analyze the options currently available in the market for low-latency broadcasting.

UDP

Probably the first technology that is widely used in modern TV broadcasting and associated with the term "low latency" is the multicasting of MPEG TS streaming content over UDP. Typically, this format is suitable for closed, no-load networks, where packet loss is minimal. For example, broadcast from encoder to source end station modulator (usually within the same server rack), or IPTV broadcast over dedicated copper or fiber optic lines with amplifiers and repeaters.

This technique is commonly used and exhibits good delay characteristics. The latency associated with encoding, data transmission and decoding implemented by companies in the market using Ethernet does not exceed 80ms at 25 frames per second. At higher frame rates, this latency characteristic is even lower.

Figure 1. UDP broadcast latency measurement

The top half of Figure 1 shows the signal from the SDI capture card, and the bottom half shows the signal going through the encoding, multiplexing, broadcasting, receiving, and decoding stages. As shown, the second signal arrives one frame later (in this case, since the signal is 25fps, 1 frame is 40ms). A similar solution was used at Confederations Cup 2017 and FIFA World Cup 2018, only a modulator, a distributed DVB-C network and a TV as end device were added to the architecture chain, resulting in a total latency of 220-240ms .

What if the signal goes through an external network? There are various issues to overcome: interference, shaping, traffic blocking channels, hardware errors, broken cables, and software-level issues. In this case, not only low latency, but also retransmission of lost packets is required.

In the case of UDP, FEC with redundancy (with additional test traffic or overhead) does a good job. At the same time, the demands on network throughput increase, and therefore, latency and redundancy, depending on the expected percentage of lost packets. Since the percentage of packets that can be recovered by FEC is always limited, and can vary greatly during transmission over an open network. Therefore, in order to transmit large amounts of data reliably over long distances, it is necessary to add more redundant traffic to it.

Related video recommendations:

LinuxC++ audio and video development video: free] FFmpeg/WebRTC/RTMP/NDK/Android audio and video streaming media advanced development

[Article benefits]: The editor has compiled some relevant audio and video development learning materials (including C/C++, Linux, FFmpeg webRTC rtmp hls rtsp ffplay srs, etc.), qun994289133 to share for free, if you need it, you can join the group to get it! ~Click skirt 994289133 to join to receive information

RTMP

RTMP is a proprietary protocol of Macromedia Corporation (now owned by Adobe Corporation) and was very popular when Flash-based applications became popular. It comes in several varieties, supports TLS/SSL encryption, and even has a UDP-based variant, RTFMP (Real-Time Media Streaming Protocol, for peer-to-peer connections). RTMP splits the data stream into segments, the size of which can change dynamically. Within a channel, packets related to audio and video can be interleaved and multiplexed.

Figure 2. RTMP broadcast implementation use case

RTMP builds several virtual channels over which audio, video, metadata, etc. are transmitted. Most CDNs no longer support RTMP as a protocol for distributing traffic to end customers. However, Nginx has its own RTMP module that supports the normal RTMP protocol, which runs over TCP and uses the default port 1935. Nginx can act as an RTMP server, distributing the content it receives from RTMP streaming. Also, RTMP is still a popular protocol for delivering traffic to CDNs, but in the future, traffic will be delivered using other protocols.

Currently, Flash technology is outdated and unsupported: browsers either reduce support for it or ban it entirely. RTMP does not support HTML5 and is difficult to use in browsers (playback requires Adobe Flash plugin). To bypass firewalls they use RTMPT (wrapped into HTTP requests and use standard 80/443 instead of 1935 ports), but this greatly affects latency and redundancy (by various estimates, RTT and overall latency increases by 30% ). Nonetheless, RTMP is still popular, for example, for broadcasts on YouTube or on social media (Facebook's RTMPS).

The main downside of RTMP is the lack of support for HEVC/VP9/AV1 encoders, and the limitation of only allowing two audio tracks. Also, RTMP does not include timestamps in packet headers. RTMP only contains tags calculated from the frame rate, so the decoder doesn't know exactly when to decode the stream. This requires a receiving component to generate samples for decoding evenly, so the buffer must grow by the size of the packet jitter.

Another problem with RTMP is resending lost TCP packets, which was described earlier. In order to keep backhaul traffic low, acknowledgements received (ACKs) do not go directly to the sender. Confirmations of ACKs or NACKs are sent to the broadcaster only after the chain of packets has been received.

According to various estimates, for broadcasting using RTMP, the delay through the complete encoding path (RTMP encoder → RTMP server → RTMP client) is at least two seconds.

CMAF

CMAF (Common Media Application Format) is a protocol developed by Apple and Microsoft at the invitation of MPEG for adaptive broadcasting over HTTP (with an adaptive bit rate that changes according to changes in the overall network bandwidth rate). Typically, Apple's HTTP Live Streaming (HLS) uses MPEG TS streams, while MPEG DASH uses fragmented MP4. In July 2017, the CMAF standard was released. In CMAF, fragmented MP4 fragments (ISOBMFF) are transmitted over HTTP, and there are two different playlists for the same content, for a specific player: iOS (HLS) or Android/Microsoft (MPEG DASH).

By default, CMAF (like HLS and MPEG DASH) is not designed for low-latency broadcast. But industry attention and interest in low-latency is growing, so some vendors offer extensions to the standard, such as low-latency CMAF. This extension assumes that both the broadcaster and the receiver support two methods:

  1. Chunked encoding: split the clip into sub clips (small clips with moof+mdat mp4 boxes that eventually make up the whole clip suitable for playback) and send before the whole clip is flattened.

  2. Chunked Transfer-Encoding: Send sub-segments to CDN (origin) using HTTP 1.1: send only 1 HTTP POST request for the entire segment every 4 seconds (25 frames per second), after which 100 small segments can be sent in the same session (each each clip has one frame). Players can also try to download incomplete segments, while CDNs use chunked transfer encoding to serve up completed segments and then stay connected until new segments are added to the ones being downloaded. Once the entire segment is formed on the CDN side, the segment transfer to the player is complete.

Figure 3. Standard block CMAF

If you want to switch between profiles, you need buffering (at least 2 seconds). Given this, and possible distribution issues, the standard's developers claim a delay of less than 3 seconds. At the same time, important things like scaling with thousands of simultaneous clients via CDN, encryption (along with universal encryption support), HEVC and WebVTT (subtitles) support, guaranteed delivery and compatibility with different players (Apple/Microsoft) are important Features are preserved. On the downside, one might note the mandatory LL CMAF support on the player side (support for fragmented fragments and advanced manipulation of internal buffers). However, in the case of incompatibility, players can still use content within the CMAF specification, with standard delays for HLS or DASH.

LL-HLS

In June 2019, Apple released the specification for low-latency HLS.

It consists of the following parts:

  1. Generate partial clips (fragmented MP4 or TS) with a minimum duration of 200ms, which can be used even before the entire clip consisting of these parts is complete. Outdated segments are periodically removed from the playlist;

  2. The server side can use the HTTP/2 push mode to send the updated playlist along with the new segment. However, in the last specification revision in January 2020, this recommendation was excluded;

  3. It is the server's responsibility to hold the request (block) until a version of the playlist containing the new segment is available. Blocking playlist reloads eliminates polling;

  4. instead of sending the full playlist, send the delta of the playlist (the default playlist is saved, then only the delta is sent when it occurs, not the full playlist);

  5. The server announces the upcoming new partial fragment (preload prompt);

  6. Information about playlists is loaded simultaneously in adjacent profiles to speed up switching.

Figure 4. How LL HLS works

With CDNs and players fully supporting the specification, expect latency to be less than 3 seconds. HLS is very widely used for broadcast on open networks due to its excellent scalability, encryption and adaptive bitrate support for cross-platform functionality and backward compatibility, which is useful if the player does not support LL HLS.

WebRTC

WebRTC (Web Real Time Communication) is an open source protocol developed by Google in 2011. It is used in Google Hangout, Slack, BigClueButton and YouTube Live. WebRTC is a set of standards, protocols, and JavaScript programming interfaces that implement end-to-end encryption in point-to-point connections using DTLS-SRTP. Additionally, the technology uses no third-party plug-ins or software and can pass through firewalls (for example, during a video conference in a browser) without loss of quality and latency. When playing video, a UDP-based WebRTC implementation is usually used.

The protocol works as follows: a host sends a connect request to the peer client to connect to. Before a connection between peer clients is established, they communicate with each other through a third-party signal server. Each peer client then asks the STUN server "Who am I?" (How to find me from outside?).

Meanwhile, there are public Google STUN servers (eg stun.l.google.com:19302). The STUN server provides a list of IPs and ports through which the current host can be reached. ICE candidates are formed from this list. The second client does the same. The ICE candidates are exchanged through the signal server, and it is at this stage that a peer-to-peer connection is established, that is, a peer-to-peer network is formed.

If a direct connection cannot be established, a so-called TURN server acts as a relay/proxy server, which is also shortlisted for ICE.

The SCTP (application data) and SRTP (audio and video data) protocols are responsible for multiplexing, transmission, congestion control and reliable delivery. For the "handshake" exchange and further traffic encryption, DTLS is used.

Figure 5. WebRTC protocol stack

Use Opus and VP8 as codecs. The maximum supported resolution is 720p at 30fps, and the bit rate is up to 2Mbps.

One downside of WebRTC technology in terms of security is that it defines a real IP even behind a NAT and when using a Tor network or a proxy server. Because of the connection structure, WebRTC is not suitable for a large number of peer-to-peer clients watching simultaneously (difficult to scale), and currently CDNs rarely support it. Finally, WebRTC is inferior to other protocols in terms of encoding quality and maximum amount of data transferred.

WebRTC is not available in Safari and partially unavailable in Bowser and Edge. Google claims its latency is less than a second. At the same time, the protocol can be used not only for video conferencing, but also for applications such as file transfer.

SRT

SRT (Secure Reliable Transport) is a protocol developed by Haivision in 2012. The protocol operates on the basis of UDT (UDP-based Data Transfer Protocol) and ARQ packet recovery technology. It supports AES-128 and AES-256 encryption. In addition to listening (server) mode, it also supports calling (client) and rendezvous (when both parties initiate a connection) mode, which allows connections to be established through firewalls and NATs. SRT's "handshake" process takes place within existing security policies, thus allowing external connections without the need to open permanent external ports in the firewall.

SRT includes timestamps within each packet, which allows playback at a rate equal to the stream encoding rate without extensive buffering, while making jitter (changing packet arrival rate) and incoming bit rate be consistent. Unlike in TCP where the loss of a packet can cause the entire chain of packets to be resent, starting with the lost packet, SRT identifies a specific packet by its number and resends only that packet. This has a positive effect on latency and redundancy.

Retransmitted packets have higher priority than standard broadcasts. Unlike standard UDT, SRT completely redesigns the architecture of retransmitting data packets, and reacts immediately if a data packet is lost. This technique is a variant of selective repeat/reject ARQ. It is worth noting that a particular lost packet may only be resent a fixed number of times. When the time on a packet exceeds 125% of the total delay, the sender skips the packet. SRT supports FEC, and it is up to the user to decide which of these two technologies (or both) to use to balance the lowest latency with the highest transmission reliability.

Figure 6. How SRT works on an open network

Data transmission in SRT can be bidirectional: both points can send data at the same time, and can also be the party listening and initiating the connection at the same time. Rendezvous mode can be used when both parties need to establish a connection. The protocol has an internal multiplexing mechanism that allows multiple streams of a session to be multiplexed into one connection using one UDP port. SRT is also suitable for fast file transfers, an application first introduced in UDT.

SRT has a network congestion control mechanism. Every 10ms, the sender receives the latest data on the RTT and its changes, the buffer size available, the packet reception rate, and the approximate size of the current link. SRT has a limit on the minimum delay between two packets sent in succession. If they don't arrive in time, they are removed from the queue.

The developers claim that the smallest possible latency using SRT is 120ms with buffers set to the smallest possible distance in closed network transmissions over short distances. The recommended total latency for stable broadcasts is 3-4 RTTs. Additionally, SRT handles long distances (several thousand kilometers) and high bit rates (10Mbps and above) better than its competitor RTMP.

Figure 7. SRT Broadcast Delay Test

In the example above, the lab-measured SRT broadcast latency is 3 frames at 25fps. That is 40ms*3=120ms. From this, we can conclude that the ultra-low latency of 0.1 seconds that can be achieved in UDP broadcast is also achievable in SRT broadcast. The scalability of SRT is not at the same level as HLS or DASH/CMAF, but SRT is strongly supported by CDNs and relayers, and also supports direct broadcast to end clients in listening mode through media servers.

In 2017, Haivision disclosed the source code of the SRT library and created the SRT Alliance, which includes more than 350 members.

Summarize

As a summary, the table below shows a comparison of the various protocols:

  1. Transmission by CDN to end users is not supported. Supports content streaming to the last mile, for example, to a CDN or a streamer.

  2. Not supported in browser

  3. Not supported in Safari

Right now, everything that is open source and well documented is rapidly gaining popularity. It can be argued that formats like WebRTC and SRT have a long-term future in their respective applications. In terms of lowest latency, these protocols have surpassed adaptive broadcast over HTTP while maintaining reliable transport, with low redundancy, and support for encryption (AES for SRT and DTLS/SRTP for WebRTC).

Also, recently the "little brother" of the SRT (in terms of when the protocol was created, not in terms of features and capabilities), the RIST protocol, is gaining popularity, but that's another topic. At the same time, RTMP is actively being crowded out by new competitors, and due to the lack of native support in browsers, it is unlikely to be widely used anytime soon.

Guess you like

Origin blog.csdn.net/m0_60259116/article/details/123404517