[Audio and Video Day 14] webRTC protocol (1)

protocol

ICE

Interactive Connectivity Establishment (ICE) is a framework that enables web browsers to connect to each other.
There are many reasons why a connection cannot be established directly from end a to end b.

  1. It needs to bypass firewalls that prevent connections from being made
  2. If your device does not have a public IP address, then it needs to provide you with a unique address
  3. If your router does not allow direct connection to the peer, you need to transfer data through the server.
    ICE uses STUN and/or TURN servers to accomplish this task, as described below.

STUN

Session Traversal Utilities for NAT (STUN) is a protocol used to discover your public address and identify any restrictions in your router that prevent direct connections to peers.
The client will send a request to the STUN server on the Internet, and the STUN server will reply the client's public address and whether the client can be accessed behind the router's NAT.
You can directly access it.
insert image description here

NAT

Network Address Translation (NAT) is used to give your device a public IP address . The router will have a public IP address and each device connected to the router will have a private IP address. The request will be translated from the device's private IP to the router's public IP with a unique port . This way you don't have to have each device have a unique public IP, but still be discoverable on the internet.
Routers have restrictions on who can connect to devices that are already on the Internet. This could mean that even if we have the public IP address found by the STUN server, not just anyone can create a connection. In this case we need to use TURN.

TURN

Some routers that use NAT use a "symmetric NAT" restriction method. This means that the router will only accept connections from peers you have connected to before. [That is, under symmetric NAT, you can only connect to the peer that was connected before]
Traversal Using Relays around NAT (TURN) means to bypass the restrictions of symmetric NAT by opening a connection to the TURN server and relaying all information through the server. You'll create a connection to the TURN server and tell all peers to send packets to the server, which will then be forwarded to you. This obviously comes with some cost overhead, so it should only be used if there are no other alternatives.
insert image description here

SDP

Session Description Protocol (SDP) is a standard for describing connected multimedia content, such as resolution, format, codec, encryption, etc., so that peers can understand each other during data transmission. Essentially, this is metadata that describes the content, not the media content itself.
Technically speaking, SDP is not really a protocol, but a data format used to describe the connection between devices sharing media .
Writing SDP documentation is well beyond the scope of this document; however, here are a few things worth noting:

SDP structure

An SDP consists of one or more lines of UTF-8 text, each line beginning with a character type, followed by an equals sign ("="), and then structured text containing a value or description, formatted depending on the type. Lines of text beginning with a given letter are often referred to as "letter lines". For example, lines that provide media descriptions have type "m", so these lines are called "m-line".

Signaling and Connecting

Signaling: How peers find each other in WebRTC

[Transfer SDP through out-of-band signaling]
When the WebRTC proxy starts, it does not know who it is communicating with or what they are communicating with. Signaling solves this problem. Signaling is used to bootstrap calls, allowing two independent WebRTC proxies to start communicating.
Signaling uses a protocol called SDP (Session Description Protocol). Each SDP message consists of key/value pairs and contains a list of "media sections". The SDP exchanged by two WebRTC proxies contains the following details:

  • Proxy accessible ip and ports
  • The number of audio and video tracks the agent wishes to send.
  • Audio and video codecs supported by each proxy.
  • The value (uFrag/uPwd) to use when connecting.
  • The value (certificate thumbprint) to use when secure.
    It is important to note that signaling usually happens "out-of-band", which means that applications usually do not use WebRTC itself to exchange signaling messages. Any architecture suitable for sending messages can relay SDP between connected peers, and many applications will simply use their existing infrastructure (such as REST endpoints, WebSocket connections, or authentication proxies) to facilitate appropriate SDP transactions between clients.

Connecting and NAT Traversal with STUN/TURN

【Establish connection via ICE】
Once two WebRTC proxies have exchanged sdp, they have enough information to try to connect to each other. To enable this connection, WebRTC uses another proven technology called ICE (Interactive Connection Establishment).
ICE is a protocol that predates WebRTC and allows a direct connection between two proxies without a central server. These two proxies may be on the same network, or they may be on the other side of the world.
ICE can achieve direct connection, but the real connection process involves a concept called "NAT traversal" and the use of STUN/TURN servers.
When two proxies successfully establish an ICE connection, WebRTC will move to the next step; establishing encrypted transport for sharing audio, video, and data between them.

Signaling

When you create a WebRTC proxy, it doesn't know anything about the other peer. It doesn't know who it will connect with or what they will send. We use signaling to bootstrap calls . After these values ​​are exchanged, WebRTC proxies can communicate directly with each other. Signaling information is text only. WebRTC doesn't care how they are transported. They are usually shared via Websockets, but this is not required.
WebRTC uses a protocol called sdp . With this protocol, two WebRTC endpoints will share all the state needed to establish a connection. The protocol itself is easy to read and understand. The complexity comes from understanding all the values ​​that WebRTC populates**. This protocol is not specific to WebRTC**. WebRTC actually only utilizes a subset of the sdp protocol , so we only learn the parts we need. After we understand the protocol, we'll move on to discussing its use in WebRTC.

sdp protocol

The sdp protocol is defined in RFC 8866. It's a key/value protocol with newlines after each value. It feels similar to an INI file. A session description contains 0 or more media descriptions. Mentally, you can model this as a session description, which contains an array of media descriptions.
A media description usually maps to a single media stream. So if you wanted to describe a session with three video streams and two audio tracks, you would have five media descriptions.
Each line in the session description starts with a character, which is your key. Then it will be followed by an equals sign. After the equal sign is the value. After the value is complete, there will be a newline.
The Session Description Protocol defines all valid keys. You can only use letters defined in the protocol as keys. These keys have important meanings, which will be explained later. Session description below:

a=my-sdp-value
a=second-value

You have two lines. The value in the first line is my-sdp-value, and the value in the second line is second-value.
WebRTC does not use all keys defined by the Session Description Protocol. Only keys used in the JavaScript Session Establishment Protocol (JSEP) defined in RFC 8829 are significant. The seven keys below are all you need to know right now:

v - Version, should be equal to 0.
o - Origin, 包含一个唯一的ID,用于重新协商.
s - Session Name, should be equal to -.
t - Timing, should be equal to 0 0.
m - Media Description (m=<media> <port> <proto> <fmt> ...), described in detail below.
a - Attribute, 一个自由文本字段。这是WebRTC中最常见的一行。
c - Connection Data, should be equal to IN IP4 0.0.0.0.

"Session Description" can contain unlimited "Media Description".
The Media Description definition contains a list of formats. These formats map to RTP payload types. Then, the actual codec is defined by the Attribute whose value is rtpmap in Media Description. Each media description can contain an unlimited number of attributes.
Take this excerpt from the Session Description as an example:

v=0
m=audio 4000 RTP/AVP 111
a=rtpmap:111 OPUS/48000/2
m=video 4000 RTP/AVP 96
a=rtpmap:96 VP8/90000
a=my-sdp-value

You have two media descriptions, one for audio in fmt 111 type and one for video in 96 format. The first media description has only one attribute. This property maps payload type 111 to Opus. The second Media Description has two properties. The first attribute maps Payload Type 96 to VP8, and the second attribute is my-sdp-value.
The following brings together all the concepts we've discussed. These are all properties of the Session Description Protocol used by WebRTC.

v=0
o=- 0 0 IN IP4 127.0.0.1
s=-
c=IN IP4 127.0.0.1
t=0 0
m=audio 4000 RTP/AVP 111
a=rtpmap:111 OPUS/48000/2
m=video 4002 RTP/AVP 96
a=rtpmap:96 VP8/90000

v, o, s, c, t are defined, but they do not affect the WebRTC session.
You have two media descriptions. One is audio type and one is video type.
Each has an attribute. This property configures the details of the RTP pipeline.

How WebRTC uses sdp

WebRTC uses an offer/answer model . This means that a WebRTC proxy makes an "offer" to start a call, and if it is willing to accept what is offered, the WebRTC other end "answers". This gives the responder the opportunity to reject unsupported codecs in the media description. This is how both ends understand the format they want to exchange.
Transceivers are for sending and receiving
Transceivers are a WebRTC specific concept that you will see in the API. It exposes the "Media Description" to the JavaScript API. Each Media Description becomes a Transceiver. Every time you create a Transceiver, a new "Media Description is added to the local Session Description. Every Media Description in WebRTC has a direction property. This allows WebRTC proxies to declare "I'm going to send this codec Here you are, but I'm not willing to accept anything in return". There are four valid values: send recv sendrecv inactive
SDP values ​​used by WebRTC
A list of properties commonly found in session descriptions for WebRTC proxies. These values ​​control the subsystem.
group:BUNDLE : Bundling is the act of running multiple types of traffic on one connection. Some WebRTC implementations use a dedicated connection for each media stream. Bundling is the best. fingerprint:sha-256 : This is the hash of the peer's
certificate used for DTLS .When the DTLS handshake is complete, compare it to the actual certificate to confirm that you are communicating with what you expect.
**setup:** This controls DTLS proxy behavior. This determines whether it acts as a client or The server is running. Possible values ​​are:

  • setup:active - run as a DTLS client.
  • setup:passive - run as a DTLS server.
  • setup:actpass - Lets alternative WebRTC proxy selection.
    **mid:** Identifies the media stream in the session description.
    ice-ufrag : This is the user fragment value for ICE Agent. Used to authenticate ICE Traffic.
    ice-pwd : This is the password for the ICE Agent. Used to authenticate ICE Traffic.
    rtpmap : This value is used to map a specific codec to an RTP payload type. The payload type is not static, so for each call the provider decides the payload type per codec.
    fmtp : Defines additional values ​​for a payload type. This is useful for communicating specific video profiles or encoder settings.
    **candidate:** This is an ICE Candidate from ICE Agent. This is one possible address available to the WebRTC Agent. These will be explained in detail in the next chapter.
    ssrc : A synchronization source (SSRC) defines a single media stream track.
    label : is the ID of this individual stream. mslabel is the ID of the container, and there can be multiple streams in the container.

WebRTC session example

Below is the full session description generated by the WebRTC client:

v=0
o=- 3546004397921447048 1596742744 IN IP4 0.0.0.0
s=-
t=0 0
a=fingerprint:sha-256 0F:74:31:25:CB:A2:13:EC:28:6F:6D:2C:61:FF:5D:C2:BC:B9:DB:3D:98:14:8D:1A:BB:EA:33:0C:A4:60:A8:8E
a=group:BUNDLE 0 1
m=audio 9 UDP/TLS/RTP/SAVPF 111
c=IN IP4 0.0.0.0
a=setup:active
a=mid:0
a=ice-ufrag:CsxzEWmoKpJyscFj
a=ice-pwd:mktpbhgREmjEwUFSIJyPINPUhgDqJlSd
a=rtcp-mux
a=rtcp-rsize
a=rtpmap:111 opus/48000/2
a=fmtp:111 minptime=10;useinbandfec=1
a=ssrc:350842737 cname:yvKPspsHcYcwGFTw
a=ssrc:350842737 msid:yvKPspsHcYcwGFTw DfQnKjQQuwceLFdV
a=ssrc:350842737 mslabel:yvKPspsHcYcwGFTw
a=ssrc:350842737 label:DfQnKjQQuwceLFdV
a=msid:yvKPspsHcYcwGFTw DfQnKjQQuwceLFdV
a=sendrecv
a=candidate:foundation 1 udp 2130706431 192.168.1.1 53165 typ host generation 0
a=candidate:foundation 2 udp 2130706431 192.168.1.1 53165 typ host generation 0
a=candidate:foundation 1 udp 1694498815 1.2.3.4 57336 typ srflx raddr 0.0.0.0 rport 57336 generation 0
a=candidate:foundation 2 udp 1694498815 1.2.3.4 57336 typ srflx raddr 0.0.0.0 rport 57336 generation 0
a=end-of-candidates
m=video 9 UDP/TLS/RTP/SAVPF 96
c=IN IP4 0.0.0.0
a=setup:active
a=mid:1
a=ice-ufrag:CsxzEWmoKpJyscFj
a=ice-pwd:mktpbhgREmjEwUFSIJyPINPUhgDqJlSd
a=rtcp-mux
a=rtcp-rsize
a=rtpmap:96 VP8/90000
a=ssrc:2180035812 cname:XHbOTNRFnLtesHwJ
a=ssrc:2180035812 msid:XHbOTNRFnLtesHwJ JgtwEhBWNEiOnhuW
a=ssrc:2180035812 mslabel:XHbOTNRFnLtesHwJ
a=ssrc:2180035812 label:JgtwEhBWNEiOnhuW
a=msid:XHbOTNRFnLtesHwJ JgtwEhBWNEiOnhuW
a=sendrecv

Here's what we know from this message:
We have two media sections, an audio section and a video section.
They are both sendrecv transceivers. We got two streams, we can send two back.
We have the ICE candidate and authentication details so we can try to connect.
We have a certificate thumbprint so we can make secure calls.

Connecting

Why does WebRTC need a dedicated subsystem to connect to?

Most applications deployed today establish client/server connections. Client/server connections require the server to have a stable known transport address . The client connects with the server, and the server responds.

WebRTC does not use a client/server model, it establishes a peer-to-peer (P2P) connection. In a P2P connection, the task of creating a connection is evenly distributed to both ends. This is because transport addresses (IP and port) in WebRTC cannot be assumed and may even change during a session. WebRTC will gather all the information it can, and will do its best to enable bi-directional communication between two WebRTC proxies.

However, establishing a peer-to-peer connection can be difficult. These proxies may be on different networks that are not directly connected. You may still experience other issues when there is a direct connection. In some cases your clients are not using the same network protocol (UDP <-> TCP) or possibly different IP versions (IPv4 <-> IPv6)

Despite these difficulties when establishing a P2P connection, you gain advantages over traditional client/server technologies due to the following properties provided by WebRTC.

  1. Reduced Bandwidth Costs
    Because media communication happens directly between peers, you don't pay for it, and you don't need to host a separate server to forward the media.
  2. Lower Latency
    Low latency: Lower Latency
    direct connection is faster, when running everything through the server, the transmission speed will be slower
    3.Secure E2E Communication
    end-to-end secure communication: direct communication is more secure. Since users won't be routing data through your server, they don't even need to trust that you won't decrypt the data

The process described above is called Interactive Connection Establishment (ICE). Another protocol that predates WebRTC. ICE is a protocol that tries to find the best way to communicate between two ICE agents. Each ICE Agent announces how it can be reached , these are called candidates. Candidates are essentially the transport addresses of the agent that it believes can be reached by another peer. ICE then decides the best combination of candidates.

Networking real-world constraints

ICE is all about overcoming the limitations of real-world networks. Before we dive into solutions, let's talk about the actual problem

  1. Not on the same network
    insert image description here
    It is very easy for hosts on the same network to connect. 192.168.0.1 -> 192.168.0.2 is easy to do. These two hosts can connect to each other without any external help.
    However, hosts using router B (routerB) have no way to directly access anything behind router a (routerA), how do you distinguish between 192.168.0.1 behind router A (routerA) and the same IP behind router B (routerB)? They are private ip. A host using router B (routerB) can send traffic directly to router A (routerA), but the request will end up there. How does router A (routerA) know which host it should forward the message to?
  2. Protocol restrictions
    Some networks don't allow UDP traffic at all, or maybe they don't allow TCP. Some networks may have very low MTU (Maximum Transmission Unit). Network administrators can change many variables that can make communication difficult.
  3. Firewall/IDS Rules
    The other is "Deep Packet Inspection" and other intelligent filtering. Some network administrators run software that tries to process each packet. Many times, this software doesn't understand WebRTC, so it blocks it because it doesn't know what to do, e.g. treat WebRTC packets as suspicious UDP packets on an arbitrary port instead of a whitelist.

NAT Mapping

NAT (Network Address Translation) mapping is what makes WebRTC connectivity possible. This is how WebRTC allows two ends in completely different subnets to communicate, solving the "not on the same network" problem described above. Although Agent 1 and Agent 2 are in different networks, they can communicate.
insert image description here
In order to realize this communication, a NAT mapping needs to be established. Agent 1 establishes a WebRTC connection with Agent 2 through port 7000. This will create a binding from 192.168.0.1:7000 to 5.0.0.1:7000. This allows Agent 2 to reach Agent 1 with packets sent to 5.0.0.1:7000. NAT mapping in this example is like an automatic version of port forwarding in a router.

The disadvantage of NAT mapping is that there is no single form of mapping (such as static port forwarding), and the behavior is not consistent across networks. Internet service providers and hardware manufacturers may do this in different ways. [I don't quite understand the disadvantages of NAT mapping] In some cases, network administrators will even disable it. The good news is that all the behavior is understandable and observable, so the ICE agent was able to confirm that it created a NAT mapping, and the properties of the mapping. The document describing these behaviors is RFC 4787.

create mapping

Creating the mapping is the easy part. When you send a packet to an address outside your network, a mapping is created. NAT mapping is just a temporary public IP and port assigned by your NAT. Outbound messages will be rewritten with their source address given by the new mapped address. If a message is sent to the map, it will be automatically routed back to the host inside the NAT that created it. The details of the mapping are where it gets complicated.

Map creation behavior

Mapping creation falls into three distinct categories:

  1. Endpoint Independent Mapping
    Creates a mapping for each sender inside the NAT. If two packets are sent to two different remote addresses, the NAT mapping will be reused. Both remote hosts will see the same source IP and port. If the remote host responds, it will be sent back to the same local listener.
    This is the best case scenario. For work calls, at least one party must be of this type.
  2. Address Dependent Mapping
    A new mapping is created each time a packet is sent to a new address. If two packets are sent to different hosts, two mappings will be generated. If two packets are sent to the same remote host but with different destination ports, no new mapping will be created.
  3. Address and port dependent mapping
    If the remote IP or port is different, create a new mapping. If two packets are sent to the same remote host but with different destination ports, a new mapping is created.

map filtering behavior

Map filtering is the rules about who can use a map. They fall into three similar categories:

  1. Endpoint independent filtering
    Anyone can use the map. You can share a map with multiple other peers, and they can all send traffic to it.
  2. Address-related filtering
    Only the host that created the mapping can use the mapping. If you send a packet to host A, you will only get a response from the same host. If host B tries to send a packet to this map, it will be ignored.
  3. Address and port dependent filtering
    Only the host and port for which the mapping is created can use the mapping. If you send a packet to a:5000, you will only get a response from the same host and port. If A:5001 tries to send a packet to this map, it will be ignored.

map refresh

If a mapping is not used for 5 minutes, it is recommended to destroy it. It all depends on the ISP or hardware manufacturer.

STUN

STUN (Session Transport Utility for NAT) is a protocol designed to work with NAT. This is another technology that came before WebRTC (and ICE!). It is defined by RFC 8489, which also defines the STUN packet structure. The STUN protocol is also used in ICE/TURN.

STUN is useful because it allows programmatic creation of NAT mappings. Before STUN, we were able to create NAT mappings, but we didn't know what the mapped IPs and ports were! STUN not only enables you to create a mapping, but it also allows you to get the details of the mapping, which you can share with others, and they can then pass data back to you through the mapping you just created.

Let's start with a basic description of STUN. Later, we will expand the topic to the use of TURN and ICE. For now, we're only going to describe the request/response flow to create the mapping. We'll then discuss how to get the details of that map to share with others. This process happens when you have a stun: server for WebRTC PeerConnection in ICE URLs. In a nutshell, STUN sends a request to a STUN server outside the NAT, and the server returns what it observes in the request, which STUN uses to help the endpoint behind the NAT figure out which mapping has been created.

Protocol Structure

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0 0|     STUN Message Type     |         Message Length        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         Magic Cookie                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|                     Transaction ID (96 bits)                  |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                             Data                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

STUN message types
Each STUN packet has a type. Currently, we only care about the following:

  • Binding Request - 0x0001
  • Binding Response - 0x0101
    We send one in order to create a NAT mapping Binding Request. Then the server responds with one Binding Response.

Message Length
This is Datathe length of the segment. This section contains 消息类型arbitrary data as defined by .

Magic Cookie
refers to the fixed value 0x2112A442, sent in network byte order. This value helps distinguish STUN traffic from other protocols.

Transaction ID
is a 96-bit identifier used to uniquely identify a request/response pair. This helps you pair requests and responses.

data
data will contain a list of STUN attributes. A STUN attribute has the following structure:

0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         Type                  |            Length             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         Value (variable)                ....
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

STUN Binding Request No attributes are used. This means a STUN Binding Requestheader only.

STUN Binding Responseuse one XOR-MAPPED-ADDRESS (0x0020). This property contains an IP and a port. This is exactly the IP and port of the NAT mapping created!

Create NAT mappings

Creating a NAT mapping using STUN requires only one request. You send a STUN Binding Request to the STUN server. Then, the STUN server responds with a STUN Binding Response. The STUN Binding Response will contain the mapped address. The mapped address is how the STUN server sees you, and is also your NAT mapping. If you want someone to send you packets, then you should share that mapped address.

People also refer to the mapped address as the public IP or Server Reflexive Candidate.

Determine the NAT type

Unfortunately, mapped addresses may not be available in all cases. In the case of address-dependent mappings, only the STUN server can send traffic back to you. If you share it, then messages another peer tries to send to that address will be dropped. This makes the peer unable to communicate with other peers. If the STUN server can also forward the packet to the peer for you, you may find that the address-related mapping problem is actually solvable! This is the TURN solution that will be mentioned below.

RFC 5780 defines a way to run a test to determine your NAT type. This is useful because you may know ahead of time whether a direct connection is possible.

Guess you like

Origin blog.csdn.net/Magic_o/article/details/130193516