Overview of UDT Source Code Analysis (1)

introduce

As network bandwidth delay products increase, the commonly used TCP protocol begins to become inefficient. This is because its AIMD algorithm completely reduces the congestion window, but cannot quickly restore the available bandwidth. Theoretical traffic analysis shows that TCP is more vulnerable to packet loss attacks when the BDP increases.

In addition, unfair RTT inherited from TCP congestion control also becomes a serious problem in distributed data-intensive programs. Concurrent TCP streams with different RTTs will share bandwidth unfairly. Although common TCP implementations are used in small BDP networks to share bandwidth relatively equally, in networks with a large number of BDPs, common TCP-based programs have to suffer from severe inequity. This RTT-based algorithm severely limits its efficiency in distributed computing over a wide area network, such as grid computing over the internet.

Considering the above background, there is a need for a transport protocol that supports high-performance data transmission in high-BDP networks. We recommend an application-level transport protocol called UDT or UDP-based data transport protocol with a congestion control algorithm.

Design goals

UDT is mainly used in the case where a small number of bulk sources share abundant bandwidth. The most typical example is grid computing built on fiber wide area networks. Some research institutes run their distributed data-intensive programs on such networks, such as , remote access to instruments, distributed data mining and high-resolution multimedia streaming.

The main goals of UDT are efficiency, fairness, and stability. A single or small number of UDT streams should utilize the available bandwidth provided by all high-speed connections, even if the bandwidth varies drastically. At the same time, all concurrent streams must share bandwidth fairly, independent of different bandwidth bottlenecks, start times, and RTTs. Stability requires that the packet sending rate should always converge very fast on the available bandwidth, and congestion collisions must be avoided.

UDT is not intended to replace TCP in the case of relatively small bandwidth and a large number of multivariate short file streams.

UDT mainly acts as a friend of TCP and coexists with TCP. The bandwidth allocated by UDT should not exceed the maximum and minimum fair sharing principle according to the MAX-MIN rule. (Note, the max-min rule allows UDT to allocate available bandwidth that TCP cannot use under high BDP connections).

Protocol description

1 Overview

UDTs are duplex and each UDT entity has two parts: send and receive. The sender sends (and retransmits) application data according to flow control and rate control. The receiver receives data packets and control packets, and sends control packets according to the received packets. Sending and receiving programs share the same UDP port for sending and receiving.

The receiver is also responsible for triggering and handling all control events, including congestion control and reliability control and their relative mechanisms, such as RTT estimation, bandwidth estimation, acknowledgement and retransmission.

UDTs always try to pack application-layer data into a fixed size, unless the data is not that big. Similar to TCP, this fixed packet size is called MSS (Maximum Packet Size). Since UDT is expected to transmit large data streams, we assume that only a small fraction of irregularly sized packets are in the UDT session. MSS can be installed by application, MTU is its optimal value (including all headers).

The UDT congestion control algorithm combines rate control and window (flow control), the former adjusting the transmission period of the packet, and the latter limiting the packet whose largest bit is acknowledged. The parameters used in rate control are updated by bandwidth estimation techniques, which are inherited from the received-based packet method. At the same time, the rate control period is a constant for estimating RTT, the flow control parameter depends on the data arrival speed of the other party, and the size of the buffer released by the receiving end.

2. Packet structure

  • the whole frame:

  • data pack:

bit 0:    
    0:数据包
    1:控制包

bit ff:
    11:单独的数据包
    10:一个数据流中的第一个数据包
    01:一个数据流中的最后一个数据包

bit o:
    0:不立刻交付给用户
    1:立刻交付给用户
  • Control package:
bit 1~15:
    0:HandleShake
        Additional Info:Undefined
        Control Info:Struct CHandleShake
    1:Keep-Alive
        Additional Info:Undefinded
        Control Info:Undefined
    2:ACK
        Additional Info:The ACK sequence number
        Control Info:  RTT    
                                RTT 方差
                                接收Buffer的可用空间大小(in bytes)
                                通告发送方的流量窗口大小(in packets)
                                估计带宽(每秒的数据包数量)  
    3:NAK(定时发送这个包,解决包丢失问题)
        Additional Info:Undefine
        Control Info:Loss List
    4:Congestion/Delay Warning
        Additional Info:Undefined
        Control Info:None
    5:ShutDown
        Additional Info:Undefined
        Control Info:None
    6:ACK-ACKed
        Addritional Info:The ACK sequence number
        Control Info:None
    7:Message Drop Request
        Additional Info:Message ID
        Control Info:  first sequence number of the message
                                last sequence number of the message
    8:Error Sinnal from the Peer Side
        Addritional:Error Code
        Control Info:NONE
    0x7FFF:Explained by bits16 - 31                            

3. Timer

UDT uses 4 timers at the receiving end to trigger different periodic events, including rate control, acknowledgment, loss report (negative acknowledgment) and retransmission/connection maintenance.

Timers in UDT use system time as source. The UDT receiver actively queries the system time to check whether a timer has expired. For a timer T, which has a period TP, the constant variable t is used to record the time when T was set or reset recently. If T is reset at system time t0 (t = t0), then any t1 (t1 - t >= TP) is a condition for T to expire.

The four timers are: RC timer, ACK timer, NAK timer, EXP timer. Their cycles are: RCTP, ATP, NTP, ETP.

The RC timer is used to trigger periodic rate control. The ACK timer is used to trigger periodic selective acknowledgments (acknowledgment packets). RCTP and ATP are constant values, the value is: RCTP=ATP=0.01 seconds.

NAK is used to trigger negative replies (NAK packets). The retransmission timer is used to trigger the retransmission of a packet and maintain the connection state. Their period depends on the estimate of RTT. The ETP value also depends on the number of consecutive EXP time overflows. The recommended initial value of RTT is 0.1 seconds, while the initial values ​​of NTP and ETP are: NTP=3 RTT, ETP=3 RTT+ATP.

Query the system time on each bounded UDP receive operation (if a UDP packet is received, some additional necessary data processing time) to check if the four timers have expired. The recommended period granularity is microseconds. The UDP receive timeout value is a choice of implementation, which depends on the trade-off between the burden of the round-robin query and the accuracy of the event period.

The rate control event updates the packet sending cycle, and the UDT sender uses STP to arrange the sending of data packets. Assuming one is sent at time t0, then the next packet send time is (t0 + STP). In other words, if the previous packet transmission took t' time, the sender will wait (STP-t') to send the next packet (if STP-t' < 0, there is no need to wait). This wait interval requires a high-precision implementation, and the CPU clock cycle granularity is recommended.

4. Sender algorithm

  • Data structure
    A: History window: a circular array to record the start time of each data packet SND PKT
    B: sender loss list: sending segment loss list is a connection list, used to store the lost packet sequence number returned in the receiver's NAK packet . The numbers are stored in increasing order
  • Sending Algorithm
    A: If the sender's lost list is non-empty, retransmit the first packet in the list, and delete the member, to 5.
    B: Waiting for application data to be sent
    C: If the number of unanswered packets exceeds the size of the two-quantity window, go to 1. If not wrap a new package and send it.
    D: If the serial number of the current package is 16n, and n is an integer, go to step 2.
    E: Record the sending time of the packet in the SND PKT history window
    F: Wait for the outer SYN time if this is the first packet since the last time the sending rate was reduced.
    G: Wait outside (STP – t) time, t is the total time between steps 1 and 4, then go to 1.

    5. Receiver algorithm

  • Data Structure
    A: Receiver Loss Linked List: It is a duple connection linked list. The values ​​of elements include: the sequence number of the lost data packet, the feedback time of the most recent lost packet, and the number of times the packet has been fed back. Values ​​are stored in increasing packet sequence number.
    B: acknowledgment history window: a circular array for each time and time of sending an ACK; due to its circular nature, it means that if there is no more space in the array, the new value will overwrite the old value.
    C: History window: a circular array used to record the arrival time of each packet. RCV PKT
    D: Pair Window: One used to record the time interval between each probed packet pair.
    E: A variable used to record the maximum received packets required. The LRSN is initialized to the initial sequence number minus one. LRSN
  • Reception Algorithm
    A: Query the system time to check whether the RC, ACK, NAK, or EXP timer expires. If any timer expires, handle the event (described later in this section) and reset the expired timer.
    B: Start a time bounded UDP receive. If every packet arrives, turn 1.
    C: Set exp-count to 1, and update ETP to: ETP=RTT+4*RTTVar + ATP.
    D: If all transmit packets have been acknowledged, reset the EXP time variable.
    E: Check the flag bit of the packet header. If it's a control packet, handle it according to the type, then turn 1.
    F: If the need of the current packet is 16n+1, n is an integer, record the time interval between the current packet and the last packet in the pairing window.
    G: Record the arrival time of the packet in the PKT history window
    H: If the sequence number of the current data packet is greater than LRSN+1, put all sequence numbers between (but not including) these two values ​​into the receive loss linked list, and in a NAK These sequence numbers are sent to the sender in the packet. If the sequence number is less than the LRSN, it is removed from the receive lost list.
    I: Update LRSN, go to 1.

    6. Handle various packet arrival events

  • Handle RC timer expiration
    A: Find the sequence numbers before all packets received by the receiver according to the following principles: if the receiver loss list is empty, the ACK number is LRSN+1, otherwise it is the smallest in the receive loss queue serial number.
    B: If the acknowledgment number is not greater than the largest acknowledgment number ever acknowledged by ACK2, or equal to the acknowledgment number of the last acknowledgment and the time interval between two acknowledgments is less than RTT+4 RTTVar, stop (do not send acknowledgment).
    C: Assign this response a uniquely increasing ACK sequence number. It is recommended that the ACK sequence number be incremented as in step 1 and overlap after reaching the maximum value.
    D: Calculate the arrival rate of packets according to the following algorithm: Calculate the median of the last 16 packet arrival intervals (AI) using the values ​​in the PKT history window. Among these 16 values, delete those
    packets that are greater than AI 8 or less than AI*8. If there are 8 values ​​left at the end, calculate their average value (AI'), and the packet arrival speed is 1/AI' (packets per second). quantity), otherwise 0.
    E: Calculate the flow window for each end (W) according to Section 3.7. The effective traffic window size is then calculated as: max(W, available receiver buffer size), 2).
    F: Calculate the connection capacity estimate according to the following algorithm. If the flow control fast start phase (3.7) continues, return 0, otherwise calculate the last 16 packet interval (PI), these values ​​are in the packet window, then the connection capacity is 1/PI (number of packets per second).
    G: Pack the response sequence number, response number, RTT, RTT variable, effective traffic window size and estimate the connection, put them in the ACK packet, and send it out.
    H: Record the ACK sequence number, the response number and the start time of this response, and put it in the history window.
  • Handle the expiration of the NAK timer
    . A.: Find the receiver's loss list, find all the packets whose last feedback time is (k (RTT+4 RTTVar)) before, and add 1 to the current feedback times of this packet. If there is no feedback loss ,stop.
    B: Compress the sequence numbers obtained in the first step (see 3.9) and send them to the sender in a NAK packet.
    C: If not stop the flow control fast start stage.
  • Handle EXP timer expiration
    A: If the sender's lost list is not empty, stop
    B: Put all unanswered packets into the sender's lost list
    C: If (exp-count>16) and since the last received from the other party The total time since a packet is more than 3 seconds, or this time has exceeded 3 minutes, it is considered that the connection has been disconnected, and the UDT connection is closed.
    D: If there is no data, there is no response, and a keep-alive packet is sent to the peer, otherwise, the sequence numbers of all unanswered packets are put into the sending loss list.
    E: Update exp-count to: exp-count=exp-count+1
    F: Update ETP to: ETP=exp-count (RTT+4 RTTVar)+ATP.
  • Received response packet
    A: Update the largest response sequence number
    B: Update RTT and RTTVar as: RTT = rtt, RTTVar = rv; rtt and rv are the RTT and RTTVar values ​​in the ACK packet
    C: Update NTP and ETP as: NTP=RTT +4 RTTVar; ETP=exp-count (RTT+4 RTTVar)+ATP.
    D: Update the connection capacity estimate: B=(B
    7+b)/8, where b is the value of the ACK packet.
    E: Update the traffic window size to the value in the ACK.
    F: Send an ACK2 packet, and set the same response number as the ACK sequence number to the peer
    G: Reset the EXP timer
  • Receive NAK packets
    A: Put the sequence numbers in all NAK packets into the sender's loss list
    B: Update the STP through rate control
    C: Reset the EXP timer
  • Received ACK2 packet
    A: In the ACK history window, look up the ACK packet of the line according to the received ACK2 sequence number.
    B: Update the largest acknowledgment number that has ever been answered
    C: Calculate the new rtt value according to the arrival time of ACK2 and the departure time of ACK, and update the RTT and RTTVar values:
    RTTVar = (RTTVar 3 +abs(rtt-RTT)/4
    RTT = (RTT
    7+rtt)/8
    The initial values ​​of RTT and RTTVar are 0.1 seconds and 0.05 seconds.
    D: Update NTP and ETP as:
    NTP = RTT;
    ETP = (exp-count +1)* RTT+ATP
  • Receive a keep alive package
    and do nothing

    7. Speed ​​control algorithm

  • The fast-start
    STP is initialized to the smallest time precision (1 CPU cycle or 1 millisecond). This is in the fast-start phase. Generally, when an ACK packet is received and the estimated bandwidth carried by it is greater than 0, it stops. The sending period of the packet is set to 1/W, where W is the size of the flow window carried by the ACK.
    The fast start phase only occurs when a UDT connection is started, and does not occur after a UDT connection. After the fast-start phase, the following algorithm will work.
  • When the RC timer expires:
    A: If no ACK is received within the last RCTP time, stop
    B: Calculate the loss rate within the last RCTP time, the calculation method is based on the total sent packets and NAK feedback The total number of lost packets. If the loss rate is greater than 0.1%, stop.
    C: The increase in the number of packets sent in the next RCTP time is calculated as follows: (inc)
    If (B<=C) inc = 1/MSS
    Else inc = max (10^(ceil(log10((BC) MSS 8))) *Beta/MSS,1/MSS)
    B is the connection capacity estimate, C is the current sending speed. Both are calculated as how many packets per second. MSS is calculated in bytes; Beta is a constant value of 0.0000015.

    D: Update STP: STP=(STP RCTP)/(STP inc + RCTP)
    E: Calculate the real data sending period (rsp), get it from the SND PKT history window, if (STP<0.5 rsp) set STP to (0.5 rsp).
    F: If (STP<1.0), set STP to 1.0.
  • When receiving a NAK packet:
    • Data structure:
      A: Maximum sequence number LSD sent since last rate reduction
      B: Number of NAKs since last LSD update NumNAK
      C: Average number of NAK moves between two events when maximum sequence number is greater than LSD. AvgNAK
      D: A random average between 1 and AvgNAK. DR
    • Algorithm:
      A: If the largest missing sequence number in NAK is greater than LSD:
      increase STP as: STP=STP (1+1/8)
      Update AvgNAK as: AvgNAK = (AvgNAK
      7 +NumNAK)/8
      Update DR
      reset NumNAK = 0
      record LSD
      B: Otherwise, increase NumNAK by 1 step; if NumNAK % DR = 0; increase STP as: STP=STP*(1+1/8); record LSD.
  • Flow control algorithm:
    The initial value of the flow control window size (W) is 16
  • When the ACK timer expires:
    A: Flow control fast start: If no NAK is generated or W has not arrived or exceeds 15 packets, and AS>0, the flow window size is updated to the total number of response packets.
    B: Otherwise, if (AS>0), W is updated to: (AS is the arrival speed of the packet): W= ceil (W 0.875+AS (RTT +ATP) *0.125)
    C: Limit W to the maximum flow window size of the other party .
  • Connection establishment and closure:
    A UDT entity is first started as a SERVER, which sends handshake packets when a client needs to connect. The client should send a handshake packet at regular intervals before receiving a handshake response packet from the server or time out (the time interval is weighed by the response time and the system overhead).
    The handshake packet has the following information:
    A: Version: This value is for compatibility purposes. The current version is 2UDT
    B:Initial Sequence Number: This is the starting sequence number that this UDT entity will use to send packets in the future. It must be a random value between 1 and (2^31-1). Additionally, it is recommended that this value should not repeat within a reasonable time history window.
    C: The size of the packet (measured by IP payload) MSS
    D: Maximum traffic window size: This is the maximum traffic window size allowed by the UDT entity that received the handshake message, the window size is usually limited to the size of the data structure on the receiving end.
    After the server receives a handshake packet, it compares the MSS value with its own value and sets its own value to the smaller value. The resulting value is also sent to the client in the handshake response, along with the server's version information, initial sequence number, and maximum traffic window size.
    The version field is used to check compatibility on both ends. The initial sequence number and the maximum flow window size are used to initialize the parameters of the UDT entity that received this handshake packet.
    The server is ready to send or receive data after the first step is completed. However, whenever any handshake packets are received from the same client, it should send a response packet.
    Once the client gets a handshake response from the server it enters the sending and receiving data state. Set its own MSS to the value in the handshake response packet and initialize the corresponding parameters to the values ​​in the packet (sequence number, maximum flow window). If any other handshake message is received, discard it.
    If the UDT entity in it wants to close, it will send a close message to the peer; the peer will close itself after receiving this message. This shutdown message is transmitted over UDP and is only sent once, not guaranteed to be received. If the message is not received, the other party will close the connection according to the time-out mechanism.
  • Solution for lost information:
    The lost information carried in the NAK packet is an array of 32-bit integers. If the middle number of the array is a normal sequence number (the 1st digit is 0), it means the packet of this sequence number is lost, if the 1st digit is 1, it means starting from this number (including this number) to the next array The packet (its first bit must be 0) between elements in (including this element value) is lost.
    For example, the information carried in the following NAK:
    0x00000002, 0x80000006, 0x0000000B, 0x0000000E
    The above information indicates that packets with serial numbers: 2, 6, 7, 8, 9, 10, 11, and 14 are lost.

    8. Efficiency and fairness

    UDT is able to take full advantage of the available bandwidth of the current wired network independent of connection capacity, RTT, background coexistence streams, given connection bit error rate. It takes a constant time for UDT to go from 0bits/s to 90% bandwidth without packet loss, which is 7.5 seconds. UDT is not suitable for wireless networks.
    UDT does satisfy the maximum-minimum fairness of the single-bottle network topology. In the case of multiple bottle strengths, it can ensure that the smaller bottle strengths are connected or at least half shared equally according to the principle of maximum and minimum. RTT has little effect on fairness.
    When coexisting with large TCP streams, TCP can take up more bandwidth than UDT, except in three cases:
    1. Network BDPs are so large that TCP cannot take advantage of their fair share of bandwidth. In this case, UDT will occupy bandwidth that TCP cannot use.
    2. The connection capacity is so small that UDT's bandwidth estimation technique does not work optimally; simulations show that the limit connection capacity is about 100kb/s.
    3. In a network using FIFO queues as network paths, if the queue size is larger than the BDP, the shared bandwidth of TCP decreases as the queue size increases. However, the shared bandwidth reaching the UDT is that the queue size often exceeds the amount provided by the actual router/switch.
    When short (timewise) web-like TCP streams coexist with small concurrent UDT streams, UDT has very little effect on TCP streams.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325123093&siteId=291194637