Linux Kernel Network - Congestion Control Series (1)

When it comes to network congestion control, you may be familiar with the concepts of "additive increase", "multiplicative decrease", "slow start", "congestion avoidance", "fast retransmission", "fast recovery" in stereotyped essays. Yes, this is the basic theory of a classic network congestion control algorithm, but in actual implementation, different congestion control algorithms are very different. This article learns the specific implementation framework of the network congestion control algorithm from the Linux kernel source code . From the perspective of the current development of network congestion control algorithms, there are mainly four types of network congestion control algorithms :

  • Congestion control algorithms based on packet loss, which regard packet loss as network congestion. Take a slow detection method, gradually increase the congestion window, and reduce the congestion window when packet loss occurs. Representative algorithms include Tahoe, Reno, NewReno, BIC, and Cubic.
  • Delay-based congestion control algorithms, this type of algorithm regards the increase of delay as network congestion, reduces the congestion window when the delay increases, and increases the congestion window when the delay decreases. Representative algorithms include Vegas and Westwood.
  • The congestion control algorithm based on link capacity, the representative algorithm is BBR, which adopts an alternative method, no longer uses signals such as packet loss and delay to measure whether congestion occurs, but directly models the network to avoid and deal with real congestion Network congestion.
  • Congestion control algorithms based on learning, this type of algorithm does not have a specific congestion signal, generally based on training data and evaluation functions, and generates a network congestion control strategy model through machine learning, representative algorithms include Remy, PCC, Aurora, DRL-CC, Orca wait.

Since the core concept of each type of congestion control algorithm is very different, the implementation and principle of each algorithm will be presented in subsequent articles. This article first analyzes and roughly studies the implementation details and general framework of network congestion control in the Linux kernel . Before conducting a formal analysis, let’s briefly sort out common sense and concepts:

  • What is network congestion : Network congestion refers to the phenomenon that the amount of data transmitted in the network exceeds the processing capacity of network links or nodes, resulting in increased network delay, increased packet loss rate, and decreased bandwidth utilization.
  • Window (Window) : The TCP protocol header as shown in the figure below occupies 16 bits, which is used by the receiving end to tell the sending end how many buffers are available to receive data.

  • Sliding window, sending window: the black box shown in the figure below represents the sending window. The sliding window is just a visual name, that is, the sending window moves all the time to achieve the purpose of sending new data. As shown in the figure below, when the ACK data packet sent by the receiving end is received, the sending window moves to the right. The gray box in the figure represents the data that has been sent and confirmed, and the red represents the data that has been sent and just confirmed. It is precisely because the data of 5 bytes has just been confirmed that the sending window can be driven to move 5 units to the right, making the serial number 52~ The data of 56 (green box, which represents the data to be sent that is allowed to be sent) can be sent. When the data in the range of 37~51 (blue box, represents the data packet sent but not confirmed) can be confirmed, the sending window can be sent to the Swipe right. The data in front of the sending window (yellow box, data to be sent that is not allowed to be sent) can only be sent after waiting for the sending window interval. The sliding window of TCP is dynamic. We can imagine it as a common math problem in elementary school, a pool with volume V, water inflow V1 per hour, and water outflow V2. When the pool is full, no more injection is allowed. If there is a hydraulic system to control the size of the pool, then the rate and amount of water injection can be controlled. Such a pool is similar to a TCP window. According to the change of its own processing capability, the application restricts the sending window traffic of the peer end through the control of the TCP receiving window size of the local end.

  • Congestion window : The concept of the sending window was introduced above. In the TCP protocol, there is a variable that reflects the transmission capacity of the network, called the congestion window (congestion window), denoted as cwnd. The actual send window size of the sender is actually the smaller value of the advertised window rwnd and the congestion window cwnd for the receiver.
W=min(cwnd,rwnd)

From the above concepts, we can know that the congestion window can indirectly reflect the status of the network, and then limit the size of the sending window. As one of the core variables in network congestion control, congestion window plays a key role in network congestion control. The following figure shows the four core structures. The relationship between the four structures has object-oriented characteristics. Through layer-by-layer inheritance, the reuse of classes is realized; many functions related to the network in the kernel often use struct sock as parameters, and the functions are internally based on Different business logic converts struct sock into different business structures.

struct tcp_sockstruct inet_connection_sockInherited from the structure, struct inet_connection_socksome fields related to the tcp protocol are added to it, such as the sliding window protocol, congestion algorithm and other TCP-specific attributes. Due to this inheritance relationship, they can be converted to each other. The following two conversion methods are given as examples. The first is to convert struct sock to struct tcp_sock, and the second is to convert struct sock to struct inet_connection_sock. Expand the struct tcp_sock below to see the fields related to network congestion control.

static inline struct tcp_sock *tcp_sk(const struct sock *sk)
{
 return (struct tcp_sock *)sk;
}
static inline struct inet_connection_sock *inet_csk(const struct sock *sk)
{
 return (struct inet_connection_sock *)sk;
}

The fields related to network congestion control defined in struct tcp_sock are as follows:

struct tcp_sock {//在 inet_connection_sock  基础上增加了 滑动窗口 拥塞控制算法等tcp 专有 属性
    __be32    pred_flags;/*首部预测标志 在接收到 syn 跟新窗口 等时设置此标志 ,
    此标志和时间戳 序号等 用于判断执行 快速还是慢速路径*/

    u64    bytes_received;    /* RFC4898 tcpEStatsAppHCThruOctetsReceived
                 * sum(delta(rcv_nxt)), or how many bytes
                 * were acked.
                 */
    u32    segs_in;    /* RFC4898 tcpEStatsPerfSegsIn
                 * total number of segments in.
                 */
     u32    rcv_nxt;    /* What we want to receive next  等待接收的下一个序列号    */
    u32    copied_seq;    /* Head of yet unread data        */

/* rcv_nxt on last window update sent最早接收但没有确认的序号, 也就是接收窗口的左端,
        在发送ack的时候, rcv_nxt更新 因此rcv_wup 更新比rcv_nxt 滞后一些  */
    u32    rcv_wup;    

    u32    snd_nxt;    /* Next sequence we send 等待发送的下一个序列号        */
    u32    segs_out;    /* RFC4898 tcpEStatsPerfSegsOut
                 * The total number of segments sent.
                 */
    u64    bytes_acked;    /* RFC4898 tcpEStatsAppHCThruOctetsAcked
                 * sum(delta(snd_una)), or how many bytes
                 * were acked.
                 */
    struct u64_stats_sync syncp; /* protects 64bit vars (cf tcp_get_info()) */

     u32    snd_una;    /* First byte we want an ack for  最早一个未被确认的序号    */
     u32    snd_sml;    /* Last byte of the most recently transmitted small packet  最近发送一个小于mss的最后 一个字节序列号
    在成功发送, 如果报文小于mss,跟新这个字段 主要用来判断是否启用 nagle 算法*/
    u32    rcv_tstamp;    /* timestamp of last received ACK (for keepalives)  最近一次收到ack的时间 用于 tcp 保活*/
    u32    lsndtime;    /* timestamp of last sent data packet (for restart window) 最近一次发送 数据包时间*/
    u32    last_oow_ack_time;  /* timestamp of last out-of-window ACK */

    u32    tsoffset;    /* timestamp offset */

    struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
    unsigned long    tsq_flags;

    /* Data for direct copy to user cp 数据到用户进程的控制块 有用户缓存以及其长度 prequeue 队列 其内存*/
    struct {
        struct sk_buff_head    prequeue // tcp 段 缓冲到此队列 知道进程主动读取才真正的处理;
        struct task_struct    *task;
        struct msghdr        *msg;
        int            memory;// prequeue 当前消耗的内存
        int            len;// 用户缓存中 当前可以使用的缓存大小 
    } ucopy;

    u32    snd_wl1;    /* Sequence for window update记录跟新发送窗口的那个ack 段号 用来判断是否 需要跟新窗口
    如果后续收到ack大于snd_wll 则表示需要更新 窗口*/
    u32    snd_wnd;    /* The window we expect to receive 接收方 提供的窗口大小 也就是发送方窗口大小    */
    u32    max_window;    /* Maximal window ever seen from peer 接收方通告的最大窗口    */
    u32    mss_cache;    /* Cached effective mss, not including SACKS  发送方当前有效的mss*/

    u32    window_clamp;    /* Maximal window to advertise 滑动窗口最大值        */
    u32    rcv_ssthresh;    /* Current window clamp  当前接收窗口的阈值            */
    ......
     u32    snd_ssthresh;    /* Slow start size threshold 拥塞控制 满启动阈值        */
     u32    snd_cwnd;    /* Sending congestion window    当前拥塞窗口大小  ---发送的拥塞窗口    */
    u32    snd_cwnd_cnt;    /* Linear increase counter    自从上次调整拥塞窗口后 到目前位置接收到的
    总ack段数 如果该字段为0  表示调整拥塞窗口但是没有收到ack,调整拥塞窗口之后 收到ack段就回让
    snd_cwnd_cnt 加1 */
    u32    snd_cwnd_clamp; /* Do not allow snd_cwnd to grow above this  snd_cwnd  的最大值*/
    u32    snd_cwnd_used;//记录已经从队列发送而没有被ack的段数
    u32    snd_cwnd_stamp;//记录最近一次检验cwnd 的时间;     拥塞期间 每次会检验cwnd而调节拥塞窗口 ,
    //在非拥塞期间,为了防止应用层序造成拥塞窗口失效  因此在发送后 有必要检测cwnd
    u32    prior_cwnd;    /* Congestion window at start of Recovery.在进入 Recovery 状态时的拥塞窗口 */
    u32    prr_delivered;    /* Number of newly delivered packets to在恢复阶段给接收者新发送包的数量
                 * receiver in Recovery. */
    u32    prr_out;    /* Total number of pkts sent during Recovery.在恢复阶段一共发送的包的数量 */

     u32    rcv_wnd;    /* Current receiver window 当前接收窗口的大小        */
    u32    write_seq;    /* Tail(+1) of data held in tcp send buffer   已加入发送队列中的最后一个字节序号*/
    u32    notsent_lowat;    /* TCP_NOTSENT_LOWAT */
    u32    pushed_seq;    /* Last pushed seq, required to talk to windows */
    u32    lost_out;    /* Lost packets丢失的数据报            */
    u32    sacked_out;    /* SACK'd packets启用 SACK 时,通过 SACK 的 TCP 选项标识已接收到的段的数量。
                 不启用 SACK 时,标识接收到的重复确认的次数,该值在接收到确认新数据段时被清除。            */
    u32    fackets_out;    /* FACK'd packets    FACK'd packets 记录 SND.UNA 与 (SACK 选项中目前接收方收到的段中最高序号段) 之间的段数。FACK
            用 SACK 选项来计算丢失在网络中上的段数  lost_out=fackets_out-sacked_out  left_out=fackets_out        */

    /* from STCP, retrans queue hinting */
    struct sk_buff* lost_skb_hint; /*在重传队列中, 缓存下次要标志的段*/
    struct sk_buff *retransmit_skb_hint;/* 表示将要重传的起始包*/

    /* OOO segments go in this list. Note that socket lock must be held,
     * as we do not use sk_buff_head lock.
     */
    struct sk_buff_head    out_of_order_queue;

    /* SACKs data, these 2 need to be together (see tcp_options_write) */
    struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */
    struct tcp_sack_block selective_acks[4]; /* The SACKS themselves*/

    struct tcp_sack_block recv_sack_cache[4];

    struct sk_buff *highest_sack;   /* skb just after the highest
                     * skb with SACKed bit set
                     * (validity guaranteed only if
                     * sacked_out > 0)
                     */

    int     lost_cnt_hint;/* 已经标志了多少个段 */
    u32     retransmit_high;    /* L-bits may be on up to this seqno  表示将要重传的起始包 */

    u32    prior_ssthresh; /* ssthresh saved at recovery start表示前一个snd_ssthresh得大小    */
    u32    high_seq;    /* snd_nxt at onset of congestion拥塞开始时,snd_nxt的大----开始拥塞的时候下一个要发送的序号字节*/

    u32    retrans_stamp;    /* Timestamp of the last retransmit,
                 * also used in SYN-SENT to remember stamp of
                 * the first SYN. */
    u32    undo_marker;    /* snd_una upon a new recovery episode. 在使用 F-RTO 算法进行发送超时处理,或进入 Recovery 进行重传,
                    或进入 Loss 开始慢启动时,记录当时 SND.UNA, 标记重传起始点。它是检测是否可以进行拥塞控制撤销的条件之一,一般在完成
                    拥塞撤销操作或进入拥塞控制 Loss 状态后会清零。*/
    int    undo_retrans;    /* number of undoable retransmissions. 在恢复拥塞控制之前可进行撤销的重传段数。
                    在进入 FTRO 算法或 拥塞状态 Loss 时,清零,在重传时计数,是检测是否可以进行拥塞撤销的条件之一。*/
    u32    total_retrans;    /* Total retransmits for entire connection */

    u32    urg_seq;    /* Seq of received urgent pointer  紧急数据的序号 所在段的序号和紧急指针相加获得*/
    unsigned int        keepalive_time;      /* time before keep alive takes place */
    unsigned int        keepalive_intvl;  /* time interval between keep alive probes */

    int            linger2;

/* Receiver side RTT estimation */
    struct {
        u32    rtt;
        u32    seq;
        u32    time;
    } rcv_rtt_est;

/* Receiver queue space */
    struct {
        int    space;
        u32    seq;
        u32    time;
    } rcvq_space;

/* TCP-specific MTU probe information. */
    struct {
        u32          probe_seq_start;
        u32          probe_seq_end;
    } mtu_probe;
    u32    mtu_info; /* We received an ICMP_FRAG_NEEDED / ICMPV6_PKT_TOOBIG
               * while socket was owned by user.
               */

#ifdef CONFIG_TCP_MD5SIG
    const struct tcp_sock_af_ops    *af_specific;
    struct tcp_md5sig_info    __rcu *md5sig_info;
#endif

    struct tcp_fastopen_request *fastopen_req;

    struct request_sock *fastopen_rsk;
    u32    *saved_syn;
};

 Information through train: Linux kernel source code technology learning route + video tutorial kernel source code

Learning through train: Linux kernel source code memory tuning file system process management device driver/network protocol stack

Let's look at a particularly important framework, which can also be called a congestion control engine. As shown in the following structure, tcp_congestion_ops describes the operations that a set of congestion control algorithms need to support . This framework defines some hook functions. Different congestion control algorithms in the Linux kernel implement the following hook functions according to the algorithm idea, and then register to complete the design of the congestion control algorithm.

struct tcp_congestion_ops {
 struct list_head list;
 u32 key;
 u32 flags;

 /* initialize private data (optional) */
 void (*init)(struct sock *sk);
 /* cleanup private data  (optional) */
 void (*release)(struct sock *sk);

 /* return slow start threshold (required) */
 u32 (*ssthresh)(struct sock *sk);
 /* do new cwnd calculation (required) */
 void (*cong_avoid)(struct sock *sk, u32 ack, u32 acked);
 /* call before changing ca_state (optional) */
 void (*set_state)(struct sock *sk, u8 new_state);
 /* call when cwnd event occurs (optional) */
 void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev);
 /* call when ack arrives (optional) */
 void (*in_ack_event)(struct sock *sk, u32 flags);
 /* new value of cwnd after loss (required) */
 u32  (*undo_cwnd)(struct sock *sk);
 /* hook for packet ack accounting (optional) */
 void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
 /* suggest number of segments for each skb to transmit (optional) */
 u32 (*tso_segs_goal)(struct sock *sk);
 /* returns the multiplier used in tcp_sndbuf_expand (optional) */
 u32 (*sndbuf_expand)(struct sock *sk);
 /* call when packets are delivered to update cwnd and pacing rate,
  * after all the ca_state processing. (optional)
  */
 void (*cong_control)(struct sock *sk, const struct rate_sample *rs);
 /* get info for inet_diag (optional) */
 size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
      union tcp_cc_info *info);

 char   name[TCP_CA_NAME_MAX];
 struct module  *owner;
};

Users can implement a custom congestion control algorithm by customizing the above hook functions and register. The following intercepts the implementation and registration code fragments of the interface of the cubic congestion control algorithm. It can be noticed that cubic only implements part of the hook functions of the congestion control engine tcp_congestion_ops, because some hook functions must be implemented, and some are selected according to the algorithm.

static struct tcp_congestion_ops cubictcp __read_mostly = {
 .init  = bictcp_init,
 .ssthresh = bictcp_recalc_ssthresh,
 .cong_avoid = bictcp_cong_avoid,
 .set_state = bictcp_state,
 .undo_cwnd = tcp_reno_undo_cwnd,
 .cwnd_event = bictcp_cwnd_event,
 .pkts_acked     = bictcp_acked,
 .owner  = THIS_MODULE,
 .name  = "cubic",
};

static int __init cubictcp_register(void)
{
 BUILD_BUG_ON(sizeof(struct bictcp) > ICSK_CA_PRIV_SIZE);
 beta_scale = 8*(BICTCP_BETA_SCALE+beta) / 3
  / (BICTCP_BETA_SCALE - beta);

 cube_rtt_scale = (bic_scale * 10); /* 1024*c/rtt */

 cube_factor = 1ull << (10+3*BICTCP_HZ); /* 2^40 */

 /* divide by bic_scale and by constant Srtt (100ms) */
 do_div(cube_factor, bic_scale * 10);

 return tcp_register_congestion_control(&cubictcp);
}

static void __exit cubictcp_unregister(void)
{
 tcp_unregister_congestion_control(&cubictcp);
}

module_init(cubictcp_register);
module_exit(cubictcp_unregister);

law . The following table shows the two parameters and their meanings.

parameter meaning
net.ipv4.tcp_congestion_control The currently running congestion control algorithm
net.ipv4.tcp_available_congestion_control Currently Supported Congestion Control Algorithms

Specifically, as shown in the figure below, you can see the currently supported congestion control algorithm and the currently used congestion control algorithm through the parameters. It can be seen that the currently supported congestion control algorithm includes the bbr algorithm, and the bbr algorithm is supported starting from kernel version 4.9.

 

If you pay attention, many traditional congestion control algorithms were mentioned at the beginning of this article, but you did not see them in the above command. In fact, there are many congestion control algorithms that have not been installed in Linux. The following command checks all the implemented ones in the Linux system . Congestion control algorithm module :

If you want to install a specific congestion control algorithm, you can install the specified congestion control algorithm through the modprobe command. The Vegas congestion control algorithm is installed as shown below. At this time, check the congestion control algorithms that can be used in the current system. There is an additional Vegas algorithm. .

In addition to dynamically viewing the congestion control algorithms available to the current Linux system and the currently used congestion control algorithms, it is also possible to dynamically switch congestion control algorithms. Switch the default cubic congestion control algorithm to bbr congestion control algorithm as shown below.

After switching, the verification is as follows. The currently running congestion control algorithm is switched from the previous cubic congestion control algorithm to the bbr congestion control algorithm.

 

Guess you like

Origin blog.csdn.net/youzhangjing_/article/details/131703758