TSO/GSO of linux kernel protocol stack TCP layer data transmission

table of Contents

1 Basic concepts

2 TCP delay segmentation judgment

2.1 Client initialization

2.2 Server-side initialization

2.3 sk_setup_caps()

3 Overall structure

4. TCP send path TSO processing

4.1 tcp_sendmsg()

4.1.1 tcp_current_mss

4.2 tcp_write_xmit()

4.2.1 tcp_init_tso_segs ()

4.2.2 tso_fragment ()


TSO-related content is flooded with the entire sending process of TCP, and understanding its mechanism is crucial to understanding the sending process of TCP.

1 Basic concepts

We know that the maximum amount of data that a network device can transmit at a time is the MTU, that is, each data packet transmitted by IP to the network device cannot exceed MTU bytes. The segmentation and reassembly function of the IP layer is to adapt to the MTU of the network device. existing. Theoretically speaking, TCP does not care about the limitation of MTU. It only needs to drop data packets to IP at will according to its own wishes. Whether fragmentation is required can be handled transparently by IP, but because TCP is a reliable streaming transmission, if It is responsible for transmission at the IP layer. Since only the first IP message contains TCP, if the subsequent TCP message is lost during transmission, the two parties of the communication will not be able to perceive it. Based on this, TCP will always be based on MTU sets its own packet size, try to avoid fragmentation of data packets at the IP layer, which means that TCP will ensure that when a TCP segment is encapsulated by IP and transmitted to the network device, the size of the data packet will not exceed the MTU of the network device.

This realization of TCP will make it necessary to segment the data passed in from the user space. This kind of work is very fixed, but it will consume CPU time, so it is desirable to optimize this operation in a high-speed network. The idea of ​​optimization is that TCP transmits large blocks of data (far exceeding the MTU) to the network device, and the network device divides the data according to the MTU to release the CPU resources. This is the design idea of ​​TSO (TCP Segmentation Offload).

Obviously, TSO needs network equipment hardware support. One step further, TSO is actually a delayed segmentation technology. Delayed segmentation reduces data copy operations on the sending path. Therefore, even if the network device does not support TSO, it is profitable as long as the segmentation can be delayed, and not only Limited to TCP, it is also possible for other L4 protocols, which leads to GSO (Generic Segmentation Offload). This technique means to delay segmentation as much as possible. It is best to perform segmentation processing in the device driver. However, this requires modifying all network device drivers, which is not realistic, so a little bit in advance, in the data The entrance submitted to the network device is segmented by software (see dev_queue_xmit()), which is exactly how the Linux kernel implements it.

Note: Some similar concepts, such as LSO, UFO, etc., can be understood by analogy and will not be described here.

2 TCP delay segmentation judgment

For TCP, no matter whether the final delay segmentation is realized by TSO (network equipment) or by software (GSO), the processing of TCP is the same. Let's take a look at how TCP can determine whether it can delay segmentation.

static inline int sk_can_gso(const struct sock *sk)
{
	//实际上检查的就是sk->sk_route_caps是否设定了sk->sk_gso_type能力标记
	return net_gso_ok(sk->sk_route_caps, sk->sk_gso_type);
}

static inline int net_gso_ok(int features, int gso_type)
{
	int feature = gso_type << NETIF_F_GSO_SHIFT;
	return (features & feature) == feature;
}

The sk_route_caps field represents routing capabilities; sk_gso_type represents the GSO technology that the L4 protocol expects to support at the bottom. Both of these fields are set during the three-way handshake. The initialization of the client and server are as follows.

2.1 Client initialization

The client is done in tcp_v4_connect(), the relevant code is as follows:

int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
...
	//设置GSO类型为TCPV4,该类型值会体现在每一个skb中,底层在
	//分段时需要根据该类型区分L4协议是哪个,以做不同的处理
	sk->sk_gso_type = SKB_GSO_TCPV4;
	//见下面
	sk_setup_caps(sk, &rt->u.dst);
...
}

2.2 Server-side initialization

The server side is the last step of the three-way handshake after receiving the ACK from the client. A sock will be created and initialized. The relevant code is as follows:

struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
				  struct request_sock *req,
				  struct dst_entry *dst)
{
...
	//同上
	newsk->sk_gso_type = SKB_GSO_TCPV4;
	sk_setup_caps(newsk, dst);
...
}

2.3 sk_setup_caps()

The device and the route are related. The L4 protocol will check the route first, so the device's capabilities will eventually be reflected in the route cache. sk_setup_caps() initializes the sk_route_caps field according to the device capabilities in the route cache.

enum {
	SKB_GSO_TCPV4 = 1 << 0,
	SKB_GSO_UDP = 1 << 1,
	/* This indicates the skb is from an untrusted source. */
	SKB_GSO_DODGY = 1 << 2,
	/* This indicates the tcp segment has CWR set. */
	SKB_GSO_TCP_ECN = 1 << 3,
	SKB_GSO_TCPV6 = 1 << 4,
};

#define NETIF_F_GSO_SHIFT	16
#define NETIF_F_GSO_MASK	0xffff0000
#define NETIF_F_TSO		(SKB_GSO_TCPV4 << NETIF_F_GSO_SHIFT)
#define NETIF_F_UFO		(SKB_GSO_UDP << NETIF_F_GSO_SHIFT)
#define NETIF_F_TSO_ECN		(SKB_GSO_TCP_ECN << NETIF_F_GSO_SHIFT)
#define NETIF_F_TSO6		(SKB_GSO_TCPV6 << NETIF_F_GSO_SHIFT)

#define NETIF_F_GSO_SOFTWARE	(NETIF_F_TSO | NETIF_F_TSO_ECN | NETIF_F_TSO6)

void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
{
	__sk_dst_set(sk, dst);
	//初始值来源于网络设备中的features字段
	sk->sk_route_caps = dst->dev->features;
	//如果支持GSO,那么路由能力中的TSO标记也会设定,因为对于L4协议来讲,
	//延迟分段具体是用软件还是硬件来实现自己并不关心
	if (sk->sk_route_caps & NETIF_F_GSO)
		sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;
	//支持GSO时,sk_can_gso()返回非0。还需要对一些特殊场景判断是否真的可以使用GSO
	if (sk_can_gso(sk)) {
		//只有使用IPSec时,dst->header_len才不为0,这种情况下不能使用TSO特性
		if (dst->header_len)
			sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
		else
			//支持GSO时,必须支持SG IO和校验功能,这是因为分段时需要单独设置每个
			//分段的校验和,这些工作L4是没有办法提前做的。此外,如果不支持SG IO,
			//那么延迟分段将失去意义,因为这时L4必须要保证skb中数据只保存在线性
			//区域,这就不可避免的在发送路径中必须做相应的数据拷贝操作
			sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
	}
}

The meaning of several capabilities involved in the above code is shown in the following table:

ability value description
NETIF_F_GSO 0x0000 0800 If the software-implemented GSO is turned on, set this flag. In the higher version of the kernel, the value is forced to open in register_netdevice()
NETIF_F_TSO 0x0001 0000 If the network device supports TSO over IP, set this flag
NETIF_F_TSO_ECN 0x0008 0000 If the network device supports TSO with the ECE mark, set the mark
NETIF_F_TSO6 0x0010 0000 If the network device supports TSO over IPv6, set this flag

3 Overall structure

TSO processing will affect the entire data packet transmission path, not just the TCP layer. Let's first look at an overall structure diagram, and then analyze the processing of TSO on the lower TCP layer transmission path. The processing of other protocol layers needs to be supplemented later.

Note: The picture comes from: https://www.cnblogs.com/lvyilong316/p/6818231.html
Insert picture description here
As shown in the figure above, on the TCP sending path, there are the following points to design TSO processing:

  1. Call tcp_current_mss() in tcp_sendmsg() to determine the maximum amount of data that an skb can hold, that is, determine tp->xmit_size_goal;
  2. Call tcp_init_gso_segs() in tcp_write_xmit() to set the GSO field in skb, and the underlying software or network card will perform segmentation processing based on this information;
  3. Use tso_fragment() to segment the data packet, see "The new data sent by the TCP layer of the linux kernel protocol stack" .

4. TCP send path TSO processing

4.1 tcp_sendmsg()

The first is tcp_sendmsg(), this function is responsible for encapsulating user space data into skb, so it needs to know how much data each skb should hold. This is set by tcp_current_mss(), the code is as follows:

int tcp_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
		size_t size)
{
...
	//tcp_current_mss()中会设置tp->xmit_size_goal
	mss_now = tcp_current_mss(sk, !(flags&MSG_OOB));
	//size_goal就是本次发送每个skb可以容纳的数据量,它是mss_now的整数倍,
	//后面tcp_sendmsg()在组织skb时,就以size_goal为上界填充数据
	size_goal = tp->xmit_size_goal;
...
}

4.1.1 tcp_current_mss

//在"TCP选项之MSS"笔记中已经分析过该函数确定发送MSS的部分,这里重点关注tp->xmit_size_goal的部分
unsigned int tcp_current_mss(struct sock *sk, int large_allowed)
{
	struct tcp_sock *tp = tcp_sk(sk);
	struct dst_entry *dst = __sk_dst_get(sk);
	u32 mss_now;
	u16 xmit_size_goal;
	int doing_tso = 0;

	mss_now = tp->mss_cache;

	//不考虑MSG_OOB相关,从前面的介绍中我们可以知道都是支持GSO的
	if (large_allowed && sk_can_gso(sk) && !tp->urg_mode)
		doing_tso = 1;

	//下面三个分支是MSS相关
	if (dst) {
		u32 mtu = dst_mtu(dst);
		if (mtu != inet_csk(sk)->icsk_pmtu_cookie)
			mss_now = tcp_sync_mss(sk, mtu);
	}
	if (tp->rx_opt.eff_sacks)
		mss_now -= (TCPOLEN_SACK_BASE_ALIGNED +
			    (tp->rx_opt.eff_sacks * TCPOLEN_SACK_PERBLOCK));
#ifdef CONFIG_TCP_MD5SIG
	if (tp->af_specific->md5_lookup(sk, sk))
		mss_now -= TCPOLEN_MD5SIG_ALIGNED;
#endif

	//xmit_size_goal初始化为MSS
	xmit_size_goal = mss_now;
	//如果支持TSO,则xmit_size_goal可以更大
	if (doing_tso) {
		//65535减去协议层的头部,包括选项部分
		xmit_size_goal = (65535 -
				  inet_csk(sk)->icsk_af_ops->net_header_len -
				  inet_csk(sk)->icsk_ext_hdr_len -
				  tp->tcp_header_len);
		//调整xmit_size_goal不能超过对端接收窗口的一半
		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
		//调整xmit_size_goal为MSS的整数倍
		xmit_size_goal -= (xmit_size_goal % mss_now);
	}
	//将确定的xmit_size_goal记录到TCB中
	tp->xmit_size_goal = xmit_size_goal;

	return mss_now;
}

/* Bound MSS / TSO packet size with the half of the window */
static int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
{
	//max_window为当前已知接收方所拥有的最大窗口值,这里如果参数pktsize超过
	//了接收窗口的一半,则调整其大小最大为接收窗口的一半
	if (tp->max_window && pktsize > (tp->max_window >> 1))
		return max(tp->max_window >> 1, 68U - tp->tcp_header_len);
	else
		//其余情况不做调整
		return pktsize;
}

4.2 tcp_write_xmit()

static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle)
{
...
	unsigned int tso_segs;

	while ((skb = tcp_send_head(sk))) {
...
		//用MSS初始化skb中的gso字段,返回本skb将会被分割成几个TSO段传输
		tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
		BUG_ON(!tso_segs);
...
		if (tso_segs == 1) {
			//Nagle算法检测,如果已经有小数据段没有被确认,则本次发送尝试失败
			if (unlikely(!tcp_nagle_test(tp, skb, mss_now, 
				(tcp_skb_is_last(sk, skb) ? nonagle : TCP_NAGLE_PUSH)))) {
				break;
			}
		} else {
			if (tcp_tso_should_defer(sk, skb))
				break;
		}
		//limit是本次能够发送的字节数,如果skb的大小超过了limit,那么需要将其切割
		limit = mss_now;
		if (tso_segs > 1)
			limit = tcp_mss_split_point(sk, skb, mss_now, cwnd_quota);
		if (skb->len > limit && unlikely(tso_fragment(sk, skb, limit, mss_now)))
			break;
...
	}
...
}

4.2.1 tcp_init_tso_segs ()

This function sets the GSO related field information in skb, and returns

/* This must be invoked the first time we consider transmitting
 * SKB onto the wire.
 */
static int tcp_init_tso_segs(struct sock *sk, struct sk_buff *skb, unsigned int mss_now)
{
	int tso_segs = tcp_skb_pcount(skb);
	//cond1: tso_segs为0表示该skb的GSO信息还没有被初始化过
	//cond2: MSS发生了变化,需要重新计算GSO信息
	if (!tso_segs || (tso_segs > 1 && tcp_skb_mss(skb) != mss_now)) {
		tcp_set_skb_tso_segs(sk, skb, mss_now);
		tso_segs = tcp_skb_pcount(skb);
	}
	//返回需要分割的段数
	return tso_segs;
}

/* Due to TSO, an SKB can be composed of multiple actual
 * packets.  To keep these tracked properly, we use this.
 */
static inline int tcp_skb_pcount(const struct sk_buff *skb)
{
	//gso_segs记录了网卡在传输当前skb时应该将其分割成多少个包进行
	return skb_shinfo(skb)->gso_segs;
}

/* This is valid iff tcp_skb_pcount() > 1. */
static inline int tcp_skb_mss(const struct sk_buff *skb)
{
	//gso_size记录了该skb应该按照多大的段被切割,即上次的MSS
	return skb_shinfo(skb)->gso_size;
}

//设置skb中的GSO信息,所谓GSO信息,就是指skb_shared_info中的
//gso_segs、gso_size、gso_type三个字段
static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb, unsigned int mss_now)
{
	//如果该skb数据量不足一个MSS,或者根本就不支持GSO,那么就是一个段
	if (skb->len <= mss_now || !sk_can_gso(sk)) {
		/* Avoid the costly divide in the normal non-TSO case.*/
		//只需设置gso_segs为1,另外两个字段在这种情况下无意义
		skb_shinfo(skb)->gso_segs = 1;
		skb_shinfo(skb)->gso_size = 0;
		skb_shinfo(skb)->gso_type = 0;
	} else {
		//计算要切割的段数,就是skb->len除以MSS,结果向上取整
		skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss_now);
		skb_shinfo(skb)->gso_size = mss_now;
		//gso_type来自于TCB,该字段的初始化见上文
		skb_shinfo(skb)->gso_type = sk->sk_gso_type;
	}
}

4.2.2 tso_fragment ()

tso_fragment() Fragments the data packet, see "The new data sent by the TCP layer of the linux kernel protocol stack" .

Guess you like

Origin blog.csdn.net/wangquan1992/article/details/109018488