The sending window of linux kernel protocol stack TCP data sending

table of Contents

1 Overview of the send window

2 Update of snd_una and snd_wnd

2.1 Send window initialization

2.1.1 Client initialization

2.1.2 Server-side initialization

2.2 Local receiving window rcv_wnd notification

2.2.1 Client send

2.2.2 Server send

2.3 Update the sending window during transmission

2.3.1 Update conditions of sending window

3 The influence of the sending window on the sending process


The TCP sending process is controlled by a sliding window, and the size of the sliding window is limited by the sending window and the congestion window. The congestion window is represented by the congestion control algorithm, and the sending window is the representative of the flow control algorithm. This note records the sending window related The content, including the initialization and update of the sending window, and how it affects the data sending process.

1 Overview of the send window

The sending window of TCP can be represented by the following figure:
Insert picture description here

As shown in the figure, there are three members in TCB that are strongly related to the sending window.

struct tcp_sock {
...
	//下一个要发送的序号,即序号等于snd_nxt的数据还没有发送
	u32	snd_nxt;	/* Next sequence we send		*/
	//已经发送,但是还没有被确认的最小序号,注意序号等于snd_una的数据已经发送,
	//最想收到的确认号要大于snd_una。但是有一个特殊情况,如果发送的所有数据都
	//已经被确认,那么snd_una将等于下一个要发送的数据,即snd_una代表的数据还
	//没有发送,见下面tcp_ack()更新snd_una就可以理解这一点了
	u32	snd_una;	/* First byte we want an ack for	*/
	//发送窗口大小,以字节为单位,来源于输入段首部的窗口字段,即对端接收缓冲区的剩余大小
	u32	snd_wnd;	/* The window we expect to receive	*/
	//记录到目前为止对端通告过的窗口的最大值,可以代表对端接收缓冲区的最大值
	u32	max_window;	/* Maximal window ever seen from peer	*/
	//写系统调用一旦成功返回,说明数据一被TCP协议接收,这时就要为每一个数据分配一个序号,
	//write_seq就是下一个要分配的序号,其初始值由secure_tcp_sequence_number()基于
	//算法生成。注意等于write_seq的序号还没有被分配
	u32	write_seq;	/* Tail(+1) of data held in tcp send buffer */
...
};

Note: The size of the sending window describes the size of the receiving buffer of the opposite end, that is, the size of the receiving window of the opposite end

2 Update of snd_una and snd_wnd

snd_una is the left boundary of the sending window. If this field is updated, even if the sending window size snd_wnd does not change, the entire sending window will move forward, so that from the perspective of flow control, more data can be sent (whether it is really possible to send , Also consider other factors such as congestion window).

2.1 Send window initialization

It can be imagined that the initialization of snd_una must occur during the sending of the first data segment, and the initialization of snd_wnd should occur during the processing of the first input segment, so the client and server need to be viewed separately.

2.1.1 Client initialization

The initialization of snd_una by the client of course occurs during the sending process of the SYN segment. The relevant code is as follows:

int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
...
	//选择初始发送序号
	if (!tp->write_seq)
		tp->write_seq = secure_tcp_sequence_number(inet->saddr,
							   inet->daddr,
							   inet->sport,
							   usin->sin_port);
...
}
static void tcp_connect_init(struct sock *sk)
{
...
	//发送窗口大小要从输入段首部的窗口字段获取,这时还没有任何输入段,先初始化为0
	tp->snd_wnd = 0;
	//初始化snd_una为第一个序号,该函数之后write_seq将会分配给SYN段
	tp->snd_una = tp->write_seq;
...
}

The initialization of snd_wnd occurs when the SYN+ACK segment is received, and the relevant code is as follows:

static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
					 struct tcphdr *th, unsigned len)
{
...
	if (th->ack) {
...
		tp->snd_wnd = ntohs(th->window);
...
	}
}

2.1.2 Server-side initialization

If you understand it positively, the server-side initialization of snd_una should occur when the SYN+ACK segment is sent, but in fact it is not, but when the ACK segment of the third handshake is received. As mentioned in the note that the TCP server receives the ACK packet , after the three-way handshake is completed, the child socket is created, and then tcp_rcv_state_process() will continue to be called in tcp_child_process() to process the ACK message. The code is as follows:

int tcp_child_process(struct sock *parent, struct sock *child,
		      struct sk_buff *skb)
{
	int ret = 0;
	int state = child->sk_state;

	//如果用户进程没有锁住child,则让child重新处理该ACK报文,这可以让child
	//套接字由TCP_SYN_RECV迁移到TCP_ESTABLISH状态
	if (!sock_owned_by_user(child)) {
		//见下文
		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
					    skb->len);
		/* Wakeup parent, send SIGIO */
		//child套接字状态发生了迁移,唤醒监听套接字上的进程,可能由于调用accept()而block
		if (state == TCP_SYN_RECV && child->sk_state != state)
			parent->sk_data_ready(parent, 0);
	} else {
		/* Alas, it is possible again, because we do lookup
		 * in main socket hash table and lock on listening
		 * socket does not protect us more.
		 */
		 //缓存该skb后续处理
		sk_add_backlog(child, skb);
	}

	bh_unlock_sock(child);
	sock_put(child);
	return ret;
}

int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
			  struct tcphdr *th, unsigned len)
{
...
	/* step 5: check the ACK field */
	if (th->ack) {
		int acceptable = tcp_ack(sk, skb, FLAG_SLOWPATH);

		switch (sk->sk_state) {
		case TCP_SYN_RECV:
			if (acceptable) {
...
				tcp_set_state(sk, TCP_ESTABLISHED);
				//用ACK段中的确认号初始化本端的snd_una
				tp->snd_una = TCP_SKB_CB(skb)->ack_seq;
				//用输入报文的窗口字段初始化发送窗口大小
				tp->snd_wnd = ntohs(th->window) <<
					      tp->rx_opt.snd_wscale;
...
			}
			break;
...
		}//end of switch()
	} else
		goto discard;
...
	return 0;
}

2.2 Local receiving window rcv_wnd notification

The above-mentioned initialization process has stated that the local sending window snd_wnd of the client and the server is assigned by parsing the sliding window field of tcp: ntohs(th->window) upon receiving the message from the opposite end. So how is the window sent?

2.2.1 Client send

tcp_connect
	--tcp_connect_init
		--tcp_select_initial_window //根据本地的接受缓冲区,mtu计算得出本地的接受窗口
	--tcp_transmit_skb //发送window
	

static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
			    gfp_t gfp_mask)
{
	...
	/* Build TCP header and checksum it. */
	th = tcp_hdr(skb);
	th->source		= inet->sport;
	th->dest		= inet->dport;
	th->seq			= htonl(tcb->seq);
	th->ack_seq		= htonl(tp->rcv_nxt);
	*(((__be16 *)th) + 6)	= htons(((tcp_header_size >> 2) << 12) |
					tcb->flags);

	if (unlikely(tcb->flags & TCPCB_FLAG_SYN)) {
		/* RFC1323: The window in SYN & SYN/ACK segments
		 * is never scaled.
		 */
		th->window	= htons(min(tp->rcv_wnd, 65535U));
	} else {
		th->window	= htons(tcp_select_window(sk));
	}
	...
}

2.2.2 Server send

tcp_v4_send_synack
	--__tcp_v4_send_synack
		--tcp_make_synack
			--tcp_select_initial_window
		--ip_build_and_send_pkt	

struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
				struct request_sock *req)
{
	struct inet_request_sock *ireq = inet_rsk(req);
	struct tcp_sock *tp = tcp_sk(sk);
	struct tcphdr *th;
	int tcp_header_size;
	struct tcp_out_options opts;
	struct sk_buff *skb;
	struct tcp_md5sig_key *md5;
	__u8 *md5_hash_location;
	int mss;

	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC);
	...

	if (req->rcv_wnd == 0) { /* ignored for retransmitted syns */
		__u8 rcv_wscale;
		/* Set this up on the first call only */
		req->window_clamp = tp->window_clamp ? : dst_metric(dst, RTAX_WINDOW);
		/* tcp_full_space because it is guaranteed to be the first packet */
		tcp_select_initial_window(tcp_full_space(sk),
			mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
			&req->rcv_wnd,
			&req->window_clamp,
			ireq->wscale_ok,
			&rcv_wscale);
		ireq->rcv_wscale = rcv_wscale;
	}
	...
	th = tcp_hdr(skb);
	memset(th, 0, sizeof(struct tcphdr));
	th->syn = 1;
	th->ack = 1;
	...

	/* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
	th->window = htons(min(req->rcv_wnd, 65535U));
	...
	return skb;
}

2.3 Update the sending window during transmission

Obviously, during data transmission, snd_una and snd_wnd should be updated after receiving the ACK. If the ACK is carried in the input section, there will eventually be tcp_ack() processing to confirm the related content.

The fast path processing situation, because the data is being received at this time, the windows field of the input section must not change, so there is no need to update the value of snd_wnd, just update snd_una.

Slow path processing situation, the situation is complicated, need to do more judgments, call  tcp_ack_update_window () to complete the update of the sending window.

static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
{
...
	u32 prior_snd_una = tp->snd_una;
	u32 ack = TCP_SKB_CB(skb)->ack_seq;
...
	if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
...
		//快速路径情况,用ack更新snd_una,由于快速路径,所以通告的窗口大小一定
		//没有发生变化,所以不需要更新snd_wnd
		tp->snd_una = ack;
		flag |= FLAG_WIN_UPDATE;
...
	} else {
...
		//慢速路径下,调用函数更新窗口
		flag |= tcp_ack_update_window(sk, skb, ack, ack_seq);
...
	}
...
}

/* Update our send window.
 *
 * Window update algorithm, described in RFC793/RFC1122 (used in linux-2.2
 * and in FreeBSD. NetBSD's one is even worse.) is wrong.
 */
static int tcp_ack_update_window(struct sock *sk, struct sk_buff *skb, u32 ack,
				 u32 ack_seq)
{
	struct tcp_sock *tp = tcp_sk(sk);
	int flag = 0;
	//ACK段中携带的通告窗口
	u32 nwin = ntohs(tcp_hdr(skb)->window);

	//协议规定,SYN和SYN+ACK段中是不可以携带窗口扩大因子的,所以这里
	//判断不带SYN标记位时是否需要根据窗口扩大因子调整通告的新窗口大小
	if (likely(!tcp_hdr(skb)->syn))
		nwin <<= tp->rx_opt.snd_wscale;

	if (tcp_may_update_window(tp, ack, ack_seq, nwin)) {
		//需要更新窗口
		flag |= FLAG_WIN_UPDATE;
		//更新snd_wl
		tcp_update_wl(tp, ack, ack_seq);

		if (tp->snd_wnd != nwin) {
			//更新发送窗口
			tp->snd_wnd = nwin;

			/* Note, it is the only place, where
			 * fast path is recovered for sending TCP.
			 */
			//更新了发送窗口大小,需要重新判断是否设置首部预测标记
			tp->pred_flags = 0;
			tcp_fast_path_check(sk);
			//更新已知最大通告窗口
			if (nwin > tp->max_window) {
				tp->max_window = nwin;
				//因为MSS和max_window相关,所以max_window发生了变化,需要重新计算MSS
				tcp_sync_mss(sk, inet_csk(sk)->icsk_pmtu_cookie);
			}
		}
	}
	//更新发送窗口左边界
	tp->snd_una = ack;
	return flag;
}

2.3.1 Update conditions of sending window

The core of the slow path is to determine when the sending window should be updated, which is implemented by tcp_may_update_window().

/* Check that window update is acceptable.
 * The function assumes that snd_una<=ack<=snd_next.
 */
static inline int tcp_may_update_window(const struct tcp_sock *tp,
					const u32 ack, const u32 ack_seq,
					const u32 nwin)
{
	//cond1: 确认号大于snd_una,说明确认了新数据,可以更新发送窗口左边界;
	//cond2: ACK段的序号大于snd_wl1,说明对方有发送新数据,所以需要更新snd_wl1;
	//cond3: 通告的接收窗口有变化.
	//上面只有有一个条件成立,那么就可以更新发送窗口了(条件2着实没理解...)。
	return (after(ack, tp->snd_una) ||
		after(ack_seq, tp->snd_wl1) ||
		(ack_seq == tp->snd_wl1 && nwin > tp->snd_wnd));
}

3 The influence of the sending window on the sending process

It should be understood that the sending window is the key to realizing flow control. It only affects the sending process of new data and has nothing to do with retransmission, because the retransmitted data must be within the receiving capability of the opposite end.

From "Linux Kernel Protocol Stack TCP Layer Data Transmission New Data" , there are two key functions tcp_write_xmit() and tcp_push_one() for new data transmission, and the two are very similar. Refer to the tcp_snd_wnd_test( analyzed in the previous notes) ) And tcp_mss_split_point() can understand how the sending window affects the sending process.

Guess you like

Origin blog.csdn.net/wangquan1992/article/details/109030547