linux kernel protocol stack TCP data receiving entrance

table of Contents

1 Three queues received

2 Receive entrance tcp_v4_rcv()

2.1 Add to the prequeue queue tcp_prequeue

2.2 Add backlog queue sk_add_backlog

2.3 receive queue processing tcp_v4_do_rcv


1 Three queues received

After the IP layer assembles a packet of data, if the protocol field of the packet header indicates that the upper layer protocol is TCP, the TCP_v4_rcv() function is called to pass the data to the transport layer for further processing. The overall processing process of the transport layer is very complicated. This note will first take a look at how the entrance of the transport layer is handled.

TCP's overall processing flow of input data packets can be simply expressed as the following diagram:
Insert picture description here

From the figure above, you can see that the TCP receiving process involves three queues: the prequeue queue, the receive queue, and the backlog queue. Here we first introduce the role of these three queues, and then trace the source code implementation.
From the perspective of data reception, the status of the TCP transmission control block can be divided into the following three types:

  1. The user process is reading and writing data, at this time TCB is locked
  2. The user process is reading and writing data, but it enters the dormant state because there is no data available, waiting for the data to be available, then the TCB will not be locked by the user process
  3. The user process is not reading or writing data at all, and of course the TCB will not be locked by the user process at this time.

Consider one more point, because the protocol stack's processing of input data packets is actually carried out in soft interrupts, for performance considerations, we always expect soft interrupts to end quickly.

In this way, let's understand the picture above:

  • If it is locked by the user process, then it is in situation 1. At this time, due to mutual exclusion, there is no choice. In order to quickly end the soft interrupt processing, put the data packet into the backlog queue. The real processing of this type of data packet is in the user process When the TCB is released;
  • If it is not locked by the process, first try to put the data packet into the prequeue queue. The reason is to let the soft interrupt end as soon as possible. The processing of this data packet is processed in the process of reading data by the user process;
  • If it is not locked by the process and the prequeue queue does not accept the data packet (for performance reasons, for example, the prequeue queue cannot be increased indefinitely), then there is no better way. The data packet must be processed in a soft interrupt. After finishing, add the data packet to the receive queue.

In summary, it can be summarized as follows:

  • The data packets put into the receive queue are all data packets that have been processed by TCP, such as check, ACK and other actions have been completed, these data packets can be read by the user space program; on the contrary, put into the backlog queue and prequeue The data packets in the queue still need TCP processing. In fact, these data packets are also processed by tcp_v4_do_rcv() at the right time;
  • The design of the three queues has its special purpose, and it is very important to understand the design intent behind it.

2 Receive entrance tcp_v4_rcv()

tcp_v4_rcv() is the receiving entry function of the TCP protocol.

int tcp_v4_rcv(struct sk_buff *skb)
{
	const struct iphdr *iph;
	struct tcphdr *th;
	struct sock *sk;
	int ret;
	//非本机数据包扔掉
	if (skb->pkt_type != PACKET_HOST)
		goto discard_it;

	/* Count it even if it's bad */
	TCP_INC_STATS_BH(TCP_MIB_INSEGS);

	//下面主要是对TCP段的长度进行校验。注意pskb_may_pull()除了校验,还有一个额外的功能,
	//如果一个TCP段在传输过程中被网络层分片,那么在目的端的网络层会重新组包,这会导致传给
	//TCP的skb的分片结构中包含多个skb,这种情况下,该函数会将分片结构重组到线性数据区

	//保证skb的线性区域至少有20个字节数据
	if (!pskb_may_pull(skb, sizeof(struct tcphdr)))
		goto discard_it;

	th = tcp_hdr(skb);

	if (th->doff < sizeof(struct tcphdr) / 4)
		goto bad_packet;
	//保证skb的线性区域至少包括实际的TCP首部
	if (!pskb_may_pull(skb, th->doff * 4))
		goto discard_it;

	//数据包校验相关,校验失败,则悄悄丢弃,不产生任何的差错报文
	/* An explanation is required here, I think.
	 * Packet length and doff are validated by header prediction,
	 * provided case of th->doff==0 is eliminated.
	 * So, we defer the checks. */
	if (!skb_csum_unnecessary(skb) && tcp_v4_checksum_init(skb))
		goto bad_packet;

	//初始化skb中的控制块
	th = tcp_hdr(skb);
	iph = ip_hdr(skb);
	TCP_SKB_CB(skb)->seq = ntohl(th->seq);
	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
				    skb->len - th->doff * 4);
	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
	TCP_SKB_CB(skb)->when	 = 0;
	TCP_SKB_CB(skb)->flags	 = iph->tos;
	TCP_SKB_CB(skb)->sacked	 = 0;

	//根据传入段的源和目的地址信息从ehash或者bhash中查询对应的TCB,这一步决定了
	//输入数据包应该由哪个套接字处理,获取到TCB时,还会持有一个引用计数
	sk = __inet_lookup(skb->dev->nd_net, &tcp_hashinfo, iph->saddr,
			th->source, iph->daddr, th->dest, inet_iif(skb));
	if (!sk)
		goto no_tcp_socket;

process:
	//TCP_TIME_WAIT需要做特殊处理,这里先不关注
	if (sk->sk_state == TCP_TIME_WAIT)
		goto do_time_wait;
	//IPSec相关
	if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
		goto discard_and_relse;
	nf_reset(skb);
	//TCP套接字过滤器,如果数据包被过滤掉了,结束处理过程
	if (sk_filter(sk, skb))
		goto discard_and_relse;
	//到了传输层,该字段已经没有意义,将其置为空
	skb->dev = NULL;

	//先持锁,这样进程上下文和其它软中断则无法操作该TCB
	bh_lock_sock_nested(sk);
	ret = 0;
    //如果当前TCB没有被进程上下文锁定,首先尝试将数据包放入prequeue队列,
	//如果prequeue队列没有处理,再将其处理后放入receive队列。如果TCB已
	//经被进程上下文锁定,那么直接将数据包放入backlog队列
	if (!sock_owned_by_user(sk)) {
    //DMA部分,忽略
#ifdef CONFIG_NET_DMA
		struct tcp_sock *tp = tcp_sk(sk);
		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
			tp->ucopy.dma_chan = get_softnet_dma();
		if (tp->ucopy.dma_chan)
			ret = tcp_v4_do_rcv(sk, skb);
		else
#endif
		{
        	//prequeue没有接收该数据包时返回0,那么交由tcp_v4_do_rcv()处理
			if (!tcp_prequeue(sk, skb))
				ret = tcp_v4_do_rcv(sk, skb);
		}
	} else {
    	//TCB被用户进程锁定,直接将数据包放入backlog队列
		sk_add_backlog(sk, skb);
    }
	//释放锁
	bh_unlock_sock(sk);
	//释放TCB引用计数
	sock_put(sk);
	//返回处理结果
	return ret;

no_tcp_socket:
	if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
		goto discard_it;

	if (skb->len < (th->doff << 2) || tcp_checksum_complete(skb)) {
bad_packet:
		TCP_INC_STATS_BH(TCP_MIB_INERRS);
	} else {
		tcp_v4_send_reset(NULL, skb);
	}

discard_it:
	/* Discard frame. */
	kfree_skb(skb);
	return 0;

discard_and_relse:
	sock_put(sk);
	goto discard_it;

do_time_wait:
...
}

2.1 Add to the prequeue queue tcp_prequeue

/* Packet is added to VJ-style prequeue for processing in process
 * context, if a reader task is waiting. Apparently, this exciting
 * idea (VJ's mail "Re: query about TCP header on tcp-ip" of 07 Sep 93)
 * failed somewhere. Latency? Burstiness? Well, at least now we will
 * see, why it failed. 8)8)				  --ANK
 *
 * NOTE: is this not too big to inline?
 */
static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
{
	struct tcp_sock *tp = tcp_sk(sk);
	//sysctl_tcp_low_latency(/proc/net/ipv4/tcp_low_latency)系统参数的含义是
	//“是否启动tcp低时延”,如果启用则为1,否则为0(默认)

	//tp->ucopy.task不为空,表示有进程正阻塞到该套接字上等待数据可用,所以,下面这两
	//个条件表示没有启动TCP低时延并且当前有进程在等待数据时,则把数据包放入prequeue队列

    //为什么放入prequeue队列就增加了tcp时延也非常好理解,因为放入prequeue队列的数据
	//包实际上会被延迟处理,也就会延迟给对端回复ACK,所以增加了时延
	if (!sysctl_tcp_low_latency && tp->ucopy.task) {
		__skb_queue_tail(&tp->ucopy.prequeue, skb);
		tp->ucopy.memory += skb->truesize;
        //为了防止prequeue队列无线增大,这里设置了门限,超过了该门限,
		//则直接在这里处理prequeue队列中的数据包
		if (tp->ucopy.memory > sk->sk_rcvbuf) {
			struct sk_buff *skb1;
			BUG_ON(sock_owned_by_user(sk));
			while ((skb1 = __skb_dequeue(&tp->ucopy.prequeue)) != NULL) {
				sk->sk_backlog_rcv(sk, skb1);
				NET_INC_STATS_BH(LINUX_MIB_TCPPREQUEUEDROPPED);
			}
			tp->ucopy.memory = 0;
		} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
        	//这里是另外一种情况,当prequeue队列由空变为不空时,唤醒等待进程,
			//让等待进程有机会快速处理prequeue队列
			wake_up_interruptible(sk->sk_sleep);
			//延迟确认相关
			if (!inet_csk_ack_scheduled(sk))
				inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
						          (3 * TCP_RTO_MIN) / 4,
							  TCP_RTO_MAX);
		}
		return 1;
	}
	return 0;
}

2.2 Add backlog queue sk_add_backlog

It is to directly add the data packet to the backup queue of the transmission control block, which is a simple two-way circular linked list insertion operation.

/* The per-socket spinlock must be held here. */
static inline void sk_add_backlog(struct sock *sk, struct sk_buff *skb)
{
	if (!sk->sk_backlog.tail) {
		sk->sk_backlog.head = sk->sk_backlog.tail = skb;
	} else {
		sk->sk_backlog.tail->next = skb;
		sk->sk_backlog.tail = skb;
	}
	skb->next = NULL;
}

2.3 receive queue processing tcp_v4_do_rcv

This function completes TCP's receiving and processing of a data packet, and then puts the processed data packet into the receive queue (if there is data). In fact, the processing of skb in the prequeue and backlog queues is eventually called this function, which can be clearly seen in the processing of tcp_recvmsg().

This function is just a simple distinction based on the status of the TCB, and the related content will be introduced separately in other notes.

/* The socket must have it's spinlock held when we get
 * here.
 *
 * We have a potential double-lock case here, so even when
 * doing backlog processing we use the BH locking scheme.
 * This is because we cannot sleep with the original spinlock
 * held.
 */
int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
	struct sock *rsk;
#ifdef CONFIG_TCP_MD5SIG
	/*
	 * We really want to reject the packet as early as possible
	 * if:
	 *  o We're expecting an MD5'd packet and this is no MD5 tcp option
	 *  o There is an MD5 option and we're not expecting one
	 */
	if (tcp_v4_inbound_md5_hash(sk, skb))
		goto discard;
#endif
	//连接态的数据包由tcp_rcv_established()处理
	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
		TCP_CHECK_TIMER(sk);
		if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) {
			rsk = sk;
			goto reset;
		}
		TCP_CHECK_TIMER(sk);
		return 0;
	}

	//再次检查头部长度,并完成校验
	if (skb->len < tcp_hdrlen(skb) || tcp_checksum_complete(skb))
		goto csum_err;

	//LISTEN状态数据包处理过程,见连接建立过程分析
	if (sk->sk_state == TCP_LISTEN) {
		struct sock *nsk = tcp_v4_hnd_req(sk, skb);
		if (!nsk)
			goto discard;

		if (nsk != sk) {
			if (tcp_child_process(sk, nsk, skb)) {
				rsk = nsk;
				goto reset;
			}
			return 0;
		}
	}

	//其它TCP状态到达的数据包都由tcp_rcv_state_process处理
	TCP_CHECK_TIMER(sk);
	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len)) {
		rsk = sk;
		goto reset;
	}
	TCP_CHECK_TIMER(sk);
	return 0;

reset:
	tcp_v4_send_reset(rsk, skb);
discard:
	kfree_skb(skb);
	/* Be careful here. If this function gets more complicated and
	 * gcc suffers from register pressure on the x86, sk (in %ebx)
	 * might be destroyed here. This current version compiles correctly,
	 * but you have been warned.
	 */
	return 0;

csum_err:
	TCP_INC_STATS_BH(TCP_MIB_INERRS);
	goto discard;
}

Guess you like

Origin blog.csdn.net/wangquan1992/article/details/109058708