TCP / IP implementation principles underlying queue

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/wufaliang003/article/details/91354801

Since the last study of the TCP / IP congestion control algorithms, the more I want more in-depth understanding of some of the underlying principles of TCP / IP, search a lot of information on the network and see the great God Tao Hui column about high-performance network programming , gains a lot. To sum up today, and add some of their own thinking.

 I have a better understanding of the Java language, understanding of the Java programming on the network beyond the use Netty framework. NettyNorman Maurer source contributors for web development Netty had a word of advice, "Never block the event loop, reduce context-swtiching". That is, try not to clog IO thread, but also to minimize the thread switch.

 Why can not block IO thread reads the network information? Here we must begin to understand from the classic network C10K, how server supports 10,000 concurrent requests. C10K root cause is that the IO model of the network. Linux network processing are used in synchronous blocking the way, that is, each request is assigned a process or thread, to support 10,000 concurrent, is necessary to use 10,000 threads processing requests thing? This 10000 thread scheduling, context switching and the memory they consume, will become a bottleneck. C10K common solution is to use I / O multiplexing, Netty is.

 Netty has a responsible server listens creating a thread group (mainReactor) connected and responsible for IO thread group (subReactor) connected read and write operations, but also have a special Worker thread group (ThreadPool) processing business logic.

    Three independent of each other, so there are many benefits. First, there is the establishment of a special thread group is responsible for monitoring and handling network connection, can prevent TCP / IP half-connection queue (sync) and full connection queue (acceptable) to be filled. Two separate thread group and Worker Thread IO, both parallel processing network I / O and logic operations, can be avoided IO thread is blocked to prevent TCP / IP packets received queue is full. Of course, if the business logic low, i.e. IO intensive light computing business, the business logic can be placed IO processing thread, a thread switch to avoid, which is the second half of Norman Maurer words.

 TCP / IP queue how so much ah? Today we'll take a closer look TCP / IP, several queues, including the establishment of semi-connected queue (sync) when connecting, the whole connection queue (accept) and receiving receive messages when, outoforder, prequeue and backlog queue.

Establish a connection queue

 As shown above, there are two queues: syns queue (half-connection queue) and accept queue (connection queue full). After the three-way handshake, the server receives a client's SYN packet, the relevant information into the semi-connection queue, and SYN + ACK reply to the client. When the service end of the third step of receiving the client's ACK, if not then the whole connection queue is full, then the connection from the semi-queue out the relevant information into a fully connected to the queue, otherwise press tcp_abort_on_overflowvalues to perform actions directly abandoned or over a period of time and try again.

Receive queue when packets

 Compared to establish a connection, TCP processing logic upon receiving packets is more complex, involving more queues and configuration parameters.

 Application receives a TCP packet where the program and the network server system receives incoming TCP packets are two separate processes. Both are examples of manipulation of socket, but will be determined by a time lock contention who should control, resulting in many different scenarios. For example, an application being received packet, an operating system and receives packets through the network card, then how the process? If the application does not call recv read or read messages, the operating system will receive the message how to deal with?

 Then we will give priority to three figures, describes three scenarios when receiving TCP packet, and which presents four receive relevant queue.

Receiving a packet scene

上图是TCP接收报文场景一的示意图。操作系统首先接收报文,存储到socket的receive队列,然后用户进程再调用recv进行读取。

1) 当网卡接收报文并且判断为TCP协议时,经过层层调用,最终会调用到内核的 tcp_v4_rcv方法。由于当前TCP要接收的下一个报文正是S1,所以 tcp_v4_rcv函数将其直接加入到 receive队列中。 receive队列是将已经接收到的TCP报文,去除了TCP头部、排好序放入的、用户进程可以直接按序读取的队列。由于socket不在用户进程上下文中(也就是没有用户进程在读socket),并且我们需要S1序号的报文,而恰好收到了S1报文,因此,它进入了 receive队列。

2) 接收到S3报文,由于TCP要接收的下一个报文序号是S2,所以加入到 out_of_order队列,所有乱序的报文会放在这里。

3) 接着,收到了TCP期望的S2报文,直接进入 recevie队列。由于此时 out_of_order队列不为空,需要检查一下。

4) 每次向 receive队列插入报文时都会检查 out_of_order队列,由于接收到S2报文后,期望的的序号为S3,所以 out_of_order队列中的S3报文会被移到 receive队列。

5) 用户进程开始读取socket,先在进程中分配一块内存,然后调用 read或者 recv方法。socket有一系列的具有默认值的配置属性,比如socket默认是阻塞式的,它的 SO_RCVLOWAT属性值默认为1。当然,recv这样的方法还会接收一个flag参数,它可以设置为 MSG_WAITALLMSG_PEEKMSG_TRUNK等等,这里我们假定为最常用的0。进程调用了 recv方法。

6) 调用 tcp_recvmsg方法

7) tcp_recvmsg方法会首先锁住socket。socket是可以被多线程使用的,而且操作系统也会使用,所以必须处理并发问题。要操控socket,就先获取锁。

8) 此时, receive队列已经有3个报文了,将第一个报文拷贝到用户态内存中,由于第五步中socket的参数并没有带 MSG_PEEK,所以将第一个报文从队列中移除,从内核态释放掉。反之, MSG_PEEK标志位会导致 receive队列不会删除报文。所以, MSG_PEEK主要用于多进程读取同一套接字的情形。

9) 拷贝第二个报文,当然,执行拷贝前都会检查用户态内存的剩余空间是否足以放下当前这个报文,不够时会直接返回已经拷贝的字节数。 10) 拷贝第三个报文。 11) receive队列已经为空,此时会检查 SO_RCVLOWAT这个最小阈值。如果已经拷贝字节数小于它,进程会休眠,等待更多报文。默认的 SO_RCVLOWAT值为1,也就是读取到报文就可以返回。

12) 检查 backlog队列, backlog队列是用户进程正在拷贝数据时,网卡收到的报文会进这个队列。如果此时 backlog队列有数据,就顺带处理下。 backlog队列是没有数据的,因此释放锁,准备返回用户态。

13) 用户进程代码开始执行,此时recv等方法返回的就是从内核拷贝的字节数。

接收报文场景二

 第二张图给出了第二个场景,这里涉及了 prequeue队列。用户进程调用recv方法时,socket队列中没有任何报文,而socket是阻塞的,所以进程睡眠了。然后操作系统收到了报文,此时 prequeue队列开始产生作用。该场景中, tcp_low_latency为默认的0,套接字socket的 SO_RCVLOWAT是默认的1,仍然是阻塞socket,如下图。

 其中1,2,3步骤的处理和之前一样。我们直接从第四步开始。

4) 由于此时 receive, prequeuebacklog队列都为空,所以没有拷贝一个字节到用户内存中。而socket的配置要求至少拷贝 SO_RCVLOWAT也就是1字节的报文,因此进入阻塞式套接字的等待流程。最长等待时间为 SO_RCVTIMEO指定的时间。socket在进入等待前会释放socket锁,会使第五步中,新来的报文不再只能进入 backlog队列。 5) 接到S1报文,将其加入 prequeue队列中。 6) 插入到 prequeue队列后,会唤醒在socket上休眠的进程。 7) 用户进程被唤醒后,重新获取socket锁,此后再接收到的报文只能进入 backlog队列。 8) 进程先检查 receive队列,当然仍然是空的;再去检查 prequeue队列,发现有报文S1,正好是正在等待序号的报文,于是直接从 prequeue队列中拷贝到用户内存,再释放内核中的这个报文。 9) 目前已经拷贝了一个字节的报文到用户内存,检查这个长度是否超过了最低阈值,也就是len和 SO_RCVLOWAT的最小值。 10) 由于 SO_RCVLOWAT使用了默认值1,拷贝字节数大于最低阈值,准备返回用户态,顺便会查看一下backlog队列中是否有数据,此时没有,所以准备放回,释放socket锁。 11) 返回用户已经拷贝的字节数。

接收报文场景三

 在第三个场景中,系统参数 tcp_low_latency为1,socket上设置了 SO_RCVLOWAT属性值。服务器先收到报文S1,但是其长度小于 SO_RCVLOWAT。用户进程调用 recv方法读取,虽然读取到了一部分,但是没有到达最小阈值,所以进程睡眠了。与此同时,在睡眠前接收的乱序的报文S3直接进入 backlog队列。然后,报文S2到达,由于没有使用 prequeue队列(因为设置了tcplowlatency),而它起始序号正是下一个待拷贝的值,所以直接拷贝到用户内存中,总共拷贝字节数已满足 SO_RCVLOWAT的要求!最后在返回用户前把 backlog队列中S3报文也拷贝给用户。

1) 接收到报文S1,正是准备接收的报文序号,因此,将它直接加入到有序的 receive队列中。 2) 将系统属性 tcp_low_latency设置为1,表明服务器希望程序能够及时的接收到TCP报文。用户调用的 recv接收阻塞socket上的报文,该socket的 SO_RCVLOWAT值大于第一个报文的大小,并且用户分配了足够大的长度为len的内存。 3) 调用 tcp_recvmsg方法来完成接收工作,先锁住socket。 4) 准备处理内核各个接收队列中的报文。 5) receive队列中有报文可以直接拷贝,其大小小于len,直接拷贝到用户内存。 6) 在进行第五步的同时,内核又接收到S3报文,此时socket被锁,报文直接进入 backlog队列。这个报文并不是有序的。 7) 在第五步时,拷贝报文S1到用户内存,它的大小小于 SO_RCVLOWAT的值。由于socket是阻塞型,所以用户进程进入睡眠状态。进入睡眠前,会先处理 backlog队列的报文。因为S3报文是失序的,所以进入 out_of_order队列。用户进程进入休眠状态前都会先处理一下 backlog队列。 8) 进程休眠,直到超时或者 receive队列不为空。 9) 内核接收到报文S2。注意,此时由于打开了 tcp_low_latency标志位,所以报文是不会进入 prequeue队列等待进程处理。 10) 由于报文S2正是要接收的报文,同时,一个用户进程在休眠等待该报文,所以直接将报文S2拷贝到用户内存。 11) 每处理完一个有序报文后,无论是拷贝到 receive队列还是直接复制到用户内存,都会检查 out_of_order队列,看看是否有报文可以处理。报文S3拷贝到用户内存,然后唤醒用户进程。 12) 唤醒用户进程。 13) 此时会检查已拷贝的字节数是否大于 SO_RCVLOWAT,以及 backlog队列是否为空。两者皆满足,准备返回。

 总结一下四个队列的作用。

  • receive队列是真正的接收队列,操作系统收到的TCP数据包经过检查和处理后,就会保存到这个队列中。

  • backlog是“备用队列”。当socket处于用户进程的上下文时(即用户正在对socket进行系统调用,如recv),操作系统收到数据包时会将数据包保存到 backlog队列中,然后直接返回。

  • prequeue是“预存队列”。当socket没有正在被用户进程使用时,也就是用户进程调用了read或者recv系统调用,但是进入了睡眠状态时,操作系统直接将收到的报文保存在 prequeue中,然后返回。

  • out_of_order是“乱序队列”。队列存储的是乱序的报文,操作系统收到的报文并不是TCP准备接收的下一个序号的报文,则放入 out_of_order队列,等待后续处理。

关注微信公众号和今日头条,精彩文章持续更新中。。。。。

 

 

Guess you like

Origin blog.csdn.net/wufaliang003/article/details/91354801