See the blocking and non-blocking of sockets from the linux source code

See the blocking and non-blocking of sockets from the linux source code

The author has always felt that it is an Exciting thing to know every code from the application to the framework to the operating system.
Most high-performance networking frameworks use a non-blocking mode. This time, the author will explain the difference between socket blocking (block) and non-blocking (non_block) from the perspective of linux source code. The source code of this article comes from the Linux-2.6.24 kernel version.

A simple example of a TCP non-blocking client

If we want to generate a non-blocking socket, the following code in C language looks like:

// 创建socket
int sock_fd = socket(AF_INET, SOCK_STREAM, 0);
...
// 更改socket为nonblock
fcntl(sock_fd, F_SETFL, fdflags | O_NONBLOCK);
// connect
....
while(1)  {  
    int recvlen = recv(sock_fd, recvbuf, RECV_BUF_SIZE) ; 
    ......
} 
...

Because the network protocol is very complex, a lot of object-oriented techniques are used in the kernel, so we start from creating a connection and trace it step by step to the calling point of the final code.

socket creation

Obviously, the first step of the kernel should be to locate the socket that needs to create a TCP through AF_INET, SOCK_STREAM and the last parameter 0, as shown by the green line in the following figure:
inet_family
we trace the source code call

socket(AF_INET, SOCK_STREAM, 0)
	|->sys_socket 进入系统调用
		|->sock_create
			|->__sock_create

Further analyze the code judgment of __sock_create:

const struct net_proto_family *pf;
// RCU(Read-Copy Update)是linux的一种内核同步方法,在此不阐述
// family=INET
pf = rcu_dereference(net_families[family]);
err = pf->create(net, sock, protocol);

Since family is the AF_INET protocol, note that PF_INET is defined as equal to AF_INET in the operating system, and the kernel implements the overload of pf(net_proto_family) through the function pointer. As shown below:
net_proto_family

From the source code, we can see that because it is AF_INET (PF_INET), net_families[PF_INET].create=inet_create (we will use PF_INET in the future), that is,
pf->create = inet_create; Further trace the call:

inet_create(struct net *net, struct socket *sock, int protocol){
	Sock* sock;
	......
	// 此处是寻找对应协议处理器的过程
lookup_protocol:
	// 迭代寻找protocol==answer->protocol的情况
	list_for_each_rcu(p, &inetsw[sock->type]) answer = list_entry(p, struct inet_protosw, list);

		/* Check the non-wild match. */
		if (protocol == answer->protocol) {
			if (protocol != IPPROTO_IP)
				break;
		}
	......
	// 这边answer指的是SOCK_STREAM
	sock->ops = answer->ops;
	answer_no_check = answer->no_check;
	// 这边sk->prot就是answer_prot=>tcp_prot
	sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot);
	sock_init_data(sock, sk);
	......
}

The above code is the process of finding SOCK_STREAM in INET. Let's take a look at the specific configuration of inetsw[SOCK_STREAM]:

static struct inet_protosw inetsw_array[] =
{
	{
		.type =       SOCK_STREAM,
		.protocol =   IPPROTO_TCP,
		.prot =       &tcp_prot,
		.ops =        &inet_stream_ops,
		.capability = -1,
		.no_check =   0,
		.flags =      INET_PROTOSW_PERMANENT |
			      INET_PROTOSW_ICSK,
	},
	......
}

Overloading is also used here. AF_INET has three types: TCP, UDP and Raw:
sock_ops_proto

From the above code, we can clearly find sock->ops=&inet_stream_ops;

const struct proto_ops inet_stream_ops = {
	.family		   = PF_INET,
	.owner		   = THIS_MODULE,
	......
	.sendmsg	   = tcp_sendmsg,
	.recvmsg	   = sock_common_recvmsg,
	......
}	

即sock->ops->recvmsg = sock_common_recvmsg;
同时sock->sk->sk_prot = tcp_prot;

Let's look at the definition of each function overload in tcp_prot:

struct proto tcp_prot = {
	.name			= "TCP",
	.close			= tcp_close,
	.connect		= tcp_v4_connect,
	.disconnect		= tcp_disconnect,
	.accept			= inet_csk_accept,
	......
	// 我们重点考察tcp的读
	.recvmsg		= tcp_recvmsg,
	......
}

fcntl controls the blocking\non-blocking state of sockets

We use fcntl to modify the blocking/non-blocking state of the socket. In fact: the function of fcntl is to store the O_NONBLOCK flag in the f_lags of the filp structure corresponding to sock_fd, as shown in the following figure.

fcntl

fcntl(sock_fd, F_SETFL, fdflags | O_NONBLOCK);
	|->setfl

Tracing setfl code:

static int setfl(int fd, struct file * filp, unsigned long arg) {
	......
	filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK);
	......
}

In the above figure, sock_fd finds the file descriptor of the corresponding socket in task_struct (process structure)->files_struct->fd_array, and then modifies file->flags

When calling socket.recv

We trace source code calls:

socket.recv
	|->sys_recv
		|->sys_recvfrom
			|->sock_recvmsg
				|->__sock_recvmsg
					|->sock->ops->recvmsg

It can be seen from the above: sock->ops->recvmsg = sock_common_recvmsg;

sock

It is worth noting that in sock_recmsg, there is a processing of the flag O_NONBLOCK

	if (sock->file->f_flags & O_NONBLOCK)
		flags |= MSG_DONTWAIT;

Obtain its f_flags from the file associated with the sock in the above code. If the flags has the O_NONBLOCK flag, then set msg_flags to MSG_DONTWAIT (do not wait).
fcntl and socket are associated through their common operation File structure.

continue to trace the call

sock_common_recvmsg

int sock_common_recvmsg(struct kiocb *iocb, struct socket *sock,
			struct msghdr *msg, size_t size, int flags) {
	......
	// 如果flags的MSG_DONTWAIT标识置位,则传给recvmsg的第5个参数为正,否则为0
	err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
				   flags & ~MSG_DONTWAIT, &addr_len);
	.....				   
}

It can be seen from the above: sk->sk_prot->recvmsg where sk_prot=tcp_prot, that is, the final call is tcp_prot->tcp_recvmsg,
the above code can be seen, if fcntl(O_NONBLOCK)=>MSG_DONTWAIT is set=>(flags & MSG_DONTWAIT )>0, combined with the function signature of tcp_recvmsg, that is, if O_NONBLOCK is set, set the nonblock parameter to tcp_recvmsg>0, the relationship is shown in the following figure:
fcntl_recvmsg.png

Final call logic tcp_recvmsg

First, let's look at the function signature of tcp_recvmsg:

int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
		size_t len, int nonblock, int flags, int *addr_len)

Obviously we focus on (the int nonblock parameter):

int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
		size_t len, int nonblock, int flags, int *addr_len){
	......	
	// copied是指向用户空间拷贝了多少字节,即读了多少
	int copied;
	// target指的是期望多少字节
	int target;
	// 等效为timo = noblock ? 0 : sk->sk_rcvtimeo;
	timeo = sock_rcvtimeo(sk, nonblock);
	......	
	// 如果设置了MSG_WAITALL标识target=需要读的长度
	// 如果未设置,则为最低低水位值
	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
	......

	do{
		// 表明读到数据
		if (copied) {
			// 注意,这边只要!timeo,即nonblock设置了就会跳出循环
			if (sk->sk_err ||
			    sk->sk_state == TCP_CLOSE ||
			    (sk->sk_shutdown & RCV_SHUTDOWN) ||
			    !timeo ||
			    signal_pending(current) ||
			    (flags & MSG_PEEK))
			break;
		}else{
			// 到这里,表明没有读到任何数据
			// 且nonblock设置了导致timeo=0,则返回-EAGAIN,符合我们的预期
			if (!timeo) {
				copied = -EAGAIN;
				break;
		}
		// 这边如果堵到了期望的数据,继续,否则当前进程阻塞在sk_wait_data上
		if (copied >= target) {
			/* Do not sleep, just process backlog. */
			release_sock(sk);
			lock_sock(sk);
		} else
			sk_wait_data(sk, &timeo);
	} while (len > 0);		
	......
	return copied
}

The above logic boils down to:
(1) When nonblock is set, if copied>0, return how many bytes have been read, if copied=0, return -EAGAIN, prompting the application to repeat the call.
(2) If nonblock is not set, if the read data is >= expected, return how many bytes were read. If not, use sk_wait_data to wait for the current process.
As shown in the flow chart below:

tcp_recv

Blocking function sk_wait_data

sk_wait_data code-function is:

	// 将进程状态设置为可打断INTERRUPTIBLE
	prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
	set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
	// 通过调用schedule_timeout让出CPU,然后进行睡眠
	rc = sk_wait_event(sk, timeo, !skb_queue_empty(&sk->sk_receive_queue));
	// 到这里的时候,有网络事件或超时事件唤醒了此进程,继续运行
	clear_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
	finish_wait(sk->sk_sleep, &wait);

This function calls schedule_timeout to go to sleep, which further calls the schedule function, which is first deleted from the run queue, then added to the waiting queue, and finally called the switch_to macro related to the architecture to complete the switching between processes.
As shown below:
task_schedule

When will it resume operation after blocking?

Case 1: There is corresponding network data coming

First, let's take a look at the kernel path where the network packet arrives. After the network card initiates an interrupt, it calls netif_rx to hang the event into the CPU's waiting queue, and arouses the soft interrupt (soft_irq), and then calls net_rx_action through the linux soft interrupt mechanism, as shown in the following figure:

low_recv
Note: The above picture is from PLKA (<<In-depth Linux Kernel Architecture>>)
followed by tracking next_rx_action

next_rx_action
	|-process_backlog
		......
			|->packet_type->func 在这里我们考虑ip_rcv
					|->ipprot->handler 在这里ipprot重载为tcp_protocol
						(handler 即为tcp_v4_rcv)					

Immediately after tcp_v4_rcv:

tcp_input.c
tcp_v4_rcv
	|-tcp_v4_do_rcv
		|-tcp_rcv_state_process
			|-tcp_data_queue
				|-sk->sk_data_ready=sock_def_readable
					|-wake_up_interruptible
						|-__wake_up
							|-__wake_up_common

Here __wake_up_common will wake up the process stopped in the current wait_queue_head_t, that is, the status is changed to task_running, waiting for CFS scheduling for the next action, as shown in the following figure.

wake_up

Case 2: The set timeout time comes

schedule_timeout is called in the previous call to sk_wait_event

fastcall signed long __sched schedule_timeout(signed long timeout) {
	......
	// 设定超时的回掉函数为process_timeout
	setup_timer(&timer, process_timeout, (unsigned long)current);
	__mod_timer(&timer, expire);
	// 这边让出CPU
	schedule();
	del_singleshot_timer_sync(&timer);
	timeout = expire - jiffies;
 out:
 	// 返回经过了多长事件
	return timeout < 0 ? 0 : timeout;	
}

The process_timeout function is to wake up the process again

static void process_timeout(unsigned long __data)
{
	wake_up_process((struct task_struct *)__data);
}

Summarize

The Linux kernel source code is extensive and profound, and it is very troublesome to read its code. I hope this article can help people who read the linux network protocol stack code.

original address

https://my.oschina.net/alchemystar/blog/1791017

add extra

My blog will be moved and synchronized to Tencent Cloud + Community, and everyone is invited to join: https://cloud.tencent.com/developer/support-plan

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324378050&siteId=291194637