In-depth understanding of Linux network notes (2): Blocking methods for cooperation between kernel and user processes

This article is the study notes of "In-depth Understanding of Linux Networks". The Linux source code version used is 3.10. The network card driver uses Intel's igb network card driver by default.

Read Linux source code online: https://elixir.bootlin.com/linux/v3.10/source

2. How the kernel cooperates with user processes (1)

1) Direct creation of socket

From a developer's perspective, calling the socket function creates a socket

After the socket function call is executed, the user level sees that an integer handle is returned, but in fact the kernel creates a series of socket-related kernel objects internally. Their relationship with each other is shown in the figure below:

// net/socket.c
SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
{
    
    
	...
	retval = sock_create(family, type, protocol, &sock);
	...
}

sock_create is the main location for creating sockets, in which sock_create is called__sock_create

// net/socket.c
int __sock_create(struct net *net, int family, int type, int protocol,
			 struct socket **res, int kern)
{
    
    
	int err;
	struct socket *sock;
	const struct net_proto_family *pf;
	...
  // 分配socket对象
	sock = sock_alloc();
	...
  // 获得每个协议族的操作表  
	pf = rcu_dereference(net_families[family]);
	...
  // 调用指定协议族的创建函数,对于AF_INET对应的是inet_create
	err = pf->create(net, sock, protocol, kern);
	...
}

Here __sock_create, first call sock_alloc to allocate a struct socket kernel object, then obtain the operation function table of the protocol family, and call its create method. For the AF_INET protocol family, the inet_create method is executed.

// net/ipv4/af_inet.c
static struct inet_protosw inetsw_array[] =
{
    
    
	{
    
    
		.type =       SOCK_STREAM,
		.protocol =   IPPROTO_TCP,
		.prot =       &tcp_prot,
		.ops =        &inet_stream_ops,
		.no_check =   0,
		.flags =      INET_PROTOSW_PERMANENT |
			      INET_PROTOSW_ICSK,
	},
	...
};

static int inet_create(struct net *net, struct socket *sock, int protocol,
		       int kern)
{
    
    
	struct sock *sk;
	struct inet_protosw *answer;
	struct inet_sock *inet;
	struct proto *answer_prot;  
	...
	list_for_each_entry_rcu(answer, &inetsw[sock->type], list) {
    
    

		err = 0;
		if (protocol == answer->protocol) {
    
    
			if (protocol != IPPROTO_IP)
				break;
		} else {
    
    
			if (IPPROTO_IP == protocol) {
    
    
				protocol = answer->protocol;
				break;
			}
			if (IPPROTO_IP == answer->protocol)
				break;
		}
		err = -EPROTONOSUPPORT;
	}
	...
  // 将inet_stream_ops赋到socket->ops上
	sock->ops = answer->ops;
  // 获取tcp_prot
	answer_prot = answer->prot;
	...
  // 分配sock对象,并把tcp_prot赋到sock->sk_prot上
	sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot);
	...
	// 对sock对象进行初始化
	sock_init_data(sock, sk);
	...
}

In inet_create, find the operation method implementation collections inet_stream_ops and tcp_prot defined for TCP according to the type SOCK_STREAM, and set them to socket->ops and sock->sk_prot respectively, as shown in the following figure:

Further down we see sock_init_data. In this method, the sk_data_ready function pointer in the socket is initialized and set to the default sock_def_readable, as shown in the following figure:

// net/core/sock.c
void sock_init_data(struct socket *sock, struct sock *sk)
{
    
    
	...
	sk->sk_data_ready	=	sock_def_readable;
	sk->sk_write_space	=	sock_def_write_space;
	sk->sk_error_report	=	sock_def_error_report;
	...
}

When a data packet is received on the soft interrupt, the process waiting on the socket is awakened by calling the sk_data_ready function pointer (actually set to sock_def_readable()) . This will be discussed later when we talk about the "soft interrupt module"

At this point, a tcp object, to be precise, a SOCKET_STREAM object under the AF_INET protocol family has been created. This costs the overhead of a socket system call.

2) Blocking method of cooperation between kernel and user process

Blocking IO model:

When the user thread issues an IO request, the kernel will check whether the data is ready. If not, it will wait for the data to be ready, and the user thread will be in a blocked state, and the user thread will hand over the CPU. When the data is ready, the kernel will copy the data to the user thread and return the result to the user thread. The user thread will then release the block state.

In the synchronous blocking IO model, the user process first initiates the instruction to create a socket, and then switches to the kernel state to complete the initialization of the kernel object. Next, Linux uses hard interrupts and ksoftirqd threads to process data packets. After the ksoftirqd thread is processed, the relevant user process will be notified.

From when the user process creates a socket to when a network packet arrives at the network card and is received by the user process, the overall process of synchronous blocking IO is as shown in the figure below:

1) Waiting to receive messages

The recv function of the clib library executes the recvform system call. After entering the system call, the user process enters the kernel state, executes a series of kernel protocol layer functions, and then checks whether there is data in the receiving queue of the socket object. If not, it adds itself to the waiting queue corresponding to the socket. Finally, the CPU is released, and the operating system will select the next ready process to execute. The entire process is shown in the figure below:

Next, let’s look at more specific details based on the source code. The key point to pay attention to is how recvfrom blocks its own process in the end.

// net/socket.c
SYSCALL_DEFINE6(recvfrom, int, fd, void __user *, ubuf, size_t, size,
		unsigned int, flags, struct sockaddr __user *, addr,
		int __user *, addr_len)
{
    
    
	struct socket *sock;
	...
  // 根据用户传入的fd找到socket对象
	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	...
	err = sock_recvmsg(sock, &msg, size, flags);
	...
}

The next calling sequence is: sock_recvmsg => __sock_recvmsg=>__sock_recvmsg_nosec

// net/socket.c
static inline int __sock_recvmsg_nosec(struct kiocb *iocb, struct socket *sock,
				       struct msghdr *msg, size_t size, int flags)
{
    
    
	...
	return sock->ops->recvmsg(iocb, sock, msg, size, flags);
}

Call recvmsg in the socket object ops. recvmsg points to the inet_recvmsg method.

// net/ipv4/af_inet.c
int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
		 size_t size, int flags)
{
    
    
	...
	err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
				   flags & ~MSG_DONTWAIT, &addr_len);
	...
}

Here we encounter another function pointer. This time we call the recvmsg method under sk_prot in the socket object. The recvmsg method corresponds to the tcp_recvmsg method.

// net/ipv4/tcp.c
int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
		size_t len, int nonblock, int flags, int *addr_len)
{
    
    
	...
	int copied = 0;
	...
	do {
    
    
		...
		// 遍历接收队列接收数据
		skb_queue_walk(&sk->sk_receive_queue, skb) {
    
    
			...
		}
		...
		if (copied >= target) {
    
    
			release_sock(sk);
			lock_sock(sk);
		} else // 没有收到足够数据,启用sk_wait_data阻塞当前进程
			sk_wait_data(sk, &timeo);
		...
	} while (len > 0);
	...
}

Finally we saw what we wanted to see. skb_queue_walk accesses the receiving queue under the sock object, as shown in the following figure:

If no data is received, or not enough is received, call sk_wait_data to block the current process.

// net/core/sock.c
int sk_wait_data(struct sock *sk, long *timeo)
{
    
    
	int rc;
	// 当前进程(current)关联到所定义的等待队列项上
	DEFINE_WAIT(wait);

 	// 调用sk_sleep获取sock对象下的wait
  // 并准备挂起,将当前进程设置为可打断(INTERRUPTIBLE)
	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
	set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
	// 通过调用schedule_timeout让出CPU,然后进行睡眠
	rc = sk_wait_event(sk, timeo, !skb_queue_empty(&sk->sk_receive_queue));
	...
}

Let’s take a closer look at how sk_wait_data blocks the current process, as shown in the figure below:

First, under the DEFINE_WAIT macro, a waiting queue item wait is defined. On this new waiting queue item, the callback function autoremove_wake_function is registered and the current process descriptor current is associated with its .privatemembers.

// include/linux/wait.h
#define DEFINE_WAIT_FUNC(name, function)				\
	wait_queue_t name = {
      
      						\
		.private	= current,				\
		.func		= function,				\
		.task_list	= LIST_HEAD_INIT((name).task_list),	\
	}

#define DEFINE_WAIT(name) DEFINE_WAIT_FUNC(name, autoremove_wake_function)

Then call sk_sleep in sk_wait_data to obtain the waiting queue list head wait_queue_head_t under the socket object. The source code of sk_sleep is as follows:

// include/net/sock.h
static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
    
    
	BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
	return &rcu_dereference_raw(sk->sk_wq)->wait;
}

Then call prepare_to_wait to insert the newly defined waiting queue item wait into the waiting queue of the sock object.

// kernel/wait.c
void
prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
{
    
    
	unsigned long flags;

	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
	spin_lock_irqsave(&q->lock, flags);
	if (list_empty(&wait->task_list))
		__add_wait_queue(q, wait);
	set_current_state(state);
	spin_unlock_irqrestore(&q->lock, flags);
}

In this way, when the kernel collects the data and generates a ready event, it can search for the waiting items on the socket waiting queue, and then find the callback function and the process waiting for the socket ready event.

Finally, call sk_wait_event to release the CPU, and the process will enter sleep state, which will generate the overhead of a process context switch. This overhead is expensive and takes about several microseconds of CPU time.

2)Soft interrupt module

The previous article talked about how the network packet is received by the network card after it reaches the network card, and is finally handed over to the soft interrupt for processing. Here, starting directly from the receiving function tcp_v4_rcv of the TCP protocol, the overall receiving process is shown in the figure below:

After receiving the data in the soft interrupt (that is, the ksoftirqd thread in Linux), it will execute the tcp_v4_rcv function if it finds that it is a TCP packet. Next, if it is a data packet in the ESTABLISH state, the data will finally be disassembled and placed in the receiving queue of the corresponding socket, and then sk_data_ready will be called to wake up the user process.

// net/ipv4/tcp_ipv4.c
int tcp_v4_rcv(struct sk_buff *skb)
{
    
    
	...
	// 获取tcp header
	th = tcp_hdr(skb);
	// 获取ip header
	iph = ip_hdr(skb);
	...
	// 根据数据包header中的IP、端口信息查找到对应的socket
	sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
	...
	// socket未被用户锁定
	if (!sock_owned_by_user(sk)) {
    
    
		...
		{
    
    
			if (!tcp_prequeue(sk, skb))
				ret = tcp_v4_do_rcv(sk, skb);
		}
	}
	...
}

In tcp_v4_rcv, first query the corresponding socket on the local machine based on the source and dest information in the header of the received network packet. After finding it, call the tcp_v4_do_rcv function

// net/ipv4/tcp_ipv4.c
int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
    
    
	...
	if (sk->sk_state == TCP_ESTABLISHED) {
    
    
		...
		// 执行连接状态下的数据处理
		if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) {
    
    
			rsk = sk;
			goto reset;
		}
		return 0;
	}
	// 其他非ESTABLISH状态的数据包处理
	...
}

Assume that we are processing packets in the ESTABLISH state, so we enter the tcp_rcv_established function for processing.

// net/ipv4/tcp_input.c
int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
			const struct tcphdr *th, unsigned int len)
{
    
    
				...
				// 接收数据放到队列中
				eaten = tcp_queue_rcv(sk, skb, tcp_header_len,
						      &fragstolen);
			...
			// 数据准备好,唤醒socket上阻塞掉的进程
			sk->sk_data_ready(sk, 0);
			...
}

In tcp_rcv_established, by calling the tcp_queue_rcv function, the received data is placed on the socket's receiving queue, as shown in the following figure:

The source code of function tcp_queue_rcv is as follows:

// net/ipv4/tcp_input.c
static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
		  bool *fragstolen)
{
    
    
	...
  // 把接收到的数据放到socket的接收队列的尾部  
	if (!eaten) {
    
    
		__skb_queue_tail(&sk->sk_receive_queue, skb);
		skb_set_owner_r(skb, sk);
	}
	return eaten;
}

After calling tcp_queue_rcv to receive, then call sk_data_ready to wake up the user process waiting on the socket . This is again a function pointer. In the previous section of "Direct Creation of Socket", it was mentioned that the sock_init_data function executed in the process of creating the socket has set the sk_data_ready pointer to the sock_def_readable function. It is the default data readiness handler function

// net/core/sock.c
static void sock_def_readable(struct sock *sk, int len)
{
    
    
	struct socket_wq *wq;

	rcu_read_lock();
	wq = rcu_dereference(sk->sk_wq);
  // 有进程在此socket的等待队列
	if (wq_has_sleeper(wq))
    // 唤醒等待队列上的进程
		wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLPRI |
						POLLRDNORM | POLLRDBAND);
	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
	rcu_read_unlock();
}

In sock_def_readable, wait under sock->sk_wq is accessed again. When calling recvform in the previous "waiting to receive messages" part, at the end of the execution process, the DEFINE_WAIT(wait)waiting queue associated with the current process is added to the wait under sock->sk_wq.

The next step is to call wake_up_interruptible_sync_poll to wake up the process that is blocked waiting for data on the socket, as shown in the following figure:

// include/linux/wait.h
#define wake_up_interruptible_sync_poll(x, m)				\
	__wake_up_sync_key((x), TASK_INTERRUPTIBLE, 1, (void *) (m))

// kernel/sched/core.c
void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode,
			int nr_exclusive, void *key)
{
    
    
	unsigned long flags;
	int wake_flags = WF_SYNC;

	if (unlikely(!q))
		return;

	if (unlikely(!nr_exclusive))
		wake_flags = 0;

	spin_lock_irqsave(&q->lock, flags);
	__wake_up_common(q, mode, nr_exclusive, wake_flags, key);
	spin_unlock_irqrestore(&q->lock, flags);
}

__wake_up_commonAchieve awakening. The parameter nr_exclusive passed in to this function call is 1. This means that even if multiple processes are blocked on the same socket, only one process will be awakened. Its purpose is to avoid panic, rather than waking up all processes.

// kernel/sched/core.c
static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
			int nr_exclusive, int wake_flags, void *key)
{
    
    
	wait_queue_t *curr, *next;

	list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
    
    
		unsigned flags = curr->flags;

		if (curr->func(curr, mode, wake_flags, key) &&
				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
			break;
	}
}

__wake_up_commonFind a waiting queue item curr in , and then call its curr->func . When the previous part of the recv function "waiting to receive messages" is executed, when using DEFINE_WAIT() to define the waiting queue item, the kernel sets curr->func to autoremove_wake_function

// kernel/wait.c
int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
    
    
	int ret = default_wake_function(wait, mode, sync, key);

	if (ret)
		list_del_init(&wait->task_list);
	return ret;
}

In autoremove_wake_function, default_wake_function is called

// kernel/sched/core.c
int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags,
			  void *key)
{
    
    
	return try_to_wake_up(curr->private, mode, wake_flags);
}

The task_struct passed in when calling try_to_wake_up is curr->private, which is the process item blocked due to waiting. When this function is executed, the process that is blocked while waiting on the socket will be pushed into the runnable queue, which will generate the overhead of a process context switch.

3) Synchronous blocking summary

The entire process of receiving network packets in synchronous blocking mode is divided into two parts:

The first part is the process where our own code is located. The socket() function we call will enter the kernel state to create the necessary kernel objects. The recv() function is responsible for checking the receiving queue after entering the kernel state, and blocking the current process to give up the CPU when there is no data to process.
The second part is hard interrupt and soft interrupt (system thread ksoftirqd). In these components, after the packet is processed, it will be placed in the socket's receiving queue. Then find the process that is blocked due to waiting in its waiting queue according to the socket kernel object, and wake it up.

The overall process of synchronous blocking is shown in the figure below:

Each time a process is taken off the CPU specifically to wait for data on a socket, and then replaced with another process, as shown in the figure below. When the data is ready, the sleeping process is awakened again, resulting in a total of two process context switching overheads.

Recommended reading:

Five I/O models of Linux: Take you to a thorough understanding of the five I/O models of Linux