Linux·Illustrating the process of sending network packets

Please think about a few small questions.

Question 1: When we look at the CPU consumed by the kernel sending data, should we look at sy or si?
Question 2: Why is NET_RX much larger than NET_TX in /proc/softirqs on your server ?
Q3: What memory copy operations are involved in sending network data?

Although these problems are often seen online, we seem to seldom delve into them. If we can really understand these issues thoroughly, our ability to control performance will become stronger.

With these three questions, we start today's in-depth analysis of the Linux kernel network sending process. Still in accordance with our previous tradition, start with a simple piece of code. The following code is a typical microcode of a typical server program:

int main(){
 fd = socket(AF_INET, SOCK_STREAM, 0);
 bind(fd, ...);
 listen(fd, ...);

 cfd = accept(fd, ...);

 // 接收用户请求
 read(cfd, ...);

 // 用户请求处理
 dosometing(); 

 // 给用户返回结果
 send(cfd, buf, sizeof(buf), 0);
}

Today we will discuss how the kernel sends the data packet after calling send in the above code. This article is based on Linux 3.10, and the network card driver uses Intel's igb network card as an example.

Warning: This article has more than 10,000 words and 25 pictures, please be careful with long articles!

1. Overview of Linux network sending process

I think the most important thing to look at the Linux source code is to have an overall grasp, rather than getting bogged down in various details from the beginning.

I have prepared a general flow chart for you here, and briefly explained how the data sent by send is sent to the network card step by step.

In this picture, we see that the user data is copied to the kernel state, and then entered into the RingBuffer after being processed by the protocol stack. Then the network card driver actually sends the data out. When the transmission is completed, the CPU is notified through a hard interrupt, and then the RingBuffer is cleaned up.

Because the source code will be entered later in the article, we will give a flow chart from the perspective of the source code.

Although the data has been sent at this time, there is actually one important thing that has not been done, which is to release the memory such as the cache queue.

How does the kernel know when to release the memory, of course after the network is sent. When the network card finishes sending, it will send a hard interrupt to the CPU to notify the CPU. See the diagram for a more complete process:

Note that although our topic today is sending data, the soft interrupt triggered by the hard interrupt is NET_RX_SOFTIRQ instead of NET_TX_SOFTIRQ! ! ! (T is the abbreviation of transmit, R means receive)

Is it a surprise, is it a surprise? ? ?

So that's part of the reason for opening question 1 (note, it's only part of the reason).

Question 1: Check /proc/softirqs on the server, why is NET_RX much larger than NET_TX?

The completion of the transfer eventually triggers NET_RX, not NET_TX. So naturally you can see more NET_RX by observing /proc/softirqs.

Ok, now you have a big picture of how the kernel sends network packets. Don't be complacent, the details we need to know are more valuable, let's continue! !

2. Network card startup preparation

The network cards on the current servers generally support multiple queues. Each queue is represented by a RingBuffer, and the network card with multiple queues enabled will have multiple RingBuffers.

One of the most important tasks when the network card is started is to allocate and initialize RingBuffer. Understanding RingBuffer will be very helpful for us to master sending later. Because today's topic is sending, let's take the transmission queue as an example, let's look at the actual process of allocating RingBuffer when the network card starts.

When the network card is started, the __igb_open function will be called, and the RingBuffer is allocated here.

//file: drivers/net/ethernet/intel/igb/igb_main.c
static int __igb_open(struct net_device *netdev, bool resuming)
{
 struct igb_adapter *adapter = netdev_priv(netdev);

 //分配传输描述符数组
 err = igb_setup_all_tx_resources(adapter);

 //分配接收描述符数组
 err = igb_setup_all_rx_resources(adapter);

 //开启全部队列
 netif_tx_start_all_queues(netdev);
}

In the above __igb_open function, call igb_setup_all_tx_resources to allocate all transmit RingBuffers, and call igb_setup_all_rx_resources to create all receive RingBuffers.

//file: drivers/net/ethernet/intel/igb/igb_main.c
static int igb_setup_all_tx_resources(struct igb_adapter *adapter)
{
 //有几个队列就构造几个 RingBuffer
 for (i = 0; i < adapter->num_tx_queues; i++) {
  igb_setup_tx_resources(adapter->tx_ring[i]);
 }
}

The real construction process of RingBuffer is completed in igb_setup_tx_resources.

//file: drivers/net/ethernet/intel/igb/igb_main.c
int igb_setup_tx_resources(struct igb_ring *tx_ring)
{
 //1.申请 igb_tx_buffer 数组内存
 size = sizeof(struct igb_tx_buffer) * tx_ring->count;
 tx_ring->tx_buffer_info = vzalloc(size);

 //2.申请 e1000_adv_tx_desc DMA 数组内存
 tx_ring->size = tx_ring->count * sizeof(union e1000_adv_tx_desc);
 tx_ring->size = ALIGN(tx_ring->size, 4096);
 tx_ring->desc = dma_alloc_coherent(dev, tx_ring->size,
        &tx_ring->dma, GFP_KERNEL);

 //3.初始化队列成员
 tx_ring->next_to_use = 0;
 tx_ring->next_to_clean = 0;
}

As can be seen from the above source code, in fact, a RingBuffer has not only one ring queue array, but two.

1) igb_tx_buffer array: This array is used by the kernel and applied through vzalloc. 2) e1000_adv_tx_desc array: This array is used by the network card hardware. The hardware can directly access this memory through DMA and allocate it through dma_alloc_coherent.

At this time there is no connection between them. When sending in the future, the pointers at the same position in the two ring arrays will all point to the same skb. In this way, the kernel and hardware can jointly access the same data, the kernel writes data into skb, and the network card hardware is responsible for sending it.

Finally, call netif_tx_start_all_queues to start the queue. In addition, the processing function igb_msix_ring for hard interrupt is actually registered in __igb_open.

Three, accept to create a new socket

Before sending data, we often need a socket that has already established a connection.

Let's take the accept mentioned in the server microform source code as an example. After accepting, the process will create a new socket, and then put it in the open file list of the current process, which is specially used to communicate with the corresponding client. communication.

Assuming that the server process has established two connections with the client through accept, let's take a brief look at the relationship between these two connections and the process.

The more specific structural diagram of the socket kernel object representing a connection is as follows.

In order to avoid overwhelming, the detailed source code process of accept will not be introduced here. If you are interested, please refer to "Illustration | In-depth Demystification of How epoll Realizes IO Multiplexing!" " . The first part of the article.

Today we still focus on the data sending process.

4. Sending data really starts

4.1 send system call implementation

The source code for the send system call is in the file net/socket.c. In this system call, the sendto system call is actually used internally. Although the entire call chain is not short, it actually only does two simple things,

The first is to find out the real socket in the kernel, and the function addresses of various protocol stacks are recorded in this object.
The second is to construct a struct msghdr object, and put all the data passed in by the user, such as buffer address, data length, etc., into it.

The rest is handed over to the next layer, the function inet_sendmsg in the protocol stack, where the address of the inet_sendmsg function is found through the ops member in the socket kernel object. The general process is shown in the figure.

With the above understanding, it will be much easier for us to look at the source code. The source code is as follows:

//file: net/socket.c
SYSCALL_DEFINE4(send, int, fd, void __user *, buff, size_t, len,
  unsigned int, flags)
{
 return sys_sendto(fd, buff, len, flags, NULL, 0);
}

SYSCALL_DEFINE6(......)
{
 //1.根据 fd 查找到 socket
 sock = sockfd_lookup_light(fd, &err, &fput_needed);

 //2.构造 msghdr
 struct msghdr msg;
 struct iovec iov;

 iov.iov_base = buff;
 iov.iov_len = len;
 msg.msg_iovlen = 1;

 msg.msg_iov = &iov;
 msg.msg_flags = flags;
 ......

 //3.发送数据
 sock_sendmsg(sock, &msg, len);
}

As can be seen from the source code, the send function and sendto function we use in user mode are actually implemented by the sendto system call. send is just an easier-to-call way encapsulated for convenience.

In the sendto system call, the real socket kernel object is first searched according to the socket handle number passed in by the user. Then pack the buff, len, flag and other parameters requested by the user into a struct msghdr object.

Then called sock_sendmsg => __sock_sendmsg ==> __sock_sendmsg_nosec. In __sock_sendmsg_nosec, the call will enter the protocol stack from the system call, let's look at its source code.

//file: net/socket.c
static inline int __sock_sendmsg_nosec(...)
{
 ......
 return sock->ops->sendmsg(iocb, sock, msg, size);
}

Through the socket kernel object structure diagram in the third section, we can see that what is called here is sock->ops->sendmsg, and inet_sendmsg is actually executed. This function is a general send function provided by the AF_INET protocol family.

4.2 Transport layer processing

1) Transport layer copy

After entering the protocol stack inet_sendmsg, the kernel will then find the specific protocol sending function on the socket. For the TCP protocol, that's tcp_sendmsg (also found through the socket kernel object).

In this function, the kernel will apply for a kernel-mode skb memory, and copy the data to be sent by the user into it. Note that the sending may not actually start at this time. If the sending condition is not met, it is likely that the call will return directly. The approximate process is as shown in the figure:

Let's look at the source code of the inet_sendmsg function.

//file: net/ipv4/af_inet.c
int inet_sendmsg(......)
{
 ......
 return sk->sk_prot->sendmsg(iocb, sk, msg, size);
}

In this function, the sending function of the specific protocol will be called. Also refer to the socket kernel object structure diagram in the third section, we see that for the socket under the TCP protocol, sk->sk_prot->sendmsg points to tcp_sendmsg (for UPD, it is udp_sendmsg).

The function tcp_sendmsg is relatively long, so let's look at it several times. look at this first

//file: net/ipv4/tcp.c
int tcp_sendmsg(...)
{
 while(...){
  while(...){
   //获取发送队列
   skb = tcp_write_queue_tail(sk);

   //申请skb 并拷贝
   ......
  }
 }
}

//file: include/net/tcp.h
static inline struct sk_buff *tcp_write_queue_tail(const struct sock *sk)
{
 return skb_peek_tail(&sk->sk_write_queue);
}

Understanding calling tcp_write_queue_tail on socket is a prerequisite for understanding sending. As shown above, this function is to get the last skb in the socket sending queue. skb is the abbreviation of struct sk_buff object, and the user's sending queue is a linked list composed of this object.

Let's look at other parts of tcp_sendmsg.

//file: net/ipv4/tcp.c
int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
  size_t size)
{
 //获取用户传递过来的数据和标志
 iov = msg->msg_iov; //用户数据地址
 iovlen = msg->msg_iovlen; //数据块数为1
 flags = msg->msg_flags; //各种标志

 //遍历用户层的数据块
 while (--iovlen >= 0) {

  //待发送数据块的地址
  unsigned char __user *from = iov->iov_base;

  while (seglen > 0) {

   //需要申请新的 skb
   if (copy <= 0) {

    //申请 skb，并添加到发送队列的尾部
    skb = sk_stream_alloc_skb(sk,
         select_size(sk, sg),
         sk->sk_allocation);

    //把 skb 挂到socket的发送队列上
    skb_entail(sk, skb);
   }

   // skb 中有足够的空间
   if (skb_availroom(skb) > 0) {
    //拷贝用户空间的数据到内核空间，同时计算校验和
    //from是用户空间的数据地址 
    skb_add_data_nocache(sk, skb, from, copy);
   } 
   ......

This function is relatively long, but the logic is not complicated. Among them, msg->msg_iov stores the buffer of the data to be sent in the user mode memory. Next, apply for kernel memory in the kernel state, such as skb, and copy the data in the user memory to the kernel state memory. This will involve the overhead of one or several memory copies .

As for when the kernel actually sends out the skb. Some judgments will be made in tcp_sendmsg.

//file: net/ipv4/tcp.c
int tcp_sendmsg(...)
{
 while(...){
  while(...){
   //申请内核内存并进行拷贝

   //发送判断
   if (forced_push(tp)) {
    tcp_mark_push(tp, skb);
    __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
   } else if (skb == tcp_send_head(sk))
    tcp_push_one(sk, mss_now);  
   }
   continue;
  }
 }
}

Only when forced_push(tp) or skb == tcp_send_head(sk) is satisfied, the kernel will actually start sending packets. Among them, forced_push(tp) judges whether the unsent data has exceeded half of the maximum window.

If the conditions are not met, the data that the user wants to send this time is just copied to the kernel and the job is over!

2) Transport layer sending

Assuming that the kernel sending conditions have been met now, let's track the actual sending process. For the functions in the previous section, when the actual sending conditions are met, no matter whether __tcp_push_pending_frames or tcp_push_one is called, it will actually execute tcp_write_xmit in the end.

So we look directly from tcp_write_xmit, this function handles the congestion control of the transport layer and the work related to the sliding window. When the window requirements are met, set the TCP header and pass the skb to the lower network layer for processing.

Let's look at the source code of tcp_write_xmit.

//file: net/ipv4/tcp_output.c
static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
      int push_one, gfp_t gfp)
{
 //循环获取待发送 skb
 while ((skb = tcp_send_head(sk))) 
 {
  //滑动窗口相关
  cwnd_quota = tcp_cwnd_test(tp, skb);
  tcp_snd_wnd_test(tp, skb, mss_now);
  tcp_mss_split_point(...);
  tso_fragment(sk, skb, ...);
  ......

  //真正开启发送
  tcp_transmit_skb(sk, skb, 1, gfp);
 }
}

You can see that the sliding window and congestion control we learned in the network protocol are completed in this function, and this part will not be expanded too much. Interested students can find this source code to read by themselves. We only look at the main process of sending today, and then we come to tcp_transmit_skb.

//file: net/ipv4/tcp_output.c
static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
    gfp_t gfp_mask)
{
 //1.克隆新 skb 出来
 if (likely(clone_it)) {
  skb = skb_clone(skb, gfp_mask);
  ......
 }

 //2.封装 TCP 头
 th = tcp_hdr(skb);
 th->source  = inet->inet_sport;
 th->dest  = inet->inet_dport;
 th->window  = ...;
 th->urg   = ...;
 ......

 //3.调用网络层发送接口
 err = icsk->icsk_af_ops->queue_xmit(skb, &inet->cork.fl);
}

The first thing is to clone a new skb first. Here we will focus on why we need to copy an skb?

It is because the skb is calling the network layer later, and when the network card is finally sent, the skb will be released. And we know that the TCP protocol supports lost retransmission, and this skb cannot be deleted before receiving the ACK from the other party. So the kernel's method is that every time the network card is called to send, what is actually passed out is a copy of skb. Wait until the ACK is received before actually deleting.

The second thing is to modify the TCP header in skb, and set the TCP header according to the actual situation. Here is a little trick to introduce, skb actually contains all the headers in the network protocol. When setting the TCP header, just point the pointer to the proper position of skb. When setting the IP header later, just move the pointer to avoid frequent memory application and copying, which is very efficient.

tcp_transmit_skb is the last step in sending data at the transport layer, and then it can enter the network layer for the next layer of operations. The sending interface icsk->icsk_af_ops->queue_xmit() provided by the network layer is called.

In the following source code, we know that queue_xmit actually points to the ip_queue_xmit function.

//file: net/ipv4/tcp_ipv4.c
const struct inet_connection_sock_af_ops ipv4_specific = {
 .queue_xmit    = ip_queue_xmit,
 .send_check    = tcp_v4_send_check,
 ...
}

Since then, the work of the transport layer has been completed. The data leaves the transport layer and will then enter the implementation of the kernel at the network layer.

4.3 Network layer sending processing

The implementation of sending in the network layer of the Linux kernel is located in the file net/ipv4/ip_output.c. The ip_queue_xmit called by the transport layer is also here. (It can also be seen from the file name that it has entered the IP layer, and the source file name has changed from tcp_xxx to ip_xxx.)

In the network layer, it mainly handles several tasks such as routing item search, IP header setting, netfilter filtering, skb segmentation (if it is larger than MTU), etc. After processing these tasks, it will be handed over to the lower neighbor subsystem for processing.

Let's look at the source code of the network layer entry function ip_queue_xmit:

//file: net/ipv4/ip_output.c
int ip_queue_xmit(struct sk_buff *skb, struct flowi *fl)
{
 //检查 socket 中是否有缓存的路由表
 rt = (struct rtable *)__sk_dst_check(sk, 0);
 if (rt == NULL) {
  //没有缓存则展开查找
  //则查找路由项， 并缓存到 socket 中
  rt = ip_route_output_ports(...);
  sk_setup_caps(sk, &rt->dst);
 }

 //为 skb 设置路由表
 skb_dst_set_noref(skb, &rt->dst);

 //设置 IP header
 iph = ip_hdr(skb);
 iph->protocol = sk->sk_protocol;
 iph->ttl      = ip_select_ttl(inet, &rt->dst);
 iph->frag_off = ...;

 //发送
 ip_local_out(skb);
}

ip_queue_xmit has reached the network layer. In this function, we see the function routing item search related to the network layer. If it is found, it will be set to skb (if there is no route, it will directly report an error and return).

On Linux, you can see the routing configuration of your local machine through the route command.

In the routing table, you can find out which Iface (network card) and which Gateway (network card) a destination network should send out through. After the search is found, it is cached on the socket, and the next time the data is sent, there is no need to check it.

Then put the routing table address in skb too.

//file: include/linux/skbuff.h
struct sk_buff {
 //保存了一些路由相关信息
 unsigned long  _skb_refdst;
}

The next step is to locate the position of the IP header in the skb, and then start to set the IP header according to the protocol specification.

Then go to the next step through ip_local_out.

//file: net/ipv4/ip_output.c  
int ip_local_out(struct sk_buff *skb)
{
 //执行 netfilter 过滤
 err = __ip_local_out(skb);

 //开始发送数据
 if (likely(err == 1))
  err = dst_output(skb);
 ......

In ip_local_out => __ip_local_out => nf_hook will perform netfilter filtering. If you use iptables to configure some rules, then here will check whether the rules are hit. If you set a very complicated netfilter rule, this function will cause your process CPU overhead to increase greatly .

Still don't talk much about it, just continue to talk about the process related to sending dst_output.

//file: include/net/dst.h
static inline int dst_output(struct sk_buff *skb)
{
 return skb_dst(skb)->output(skb);
}

This function finds the routing table (dst entry) to this skb and then calls the output method of the routing table. This is again a function pointer, pointing to the ip_output method.

//file: net/ipv4/ip_output.c
int ip_output(struct sk_buff *skb)
{
 //统计
 .....

 //再次交给 netfilter，完毕后回调 ip_finish_output
 return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, skb, NULL, dev,
    ip_finish_output,
    !(IPCB(skb)->flags & IPSKB_REROUTED));
}

Do some simple statistical work in ip_output, and perform netfilter filtering again. Call back ip_finish_output after filtering.

//file: net/ipv4/ip_output.c
static int ip_finish_output(struct sk_buff *skb)
{
 //大于 mtu 的话就要进行分片了
 if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb))
  return ip_fragment(skb, ip_finish_output2);
 else
  return ip_finish_output2(skb);
}

In ip_finish_output, we see that if the data is larger than the MTU, fragmentation will be performed.

The actual MTU size is determined by MTU discovery, and the Ethernet frame is 1500 bytes. In the early days, the QQ team would try to control the size of their data packets to be smaller than the MTU, and optimize network performance in this way. Because fragmentation will bring two problems: 1. Additional segmentation processing is required, which has additional performance overhead. 2. As long as a fragment is lost, the entire packet has to be retransmitted. Therefore, avoiding fragmentation not only eliminates fragmentation overhead, but also greatly reduces the retransmission rate.

In ip_finish_output2, the final sending process will enter the next layer, the neighbor subsystem.

//file: net/ipv4/ip_output.c
static inline int ip_finish_output2(struct sk_buff *skb)
{
 //根据下一跳 IP 地址查找邻居项，找不到就创建一个
 nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);  
 neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
 if (unlikely(!neigh))
  neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);

 //继续向下层传递
 int res = dst_neigh_output(dst, neigh, skb);
}

4.4 Neighborhood Subsystem

The neighbor subsystem is a system located between the network layer and the data link layer. Its function is to provide an encapsulation for the network layer, so that the network layer does not need to care about the address information of the lower layer, and let the lower layer decide which MAC address to send to.

And this neighbor subsystem is not located in the protocol stack net/ipv4/ directory, but in net/core/neighbor.c. Because this module is required for both IPv4 and IPv6.

In the neighbor subsystem, it is mainly to find or create a neighbor entry. When creating a neighbor entry, an actual arp request may be sent. Then encapsulate the MAC header, and pass the sending process to the lower-level network device subsystem. The general process is shown in the figure.

After understanding the general process, let's look back at the source code. __ipv4_neigh_lookup_noref is called in the source code of ip_finish_output2 in the above section. It searches in the arp cache, and its second parameter is the next-hop IP information of the route.

//file: include/net/arp.h
extern struct neigh_table arp_tbl;
static inline struct neighbour *__ipv4_neigh_lookup_noref(
 struct net_device *dev, u32 key)
{
 struct neigh_hash_table *nht = rcu_dereference_bh(arp_tbl.nht);

 //计算 hash 值，加速查找
 hash_val = arp_hashfn(......);
 for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
   n != NULL;
   n = rcu_dereference_bh(n->next)) {
  if (n->dev == dev && *(u32 *)n->primary_key == key)
   return n;
 }
}

If not found, call __neigh_create to create a neighbor.

//file: net/core/neighbour.c
struct neighbour *__neigh_create(......)
{
 //申请邻居表项
 struct neighbour *n1, *rc, *n = neigh_alloc(tbl, dev);

 //构造赋值
 memcpy(n->primary_key, pkey, key_len);
 n->dev = dev;
 n->parms->neigh_setup(n);

 //最后添加到邻居 hashtable 中
 rcu_assign_pointer(nht->hash_buckets[hash_val], n);
 ......

After having the neighbor entry, it still does not have the ability to send IP packets at this time, because the destination MAC address has not been obtained yet. Call dst_neigh_output to continue passing skb.

//file: include/net/dst.h
static inline int dst_neigh_output(struct dst_entry *dst, 
     struct neighbour *n, struct sk_buff *skb)
{
 ......
 return n->output(n, skb);
}

Calling output actually points to neigh_resolve_output. Inside this function it is possible to issue an arp network request.

//file: net/core/neighbour.c
int neigh_resolve_output(){

 //注意：这里可能会触发 arp 请求
 if (!neigh_event_send(neigh, skb)) {

  //neigh->ha 是 MAC 地址
  dev_hard_header(skb, dev, ntohs(skb->protocol),
           neigh->ha, NULL, skb->len);
  //发送
  dev_queue_xmit(skb);
 }
}

After getting the hardware MAC address, you can encapsulate the MAC header of skb. Finally, call dev_queue_xmit to pass the skb to the Linux network device subsystem.

4.5 Network Equipment Subsystem

The neighbor subsystem enters the network device subsystem through dev_queue_xmit.

//file: net/core/dev.c 
int dev_queue_xmit(struct sk_buff *skb)
{
 //选择发送队列
 txq = netdev_pick_tx(dev, skb);

 //获取与此队列关联的排队规则
 q = rcu_dereference_bh(txq->qdisc);

 //如果有队列，则调用__dev_xmit_skb 继续处理数据
 if (q->enqueue) {
  rc = __dev_xmit_skb(skb, q, dev, txq);
  goto out;
 }

 //没有队列的是回环设备和隧道设备
 ......
}

In the second section of the opening chapter, we said in the preparation for the start of the network card that the network card has multiple sending queues (especially the current network card). The above call to the netdev_pick_tx function is to select a queue to send.

The selection of the netdev_pick_tx sending queue is affected by configurations such as XPS, and there is also a cache, which is also a set of small and complicated logic. Here we only focus on two logics. First, the user's XPS configuration will be obtained, otherwise it will be calculated automatically. See netdev_pick_tx => __netdev_pick_tx for the code.

//file: net/core/flow_dissector.c
u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
{
 //获取 XPS 配置
 int new_index = get_xps_queue(dev, skb);

 //自动计算队列
 if (new_index < 0)
  new_index = skb_tx_hash(dev, skb);}

Then get the qdisc associated with this queue. The qdisc type can be seen through the tc command on linux, for example, it is mq disc on one of my multi-queue network card machines.

#tc qdisc
qdisc mq 0: dev eth0 root

Most devices have queues (except loopback and tunnel devices), so now we go to __dev_xmit_skb.

//file: net/core/dev.c
static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
     struct net_device *dev,
     struct netdev_queue *txq)
{
 //1.如果可以绕开排队系统
 if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
     qdisc_run_begin(q)) {
  ......
 }

 //2.正常排队
 else {

  //入队
  q->enqueue(skb, q)

  //开始发送
  __qdisc_run(q);
 }
}

There are two situations in the above code, one is that the queuing system can be bypassed, and the other is normal queuing. We only look at the second case.

First call q->enqueue to add skb to the queue. Then call __qdisc_run to start sending.

//file: net/sched/sch_generic.c
void __qdisc_run(struct Qdisc *q)
{
 int quota = weight_p;

 //循环从队列取出一个 skb 并发送
 while (qdisc_restart(q)) {
  
  // 如果发生下面情况之一，则延后处理：
  // 1. quota 用尽
  // 2. 其他进程需要 CPU
  if (--quota <= 0 || need_resched()) {
   //将触发一次 NET_TX_SOFTIRQ 类型 softirq
   __netif_schedule(q);
   break;
  }
 }
}

In the above code, we see that the while loop continuously fetches skb from the queue and sends them. Note that this time actually takes up the system state time (sy) of the user process. Only when the quota is exhausted or other processes need the CPU will the soft interrupt be triggered to send.

So this is the second reason why NET_RX is generally much larger than NET_TX when viewing /proc/softirqs on a general server . For reading, it needs to go through the NET_RX soft interrupt, and for sending, the soft interrupt is only allowed when the system state quota is exhausted.

Let's focus on qdisc_restart and continue to see the sending process.

static inline int qdisc_restart(struct Qdisc *q)
{
 //从 qdisc 中取出要发送的 skb
 skb = dequeue_skb(q);
 ...

 return sch_direct_xmit(skb, q, dev, txq, root_lock);
}

qdisc_restart takes a skb from the queue and calls sch_direct_xmit to continue sending.

//file: net/sched/sch_generic.c
int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
   struct net_device *dev, struct netdev_queue *txq,
   spinlock_t *root_lock)
{
 //调用驱动程序来发送数据
 ret = dev_hard_start_xmit(skb, dev, txq);
}

4.6 Soft interrupt scheduling

In 4.5, we saw that if the CPU in the system state is not enough to send network packets, it will call __netif_schedule to trigger a soft interrupt. This function will enter __netif_reschedule, which will actually issue a NET_TX_SOFTIRQ type soft interrupt.

The soft interrupt is run by the kernel thread, which will enter the net_tx_action function, in which the send queue can be obtained, and finally the entry function dev_hard_start_xmit in the driver is called.

//file: net/core/dev.c
static inline void __netif_reschedule(struct Qdisc *q)
{
 sd = &__get_cpu_var(softnet_data);
 q->next_sched = NULL;
 *sd->output_queue_tailp = q;
 sd->output_queue_tailp = &q->next_sched;

 ......
 raise_softirq_irqoff(NET_TX_SOFTIRQ);
}

In this function, the data queue to be sent is set in the softnet_data that can be accessed by the softirq, and added to the output_queue. Then a soft interrupt of type NET_TX_SOFTIRQ is triggered. (T stands for transmit transmission)

I won’t go into detail about the entry code of the softirq here. Interested students can refer to Section 3.2 in the article "Illustrated Linux Network Packet Receiving Process" - ksoftirqd kernel thread processes softirq.

We start directly from the callback function net_tx_action registered by NET_TX_SOFTIRQ softirq. After the user mode process triggers the soft interrupt, a soft interrupt kernel thread will execute net_tx_action.

Keep in mind that the CPU consumed by sending data will be displayed in si from now on, and the system time of the user process will not be consumed .

//file: net/core/dev.c
static void net_tx_action(struct softirq_action *h)
{
 //通过 softnet_data 获取发送队列
 struct softnet_data *sd = &__get_cpu_var(softnet_data);

 // 如果 output queue 上有 qdisc
 if (sd->output_queue) {

  // 将 head 指向第一个 qdisc
  head = sd->output_queue;

  //遍历 qdsics 列表
  while (head) {
   struct Qdisc *q = head;
   head = head->next_sched;

   //发送数据
   qdisc_run(q);
  }
 }
}

The soft interrupt will get softnet_data here. Earlier we saw that the process kernel mode wrote the sending queue to the output_queue of softnet_data when calling __netif_reschedule. The soft interrupt loops through sd->output_queue to send data frames.

Let's look at qdisc_run, which, like the process user mode, will also call __qdisc_run.

//file: include/net/pkt_sched.h
static inline void qdisc_run(struct Qdisc *q)
{
 if (qdisc_run_begin(q))
  __qdisc_run(q);
}

Then the same is to enter qdisc_restart => sch_direct_xmit until the driver function dev_hard_start_xmit.

4.7 igb network card driver sending

We saw earlier that both the kernel state of the user process and the soft interrupt context will call the dev_hard_start_xmit function in the network device subsystem. In this function, the sending function igb_xmit_frame in the driver will be called.

In the driver function, the skb will be hung on the RingBuffer. After the driver is called, the data packet will actually be sent out from the network card.

Let's take a look at the actual source code:

//file: net/core/dev.c
int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
   struct netdev_queue *txq)
{
 //获取设备的回调函数集合 ops
 const struct net_device_ops *ops = dev->netdev_ops;

 //获取设备支持的功能列表
 features = netif_skb_features(skb);

 //调用驱动的 ops 里面的发送回调函数 ndo_start_xmit 将数据包传给网卡设备
 skb_len = skb->len;
 rc = ops->ndo_start_xmit(skb, dev);
}

Among them, ndo_start_xmit is a function to be implemented by the network card driver, which is defined in net_device_ops.

//file: include/linux/netdevice.h
struct net_device_ops {
 netdev_tx_t  (*ndo_start_xmit) (struct sk_buff *skb,
         struct net_device *dev);

}

In the igb network card driver source code, we found it.

//file: drivers/net/ethernet/intel/igb/igb_main.c
static const struct net_device_ops igb_netdev_ops = {
 .ndo_open  = igb_open,
 .ndo_stop  = igb_close,
 .ndo_start_xmit  = igb_xmit_frame, 
 ...
};

That is to say, for ndo_start_xmit defined by the network device layer, the implementation function of igb is igb_xmit_frame. This function is assigned when the network card driver is initialized. For the specific initialization process, see Section 2.4 of the article "Illustrated Linux Network Packet Receiving Process" , network card driver initialization.

So when calling ops->ndo_start_xmit at the network device layer above, it will actually enter the function igb_xmit_frame. Let's step into this function to see how the driver works.

//file: drivers/net/ethernet/intel/igb/igb_main.c
static netdev_tx_t igb_xmit_frame(struct sk_buff *skb,
      struct net_device *netdev)
{
 ......
 return igb_xmit_frame_ring(skb, igb_tx_queue_mapping(adapter, skb));
}

netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
    struct igb_ring *tx_ring)
{
 //获取TX Queue 中下一个可用缓冲区信息
 first = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
 first->skb = skb;
 first->bytecount = skb->len;
 first->gso_segs = 1;

 //igb_tx_map 函数准备给设备发送的数据。
 igb_tx_map(tx_ring, first, hdr_len);
}

Here, an element is taken from the RingBuffer of the sending queue of the network card, and the skb is attached to the element.

The igb_tx_map function handles mapping the skb data into a memory DMA area accessible by the network card.

//file: drivers/net/ethernet/intel/igb/igb_main.c
static void igb_tx_map(struct igb_ring *tx_ring,
      struct igb_tx_buffer *first,
      const u8 hdr_len)
{
 //获取下一个可用描述符指针
 tx_desc = IGB_TX_DESC(tx_ring, i);

 //为 skb->data 构造内存映射，以允许设备通过 DMA 从 RAM 中读取数据
 dma = dma_map_single(tx_ring->dev, skb->data, size, DMA_TO_DEVICE);

 //遍历该数据包的所有分片,为 skb 的每个分片生成有效映射
 for (frag = &skb_shinfo(skb)->frags[0];; frag++) {

  tx_desc->read.buffer_addr = cpu_to_le64(dma);
  tx_desc->read.cmd_type_len = ...;
  tx_desc->read.olinfo_status = 0;
 }

 //设置最后一个descriptor
 cmd_type |= size | IGB_TXD_DCMD;
 tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type);

 /* Force memory writes to complete before letting h/w know there
  * are new descriptors to fetch
  */
 wmb();
}

When all required descriptors have been built and all data in the skb has been mapped to DMA addresses, the driver goes to its final step, triggering the actual send.

4.8 Send complete hard interrupt

When the data is sent, the work is not over. Because the memory has not been cleaned up. When the transmission is complete, the network card device will trigger a hard interrupt to free the memory.

In Sections 3.1 and 3.2 of the article "Illustrated Linux Network Packet Receiving Process" , we describe the processing process of hard interrupts and soft interrupts in detail.

In the sending completion hard interrupt, the cleaning of RingBuffer memory will be performed, as shown in the figure.

Look back at the source code of the soft interrupt triggered by the hard interrupt.

//file: drivers/net/ethernet/intel/igb/igb_main.c
static inline void ____napi_schedule(...){
 list_add_tail(&napi->poll_list, &sd->poll_list);
 __raise_softirq_irqoff(NET_RX_SOFTIRQ);
}

There is a very interesting detail here, whether the hard interrupt is because there is data to be received, or the sending completion notification, the soft interrupt triggered from the hard interrupt is NET_RX_SOFTIRQ . We said this in the first section, this is one of the reasons why RX is higher than TX in soft interrupt statistics.

Ok, let's go to the callback function igb_poll of the soft interrupt. In this function, we noticed that there is a line igb_clean_tx_irq, see the source code:

//file: drivers/net/ethernet/intel/igb/igb_main.c
static int igb_poll(struct napi_struct *napi, int budget)
{
 //performs the transmit completion operations
 if (q_vector->tx.ring)
  clean_complete = igb_clean_tx_irq(q_vector);
 ...
}

Let's take a look at what igb_clean_tx_irq does when the transmission is complete.

//file: drivers/net/ethernet/intel/igb/igb_main.c
static bool igb_clean_tx_irq(struct igb_q_vector *q_vector)
{
 //free the skb
 dev_kfree_skb_any(tx_buffer->skb);

 //clear tx_buffer data
 tx_buffer->skb = NULL;
 dma_unmap_len_set(tx_buffer, len, 0);

 // clear last DMA location and unmap remaining buffers */
 while (tx_desc != eop_desc) {
 }
}

It is nothing more than cleaning up the skb, releasing the DMA mapping and so on. At this point, the transfer is basically complete.

Why do I say that it is basically completed, not completely completed? Because the transport layer needs to ensure reliability, skb has not actually been deleted. It will not be deleted until it receives the ACK from the other party. At that time, it will be completely sent.

at last

Use a picture to summarize the entire sending process

After understanding the entire sending process, let's go back and review the questions mentioned at the beginning.

1. When we monitor the CPU consumed by the kernel sending data, should we look at sy or si?

In the process of sending network packets, the user process (in the kernel state) completes most of the work, even the calling of the driver. A soft interrupt is initiated only before the kernel mode process is cut off. During the sending process, most (90%) of the overhead is consumed in the kernel mode of the user process.

The soft interrupt (NET_TX type) is triggered only in a few cases, and is sent by the soft interrupt ksoftirqd kernel process.

Therefore, when monitoring the CPU overhead caused by network IO to the server, we should not only look at si, but should take both si and sy into consideration.

2. Check /proc/softirqs on the server, why is NET_RX much larger than NET_TX?

Earlier I thought NET_RX was read and NET_TX was transmit. For a Server that not only receives user requests, but also returns them to users. The numbers of these two pieces should be about the same, at least there will be no order of magnitude difference. But in fact, one of Fei Ge's servers looks like this:

After today's source code analysis, it was found that there are two reasons for this problem.

The first reason is that when the data transmission is completed, the driver is notified of the completion of transmission through a hard interrupt. However, whether the hard interrupt has data reception or the completion of sending, the triggered soft interrupt is NET_RX_SOFTIRQ, not NET_TX_SOFTIRQ.

The second reason is that for reading, it all needs to go through the NET_RX soft interrupt, and all go through the ksoftirqd kernel process. For sending, most of the work is processed in the user process kernel state, and only when the system state quota is exhausted will NET_TX be sent to let the soft interrupt go.

Based on the above two reasons, it is not difficult to understand that NET_RX is much larger than NET_TX on the machine.

3. What memory copy operations are involved in sending network data?

The memory copy here, we only refer to the memory copy of the data to be sent.

The first copy operation is after the kernel has applied for the skb. At this time, the data content in the buffer passed in by the user will be copied to the skb. If the amount of data to be sent is relatively large, the overhead of this copy operation is not small.

The second copy operation is when entering the network layer from the transport layer, and each skb will be cloned into a new copy. The network layer and the underlying drivers, soft interrupts and other components will delete this copy when the transmission is completed. The transport layer saves the original skb, and can resend it when the other side of the network has no ack, so as to realize the reliable transmission required in TCP.

The third copy is not necessary, only when the IP layer finds that the skb is larger than the MTU. It will apply for additional skb and copy the original skb into multiple small skb.

Insert a digression here, the zero copy that everyone often hears in network performance optimization, I think this is a bit exaggerated. In order to ensure reliability of TCP, the second copy cannot be saved at all. If the packet is larger than the MTU, copying during fragmentation is also unavoidable.

Seeing this, I believe that the kernel sending data packets is no longer a black box that you don't understand at all.