Linux·Illustrated Linux network packet receiving process

Because it is necessary to provide various network services to millions, tens of millions, or even over 100 million users, one of the key requirements for interviewing and promoting back-end development students in first-line Internet companies is to be able to support high concurrency and understand performance overhead , will be optimized for performance. And many times, if you don't have a deep understanding of the underlying Linux, you will feel that you have no way to start when you encounter many online performance bottlenecks.

Today, we use a graphical method to deeply understand the receiving process of network packets under Linux. Or follow the convention to borrow the simplest piece of code to start thinking. For simplicity, we use udp as an example, as follows:

int main(){
    int serverSocketFd = socket(AF_INET, SOCK_DGRAM, 0);
    bind(serverSocketFd, ...);

    char buff[BUFFSIZE];
    int readCount = recvfrom(serverSocketFd, buff, BUFFSIZE, 0, ...);
    buff[readCount] = '\0';    printf("Receive from client:%s\n", buff);
}

The above code is a piece of logic for the udp server to receive receipts. When looking at it from a development perspective, as long as the client sends corresponding data, the serverrecv_from can receive it and print it out after execution . What we want to know now is, when the network packet reaches the network card, until we recvfromreceive the data, what happened in the middle?

Through this article, you will have a deep understanding of how the Linux network system is implemented internally and how each part interacts. I believe this will be of great help to your work. This article is based on Linux 3.10. See https://mirrors.edge.kernel.org/pub/linux/kernel/v3.x/ for the source code. The network card driver uses Intel's igb network card as an example.

Friendly reminder, this article is a little long, you can mark it first and then read it!

An overview of Linux network packet receiving

In the TCP/IP network layered model, the entire protocol stack is divided into physical layer, link layer, network layer, transport layer and application layer. The physical layer corresponds to network cards and cables, and the application layer corresponds to our common Nginx, FTP and other applications. Linux implements three layers: link layer, network layer and transport layer.

In the implementation of the Linux kernel, the link layer protocol is implemented by the network card driver, and the kernel protocol stack implements the network layer and the transport layer. The kernel provides a socket interface to the upper application layer for user processes to access. The TCP/IP network layering model we see from the perspective of Linux should look like this.

Figure 1 Network protocol stack from the perspective of Linux

In the source code of Linux, the logic corresponding to the network device driver is located driver/net/ethernet, and the driver of the intel series network card is in driver/net/ethernet/intelthe directory. The protocol stack module code is located in the kerneland netdirectory.

Kernel and network device drivers are handled by means of interrupts. When data arrives on the device, it will trigger a voltage change on the relevant pin of the CPU to notify the CPU to process the data. For the network module, due to the complex and time-consuming processing, if all the processing is completed in the interrupt function, the interrupt processing function (with too high priority) will occupy the CPU excessively, and the CPU will not be able to respond to other devices. For example, mouse and keyboard messages. Therefore, the Linux interrupt processing function is divided into the upper half and the lower half. The upper part is to do the simplest work, quickly process and then release the CPU, and then the CPU can allow other interrupts to come in. Most of the rest of the work is placed in the lower half, which can be handled slowly and calmly. The lower half of the implementation method adopted by the kernel version after 2.4 is soft interrupt, which is fully handled by the ksoftirqd kernel thread. Unlike hard interrupts, hard interrupts apply voltage changes to CPU physical pins, while soft interrupts notify the soft interrupt handler by giving a binary value to a variable in memory.

Well, after getting a general understanding of network card drivers, hard interrupts, soft interrupts, and ksoftirqd threads, we will give a schematic diagram of the path for the kernel to receive packets based on these concepts:

Figure 2 Overview of Linux kernel network receiving packets

When data is received on the network card, the first working module in Linux is the network driver. The network driver will write the frame received on the network card into the memory by DMA. Then initiate an interrupt to the CPU to notify the CPU that data has arrived. Second, when the CPU receives an interrupt request, it will call the interrupt handler registered by the network driver. The interrupt processing function of the network card does not do too much work, sends a soft interrupt request, and then releases the CPU as soon as possible. When ksoftirqd detects that a soft interrupt request arrives, it calls poll to start polling and receiving packets, and after receiving it, it is handed over to the protocol stacks at all levels for processing. For UDP packets, they will be placed in the receiving queue of the user socket.

From the above picture, we have grasped the processing process of Linux on the data packet as a whole. But if we want to understand more details about the working of the network module, we have to look down.

Two Linux boot

Linux driver, kernel protocol stack and other modules need to do a lot of preparatory work before being able to receive network card data packets. For example, the ksoftirqd kernel thread must be created in advance, the processing functions corresponding to each protocol must be registered, the network device subsystem must be initialized in advance, and the network card must be started. Only after these are Ready, we can actually start receiving packets. So let's take a look at how these preparations are done.

2.1 Create ksoftirqd kernel thread

Linux soft interrupts are all performed in a dedicated kernel thread (ksoftirqd), so it is very necessary for us to see how these processes are initialized, so that we can understand the packet receiving process more accurately later. The number of processes is not 1, but N, where N is equal to the number of cores of your machine.

When the system is initialized, smpboot_register_percpu_thread is called in kernel/smpboot.c, and this function will be further executed to spawn_ksoftirqd (located in kernel/softirq.c) to create a softirqd process.

Figure 3 Create ksoftirqd kernel thread

The relevant code is as follows:

//file: kernel/softirq.c
static struct smp_hotplug_thread softirq_threads = {
    .store          = &ksoftirqd,
    .thread_should_run  = ksoftirqd_should_run,
    .thread_fn      = run_ksoftirqd,
    .thread_comm        = "ksoftirqd/%u",};

static __init int spawn_ksoftirqd(void){
    register_cpu_notifier(&cpu_nfb);

    BUG_ON(smpboot_register_percpu_thread(&softirq_threads));    return 0;
}
early_initcall(spawn_ksoftirqd);

When ksoftirqd is created, it will enter its own thread loop function ksoftirqd_should_run and run_ksoftirqd. Constantly judge whether there is a soft interrupt that needs to be processed. One thing to note here is that soft interrupts are not only network soft interrupts, but also other types.

//file: include/linux/interrupt.h

enum{
    HI_SOFTIRQ=0,
    TIMER_SOFTIRQ,
    NET_TX_SOFTIRQ,
    NET_RX_SOFTIRQ,
    BLOCK_SOFTIRQ,
    BLOCK_IOPOLL_SOFTIRQ,
    TASKLET_SOFTIRQ,
    SCHED_SOFTIRQ,
    HRTIMER_SOFTIRQ,
    RCU_SOFTIRQ,  
};

2.2 Network Subsystem Initialization

Figure 4 Network subsystem initialization

The linux kernel subsys_initcallinitializes various subsystems through calls, and you can grep out many calls to this function in the source code directory. What we are talking about here is the initialization of the network subsystem, which will execute net_dev_initthe function.

//file: net/core/dev.c
static int __init net_dev_init(void){
    ......

    for_each_possible_cpu(i) {
        struct softnet_data *sd = &per_cpu(softnet_data, i);

        memset(sd, 0, sizeof(*sd));
        skb_queue_head_init(&sd->input_pkt_queue);
        skb_queue_head_init(&sd->process_queue);
        sd->completion_queue = NULL;
        INIT_LIST_HEAD(&sd->poll_list);
        ......
    }
    ......
    open_softirq(NET_TX_SOFTIRQ, net_tx_action);    open_softirq(NET_RX_SOFTIRQ, net_rx_action);
}
subsys_initcall(net_dev_init);

In this function, a softnet_datadata structure will be applied for each CPU. In this data structure, poll_listit is waiting for the driver to register its poll function. We can see this process when the network card driver is initialized later.

In addition, open_softirq registers a processing function for each soft interrupt. The processing function of NET_TX_SOFTIRQ is net_tx_action, and that of NET_RX_SOFTIRQ is net_rx_action. After continuing to track, open_softirqI found that the registration method is recorded in softirq_vecthe variable. When the ksoftirqd thread receives a soft interrupt later, it will also use this variable to find the processing function corresponding to each soft interrupt.

//file: kernel/softirq.c
void open_softirq(int nr, void (*action)(struct softirq_action *)){
    softirq_vec[nr].action = action;
}

2.3 Protocol stack registration

The kernel implements the ip protocol at the network layer, as well as the tcp protocol and udp protocol at the transport layer. The implementation functions corresponding to these protocols are ip_rcv(), tcp_v4_rcv() and udp_rcv() respectively. Unlike the way we usually write code, the kernel is implemented through registration. fs_initcallSimilar to and in the Linux kernel subsys_initcall, it is also the entry point of the initialization module. Start network protocol stack registration after fs_initcallcalling . inet_initPassed inet_init, these functions are registered in the inet_protos and ptype_base data structures. As shown below:

Figure 5 AF_INET protocol stack registration

The relevant code is as follows

//file: net/ipv4/af_inet.c
static struct packet_type ip_packet_type __read_mostly = {
    .type = cpu_to_be16(ETH_P_IP),
    .func = ip_rcv,};static const struct net_protocol udp_protocol = {
    .handler =  udp_rcv,
    .err_handler =  udp_err,
    .no_policy =    1,
    .netns_ok = 1,};static const struct net_protocol tcp_protocol = {
    .early_demux    =   tcp_v4_early_demux,
    .handler    =   tcp_v4_rcv,
    .err_handler    =   tcp_v4_err,
    .no_policy  =   1,    .netns_ok   =   1,
};
static int __init inet_init(void){
    ......
    if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
        pr_crit("%s: Cannot add ICMP protocol\n", __func__);
    if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
        pr_crit("%s: Cannot add UDP protocol\n", __func__);
    if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
        pr_crit("%s: Cannot add TCP protocol\n", __func__);
    ......    dev_add_pack(&ip_packet_type);
}

In the above code, we can see that the handler in the udp_protocol structure is udp_rcv, and the handler in the tcp_protocol structure is tcp_v4_rcv, which is initialized through inet_add_protocol.

int inet_add_protocol(const struct net_protocol *prot, unsigned char protocol){
    if (!prot->netns_ok) {
        pr_err("Protocol %u is not namespace aware, cannot register.\n",
            protocol);
        return -EINVAL;
    }

    return !cmpxchg((const struct net_protocol **)&inet_protos[protocol],            NULL, prot) ? 0 : -1;
}

inet_add_protocolThe function registers the processing functions corresponding to tcp and udp in the inet_protos array. Look at dev_add_pack(&ip_packet_type);this line again. The type in the ip_packet_type structure is the protocol name, and func is the ip_rcv function, which will be registered in the ptype_base hash table in dev_add_pack.

//file: net/core/dev.c
void dev_add_pack(struct packet_type *pt){
    struct list_head *head = ptype_head(pt);    ......
}
static inline struct list_head *ptype_head(const struct packet_type *pt){
    if (pt->type == htons(ETH_P_ALL))
        return &ptype_all;
    else        return &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];
}

Here we need to remember that inet_protos records the address of udp and tcp processing function, and ptype_base stores the processing address of ip_rcv() function. Later we will see that the softirq will find the ip_rcv function address through ptype_base, and then send the ip packet to ip_rcv() correctly for execution. In ip_rcv, the tcp or udp processing function will be found through inet_protos, and then the packet will be forwarded to the udp_rcv() or tcp_v4_rcv() function.

To expand, if you look at the codes of functions such as ip_rcv and udp_rcv, you can see the processing of many protocols. For example, ip_rcv will handle netfilter and iptable filtering. If you have many or very complex netfilter or iptables rules, these rules are executed in the context of soft interrupts, which will increase network delay. For another example, udp_rcv will judge whether the socket receiving queue is full. The corresponding related kernel parameters are net.core.rmem_max and net.core.rmem_default. If you are interested, I suggest you read inet_initthe code of this function carefully.

2.4 Network card driver initialization

Every driver (not just network card drivers) will use module_init to register an initialization function with the kernel, and the kernel will call this function when the driver is loaded. For example, the code of the igb network card driver is located indrivers/net/ethernet/intel/igb/igb_main.c

//file: drivers/net/ethernet/intel/igb/igb_main.c
static struct pci_driver igb_driver = {
    .name     = igb_driver_name,
    .id_table = igb_pci_tbl,
    .probe    = igb_probe,
    .remove   = igb_remove,    ......
};
static int __init igb_init_module(void){
    ......
    ret = pci_register_driver(&igb_driver);    return ret;
}

After the driver pci_register_drivercall is completed, the Linux kernel knows the relevant information of the driver, such as the igb network card driver igb_driver_nameand igb_probefunction address, etc. When the network card device is recognized, the kernel will call the probe method of its driver (the probe method of igb_driver is igb_probe). The purpose of driving the execution of the probe method is to make the device ready. For the igb network card, it igb_probeis located under drivers/net/ethernet/intel/igb/igb_main.c. The main operations performed are as follows:

Figure 6 Network card driver initialization

In step 5, we can see that the network card driver implements the interface required by ethtool, and also registers here to complete the registration of the function address. When ethtool initiates a system call, the kernel will find the callback function for the corresponding operation. For the igb network card, its implementation functions are all under drivers/net/ethernet/intel/igb/igb_ethtool.c. I believe you can fully understand the working principle of ethtool this time, right? The reason why this command can view the packet sending and receiving statistics of the network card, modify the adaptive mode of the network card, and adjust the number and size of the RX queue is that the ethtool command finally calls the corresponding method of the network card driver, rather than ethtool itself having this superpower.

The igb_netdev_ops registered in step 6 contains functions such as igb_open, which will be called when the network card is started.

//file: drivers/net/ethernet/intel/igb/igb_main.c
static const struct net_device_ops igb_netdev_ops = {
  .ndo_open               = igb_open,
  .ndo_stop               = igb_close,
  .ndo_start_xmit         = igb_xmit_frame,
  .ndo_get_stats64        = igb_get_stats64,
  .ndo_set_rx_mode        = igb_set_rx_mode,
  .ndo_set_mac_address    = igb_set_mac,
  .ndo_change_mtu         = igb_change_mtu,  .ndo_do_ioctl           = igb_ioctl,
 ......

In step 7, during the initialization process of igb_probe, it is also called igb_alloc_q_vector. He registered a poll function necessary for the NAPI mechanism. For the igb network card driver, this function is igb_poll, as shown in the following code.

static int igb_alloc_q_vector(struct igb_adapter *adapter,
                  int v_count, int v_idx,
                  int txr_count, int txr_idx,
                  int rxr_count, int rxr_idx){
    ......
    /* initialize NAPI */
    netif_napi_add(adapter->netdev, &q_vector->napi,               igb_poll, 64);
}

2.5 Start the network card

When the above initialization is complete, you can start the network card. Recalling the previous network card driver initialization, we mentioned that the driver registered the structure net_device_ops variable with the kernel, which contains callback functions (function pointers) such as network card activation, packet sending, and mac address setting. When a network card is enabled (for example, via ifconfig eth0 up), the igb_open method in net_device_ops is called. It usually does the following:

Figure 7 Start the network card

//file: drivers/net/ethernet/intel/igb/igb_main.c
static int __igb_open(struct net_device *netdev, bool resuming){
    /* allocate transmit descriptors */
    err = igb_setup_all_tx_resources(adapter);

    /* allocate receive descriptors */
    err = igb_setup_all_rx_resources(adapter);

    /* 注册中断处理函数 */
    err = igb_request_irq(adapter);
    if (err)
        goto err_req_irq;

    /* 启用NAPI */
    for (i = 0; i < adapter->num_q_vectors; i++)
        napi_enable(&(adapter->q_vector[i]->napi));    ......
}

The above __igb_openfunction calls igb_setup_all_tx_resources, and igb_setup_all_rx_resources. In igb_setup_all_rx_resourcesthis step, the RingBuffer is allocated, and the mapping relationship between the memory and the Rx queue is established. (The number and size of Rx Tx queues can be configured through ethtool). Let's look at the interrupt function registration again igb_request_irq:

static int igb_request_irq(struct igb_adapter *adapter){
    if (adapter->msix_entries) {
        err = igb_request_msix(adapter);
        if (!err)
            goto request_done;
        ......    }
}
static int igb_request_msix(struct igb_adapter *adapter){
    ......
    for (i = 0; i < adapter->num_q_vectors; i++) {
        ...
        err = request_irq(adapter->msix_entries[vector].vector,
                  igb_msix_ring, 0, q_vector->name,
    }

Tracing the function call in the above code, __igb_open=> igb_request_irq=> igb_request_msix, igb_request_msixwe can see that for multi-queue network cards, interrupts are registered for each queue, and the corresponding interrupt processing function is igb_msix_ring (this function is also in drivers/ net/ethernet/intel/igb/igb_main.c). We can also see that in the msix mode, each RX queue has an independent MSI-X interrupt. From the level of network card hardware interrupts, it can be set to allow received packets to be processed by different CPUs. (You can modify the binding behavior with the CPU through irqbalance, or modify /proc/irq/IRQ_NUMBER/smp_affinity).

When the above preparations are done, you can open the door to welcome guests (data packets)!

Three welcome the arrival of data

3.1 Hard interrupt processing

First of all, when the data frame arrives on the network card from the network cable, the first stop is the receiving queue of the network card. The network card looks for an available memory location in the RingBuffer allocated to itself. After finding it, the DMA engine will DMA the data to the memory associated with the network card before. At this time, the CPU is insensitive. When the DMA operation is completed, the network card will initiate a hard interrupt like the CPU to notify the CPU that data has arrived.

Figure 8 NIC data hard interrupt processing process

Note: When the RingBuffer is full, new data packets will be discarded. When ifconfig checks the network card, there may be an overruns in it, indicating that the packet was discarded because the ring queue was full. If packet loss is found, you may need to use the ethtool command to increase the length of the ring queue.

In the section on starting the network card, we mentioned that the processing function of the hard interrupt registration of the network card is igb_msix_ring.

//file: drivers/net/ethernet/intel/igb/igb_main.c
static irqreturn_t igb_msix_ring(int irq, void *data){
    struct igb_q_vector *q_vector = data;

    /* Write the ITR value calculated from the previous interrupt. */
    igb_write_itr(q_vector);

    napi_schedule(&q_vector->napi);    return IRQ_HANDLED;
}

igb_write_itrJust record the hardware interrupt frequency (it is said that the purpose is to reduce the interrupt frequency of the CPU). Follow the napi_schedule call all the way down, __napi_schedule=>____napi_schedule

/* Called with irq disabled */
static inline void ____napi_schedule(struct softnet_data *sd,
                     struct napi_struct *napi){
    list_add_tail(&napi->poll_list, &sd->poll_list);    __raise_softirq_irqoff(NET_RX_SOFTIRQ);
}

Here we see that list_add_tailthe poll_list in the CPU variable softnet_data is modified, and the poll_list passed by the driver napi_struct is added. The poll_list in softnet_data is a two-way list in which the devices have input frames waiting to be processed. Then __raise_softirq_irqoffa soft interrupt NET_RX_SOFTIRQ is triggered. This so-called triggering process is just an OR operation on a variable.

void __raise_softirq_irqoff(unsigned int nr){
    trace_softirq_raise(nr);    or_softirq_pending(1UL << nr);
}
//file: include/linux/irq_cpustat.h
#define or_softirq_pending(x)  (local_softirq_pending() |= (x))

We have said that Linux only completes simple and necessary work in hard interrupts, and most of the remaining processing is handed over to soft interrupts. As you can see from the above code, the hard interrupt processing process is really very short. Just recorded a register, modified the poll_list of the CPU, and then issued a soft interrupt. It's that simple, even if the hard interrupt work is completed.

3.2 ksoftirqd kernel thread handles soft interrupts

Figure 9 ksoftirqd kernel thread

When the kernel thread is initialized, we introduced two thread functions ksoftirqd_should_runand in ksoftirqd run_ksoftirqd. The code is as ksoftirqd_should_runfollows:

static int ksoftirqd_should_run(unsigned int cpu){    return local_softirq_pending();
}
#define local_softirq_pending() \    __IRQ_STAT(smp_processor_id(), __softirq_pending)

Here we see that the same function is called in the hard interrupt local_softirq_pending. The difference in usage is that the hard interrupt position is for writing the mark, here it is only for reading. If it is set in the hard interrupt NET_RX_SOFTIRQ, it can be read here naturally. Next, it will actually enter the thread function for run_ksoftirqdprocessing:

static void run_ksoftirqd(unsigned int cpu){
    local_irq_disable();
    if (local_softirq_pending()) {
        __do_softirq();
        rcu_note_context_switch(cpu);
        local_irq_enable();
        cond_resched();
        return;
    }    local_irq_enable();
}

In __do_softirq, judge according to the soft interrupt type of the current CPU, and call its registered action method.

asmlinkage void __do_softirq(void){
    do {
        if (pending & 1) {
            unsigned int vec_nr = h - softirq_vec;
            int prev_count = preempt_count();
            ...
            trace_softirq_entry(vec_nr);
            h->action(h);
            trace_softirq_exit(vec_nr);
            ...
        }
        h++;
        pending >>= 1;    } while (pending);
}

In the network subsystem initialization section, we saw that we registered the handler function net_rx_action for NET_RX_SOFTIRQ. So net_rx_actionthe function will be executed.

Here we need to pay attention to a detail. The soft interrupt flag is set in the hard interrupt, and the judgment of ksoftirq whether there is a soft interrupt arrives is based on smp_processor_id(). This means that as long as the hard interrupt is responded to on which CPU, the soft interrupt is also processed on this CPU. So, if you find that your Linux soft interrupt CPU consumption is concentrated on one core, the method is to adjust the CPU affinity of the hard interrupt to scatter the hard interrupt to different CPU cores.

Let's focus on this core function again net_rx_action.

static void net_rx_action(struct softirq_action *h){
    struct softnet_data *sd = &__get_cpu_var(softnet_data);
    unsigned long time_limit = jiffies + 2;
    int budget = netdev_budget;
    void *have;

    local_irq_disable();
    while (!list_empty(&sd->poll_list)) {
        ......
        n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list);

        work = 0;
        if (test_bit(NAPI_STATE_SCHED, &n->state)) {
            work = n->poll(n, weight);
            trace_napi_poll(n);
        }
        budget -= work;    }
}

The time_limit and budget at the beginning of the function are used to control the active exit of the net_rx_action function, and the purpose is to ensure that the reception of network packets does not occupy the CPU. Wait until the next time the network card has a hard interrupt, and then process the remaining received data packets. The budget can be adjusted through kernel parameters. The remaining core logic in this function is to obtain the current CPU variable softnet_data, traverse its poll_list, and then execute the poll function registered to the network card driver. For the igb network card, it is igb_polla function of the igb driving force.

static int igb_poll(struct napi_struct *napi, int budget){
   
   

    ...
    if (q_vector->tx.ring)
        clean_complete = igb_clean_tx_irq(q_vector);

    if (q_vector->rx.ring)
        clean_complete &= igb_clean_rx_irq(q_vector, budget);    ...
}

In the read operation, igb_pollthe key work is igb_clean_rx_irqthe call to .

static bool igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget){
    ...
    do {
        /* retrieve a buffer from the ring */
        skb = igb_fetch_rx_buffer(rx_ring, rx_desc, skb);

        /* fetch next buffer in frame if non-eop */
        if (igb_is_non_eop(rx_ring, rx_desc))
            continue;
        }

        /* verify the packet layout is correct */
        if (igb_cleanup_headers(rx_ring, rx_desc, skb)) {
            skb = NULL;
            continue;
        }

        /* populate checksum, timestamp, VLAN, and protocol */
        igb_process_skb_fields(rx_ring, rx_desc, skb);

        napi_gro_receive(&q_vector->napi, skb);
    }
}

igb_fetch_rx_bufferThe function of sum igb_is_non_eopis to remove the data frame from RingBuffer. Why do you need two functions? Because it is possible that the frame occupies multiple RingBuffers, it is acquired in a loop until the end of the frame. A data frame obtained is represented by a sk_buff. After receiving the data, perform some checks on it, and then start to set the timestamp, VLAN id, protocol and other fields of the sbk variable. Next enter napi_gro_receive:

//file: net/core/dev.c
gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb){
    skb_gro_reset_offset(skb);    return napi_skb_finish(dev_gro_receive(napi, skb), skb);
}

dev_gro_receiveThis function represents the GRO feature of the network card. It can be simply understood as combining related small packets into one large packet. The purpose is to reduce the number of packets sent to the network stack, which helps to reduce CPU usage. Let's ignore it for now, and look directly napi_skb_finish, this function is mainly called netif_receive_skb.

//file: net/core/dev.c
static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb){
    switch (ret) {
    case GRO_NORMAL:
        if (netif_receive_skb(skb))
            ret = GRO_DROP;
        break;    ......
}

In netif_receive_skb, the data packet will be sent to the protocol stack. Statement, the following 3.3, 3.4, 3.5 also belong to the processing process of soft interrupt, but because the length is too long, it is taken out separately into subsections.

3.3 Network protocol stack processing

netif_receive_skbAccording to the protocol of the packet, if it is a udp packet, the function will send the packet to the ip_rcv(), udp_rcv() protocol processing function for processing.

Figure 10 Network protocol stack processing

//file: net/core/dev.c
int netif_receive_skb(struct sk_buff *skb){
    //RPS处理逻辑,先忽略    ......    return __netif_receive_skb(skb);
}
static int __netif_receive_skb(struct sk_buff *skb){
    ......  
    ret = __netif_receive_skb_core(skb, false);}static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc){
    ......

    //pcap逻辑,这里会将数据送入抓包点。tcpdump就是从这个入口获取包的    list_for_each_entry_rcu(ptype, &ptype_all, list) {
        if (!ptype->dev || ptype->dev == skb->dev) {
            if (pt_prev)
                ret = deliver_skb(skb, pt_prev, orig_dev);
            pt_prev = ptype;
        }
    }
    ......
    list_for_each_entry_rcu(ptype,
            &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
        if (ptype->type == type &&
            (ptype->dev == null_or_dev || ptype->dev == skb->dev ||
             ptype->dev == orig_dev)) {
            if (pt_prev)
                ret = deliver_skb(skb, pt_prev, orig_dev);
            pt_prev = ptype;
        }    }
}

In __netif_receive_skb_core, I looked at the packet capture point of tcpdump that I used often, and I was very excited. It seems that the time to read the source code is really not wasted. Then __netif_receive_skb_coretake out the protocol, it will take out the protocol information from the data packet, and then traverse the list of callback functions registered on this protocol. ptype_baseIt is a hash table, which we mentioned in the protocol registration section. The ip_rcv function address is stored in this hash table.

//file: net/core/dev.c
static inline int deliver_skb(struct sk_buff *skb,
                  struct packet_type *pt_prev,
                  struct net_device *orig_dev){
    ......    return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
}

pt_prev->funcThis line calls the handler function registered by the protocol layer. For ip packets, it will enter ip_rcv(if it is an arp packet, it will enter arp_rcv).

3.4 IP protocol layer processing

Let's take a general look at what Linux does at the ip protocol layer, and how the packet is further sent to the udp or tcp protocol processing function.

//file: net/ipv4/ip_input.c
int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev){
    ......
    return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL,               ip_rcv_finish);
}

Here NF_HOOKis a hook function. When the registered hook is executed, the function pointed to by the last parameter will be executed ip_rcv_finish.

static int ip_rcv_finish(struct sk_buff *skb){
    ......
    if (!skb_dst(skb)) {
        int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
                           iph->tos, skb->dev);
        ...
    }
    ......    return dst_input(skb);
}

After tracing, ip_route_input_norefI saw that it was called again ip_route_input_mc. In ip_route_input_mc, the function ip_local_deliveris assigned to dst.input, as follows:

//file: net/ipv4/route.c
static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,u8 tos, struct net_device *dev, int our){
    if (our) {
        rth->dst.input= ip_local_deliver;
        rth->rt_flags |= RTCF_LOCAL;    }
}

So back to ip_rcv_finishthat return dst_input(skb);.

/* Input packet from network to transport.  */
static inline int dst_input(struct sk_buff *skb){
    return skb_dst(skb)->input(skb);
}

skb_dst(skb)->inputThe input method called is the ip_local_deliver assigned by the routing subsystem.

//file: net/ipv4/ip_input.c
int ip_local_deliver(struct sk_buff *skb){
    /*     *  Reassemble IP fragments.     */
    if (ip_is_fragment(ip_hdr(skb))) {
        if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER))
            return 0;
    }

    return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL,               ip_local_deliver_finish);
}
static int ip_local_deliver_finish(struct sk_buff *skb){
    ......
    int protocol = ip_hdr(skb)->protocol;
    const struct net_protocol *ipprot;

    ipprot = rcu_dereference(inet_protos[protocol]);
    if (ipprot != NULL) {
        ret = ipprot->handler(skb);    }
}

As seen in the protocol registration section, the function addresses of tcp_rcv() and udp_rcv() are stored in inet_protos. Here, the distribution will be selected according to the protocol type in the package. Here, the skb package will be further dispatched to the upper layer protocol, udp and tcp.

3.5 UDP protocol layer processing

We said in the protocol registration section that the processing function of the udp protocol is udp_rcv.

//file: net/ipv4/udp.c
int udp_rcv(struct sk_buff *skb){
    return __udp4_lib_rcv(skb, &udp_table, IPPROTO_UDP);
}
int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
           int proto){
    sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);

    if (sk != NULL) {
        int ret = udp_queue_rcv_skb(sk, skb
    }    icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
}

__udp4_lib_lookup_skbIt is to find the corresponding socket according to the skb, and when it is found, put the data packet into the buffer queue of the socket. If not found, an icmp packet with destination unreachable is sent.

//file: net/ipv4/udp.c
int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb){  
    ......
    if (sk_rcvqueues_full(sk, skb, sk->sk_rcvbuf))
        goto drop;

    rc = 0;

    ipv4_pktinfo_prepare(skb);
    bh_lock_sock(sk);
    if (!sock_owned_by_user(sk))
        rc = __udp_queue_rcv_skb(sk, skb);
    else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {
        bh_unlock_sock(sk);
        goto drop;
    }
    bh_unlock_sock(sk);    return rc;
}

sock_owned_by_user judges whether the user is making a system call on this socket (socket is occupied), if not, it can be directly placed in the receiving queue of the socket. If there is, then by sk_add_backlogadding the packet to the backlog queue. When the user releases the socket, the kernel will check the backlog queue and move it to the receiving queue if there is data.

sk_rcvqueues_fullIf the receive queue is full, the packet will be discarded directly. The receive queue size is affected by the kernel parameters net.core.rmem_max and net.core.rmem_default.

Four recvfrom system calls

Two flowers bloom, each representing a branch. Above we have finished the process of receiving and processing the data packet by the entire Linux kernel, and finally put the data packet into the receiving queue of the socket. Then let's look back at recvfromwhat happened after the user process call. What we call in the code recvfromis a glibc library function. After the function is executed, the user will fall into the kernel mode and enter the system call implemented by Linux sys_recvfrom. Before understanding the Linux pair sys_revvfrom, let's take a brief look at socketthis core data structure. This data structure is too large, we only draw the content related to our topic today, as follows:

Figure 11 socket kernel data organization

socketconst struct proto_opsCorresponding in the data structure is the set of methods of the protocol. Each protocol implements a different set of methods. For the IPv4 Internet protocol family, each protocol has a corresponding processing method, as follows. For UDP, it is inet_dgram_opsdefined by , where the methods are registered inet_recvmsg.

//file: net/ipv4/af_inet.c
const struct proto_ops inet_stream_ops = {
    ......
    .recvmsg       = inet_recvmsg,
    .mmap          = sock_no_mmap,    ......
}
const struct proto_ops inet_dgram_ops = {
    ......
    .sendmsg       = inet_sendmsg,
    .recvmsg       = inet_recvmsg,    ......
}

socketAnother data structure in the data structure struct sock *skis a very large, very important substructure. Among them, sk_protthe secondary processing function is defined. For the UDP protocol, it will be set to the method set implemented by the UDP protocol udp_prot.

//file: net/ipv4/udp.c
struct proto udp_prot = {
    .name          = "UDP",
    .owner         = THIS_MODULE,
    .close         = udp_lib_close,
    .connect       = ip4_datagram_connect,
    ......
    .sendmsg       = udp_sendmsg,
    .recvmsg       = udp_recvmsg,
    .sendpage      = udp_sendpage,    ......
}

After reading socketthe variables, let's look at sys_revvfromthe implementation process.

Figure 12 The internal implementation process of the recvfrom function

is inet_recvmsgcalling sk->sk_prot->recvmsg.

//file: net/ipv4/af_inet.c
int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,size_t size, int flags){  
    ......
    err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
                   flags & ~MSG_DONTWAIT, &addr_len);
    if (err >= 0)
        msg->msg_namelen = addr_len;    return err;
}

We said above that this sk_protis net/ipv4/udp.cthe case for the socket of the udp protocol struct proto udp_prot. From this we found udp_recvmsga way.

//file:net/core/datagram.c:EXPORT_SYMBOL(__skb_recv_datagram);
struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags,int *peeked, int *off, int *err){
    ......
    do {
        struct sk_buff_head *queue = &sk->sk_receive_queue;
        skb_queue_walk(queue, skb) {
            ......
        }

        /* User doesn't want to wait */
        error = -EAGAIN;
        if (!timeo)
            goto no_packet;    } while (!wait_for_more_packets(sk, err, &timeo, last));
}

Finally we found the point we wanted to see, above we saw the so-called reading process, which is access sk->sk_receive_queue. If there is no data and the user allows waiting, wait_for_more_packets() will be called to perform the waiting operation, which will put the user process to sleep.

Five Summary

The network module is the most complicated module in the Linux kernel. It seems that a simple packet receiving process involves the interaction between many kernel components, such as the network card driver, protocol stack, kernel ksoftirqd thread, etc. It seems very complicated. This article wants to explain the kernel packet receiving process clearly in an easy-to-understand way through illustrations. Now let's string together the entire packet collection process.

After the user executes recvfromthe call, the user process proceeds to the kernel mode through the system call. If there is no data in the receive queue, the process goes to sleep and is suspended by the operating system. This piece is relatively simple, and most of the remaining scenes are performed by other modules of the Linux kernel.

First, before starting to receive packets, Linux needs to do a lot of preparatory work:

  • 1. Create a ksoftirqd thread, set its own thread function for it, and then count on it to handle soft interrupts
  • 2. Protocol stack registration, Linux needs to implement many protocols, such as arp, icmp, ip, udp, tcp, each protocol will register its own processing function, so that it is convenient to quickly find the corresponding processing function when the package arrives
  • 3. Network card driver initialization, each driver has an initialization function, and the kernel will let the driver initialize it. In this initialization process, prepare your own DMA and tell the kernel the address of the poll function of NAPI
  • 4. Start the network card, allocate RX and TX queues, and register the processing function corresponding to the interrupt

The above is the important work before the kernel is ready to receive the packet. When the above is ready, you can open the hard interrupt and wait for the arrival of the data packet.

When the data arrives, the first thing to greet it is the network card (I'm going, isn't this nonsense):

  • 1. The network card DMAs the data frame into the RingBuffer of the memory, and then initiates an interrupt notification to the CPU
  • 2. The CPU responds to the interrupt request and calls the interrupt processing function registered when the network card starts
  • 3. The interrupt processing function does almost nothing, and initiates a soft interrupt request
  • 4. The kernel thread ksoftirqd thread finds that there is a soft interrupt request, and first closes the hard interrupt
  • 5. The ksoftirqd thread starts to call the driver's poll function to receive packets
  • 6. The poll function sends the received packet to the ip_rcv function registered in the protocol stack
  • 7. The ip_rcv function sends the packet to the udp_rcv function (for tcp packets, it is sent to tcp_rcv)

Now we can go back to the question at the beginning, the simple line we saw in the user layer recvfrom, the Linux kernel has to do so much work for us, so that we can receive the data smoothly. This is still a simple UDP. If it is TCP, the kernel has to do more work. I can't help but sigh that the developers of the kernel are really well-intentioned.

After understanding the entire packet receiving process, we can clearly know the CPU overhead of Linux receiving a packet. First of all, the first block is the overhead of the user process calling the system call and falling into the kernel mode. The second block is the CPU overhead of the hard interrupt of the CPU response packet. The third block is spent by the soft interrupt context of the ksoftirqd kernel thread. Later, we will post a dedicated article to actually observe these expenses.

In addition, there are many details in the network sending and receiving that we have not expanded on, such as no NAPI, GRO, RPS, etc. Because I think what I said is too correct will affect everyone's grasp of the entire process, so try to keep only the main frame, less is more!

Guess you like

Origin blog.csdn.net/m0_64560763/article/details/131570606