Linux 内核收发包流程

收包流程：

传统方式和NAPI方式收包流程是有差异的，如图所示。

传统收包是中断，驱动处理完后直接调用netif_rx将报文送入内核处理，内核将报文skb挂到该CPU的softnet_data结构input_pkt_queue队列上，为了统一传统收包和NAPI设备收包的处理，内核为所有不使用NAPI的驱动程序提供一个虚拟设备，叫做积压设备，每个CPU一个积压设备，对应结构softnet_data->backlog_dev。input_pkt_queue即是该设备的积压队列，用于存储skb，该队列是一个双向链表，组织结构如下。中断上半部只是将报文入队，并将backlog的实例挂到poll_list上，等待下半部软中断轮询poll_list net_rx_action->preocess_backlog将报文进一步处理。

 input_pkt_queue structure
     +------------------------------------------------------------+
     |                                                            |
     |  skb_buff_head        skb_buff             skb_buff        |
     |    _______       _______________       _______________     |         
     +-->|  next |---->|           next|---->|           next|----+
     +---|  pre  |<----|           pre |<----|           pre |<---+
     |   |_len=2_|     |_______________|     |_______________|    |
     |                                                            |
     +------------------------------------------------------------+

传统收包是每个报文都触发中断，如果报文太快，中断太频繁，CPU总是处理中断，其他任务无法得到调度，于是NAPI（NewAPI）出现了，采用中断+轮询的方式收包以提高吞吐。

NAPI收包需要网卡驱动支持，如intel e1000系列网卡，在收包中断中e1000_intr_msix_rx将网卡napi实例加入softnet_data的poll_list链表上，然后设置NET_RX_SOFTIRQ软中断标志，等待net_rx_action中检查标志并处理。何时运行软中断？两个时机：1，do_IRQ-->irq_exit-->do_softirq-->call_softirq-->__do_softirq中断上半部退出的时候调用软中断处理函数net_rx_action，net_rx_action遍历poll_list链表上的网卡，函数执行过程如下（kernel version 3.2.x）。2，__do_softirq循环调用MAX_SOFTIRQ_RESTART = 10次net_rx_action如果还有pending的报文，则wakeup_softirqd唤醒ksoftirqd内核线程运行run_ksoftirqd-->__do_softirq-->net_rx_action收包。

static void net_rx_action(struct softirq_action *h)
{
    struct softnet_data *sd = &__get_cpu_var(softnet_data);
    unsigned long time_limit = jiffies + 2;
    int budget = netdev_budget; //一次中断处理的skb数目，系统默认300，对应net.core.netdev_budget = 300
    void *have;

    local_irq_disable(); //关闭中断以访问softnet_data

    while (!list_empty(&sd->poll_list)) {
        struct napi_struct *n;
        int work, weight;

        /* If softirq window is exhuasted then punt.
         * Allow this to run for 2 jiffies since which will allow
         * an average latency of 1.5/HZ.
         */
        if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit))) //轮询时间不要超过2个jiffies，处理skb数目不要超过预算300
            goto softnet_break;

        local_irq_enable();

        /* Even though interrupts have been re-enabled, this
         * access is safe because interrupts can only add new
         * entries to the tail of this list, and only ->poll()
         * calls can remove this head entry from the list.
         */
        n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list); //取poll_list链表的头，即某网卡的napi实例

        have = netpoll_poll_lock(n);

        weight = n->weight;//该网卡一次轮询最多处理的报文个数，64

        /* This NAPI_STATE_SCHED test is for avoiding a race
         * with netpoll's poll_napi().  Only the entity which
         * obtains the lock and sees NAPI_STATE_SCHED set will
         * actually make the ->poll() call.  Therefore we avoid
         * accidentally calling ->poll() when NAPI is not scheduled.
         */
        work = 0;
        if (test_bit(NAPI_STATE_SCHED, &n->state)) {
            work = n->poll(n, weight);//调用设备特定的poll函数处理报文，poll中如果一次把包收完会将设备从poll_list上摘除？；如果是非NAPI调用的是process_backlog；
            trace_napi_poll(n);
        }

        WARN_ON_ONCE(work > weight);

        budget -= work;

        local_irq_disable();

        /* Drivers must not modify the NAPI state if they
         * consume the entire weight.  In such cases this code
         * still "owns" the NAPI instance and therefore can
         * move the instance around on the list at-will.
         */
  //如果一次就把weight消耗光了，说明可能还需要继续轮询这个设备，所以把这个napi放到poll_list的末尾；如果还有报文在gro处理中，不再等待直接将报文feed进协议栈
 if (unlikely(work == weight)) {
            if (unlikely(napi_disable_pending(n))) {
                local_irq_enable();
                napi_complete(n);
                local_irq_disable();
            } else {
                if (n->gro_list) {
                    /* flush too old packets
                     * If HZ < 1000, flush all packets.
                     */
                    local_irq_enable();
                    napi_gro_flush(n, HZ >= 1000);
                    local_irq_disable();
                }
                list_move_tail(&n->poll_list, &sd->poll_list);
            }
        }

        netpoll_poll_unlock(have);
    }
out:
    net_rps_action_and_irq_enable(sd);

#ifdef CONFIG_NET_DMA
    /*
     * There may not be any more sk_buffs coming right now, so push
     * any pending DMA copies to hardware
     */
    dma_issue_pending_all();
#endif

    return;

softnet_break:
    sd->time_squeeze++;
    __raise_softirq_irqoff(NET_RX_SOFTIRQ);//如果本轮轮询没有处理完，设置软中断标志，等下次软中断调用net_rx_action处理？

    goto out;
}

软中断之后报文进入内核协议栈进行处理。期间还设计netfilter，xfrm（ipsec）等的处理，后续再详细分析。

硬件中断 -->do_IRQ-->handle_irq-->e1000_intr_msix_rx-->__napi_schedule(&adapter->napi)-->

____napi_schedule-->__raise_softirq_irqoff(NET_RX_SOFTIRQ)

do_IRQ-->irq_exit-->do_softirq-->call_softirq-->__do_softirq-->

net_rx_action->e1000e_poll-->e1000_receive_skb->napi_gro_receive-->

netif_receive_skb-->__netif_receive_skb-->__netif_receive_skb_core-->

deliver_skb-->ip_rcv-->NF_HOOK(NF_INET_PRE_ROUTING)-->

ip_rcv_finish-->dst_input-->ip_route_input_slow-->ip_local_deliver-->

NF_HOOK(NF_INET_LOCAL_IN)-->ip_local_deliver_finish-->ipprot->handler()

ip_forward-->NF_HOOK(NF_INET_FORWARD)-->ip_forward_finish-->

dst_output-->dst->output-->ip_output-->NF_HOOK_COND(NF_INET_POST_ROUTING)-->

ip_finish_output-->ip_finish_output2-->__ipv4_neigh_lookup_noref-->

dst_neigh_output-->neigh_hh_output-->dev_queue_xmit-->dev_hard_start_xmit-->ndo_start_xmit

网上找到个协议栈收发包流程图图，非常好，感谢原作者.

参考：http://zgykill.lofter.com/post/19b38e_a26bb1

http://blog.csdn.net/hui6075/article/details/51196056

http://www.cnblogs.com/super-king/p/3296201.html

https://yq.aliyun.com/articles/5002

Linux 内核收发包流程

猜你喜欢