文章目录

参考
napi

这一篇属于翻译

参考

https://wiki.linuxfoundation.org/networking/napi
LWN article on network driver porting http://lwn.net/Articles/30107/
Usenix paper http://www.cyberus.ca/~hadi/usenix-paper.tgz
Development files ftp://robur.slu.se/pub/Linux/net-development/NAPI/

napi

NAPI (“New API”)，是对网卡设备驱动（做包处理的架构）的一个扩展。目的是为了提高高速网络的性能。有两个方面：

缓解中断压力：高速网络每秒可能产生上千/万个中断，驱动需要通知系统：有很多新来的包要处理。在这种高速网络包的情况下，NAPI允许驱动关闭一部分中断量（假如有100个包进来，不需要中断100次），同时不影响业务。这样就可以减轻系统的中断处理压力。
包阻挡：当系统处在崩溃的边缘时，需要丢弃一些包来避免真实崩溃的发生；最好这些包根本就不要到系统里，而是在驱动就总挡住。这样的话系统根本就不会感知到这些包的存在，也就不会浪费时间在即将被丢弃的包上。所以遵循NAPI的驱动会导致包在网卡层就被丢弃，原因就是NAPI的这个功能。

注意：如果硬件支持，建议新的驱动都要使用这个NAPI功能。但是，内核新加的NAPI功能代码，不会破坏之前的哪些不支持NAPI的驱动正常工作。也就是之前的驱动还是可以工作在全中断的场景。（来100个包，中断系统100次）

NAPI驱动设计

下面是创建遵循NAPI的网卡驱动的必须的步骤。
驱动要为每一个中断向量分配一个struct napi_struct实例。（译注，这里也可以看出NAPI的粒度就是中断级别，每种中断，都需要自己的一套环境，来实现这个NAPI的功能）不需要调用任何特殊的函数来实例化这个struct napi_struct。经典的做法是将着结构体，嵌在驱动的私有结构体里。（the structure is typically embedded in the driver’s private structure）。必须在网络设备的初始化注册之前做使用netif_napi_add()将每个napi_struct初始化并做注册。相应的也是在网络设备注销之后才能注销这个结构体：使用函数netif_napi_del()完成注销，销毁的操作。

接下来，需要对驱动的中断处理程序做一下修改。如果驱动，由于新包的到来，收到了一个中断，先不要处理这个包，而是应该将这个中断关闭，然后通知网络子系统，让它尽快轮询这个驱动将所有可用的包都拿走（问题：网络子系统是从哪里拿，是从driver还是硬件DMA？）。当然，关闭中断是驱动和网卡适配器之间的行为。通过调用以下函数实现网络子系统的轮询（polling）：
void napi_schedule(struct napi_struct *napi);
经常看到的形式是：
if (napi_schedule_prep(napi))
__napi_schedule(napi);

例子virtio：

static void virtqueue_napi_schedule(struct napi_struct *napi,
				    struct virtqueue *vq)
{
    
    
	if (napi_schedule_prep(napi)) {
    
    
		virtqueue_disable_cb(vq);
		__napi_schedule(napi);
	}
}

这两种方式有相同的结果。（如果napi_schedule_prep()返回0，意味着已经有一个轮询被调度了，也就意味着不应该收到其他的中断，因为中断已经被关闭）。

poll()函数

下一步是为驱动创建一共poll()方法，实现的功能就是：从网卡设备上获取包，然后交给内核去处理。poll() 方法的定义原型：
int (*poll)(struct napi_struct *napi, int budget);

应该处理所有就绪的包，就像中断处理程序在NAPI出现之前一样的做法。当然不一样的地方：

不是将包传给netif_rx()；而是使用：
int netif_receive_skb(struct sk_buff *skb);

budget参数设置了一个限制，限制驱动最多要干多少活。这里的活，的单位是包的个数。poll()函数也可以处理TX的工作，如果budget有剩余。
poll()返回值是，干了多少活。
仅有在，返回值小于budget时，驱动才需要重新打开中断，并关闭polling。关闭轮询的方法是：
void napi_complete(struct napi_struct *napi);

网络子系统会确保，poll()函数被多个处理器同时触发（相同的napi_struct）。（译注，怎么做到这种方式，有没有同步问题？）
最后一步是高速网络子系统poll()函数是谁，在哪里。可用通过在初始化代码，注册napi_struct:
netif_napi_add(dev, &napi, my_poll, 16);
最后一个参数：weight，表示了当前napi实例的重要性。这个参数和poll函数里的budget是同一个数字。Gigabit and faster adaptor drivers tend to set weight to 64; smaller values can be used for slower media.

硬件架构

NAPI需要以下（硬件）功能的支持：
DMA环或者足够的RAM来存储网络包（software devices，软件设备？）
关中断的能力或者，关闭其他形式（例如向内核栈发送包的事件）的事件的能力；

NAPI处理包事件的方式被称为napi→poll()。通常，只有收到网络包的事件才使用这种方式napi→poll()，其他的事件（类似网卡掉电的事件）还是需要使用原有的中断方式来处理，以确保处理的及时性 (同时这类事件发生的频率、机率很小)。
注意：也不是说严格要求NAPI一定只是处理接收事件。通过tulip驱动测试结果得出来：如果所有的事件都使用napi→poll()，延迟还是相当小的，可用接受。Also MII/PHY handling gets a little trickier.
这篇文章种的案例是：所有的事件都是经过napi→poll()方式传递。（通过给tulip驱动打补丁的方式）。这种方式处理过的驱动还有tg3, e1000, sky2)。There are caveats that might force you to go with moving everything to napi→poll().
由于状态事件（寄存器）的回应机制的不同，不同的NIC工作方式肯定是不同。
归总起来，有两种回应机制：

Clear-on-read (COR)。读取了状态/事件寄存器之后，信息被清除。natsemi和sunbmac网卡是属于这一类，这种情况下，如果要改造它们，就只能对所有的事件使用napi→poll()方式。
Clear-on-write (COW)。通过往寄存器写入1，来清理状态。这种机制是大多数NIC的实现机制。而且可用很好的工作在NAPI模式。只有接受事件才使用napi→poll()；其他事件还是原理的中断处理方式。whatever you write in the status register clears every thing.

以下的情况Linux也是无能为力：怎样正确检测新工作来了？NAPI的实现是如果有工作要做，就要关闭中断，如果没有工作要做就打开中断。在中断被重新打开的这个小窗口时期内（这个小窗口非常小），来了一个新的包。这样导致这个新的包，要等下一个包到来发出中断时才能被处理。关键，这种情况是存在的，我们称这种小包为：过期包（“rotting packet”）。
附录2，会详细的介绍这个非常重要的问题。

锁的使用，同步问题

在任何时间，只能由一个CPU调用：napi→poll()；因为，对于每一个napi_struct，只能是一个CPU处理那个一开始的中断，然后调用napi_schedule(napi)。
核心层在触发设备发包时，使用了CPU轮流机制。（译注：没看懂）The core layer invokes devices to send packets in a round robin format. This implies receive is totaly lockless because of the guarantee that only one CPU is executing it.
发生冲突的唯一场景：几个CPU同时访问RX环。而且只发生在close()和suspend() (因为这两函数会清理RX环)；但是驱动开发者，也不用担心。上层网络层已经考虑了这个同步问题。
Local interrupts are enabled (if you don’t move all to napi→poll()). For example link/MII and txcomplete continue functioning just the same old way. This improves the latency of processing these events. It is also assumed that the receive interrupt is the largest cause of noise. Note this might not always be true. For these broken drivers, move all to napi→poll().

剩下的章节，我们假设napi→poll()只处理接收事件。

NAPI API

netif_napi_add(dev, napi, poll, weight)；为polling设备，初始化，注册napi结构体。
netif_napi_del(napi)；注销结构体；必须在相关的设备注销后才能调用此函数。free_netdev(dev)会调用netif_napi_del()来注销设备管理的napi结构体。所以驱动也不必调用直接调用这个函数。
napi_schedule(napi) ； IRQ处理程序调用这个函数，来调度napi的poll。
napi_schedule_prep(napi) ；将napi置成就绪状态，可以加到CPU（已经起来并运行中）polling列表中。可以认为这个是napi_schedule(napi)的前半部分。
__napi_schedule(napi)；假设已经调用过napi_schedule_prep(napi)，而且返回值是1，就可以将napi加到CPU的poll列表里。
napi_reschedule(napi) ；Called to reschedule polling for napi specifically for some deficient hardware.
napi_complete(napi) ；从CPU poll列表移除napi：必须在当前CPU的poll列表中存在。当napi->poll()完成工作之后，就会调用到这个函数。The structure cannot be out of poll list at this call, if it is then clearly it is a BUG().
__napi_complete(napi)； same as napi_complete but called when local interrupts are already disabled.
napi_disable(napi)； Temporarily disables napi structure from being polled. May sleep if it is currently being polled
napi_enable(napi)； Reenables napi structure for polling, after it was disabled using napi_disable()

优点

高速包的性能

NAPI provides an “inherent mitigation” which is bound by system capacity as can be seen from the following data collected by Robert Olsson’s tests on Gigabit ethernet (e1000):
在这里插入图片描述
Legend:
Ipps ; input packets per second
Tput ; packets out of total 1M that made it out
Txint ; transmit completion interrupts seen
Done ; The number of times that the poll() managed to pull all packets out of the rx ring. Note from this that the lower the load the more we could clean up the rxring
Ndone ; is the converse of “Done”. Note again, that the higher the load the more times we couldn’t clean up the rxring.

观察结果：当NIC收到890Kpackets/sec，只产生了17个RX中断。NAPI系统对于处理一个中断，只有一个包时，可能达不到想要的效果。在低速率的情况下，RX中断会增加，interrupt/packet率也会增加(as observable from that table). So there is possibility that under low enough input, you get one poll call for each input packet caused by a single interrupt each time. And if the system can’t handle interrupt per packet ratio of 1, then it will just have to chug along.

硬件流控制

大多数的芯片带有流量控制，但是只是在Rx缓存不够用时简单的发送一个pause包。Since packets are pulled off the DMA ring by a softirq in NAPI, if the system is slow in grabbing them and we have a high input rate (faster than the system’s capacity to remove packets), then theoretically there will only be one rx interrupt for all packets during a given packetstorm. Under low load, we might have a single interrupt per packet. Flow control should be programmed to apply in the case when the system can’t pull out packets fast enough, i.e send a pause only when you run out of rx buffers.

There are some tradeoffs with hardware flow control. If the driver makes receive buffers available to the hardware one by one, then under load up to 50% of the packets can end up being flow control packets. Flow control works better if the hardware is notified about buffers in larger bursts.

缺点

延迟；在一些情况下，NAPI会导致软中断的延迟。
中断掩饰；有些设备，变换IRQ掩码的操作非常慢，或者需要额外的锁。这种情况导致NAPI的增益被大大消弱。

存在的问题

IRQ竞态，过期包（rotting packet）

驱动需要注意处理这两类竞态问题。这些情况可能导致接收者停止服务，原因有：硬件和逻辑的交互？？？

IRQ mask and level-triggered

If a status bit for receive or rxnobuff is set and the corresponding interrupt-enable bit is not on, then no interrupts will be generated. However, as soon as the “interrupt-enable” bit is unmasked, an immediate interrupt is generated (assuming the status bit was not turned off). Generally the concept of level triggered IRQs in association with a status and interrupt-enable CSR register set is used to avoid the race.

If we take the example of the tulip: “pending work” is indicated by the status bit (CSR5 in tulip). The corresponding interrupt bit (CSR7 in tulip) might be turned off (but the CSR5 will continue to be turned on with new packet arrivals even if we clear it the first time). Very important is the fact that if we turn on the interrupt bit when status is set, then an immediate irq is triggered.

If we cleared the rx ring and proclaimed there was “no more work to be done” and then went on to do a few other things; then when we enable interrupts, there is a possibility that a new packet might sneak in during this phase. It helps to look at the pseudo code for the tulip poll routine:

     do {
             ACK;
             while (ring_is_not_empty()) {
                     work-work-work
                     if quota is exceeded: exit, no touching irq status/mask
             }
             /* No packets, but new can arrive while we are doing this*/
             CSR5 := read
             if (CSR5 is not set) {
                     /* If something arrives in this narrow window here,
                      *  where the comments are ;-> irq will be generated */
                     unmask irqs;
                    exit poll;
            }
    } while (rx_status_is_set);

CSR5 bit of interest is only the rx status.

If you look at the last if statement: you just finished grabbing all the packets from the rx ring … you check if status bit says there are more packets just in … it says none; you then enable rx interrupts again; if a new packet just came in during this check, we are counting that CSR5 will be set in that small window of opportunity and that by re-enabling interrupts, we would actually trigger an interrupt to register the new packet for processing.

non-level sensitive IRQs

Some systems have hardware that does not do level triggered IRQs properly. Normally, IRQs may be lost while being masked and the only way to leave poll is to do a double check for new input after netif_rx_complete() is invoked and re-enable polling (after seeing this new input).

restart_poll:
while (ring_is_not_empty()) {
work-work-work
if budget is exceeded: exit, not touching irq status/mask
}
.
.
.
enable_rx_interrupts()
napi_complete(napi);
if (ring_has_new_packet() && napi_reschedule(napi)) {
disable_rx_and_rxnobufs()
goto restart_poll
} while (rx_status_is_set);

Basically napi_complete() removes us from the poll list, but because a new packet which will never be caught due to the possibility of a race might come in, we attempt to re-add ourselves to the poll list.

调度问题

As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the general solution to schedule softirq’s to run before next interrupt and by putting them under scheduler control. Also this prevents consecutive softirq’s from monopolizing the CPU. This also has the effect that the priority of ksoftirq needs to be considered when running very CPU-intensive applications and networking to get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 (eventually more) is reported to cure problems with low network performance at high CPU load.

GIGE 路由器使用的主要进程：:

USER  PID  %CPU %MEM  SIZE   RSS TTY STAT START     TIME COMMAND
 root    3  0.2  0.0     0     0  ?   RWN  Aug 15  602:00 (ksoftirqd_CPU0)
 root  232  0.0  7.9 41400 40884  ?   S    Aug 15   74:12 gated

Linux: Kernel: [翻译】NAPI