ODP/DPDK code-level performance optimization summary Tips

ODP/DPDK code-level performance optimization summary Tips

The following process is based on ARM 64-bit CPU, for reference only

ODP is an open source framework under the Linaro Foundation, similar to DPDK. Recently, I used the ODP program to DEMO the company's SOC performance, and the performance was not ideal. After optimization, it was found that the driving moisture is very large, including the ODP framework itself. I didn't listen to Architect's suggestion in the middle, and I decided to use DPDK+ODP to show our diversified drivers. After finding a solution, I found that the performance of DPDK driver is not ideal during development.

Premise:
64B small packets are used here to simply reflect the data coming in from the meter, in order to confirm the optimal drive performance. Further reading of the message content will cause additional cachelines to be loaded, and performance will be degraded. 10G network port small packet line speed is 14.88Mpps, 134 clock cycles for 2G CPU, and the average number of instructions per packet is the key.

Perf performance detection tool:
If you can't install it directly with ubuntu, such as the Kernel source code compiled by yourself, go to tools/perf to make, and the perf produced is just that, and copy it to usr/bin.
perf list can list the performance parameters supported by your current cpu
perf stat -p 'pidof your_app` -e task-clock,... used to detect the performance parameters of the program
Note that -p parameter is attached to the existing process to avoid seeing The effect of the program startup process.
-e must have task-clock so that you can see more % and M/s statistical parameters.
If your cache miss rate is greater than 5%, the output will be color-coded
Perf is especially useful for tracking cache load miss rate and write miss rate, where it can clearly see the difference between data accesses that are not cached.

High-precision clock:
Do not use system functions to evaluate performance, the system overhead is too high. rte_rdtsc() is a good implementation, an assembly instruction. The performance hit was minimal in my environment. Use global variables to accumulate the rx/rd/tx/free time in batch processing respectively. The accumulated time of each code segment that needs to be counted is divided by this total number and displayed in %, so that the savings of one or two instructions can be clearly seen. to % is changing. The following are a few macros I use. When any code is suspected to have a problem, add
PERF_VAR xxx; //Define the global performance counter
PERF_START(); //Use it at the beginning of the code segment that needs to be counted
PERF_COUNT(xxx); //Statistics A counter
PERF_DUMP(xxx, sum); //Print the performance percentage when the command line is called or exit.
Note : The RTDSC command seems to be very time-consuming under x86. The code segments that need statistics are separated.

Eclipse:
Automatic compilation on save, even ssh to device kill & run, saves a lot of time. Performance tuning requires constant modification and comparison, and process automation can save a lot of time.

Gdb:
ODP/DPDK-like user-mode framework is convenient for debugging, which is much stronger than bare metal and kernel-like programs. If there is a problem, use "-g -O0" to compile and debug immediately.
If you have time, use GDB to track the program step by step, and maybe there will be a surprise

GIT:
I once changed the code for two days at home over the weekend, and checked in once. As a result, because I changed too many key places, it took a long time to debug the error. Later, I changed it a little and submitted it a little bit, so that which step went wrong is easy to trace back. Git's local branches and commits should be put to good use.

Assert:
When I interviewed for Java, I wrote a simple api on the whiteboard. After writing, the two interviewers burst into tears, and we finally saw someone who would use assert." The underlying api is written as little as possible with if else checks, and the above should be judged and returned value, the whole code will be ugly. The underlying calls are for their own use, and these constraints can be satisfied during debugging. During debugging, the check can be enabled through macros. What is even worse is that these extra instructions will seriously drag down the performance, especially each package will Called function. Use macro to close during operation.
When optimizing ODP Ring/Pool operation, many low-level functions can only single-step trace because there is no assert, and program errors can only be traced, which is time-consuming and labor-intensive.

Compile:
In addition to -O3, turn on -mtune. Yes When perf sees that the iTLB miss is very large, using -Os will significantly reduce the code size, you can try it. Finally, you need to add -g disassembly to see the result.

Batch:
Batch processing is one of the essence of the new network processing framework, and it is also The key to performance optimization. Batch should be used in various aspects. For example, ring filling, assigning a packet to write one is not as good as allocating a batch at a time and then writing it in batches. Although the Pool uses hugepage and per core cache, each order The second operation still needs to judge and update the pointer, at least a few instructions. If it is batched at a time, the average is only one or two per message. The
CPU can cache data and instructions, D-cache and i-Cache, generally d -cache can be prefetched artificially, but i-cache cannot, so try to make good use of i-cache while it is hot, and enter the next code after batch operation.

Pool:
Pool can also be optimized. For example, when everyone queues up for meals, each person hands over a container (array), and the master fills it for you. The algorithm for raising pigs is more efficient, and each pig allocates a space to eat by itself. That is, the location of the memory segment is returned, and there is no need to copy it into the receiving array one by one. Pool cache is an array, and it is allocated in a large pool when it is almost used up. This method of returning pointers is more efficient and requires less memory copying.
The default pool of dpdk is ring implementation. Cons and prod are in two directions. If the operation is frequent, it is equivalent to traversing the entire pool continuously, and the cache utilization is not good. If you use the stack pool implementation, it will make good use of the cache to allocate it immediately. The stack design does not have the four-way combination of rings, but only one way of locking.


For loop:
The commonly used For loop is not efficient for a simple loop body, and instructions are wasted on i++ and judgment. Algorithm reference for space-for-time: DEQUEUE_PTRS() and rte_memcpy(), and even more once in ODP pool Switch 32 to optimize memory copy

Inline and function pointers:
-O3 will basically help you automatically inline, but it is best to objdump to see the assembly, whether there is a surprise. Function pointers will affect performance, such as the callback in dpdk, if not recommended Turn it off in .config.

Likely/Unlikely:
Although CPU branch prediction has been done well, the branch predicted by yourself is more accurate.

Prefetch
I remember that I wrote a sequential large memory access program before. When the step size is smaller than the cacheline, the performance is basically the same, and when the step size is larger than the cacheline, the performance deteriorates seriously. There are three interesting factors here: the same cacheline access time consumption is very small, because the data is already in L1. After more than a certain number of consecutive memory accesses, d-cache will load the post-order memory ahead of time by prediction, and the performance is very good. If the prediction program exceeds a certain step size, the program is stupid. So to improve performance, be sure to reduce cache miss! Perf often records cache miss percentage. Prefetch is not used well, not only does the performance not improve, but D-cache is not as good as prefetch_non_tempral because it squeezes out useful data.


Memory alignment:
a struct contains u64, u8, u8, will the array operation of this structure be faster? No, change to u64, u32, u32 is faster.

Struct write:
still the above structure array, write the first two fields. Can this also be optimized? Can: write 0 to the unused field. What, added a command, are you sick? do you have medicine? Because the write memory is write-back, the cache line (64B) is read, merged and then written back to the memory. If it is fully covered, there is no need to read it. PS, when my girl is sick, you can confidently ask: Who is sick this time, right? I have medicine. . .

Register variables:
Some frequently read and written memory can be stored in register variables, which is faster than L1. .
I remember that ARM has 16B registers, and I have never tried whether it will save instructions when the assignment is cleared.

If branch:
as few as possible, especially the

order of mbuf fields on the main branch:
There are 128B in meta, two cacheline sizes, and the fields commonly used in sending and receiving packets are concentrated to the front, and only one CL is operated. Perf observed a significant drop in cache r/w miss rate.

Pointer:
The 8B memory pointer in the 64-bit system is really a waste, because now it is hugepage, and many addresses are continuous, especially in the pool, so mbuf can be replaced by idx, and it can be very short. Generally, the loopback data of the network card is used to find the corresponding mbuf, and the short address can reduce a memory search.

Pool cache size:
This is a cache dedicated to each core. The access is fast, and the capacity is not enough to access the large pool. Therefore, the size should be able to accommodate common accesses, and try not to touch the large pool.
Rx/tx queues should not be too greedy, enough is enough, otherwise it exceeds the cache capacity and reduces performance. Use

less memory:
especially the memory of per packet, it is not good to be able to cache, some data can be merged into mbuf, or used as the loopback data of the network card . It is mentioned above that short indexes replace 64-bit pointers, which can save a lot of bytes if used in mbufs.
Here is a good article about memory access time: CPU and memory

CPU utilization statistics:
There is a simple way to enter an idling when no message is received, and the number of instructions is similar to that of ordinary message processing. Otherwise, RX will take up a lot of time in terms of performance statistics. It is also necessary to count the proportion of batches that are not filled, and you can roughly see the load status.

DPDK configuration:
these are single-stepped by gdb If your application is not that complicated, you can use spsc, or manually call api
CONFIG_RTE_MBUF_DEFAULT_MEMPOOL_OPS="ring_sp_sc"
CONFIG_RTE_MBUF_REFCNT_ATOMIC=n
CONFIG_RTE_PKTMBUF_HEADROOM=0?
CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=nReference


link (just found it, you should search it earlier):
http://dpdk.org/doc/guides/prog_guide/writing_efficient_code.html
http://events.linuxfoundation.org/sites/events/files/ slides/DPDK-Performance.pdf
https://software.intel.com/en-us/articles/dpdk-performance-optimization-guidelines-white-paper

Results:
Optimized ODP/DPDK achieves line speed on ARM SOC, But even if you read only one byte in the message, because a CL is loaded, the performance will still be reduced, but it is still much better than before without optimization. The X86 itself is relatively sturdy. The company's network card can run to line speed without optimization, and there is no difference. Later, I will find a 25G dual-port network card for testing.

Power consumption :
This PMD mode framework is essentially an endless loop, I don't know if it can be confiscated When it comes to data, enter WFI/WFE, and wait for the hardware to wake up.

Outlook :
Intel has done a good job in open source, and the code is refined. More and more companies are using DPDK to do gradually complicated things, most of which are still at the network application level. Personally, I think that this kind of programming model and performance compression idea should be raised to the application level, such as database, storage, Web, and then mature into FPGA, and finally solidified into ASIC, and the cost should be reduced by orders of magnitude.
Note: The application of mtcp+dpdk should not be limited to simulating socket interfaces. Using the batch processing idea in DPDK, batch transformation of upper-layer applications including web server. I just saw the TLDK of fd.io, the open source udp/tcp that intel participated in The framework

DPDK performance optimization is not so mysterious, share your notes, welcome to exchange experience: [email protected] WeChat account: steeven_li



2016 Christmas Eve

12/28 Supplement: The performance results under X86 are also out, about 50% improvement! Also see The performance of x86 cpu is indeed at least twice as good as that of ARM.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326360771&siteId=291194637