VPP tips

VPP tips

1. The performance comes from.

Original link:

http://www.360doc.com/content/18/0428/20/53742993_749517107.shtml

https://steeven.iteye.com/blog/2347150  DPDK code level performance optimization summary

https://www.jianshu.com/p/346bf99b2fb1

https://www.jianshu.com/p/ed914b24f6da

https://blog.csdn.net/Dgh19940/article/details/79603843 

 

Architecture angle: DPDK huge page , NUMA , D-Cache optimization, the VPP the I-cache optimization ;

Algorithms angle: Bihash , look-up table Lockless ;

Code angle: the Vector , macro constructor , structure cacheline alignment , threaded tie core , the instruction prefetch instruction optimization;

 

2. Route Lookup

https://blog.csdn.net/dog250/article/details/6596046

Internet routing Overview of routing table lookup algorithm - hash / LC-Trie tree / 256-way-mtrie tree

3.  __constructor__ modifier

Through a simple example to introduce gcc 's __attribute__ ((constructor)) the role of the property. gcc allowed to function __attribute__ ((constructor)) and __attribute__ ((destructor)) two attributes, the name suggests, it is to be modified as a function of a constructor or destructor. The programmer can set these properties as a function of something like the following manner:  

void funcBeforeMain() __attribute__ ((constructor));

void funcAfterMain() __attribute__ ((destructor));

4.unlikely

CPU is executing the program pipeline mode instructions. The so-called pipeline, can be simply understood as a while executing instructions, the next instruction is read. For the program appeared in large numbers if else while for:? And other scenarios contain conditional judgment, the CPU need to be able to extract the correct next instruction so that the pipeline can be smoothly implemented. Once extracted the wrong branch of instruction, while not affecting the results of the program to run, but the whole line will be cleared and re-read the correct branch instruction, a lot of influence on the process efficiency.

CPU generally hardware branch prediction, but we can also use likely () / unlikely () , etc. displays specified Also in the design process are also of the branch determination has a certain regularity as well, such as an ordered set the input data.

To minimize Branch mispredication impact on performance, as can be converted to some common branch determines Branchless form.

Links: https://www.jianshu.com/p/ed914b24f6da

 

__builtin_expect Long (exp Long, Long C) ; !! (C) effects a Boolean value is obtained, the function effect is better branch prediction; use LIKELY () , performed if a greater chance of the following statements, using Unlikely () , executed else a better chance behind the statement of principle "It optimizes things by ordering the generated assembly code correctly, to optimize the usage of the processor pipeline. to do so, they arrange the code so that the likeliest branch is executed without performing any jmp instruction (which has the bad effect of flushing the processor pipeline). "

Mainly the branch prediction error, a jump instruction will bring a lot of performance decline. why? From "in-depth understanding of computer systems," the removal of the book, P141: " On the other hand, requires a jump prediction error processor lose it for all the instructions after the jump instruction of work has been done, and then started from the correct position at the start instruction to fill the pipeline, waste in approximately 20 ~ 40 clock cycles " ; understood that a reference to the first end of the assembly code level reference herein.

 

Links: https://www.jianshu.com/p/346bf99b2fb1

 

5. Giant Page hugepage

https://www.cnblogs.com/small-office/p/9766536.html   tied with nuclear giant page

https://www.cnblogs.com/031602523liu/p/10537694.html  UIO, giant page, CPU affinity, NUMA Profile

https://blog.csdn.net/qq_33611327/article/details/81738195  mmap analysis

The system can support large pages, support for large page size for the number of processors used by its decision.

After the large pages reserved, the next question involves use. DPDK use HUGETLBFS to use large pages. First, it needs to mount a large page to a path such as / mnt / Huge , or / dev / hugepages / , next time DPDK run, will use the mmap () system call the big page mapped to the virtual address of the user state space, then you can normally use.

Since DPDK is run in user space, while the giant pages in kernel mode, and therefore by mmap enables fast access to user space to kernel space.

6.  the CPU affinity

https://www.cnblogs.com/031602523liu/p/10537694.html

pthread_setaffinity_np  set of threads run at a specific CPU on nuclear

CPU affinity is cpu affinity mechanism, refers to the process specified in the CPU as long as possible without being migrated to run on other processors , can be mapped to a physical processor virtual processor by the associated processor on a program that is bound to put a physical CPU on.

And running on multi-core machines, each CPU itself will have its own cache, the cache information used by the process, and the process may be OS dispatched to other CPU on, so, CPU Cache hit rate is low. When a process or thread is bound CPU , the program will have been designated cpu to run, not dispatched by the operating system to another CPU , the reduction of the Cache Miss , improving performance and efficiency.

7. D-cache & I-cache

7.1 I-Cache

VPP processed as a group message, to solve I-cache jitter problems.

7.2  D-Cache

Original link: https://blog.csdn.net/Dgh19940/article/details/79603843  DPDK in Cache Optimization

1 ) Write to memory receiving descriptor, pointer to data buffer is filled, the card receives a packet according to the address after the filling into the message contents.

2 ) receive descriptor is read from memory (when the received packet, updates the card structure) (memory read), to confirm whether the packet is received.

3 ) From the receipt confirmation receive descriptor packet, the read pointer control structure from memory, and then read from memory control structure, the padding information read from the reception to the control descriptor structure (memory read).

4 ) updating a receive queue register indicating the software has received a new message.

5 ) reads the packet headers from memory (memory read), decided to forward the port.

6 ) from the control structure of the message information filled into transmit queue descriptor transmission, the transmission queue update register.

7 ) read transmit descriptor (memory read) from memory, check the hardware packet is sent out.

8 ), if any, is read from the memory corresponding control structure (memory read), the data buffer is released.

It can be seen that a packet processing process, requires 6 times read memory (above (memory read)).

In other words, to ensure that the 80 processed clock cycles a message DPDK must ensure that the data to be read Cache hit, or if Cache miss, the performance will be severely degraded.

7.3  Cache Consistency

After the data structure definition or data buffer is allocated, there is a memory address which corresponds to, and read and write procedures. During a read, the memory is first loaded into the Cache , and then sent to the internal processor registers; is sent from the register at the time of a write operation the Cache , and finally written back to the memory bus.

In this way there will be two questions:

1) data structures / data buffer corresponding to the Cache Line is aligned? If not, even if the data area is smaller than the Cache Line , then it will take up two Cache Line ; in addition, if a CacheLine belong to another data structure and is another processor core processing, data synchronization how it?

Answer: structure aligned __rte_cache_aligned;

Doing this can often make copy operations more efficient, because the compiler can use whatever instructions copy the biggest chunks of memory when performing copies to or from the variables or fields that you have aligned this way.

2) assuming that the data structure / start address of the buffer is CacheLine aligned, but there is a plurality of cores at the same time the memory read and write, how to resolve conflicts?

The answer: DPDK solution is simple, first of all to avoid multiple cores access the same memory address or data structures. Each nuclear avoid sharing data with other core, thereby reducing errors due to data-sharing resulting from Cache consistency overhead.

8.  Bhihs

9. The  data prefetching

__builtin_prefetch()

https://www.cnblogs.com/dongzhiquan/p/3694858.html

VPP data processing operation in processing a large number of packets

p4 = vlib_get_buffer(vm, from[4]);

vlib_prefetch_buffer_header (p4, LOAD);

CLIB_PREFETCH (p4-> packet, CLIB_CACHE_LINE_BYTES, STORE);

10. The instruction optimization

Guess you like

Origin www.cnblogs.com/sunnypoem/p/11368500.html