An example analysis of a Linux kernel memory leak

problem statement

On a CEPH storage node, the available memory becomes less and less as the runtime increases. After all applications are exited, all caches are released, and the available memory still does not increase. After restarting the node, all memory usage returns to normal, and the same situation occurs again after running for a period of time (about a week). In addition, this problem cannot be reproduced on the self-test server we built, and we can only perform non-invasive diagnosis by analyzing the production environment data.

(Additional note: Since the on-site debugging data is not saved, the data displayed by the following command is for demonstration only!!!)

question confirmed

$ grep Pss /proc/[1-9]*/smaps | awk '{total+=$2}; END {print total}'
1721106
$ free -h
              total        used        free      shared  buff/cache   available
Mem:           125G         95G        4.2G        4.0G         26G         25G
Swap:          9.4G        444M        8.9G

$ cat /proc/meminfo
MemTotal:       131748024 kB
MemFree:         4229544 kB
MemAvailable:   26634796 kB
Buffers:          141416 kB
Cached:         24657800 kB
SwapCached:       198316 kB
Active:          7972388 kB
Inactive:       19558436 kB
Active(anon):    4249920 kB
Inactive(anon):  2666784 kB
Active(file):    3722468 kB
Inactive(file): 16891652 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       9830396 kB
SwapFree:        9375476 kB
Dirty:                80 kB
Writeback:             0 kB
AnonPages:       2601440 kB
Mapped:            71828 kB
Shmem:           4185096 kB
Slab:            2607824 kB
SReclaimable:    2129004 kB
SUnreclaim:       478820 kB
KernelStack:       29616 kB
PageTables:        45636 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    75704408 kB
Committed_AS:   14023220 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      529824 kB
VmallocChunk:   34292084736 kB
HardwareCorrupted:     0 kB
AnonHugePages:    260096 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      125688 kB
DirectMap2M:     3973120 kB
DirectMap1G:    132120576 kB
  • Under normal circumstances, there should be the following formula ( Kmemory for the kernel, ignoring the Swap memory here):
T = U + K + S + C + F

From this, the memory used by the kernel can be calculated K, and it turns out that the kernel occupies an abnormally large amount of memory. The entire faulty node has 128G of memory, and the kernel occupies more than 100G, so it can be preliminarily inferred that a memory leak has occurred in the kernel.

Principle analysis

According to experience, the code that generally leaks memory and exhausts memory must be the part that frequently requests to release memory.

The memory that may be frequently requested and released in the kernel may include:

  • The kernel manages data structures like task_struct, inodeetc, and the code is generally heavily tested and less likely to go wrong.
  • Kernel IOsubsystems or drivers, such as block devices BIO, network protocol stacks, SKBand storage network device drivers.

The most likely problem here is the driver of the storage or network device. Ask the relevant R&D personnel about the recent kernel and driver changes, and finally learn that X710the driver of the network card has been updated recently i40e. It is preliminarily inferred that the problem should appear in the network card driver.

On-site analysis

The Linux kernel uses a hierarchical memory management method, each layer solves different problems. The key parts from bottom to top are as follows:

  1. Physical memory management is mainly used to describe the layout and attributes of memory. There are mainly three structures, Node, Zoneand , to manage memory in units;PagePage
  2. BuddyMemory management, mainly solves the problem of external fragmentation, and uses get_free_pagesfunctions such as Pagethe Npower to apply for release;
  3. SlabMemory management, mainly to solve the problem of internal fragmentation, can apply for memory in batches according to the size specified by the user (the object buffer pool needs to be created first);
  4. Kernel cache objects, use Slab to pre-allocate some fixed-size caches, and use kmalloc, vmallocand other functions to apply and release memory in units of bytes.

Next, we first look at which level of memory leaks (additional note: there are many related memory management technologies such as large page memory, page cache, block cache, etc., they all apply for memory from these levels , not the key, all ignored here.).

$ cat /proc/buddyinfo
Node 0, zone      DMA      0      1      1      0      2      1      1      0      0      1      3 
Node 0, zone    DMA32   3222   6030   3094   3627    379      0      0      0      0      0      0 
Node 0, zone   Normal  13628      0      0      0      0      0      0      0      0      0      0 
Node 1, zone   Normal  73167 165265 104556  17921   2120    144      1      0      0      0      0

$ cat /proc/buddyinfo | awk '{sum=0;for(i=5;i<=NF;i++) sum+=$i*(2^(i-5))};{total+=sum/256};{print $1 " " $2 " " $3 " " $4 "\t : " sum/256 "M"} END {print "total\t\t\t : " total "M"}'
Node 0, zone DMA      : 14.5234M
Node 0, zone DMA32    : 245.07M
Node 0, zone Normal   : 53.2344M
Node 1, zone Normal   : 3921.41M
total                 : 4234.24M

From this we can see Buddyhow much memory is allocated in total.

  • Check Slabmemory usage:
$ slabtop -o
 Active / Total Objects (% used)    : 3522231 / 6345435 (55.5%)
 Active / Total Slabs (% used)      : 148128 / 148128 (100.0%)
 Active / Total Caches (% used)     : 74 / 107 (69.2%)
 Active / Total Size (% used)       : 1297934.98K / 2593929.78K (50.0%)
 Minimum / Average / Maximum Object : 0.01K / 0.41K / 15.88K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
1449510 666502  45%    1.06K  48317       30   1546144K xfs_inode              
1229592 967866  78%    0.10K  31528       39    126112K buffer_head            
1018560 375285  36%    0.06K  15915       64     63660K kmalloc-64             
643216 322167  50%    0.57K  11486       56    367552K radix_tree_node        
350826 147688  42%    0.38K   8353       42    133648K blkdev_requests        
310421 131953  42%    0.15K   5857       53     46856K xfs_ili                
273420  95765  35%    0.19K   6510       42     52080K dentry                 
174592  36069  20%    0.25K   2728       64     43648K kmalloc-256            
155680 155680 100%    0.07K   2780       56     11120K Acpi-ParseExt          
 88704  34318  38%    0.50K   1386       64     44352K kmalloc-512            
 85176  52022  61%    0.19K   2028       42     16224K kmalloc-192            
 59580  59580 100%    0.11K   1655       36      6620K sysfs_dir_cache        
 43031  42594  98%    0.21K   1163       37      9304K vm_area_struct         
 35392  30850  87%    0.12K    553       64      4424K kmalloc-128            
 35070  20418  58%    0.09K    835       42      3340K kmalloc-96             
 34304  34304 100%    0.03K    268      128      1072K kmalloc-32

$ cat /proc/slabinfo
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kvm_async_pf           0      0    136   60    2 : tunables    0    0    0 : slabdata      0      0      0
kvm_vcpu               0      0  16256    2    8 : tunables    0    0    0 : slabdata      0      0      0
kvm_mmu_page_header      0      0    168   48    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_dqtrx              0      0    528   62    8 : tunables    0    0    0 : slabdata      0      0      0
xfs_dquot              0      0    472   69    8 : tunables    0    0    0 : slabdata      0      0      0
xfs_icr                0      0    144   56    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_ili           131960 310421    152   53    2 : tunables    0    0    0 : slabdata   5857   5857      0
xfs_inode         666461 1449510   1088   30    8 : tunables    0    0    0 : slabdata  48317  48317      0
xfs_efd_item        8120   8280    400   40    4 : tunables    0    0    0 : slabdata    207    207      0
xfs_da_state        2176   2176    480   68    8 : tunables    0    0    0 : slabdata     32     32      0
xfs_btree_cur       1248   1248    208   39    2 : tunables    0    0    0 : slabdata     32     32      0
xfs_log_ticket     12981  13200    184   44    2 : tunables    0    0    0 : slabdata    300    300      0
nfsd4_openowners       0      0    440   37    4 : tunables    0    0    0 : slabdata      0      0      0
rpc_inode_cache       51     51    640   51    8 : tunables    0    0    0 : slabdata      1      1      0
ext4_groupinfo_4k   4440   4440    136   60    2 : tunables    0    0    0 : slabdata     74     74      0
ext4_inode_cache    4074   5921   1048   31    8 : tunables    0    0    0 : slabdata    191    191      0
ext4_xattr           276    276     88   46    1 : tunables    0    0    0 : slabdata      6      6      0
ext4_free_data      3264   3264     64   64    1 : tunables    0    0    0 : slabdata     51     51      0
ext4_allocation_context   2048   2048    128   64    2 : tunables    0    0    0 : slabdata     32     32      0
ext4_io_end         1785   1785     80   51    1 : tunables    0    0    0 : slabdata     35     35      0
ext4_extent_status  20706  20706     40  102    1 : tunables    0    0    0 : slabdata    203    203      0
jbd2_journal_handle   2720   2720     48   85    1 : tunables    0    0    0 : slabdata     32     32      0
jbd2_journal_head   4680   4680    112   36    1 : tunables    0    0    0 : slabdata    130    130      0
jbd2_revoke_table_s    256    256     16  256    1 : tunables    0    0    0 : slabdata      1      1      0
jbd2_revoke_record_s   4096   4096     32  128    1 : tunables    0    0    0 : slabdata     32     32      0
scsi_cmd_cache      7056   7272    448   36    4 : tunables    0    0    0 : slabdata    202    202      0
UDPLITEv6              0      0   1152   28    8 : tunables    0    0    0 : slabdata      0      0      0
UDPv6                728    728   1152   28    8 : tunables    0    0    0 : slabdata     26     26      0
tw_sock_TCPv6          0      0    256   64    4 : tunables    0    0    0 : slabdata      0      0      0
TCPv6                405    405   2112   15    8 : tunables    0    0    0 : slabdata     27     27      0
uhci_urb_priv          0      0     56   73    1 : tunables    0    0    0 : slabdata      0      0      0
cfq_queue          27790  27930    232   70    4 : tunables    0    0    0 : slabdata    399    399      0
bsg_cmd                0      0    312   52    4 : tunables    0    0    0 : slabdata      0      0      0
mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
hugetlbfs_inode_cache    106    106    608   53    8 : tunables    0    0    0 : slabdata      2      2      0
configfs_dir_cache      0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
dquot               2048   2048    256   64    4 : tunables    0    0    0 : slabdata     32     32      0
kioctx                 0      0    576   56    8 : tunables    0    0    0 : slabdata      0      0      0
userfaultfd_ctx_cache      0      0    128   64    2 : tunables    0    0    0 : slabdata      0      0      0
pid_namespace          0      0   2176   15    8 : tunables    0    0    0 : slabdata      0      0      0
user_namespace         0      0    280   58    4 : tunables    0    0    0 : slabdata      0      0      0
posix_timers_cache      0      0    248   66    4 : tunables    0    0    0 : slabdata      0      0      0
UDP-Lite               0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
RAW                 1530   1530    960   34    8 : tunables    0    0    0 : slabdata     45     45      0
UDP                 1024   1024   1024   32    8 : tunables    0    0    0 : slabdata     32     32      0
tw_sock_TCP        10944  11328    256   64    4 : tunables    0    0    0 : slabdata    177    177      0
TCP                 2886   3842   1920   17    8 : tunables    0    0    0 : slabdata    226    226      0
blkdev_queue         118    225   2088   15    8 : tunables    0    0    0 : slabdata     15     15      0
blkdev_requests   147485 350826    384   42    4 : tunables    0    0    0 : slabdata   8353   8353      0
blkdev_ioc          2262   2262    104   39    1 : tunables    0    0    0 : slabdata     58     58      0
fsnotify_event_holder   5440   5440     24  170    1 : tunables    0    0    0 : slabdata     32     32      0
fsnotify_event     15912  16252    120   68    2 : tunables    0    0    0 : slabdata    239    239      0
sock_inode_cache   12478  13260    640   51    8 : tunables    0    0    0 : slabdata    260    260      0
net_namespace          0      0   4608    7    8 : tunables    0    0    0 : slabdata      0      0      0
shmem_inode_cache   3264   3264    680   48    8 : tunables    0    0    0 : slabdata     68     68      0
Acpi-ParseExt     155680 155680     72   56    1 : tunables    0    0    0 : slabdata   2780   2780      0
Acpi-Namespace     16422  16422     40  102    1 : tunables    0    0    0 : slabdata    161    161      0
taskstats           1568   1568    328   49    4 : tunables    0    0    0 : slabdata     32     32      0
proc_inode_cache   12352  12544    656   49    8 : tunables    0    0    0 : slabdata    256    256      0
sigqueue            1632   1632    160   51    2 : tunables    0    0    0 : slabdata     32     32      0
bdev_cache           858    858    832   39    8 : tunables    0    0    0 : slabdata     22     22      0
sysfs_dir_cache    59580  59580    112   36    1 : tunables    0    0    0 : slabdata   1655   1655      0
inode_cache        15002  17050    592   55    8 : tunables    0    0    0 : slabdata    310    310      0
dentry             96235 273420    192   42    2 : tunables    0    0    0 : slabdata   6510   6510      0
iint_cache             0      0     80   51    1 : tunables    0    0    0 : slabdata      0      0      0
selinux_inode_security  22681  23205     80   51    1 : tunables    0    0    0 : slabdata    455    455      0
buffer_head       968560 1229592    104   39    1 : tunables    0    0    0 : slabdata  31528  31528      0
vm_area_struct     43185  43216    216   37    2 : tunables    0    0    0 : slabdata   1168   1168      0
mm_struct            860    860   1600   20    8 : tunables    0    0    0 : slabdata     43     43      0
files_cache         1887   1887    640   51    8 : tunables    0    0    0 : slabdata     37     37      0
signal_cache        3595   3724   1152   28    8 : tunables    0    0    0 : slabdata    133    133      0
sighand_cache       2373   2445   2112   15    8 : tunables    0    0    0 : slabdata    163    163      0
task_xstate         4920   5226    832   39    8 : tunables    0    0    0 : slabdata    134    134      0
task_struct         2303   2420   2944   11    8 : tunables    0    0    0 : slabdata    220    220      0
anon_vma           27367  27392     64   64    1 : tunables    0    0    0 : slabdata    428    428      0
shared_policy_node   5525   5525     48   85    1 : tunables    0    0    0 : slabdata     65     65      0
numa_policy          248    248    264   62    4 : tunables    0    0    0 : slabdata      4      4      0
radix_tree_node   321897 643216    584   56    8 : tunables    0    0    0 : slabdata  11486  11486      0
idr_layer_cache      953    975   2112   15    8 : tunables    0    0    0 : slabdata     65     65      0
dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-512        0      0    512   64    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-256        0      0    256   64    4 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-128        0      0    128   64    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-192        0      0    192   42    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-8192         314    340   8192    4    8 : tunables    0    0    0 : slabdata     85     85      0
kmalloc-4096         983   1024   4096    8    8 : tunables    0    0    0 : slabdata    128    128      0
kmalloc-2048        4865   4928   2048   16    8 : tunables    0    0    0 : slabdata    308    308      0
kmalloc-1024       10084  10464   1024   32    8 : tunables    0    0    0 : slabdata    327    327      0
kmalloc-512        34318  88704    512   64    8 : tunables    0    0    0 : slabdata   1386   1386      0
kmalloc-256        35482 174592    256   64    4 : tunables    0    0    0 : slabdata   2728   2728      0
kmalloc-192        52022  85176    192   42    2 : tunables    0    0    0 : slabdata   2028   2028      0
kmalloc-128        30732  35392    128   64    2 : tunables    0    0    0 : slabdata    553    553      0
kmalloc-96         20418  35070     96   42    1 : tunables    0    0    0 : slabdata    835    835      0
kmalloc-64        375761 1018560     64   64    1 : tunables    0    0    0 : slabdata  15915  15915      0
kmalloc-32         34304  34304     32  128    1 : tunables    0    0    0 : slabdata    268    268      0
kmalloc-16         18432  18432     16  256    1 : tunables    0    0    0 : slabdata     72     72      0
kmalloc-8          25088  25088      8  512    1 : tunables    0    0    0 : slabdata     49     49      0
kmem_cache_node      683    704     64   64    1 : tunables    0    0    0 : slabdata     11     11      0
kmem_cache           576    576    256   64    4 : tunables    0    0    0 : slabdata      9      9      0

With the above command, we can determine which Slabcaches take up the most memory.

From the field data analysis, it was found that Buddymore than 100 G of memory were allocated, and Slabonly a few G of memory was used. SlabThis shows that the leaked memory is not kmallocleaked out, but from the Buddyleak.

BuddyThe allocated memory may be used by Slablarge page memory, page cache, block cache, driver, application page fault mapping or mmapother kernel code that needs to apply memory in page units. Among these parts, the most likely problem is still the driver.

Source code analysis

Usually in the high-speed network card driver, in order to achieve high performance, the memory will be Buddydirectly divided according to the Nth power of the page (the large page memory is also obtained from Buddyit, just the large page table used when mapping to the user layer, There is no subdivision here), and then IOmap to the network card, and use RingBufferarrays or DMAchains to form multiple queues. The X710network card is a relatively high-end network card, and it should also have this function, and its implementation cannot be separated from these basic methods. The Buddymain function for allocating memory is __get_free_pages(different kernel versions, there are also some macro definitions and variants, but they are all the same, and the memory must be allocated to Pagesthe power of , and input with parameters ).NNorder

Quick analysis:

  • It can be quickly inferred from the following output that the direct Buddyallocation page function is indeed used in sending and receiving data, although he uses a "variant" function alloc_pages_node(this function must also indirectly call the Buddymemory allocation function, because there are orderparameters, it is not detailed here said).
$ grep -rHn pages
src/Makefile:107:	@echo "Copying manpages..."
src/kcompat.h:5180:#ifndef dev_alloc_pages
src/kcompat.h:5181:#define dev_alloc_pages(_order) alloc_pages_node(NUMA_NO_NODE, (GFP_ATOMIC | __GFP_COLD | __GFP_COMP | __GFP_MEMALLOC), (_order))
src/kcompat.h:5184:#define dev_alloc_page() dev_alloc_pages(0)
src/kcompat.h:5620:	__free_pages(page, compound_order(page));
src/i40e_txrx.c:1469:	page = dev_alloc_pages(i40e_rx_pg_order(rx_ring));
src/i40e_txrx.c:1485:		__free_pages(page, i40e_rx_pg_order(rx_ring));
src/i40e_txrx.c:1858: * Also address the case where we are pulling data in on pages only
src/i40e_txrx.c:1942: * For small pages, @truesize will be a constant value, half the size
src/i40e_txrx.c:1951: * For larger pages, @truesize will be the actual space used by the
src/i40e_txrx.c:1955: * space for a buffer.  Each region of larger pages will be used at
src/i40e_lan_hmc.c:295: * This will allocate memory for PDs and backing pages and populate
src/i40e_lan_hmc.c:394:				/* remove the backing pages from pd_idx1 to i */
  • From the following output, it can be quickly inferred that the driver uses kmallocand vmallocfunctions, which are all kernel cache objects and Slabare allocated during use. From the previous analysis, it can be seen that there Slabis no problem, so these parts theoretically will not There is a problem (if you really want to track down, you can actually calculate mallocthe memory size of these, such as 100 bytes, then you can see if the above kmalloc-128cache Slabis normal).
$ grep -rHn malloc
src/kcompat.c:672:	buf = kmalloc(len, gfp);
src/kcompat.c:683:	void *ret = kmalloc(size, flags);
src/kcompat.c:746:		adapter->config_space = kmalloc(size, GFP_KERNEL);
src/kcompat.h:52:#include <linux/vmalloc.h>
src/kcompat.h:1990:#ifndef vmalloc_node
src/kcompat.h:1991:#define vmalloc_node(a,b) vmalloc(a)
src/kcompat.h:1992:#endif /* vmalloc_node*/
src/kcompat.h:3587:	void *addr = vmalloc_node(size, node);
src/kcompat.h:3596:	void *addr = vmalloc(size);
src/kcompat.h:4011:	p = kmalloc(len + 1, GFP_KERNEL);
src/kcompat.h:5342:static inline bool page_is_pfmemalloc(struct page __maybe_unused *page)
src/kcompat.h:5345:	return page->pfmemalloc;
src/i40e_txrx.c:1930:		!page_is_pfmemalloc(page);
src/i40e_main.c:11967:	buf = kmalloc(INFO_STRING_LEN, GFP_KERNEL);
  • Since memory leaks must occur in places that frequently apply for release, after a quick investigation of the above doubts, only the places where the sending and receiving queues apply for memory release are the most likely to have problems. It can be seen from the following code that he did not use the kernel network protocol stack SKBto send and receive data, but Buddythe page memory that he used directly, and then mapped to it DMA(this is actually one of the worst places to write high-speed network card drivers).
// src/i40e_txrx.c

/**
 * i40e_alloc_mapped_page - recycle or make a new page
 * @rx_ring: ring to use
 * @bi: rx_buffer struct to modify
 *
 * Returns true if the page was successfully allocated or
 * reused.
 **/
static bool i40e_alloc_mapped_page(struct i40e_ring *rx_ring,
				   struct i40e_rx_buffer *bi)
{
	struct page *page = bi->page;
	dma_addr_t dma;

	/* since we are recycling buffers we should seldom need to alloc */
	if (likely(page)) {
		rx_ring->rx_stats.page_reuse_count++;
		return true;
	}

	/* alloc new page for storage */
	page = dev_alloc_pages(i40e_rx_pg_order(rx_ring));
	if (unlikely(!page)) {
		rx_ring->rx_stats.alloc_page_failed++;
		return false;
	}

	/* map page for use */
	dma = dma_map_page_attrs(rx_ring->dev, page, 0,
				 i40e_rx_pg_size(rx_ring),
				 DMA_FROM_DEVICE,
				 I40E_RX_DMA_ATTR);

	/* if mapping failed free memory back to system since
	 * there isn't much point in holding memory we can't use
	 */
	if (dma_mapping_error(rx_ring->dev, dma)) {
		__free_pages(page, i40e_rx_pg_order(rx_ring));
		rx_ring->rx_stats.alloc_page_failed++;
		return false;
	}

	bi->dma = dma;
	bi->page = page;
	bi->page_offset = i40e_rx_offset(rx_ring);

	/* initialize pagecnt_bias to 1 representing we fully own page */
	bi->pagecnt_bias = 1;

	return true;
}

According to past experience, if you continue to read, you have to look at the Data Sheetsum Programming Guideof this network card. Obviously, Intelthere is no plan to open these materials in the short term. If you really want to analyze it carefully, the difficulty is not small. It is estimated that it will take more than a month, and this analysis direction will be temporarily stopped. However, it is basically not inconsistent with the conclusions of the previous memory analysis.

data search

Next, go to the kernel Mail Listor source code repository to check:

https://sourceforge.net/projects/e1000/files/i40e%20stable/2.3.6/

Changelog for i40e-linux-2.3.6
===========================================================================

- Fix mac filter removal timing issue
- Sync i40e_ethtool.c with upstream
- Fixes for TX hangs
- Some fixes for reset of VFs
- Fix build error with packet split disabled
- Fix memory leak related to filter programming status
- Add and modify branding strings
- Fix kdump failure
- Implement an ethtool private flag to stop LLDP in FW
- Add delay after EMP reset for firmware to recover
- Fix incorrect default ITR values on driver load
- Fixes for programming cloud filters
- Some performance improvements
- Enable XPS with QoS on newer kernels
- Enable support for VF VLAN tag stripping control
- Build fixes to force perl to load specific ./SpecSetup.pm file
- Fix the updating of pci.ids
- Use 16 byte descriptors by default
- Fixes for DCB
- Don't close client in debug mode
- Add change MTU log in VF driver
- Fix for adding multiple ethtool filters on the same location
- Add new branding strings for OCP XXV710 devices
- Remove X722 Support for Destination IP Cloud Filter
- Allow turning off offloads when the VF has VLAN set
  • Check in the kernel source repository, this is the one that says to solve the memory leak problem Patch, but it doesn't actually:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972

author	Alexander Duyck <[email protected]>	2017-10-04 08:44:43 -0700
committer	Jeff Kirsher <[email protected]>	2017-10-10 08:04:36 -0700
commit	2b9478ffc550f17c6cd8c69057234e91150f5972 (patch)
tree	3c3478f6c489db75c980a618a44dbd0dc80fc3ef /drivers/net/ethernet/intel/i40e/i40e_txrx.c
parent	e836e3211229d7307660239cc957f2ab60e6aa00 (diff)
download	net-2b9478ffc550f17c6cd8c69057234e91150f5972.tar.gz

i40e: Fix memory leak related filter programming status
It looks like we weren't correctly placing the pages from buffers that had
been used to return a filter programming status back on the ring. As a
result they were being overwritten and tracking of the pages was lost.

This change works to correct that by incorporating part of
i40e_put_rx_buffer into the programming status handler code. As a result we
should now be correctly placing the pages for those buffers on the
re-allocation list instead of letting them stay in place.

Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming status")
Reported-by: Anders K. Pedersen <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
Tested-by: Anders K Pedersen <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>


diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 1519dfb..2756131 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1038,6 +1038,32 @@ reset_latency:
 }
 
 /**
+ * i40e_reuse_rx_page - page flip buffer and store it back on the ring
+ * @rx_ring: rx descriptor ring to store buffers on
+ * @old_buff: donor buffer to have page reused
+ *
+ * Synchronizes page for reuse by the adapter
+ **/
+static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
+			       struct i40e_rx_buffer *old_buff)
+{
+	struct i40e_rx_buffer *new_buff;
+	u16 nta = rx_ring->next_to_alloc;
+
+	new_buff = &rx_ring->rx_bi[nta];
+
+	/* update, and store next to alloc */
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	/* transfer page from old buffer to new buffer */
+	new_buff->dma		= old_buff->dma;
+	new_buff->page		= old_buff->page;
+	new_buff->page_offset	= old_buff->page_offset;
+	new_buff->pagecnt_bias	= old_buff->pagecnt_bias;
+}
+
+/**
  * i40e_rx_is_programming_status - check for programming status descriptor
  * @qw: qword representing status_error_len in CPU ordering
  *
@@ -1071,15 +1097,24 @@ static void i40e_clean_programming_status(struct i40e_ring *rx_ring,
 					  union i40e_rx_desc *rx_desc,
 					  u64 qw)
 {
-	u32 ntc = rx_ring->next_to_clean + 1;
+	struct i40e_rx_buffer *rx_buffer;
+	u32 ntc = rx_ring->next_to_clean;
 	u8 id;
 
 	/* fetch, update, and store next to clean */
+	rx_buffer = &rx_ring->rx_bi[ntc++];
 	ntc = (ntc < rx_ring->count) ? ntc : 0;
 	rx_ring->next_to_clean = ntc;
 
 	prefetch(I40E_RX_DESC(rx_ring, ntc));
 
+	/* place unused page back on the ring */
+	i40e_reuse_rx_page(rx_ring, rx_buffer);
+	rx_ring->rx_stats.page_reuse_count++;
+
+	/* clear contents of buffer_info */
+	rx_buffer->page = NULL;
+
 	id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
 		  I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
 
@@ -1639,32 +1674,6 @@ static bool i40e_cleanup_headers(struct i40e_ring *rx_ring, struct sk_buff *skb,
 }
 
 /**
- * i40e_reuse_rx_page - page flip buffer and store it back on the ring
- * @rx_ring: rx descriptor ring to store buffers on
- * @old_buff: donor buffer to have page reused
- *
- * Synchronizes page for reuse by the adapter
- **/
-static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
-			       struct i40e_rx_buffer *old_buff)
-{
-	struct i40e_rx_buffer *new_buff;
-	u16 nta = rx_ring->next_to_alloc;
-
-	new_buff = &rx_ring->rx_bi[nta];
-
-	/* update, and store next to alloc */
-	nta++;
-	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
-
-	/* transfer page from old buffer to new buffer */
-	new_buff->dma		= old_buff->dma;
-	new_buff->page		= old_buff->page;
-	new_buff->page_offset	= old_buff->page_offset;
-	new_buff->pagecnt_bias	= old_buff->pagecnt_bias;
-}
-
-/**
  * i40e_page_is_reusable - check if any reuse is possible
  * @page: page struct to check
  *
  • Then look at Mail Listthe feedback of this driver on the kernel. There are many people who encounter this problem, and we are not alone:
https://www.mail-archive.com/[email protected]&q=subject:%22Re%5C%3A+Linux+4.12%5C%2B+memory+leak+on+router+with+i40e+NICs%22&o=newest

...

Upgraded and looks like problem is not solved with that patch
Currently running system with

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not
leaking (with
use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same
net.git kernel there
is no memleak - after hour same memory usage - so for 100%
this is i40e
driver problem.

....

So far, all doubts point to the release of the memory application in the sending and receiving queue of this network card.

problem verification

Since we can't reproduce the problem in the development and test environment, the host without upgrading the network card driver is normal. Therefore, I found a host with this problem to downgrade the driver to the 2.2.4version, and the others remained unchanged. After running for two weeks, everything was normal. The problem was rudely solved. In the future, it may be necessary to continue to follow up the update of the official website driver, and then upgrade after stabilization and verification.

Summary recommendations

  1. The entire production environment should have more complete logs and monitoring to facilitate timely discovery of problems and fault diagnosis.
  2. Changes to hardware, drivers, kernels, and systems should be strictly controlled and verified, and preferably discussed and evaluated with kernel-related developers.
  3. When solving problems, you should combine principle analysis, source code analysis, on-site analysis and test comparison, and do not go all the way to the dark.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324377922&siteId=291194637