problem statement
On a CEPH storage node, the available memory becomes less and less as the runtime increases. After all applications are exited, all caches are released, and the available memory still does not increase. After restarting the node, all memory usage returns to normal, and the same situation occurs again after running for a period of time (about a week). In addition, this problem cannot be reproduced on the self-test server we built, and we can only perform non-invasive diagnosis by analyzing the production environment data.
(Additional note: Since the on-site debugging data is not saved, the data displayed by the following command is for demonstration only!!!)
question confirmed
- Count the memory occupied by all applications
U
(unit is K, reference: How to count the total memory occupied by all processes? ):
$ grep Pss /proc/[1-9]*/smaps | awk '{total+=$2}; END {print total}'
1721106
- View the total system memory
T
, free memoryF
, shared memoryS
, and cacheC
(refer to: Linux memory viewing method meminfo\maps\smaps\status file analysis ):
$ free -h
total used free shared buff/cache available
Mem: 125G 95G 4.2G 4.0G 26G 25G
Swap: 9.4G 444M 8.9G
$ cat /proc/meminfo
MemTotal: 131748024 kB
MemFree: 4229544 kB
MemAvailable: 26634796 kB
Buffers: 141416 kB
Cached: 24657800 kB
SwapCached: 198316 kB
Active: 7972388 kB
Inactive: 19558436 kB
Active(anon): 4249920 kB
Inactive(anon): 2666784 kB
Active(file): 3722468 kB
Inactive(file): 16891652 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 9830396 kB
SwapFree: 9375476 kB
Dirty: 80 kB
Writeback: 0 kB
AnonPages: 2601440 kB
Mapped: 71828 kB
Shmem: 4185096 kB
Slab: 2607824 kB
SReclaimable: 2129004 kB
SUnreclaim: 478820 kB
KernelStack: 29616 kB
PageTables: 45636 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 75704408 kB
Committed_AS: 14023220 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 529824 kB
VmallocChunk: 34292084736 kB
HardwareCorrupted: 0 kB
AnonHugePages: 260096 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 125688 kB
DirectMap2M: 3973120 kB
DirectMap1G: 132120576 kB
- Under normal circumstances, there should be the following formula (
K
memory for the kernel, ignoring the Swap memory here):
T = U + K + S + C + F
From this, the memory used by the kernel can be calculated K
, and it turns out that the kernel occupies an abnormally large amount of memory. The entire faulty node has 128G of memory, and the kernel occupies more than 100G, so it can be preliminarily inferred that a memory leak has occurred in the kernel.
Principle analysis
According to experience, the code that generally leaks memory and exhausts memory must be the part that frequently requests to release memory.
The memory that may be frequently requested and released in the kernel may include:
- The kernel manages data structures like
task_struct
,inode
etc, and the code is generally heavily tested and less likely to go wrong. - Kernel
IO
subsystems or drivers, such as block devicesBIO
, network protocol stacks,SKB
and storage network device drivers.
The most likely problem here is the driver of the storage or network device. Ask the relevant R&D personnel about the recent kernel and driver changes, and finally learn that X710
the driver of the network card has been updated recently i40e
. It is preliminarily inferred that the problem should appear in the network card driver.
On-site analysis
The Linux kernel uses a hierarchical memory management method, each layer solves different problems. The key parts from bottom to top are as follows:
- Physical memory management is mainly used to describe the layout and attributes of memory. There are mainly three structures,
Node
,Zone
and , to manage memory in units;Page
Page
Buddy
Memory management, mainly solves the problem of external fragmentation, and usesget_free_pages
functions such asPage
theN
power to apply for release;Slab
Memory management, mainly to solve the problem of internal fragmentation, can apply for memory in batches according to the size specified by the user (the object buffer pool needs to be created first);- Kernel cache objects, use Slab to pre-allocate some fixed-size caches, and use
kmalloc
,vmalloc
and other functions to apply and release memory in units of bytes.
Next, we first look at which level of memory leaks (additional note: there are many related memory management technologies such as large page memory, page cache, block cache, etc., they all apply for memory from these levels , not the key, all ignored here.).
- View Buddy memory usage (reference: Linux /proc/buddyinfo for understanding ):
$ cat /proc/buddyinfo
Node 0, zone DMA 0 1 1 0 2 1 1 0 0 1 3
Node 0, zone DMA32 3222 6030 3094 3627 379 0 0 0 0 0 0
Node 0, zone Normal 13628 0 0 0 0 0 0 0 0 0 0
Node 1, zone Normal 73167 165265 104556 17921 2120 144 1 0 0 0 0
$ cat /proc/buddyinfo | awk '{sum=0;for(i=5;i<=NF;i++) sum+=$i*(2^(i-5))};{total+=sum/256};{print $1 " " $2 " " $3 " " $4 "\t : " sum/256 "M"} END {print "total\t\t\t : " total "M"}'
Node 0, zone DMA : 14.5234M
Node 0, zone DMA32 : 245.07M
Node 0, zone Normal : 53.2344M
Node 1, zone Normal : 3921.41M
total : 4234.24M
From this we can see Buddy
how much memory is allocated in total.
- Check
Slab
memory usage:
$ slabtop -o
Active / Total Objects (% used) : 3522231 / 6345435 (55.5%)
Active / Total Slabs (% used) : 148128 / 148128 (100.0%)
Active / Total Caches (% used) : 74 / 107 (69.2%)
Active / Total Size (% used) : 1297934.98K / 2593929.78K (50.0%)
Minimum / Average / Maximum Object : 0.01K / 0.41K / 15.88K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
1449510 666502 45% 1.06K 48317 30 1546144K xfs_inode
1229592 967866 78% 0.10K 31528 39 126112K buffer_head
1018560 375285 36% 0.06K 15915 64 63660K kmalloc-64
643216 322167 50% 0.57K 11486 56 367552K radix_tree_node
350826 147688 42% 0.38K 8353 42 133648K blkdev_requests
310421 131953 42% 0.15K 5857 53 46856K xfs_ili
273420 95765 35% 0.19K 6510 42 52080K dentry
174592 36069 20% 0.25K 2728 64 43648K kmalloc-256
155680 155680 100% 0.07K 2780 56 11120K Acpi-ParseExt
88704 34318 38% 0.50K 1386 64 44352K kmalloc-512
85176 52022 61% 0.19K 2028 42 16224K kmalloc-192
59580 59580 100% 0.11K 1655 36 6620K sysfs_dir_cache
43031 42594 98% 0.21K 1163 37 9304K vm_area_struct
35392 30850 87% 0.12K 553 64 4424K kmalloc-128
35070 20418 58% 0.09K 835 42 3340K kmalloc-96
34304 34304 100% 0.03K 268 128 1072K kmalloc-32
$ cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kvm_async_pf 0 0 136 60 2 : tunables 0 0 0 : slabdata 0 0 0
kvm_vcpu 0 0 16256 2 8 : tunables 0 0 0 : slabdata 0 0 0
kvm_mmu_page_header 0 0 168 48 2 : tunables 0 0 0 : slabdata 0 0 0
xfs_dqtrx 0 0 528 62 8 : tunables 0 0 0 : slabdata 0 0 0
xfs_dquot 0 0 472 69 8 : tunables 0 0 0 : slabdata 0 0 0
xfs_icr 0 0 144 56 2 : tunables 0 0 0 : slabdata 0 0 0
xfs_ili 131960 310421 152 53 2 : tunables 0 0 0 : slabdata 5857 5857 0
xfs_inode 666461 1449510 1088 30 8 : tunables 0 0 0 : slabdata 48317 48317 0
xfs_efd_item 8120 8280 400 40 4 : tunables 0 0 0 : slabdata 207 207 0
xfs_da_state 2176 2176 480 68 8 : tunables 0 0 0 : slabdata 32 32 0
xfs_btree_cur 1248 1248 208 39 2 : tunables 0 0 0 : slabdata 32 32 0
xfs_log_ticket 12981 13200 184 44 2 : tunables 0 0 0 : slabdata 300 300 0
nfsd4_openowners 0 0 440 37 4 : tunables 0 0 0 : slabdata 0 0 0
rpc_inode_cache 51 51 640 51 8 : tunables 0 0 0 : slabdata 1 1 0
ext4_groupinfo_4k 4440 4440 136 60 2 : tunables 0 0 0 : slabdata 74 74 0
ext4_inode_cache 4074 5921 1048 31 8 : tunables 0 0 0 : slabdata 191 191 0
ext4_xattr 276 276 88 46 1 : tunables 0 0 0 : slabdata 6 6 0
ext4_free_data 3264 3264 64 64 1 : tunables 0 0 0 : slabdata 51 51 0
ext4_allocation_context 2048 2048 128 64 2 : tunables 0 0 0 : slabdata 32 32 0
ext4_io_end 1785 1785 80 51 1 : tunables 0 0 0 : slabdata 35 35 0
ext4_extent_status 20706 20706 40 102 1 : tunables 0 0 0 : slabdata 203 203 0
jbd2_journal_handle 2720 2720 48 85 1 : tunables 0 0 0 : slabdata 32 32 0
jbd2_journal_head 4680 4680 112 36 1 : tunables 0 0 0 : slabdata 130 130 0
jbd2_revoke_table_s 256 256 16 256 1 : tunables 0 0 0 : slabdata 1 1 0
jbd2_revoke_record_s 4096 4096 32 128 1 : tunables 0 0 0 : slabdata 32 32 0
scsi_cmd_cache 7056 7272 448 36 4 : tunables 0 0 0 : slabdata 202 202 0
UDPLITEv6 0 0 1152 28 8 : tunables 0 0 0 : slabdata 0 0 0
UDPv6 728 728 1152 28 8 : tunables 0 0 0 : slabdata 26 26 0
tw_sock_TCPv6 0 0 256 64 4 : tunables 0 0 0 : slabdata 0 0 0
TCPv6 405 405 2112 15 8 : tunables 0 0 0 : slabdata 27 27 0
uhci_urb_priv 0 0 56 73 1 : tunables 0 0 0 : slabdata 0 0 0
cfq_queue 27790 27930 232 70 4 : tunables 0 0 0 : slabdata 399 399 0
bsg_cmd 0 0 312 52 4 : tunables 0 0 0 : slabdata 0 0 0
mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
hugetlbfs_inode_cache 106 106 608 53 8 : tunables 0 0 0 : slabdata 2 2 0
configfs_dir_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
dquot 2048 2048 256 64 4 : tunables 0 0 0 : slabdata 32 32 0
kioctx 0 0 576 56 8 : tunables 0 0 0 : slabdata 0 0 0
userfaultfd_ctx_cache 0 0 128 64 2 : tunables 0 0 0 : slabdata 0 0 0
pid_namespace 0 0 2176 15 8 : tunables 0 0 0 : slabdata 0 0 0
user_namespace 0 0 280 58 4 : tunables 0 0 0 : slabdata 0 0 0
posix_timers_cache 0 0 248 66 4 : tunables 0 0 0 : slabdata 0 0 0
UDP-Lite 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
RAW 1530 1530 960 34 8 : tunables 0 0 0 : slabdata 45 45 0
UDP 1024 1024 1024 32 8 : tunables 0 0 0 : slabdata 32 32 0
tw_sock_TCP 10944 11328 256 64 4 : tunables 0 0 0 : slabdata 177 177 0
TCP 2886 3842 1920 17 8 : tunables 0 0 0 : slabdata 226 226 0
blkdev_queue 118 225 2088 15 8 : tunables 0 0 0 : slabdata 15 15 0
blkdev_requests 147485 350826 384 42 4 : tunables 0 0 0 : slabdata 8353 8353 0
blkdev_ioc 2262 2262 104 39 1 : tunables 0 0 0 : slabdata 58 58 0
fsnotify_event_holder 5440 5440 24 170 1 : tunables 0 0 0 : slabdata 32 32 0
fsnotify_event 15912 16252 120 68 2 : tunables 0 0 0 : slabdata 239 239 0
sock_inode_cache 12478 13260 640 51 8 : tunables 0 0 0 : slabdata 260 260 0
net_namespace 0 0 4608 7 8 : tunables 0 0 0 : slabdata 0 0 0
shmem_inode_cache 3264 3264 680 48 8 : tunables 0 0 0 : slabdata 68 68 0
Acpi-ParseExt 155680 155680 72 56 1 : tunables 0 0 0 : slabdata 2780 2780 0
Acpi-Namespace 16422 16422 40 102 1 : tunables 0 0 0 : slabdata 161 161 0
taskstats 1568 1568 328 49 4 : tunables 0 0 0 : slabdata 32 32 0
proc_inode_cache 12352 12544 656 49 8 : tunables 0 0 0 : slabdata 256 256 0
sigqueue 1632 1632 160 51 2 : tunables 0 0 0 : slabdata 32 32 0
bdev_cache 858 858 832 39 8 : tunables 0 0 0 : slabdata 22 22 0
sysfs_dir_cache 59580 59580 112 36 1 : tunables 0 0 0 : slabdata 1655 1655 0
inode_cache 15002 17050 592 55 8 : tunables 0 0 0 : slabdata 310 310 0
dentry 96235 273420 192 42 2 : tunables 0 0 0 : slabdata 6510 6510 0
iint_cache 0 0 80 51 1 : tunables 0 0 0 : slabdata 0 0 0
selinux_inode_security 22681 23205 80 51 1 : tunables 0 0 0 : slabdata 455 455 0
buffer_head 968560 1229592 104 39 1 : tunables 0 0 0 : slabdata 31528 31528 0
vm_area_struct 43185 43216 216 37 2 : tunables 0 0 0 : slabdata 1168 1168 0
mm_struct 860 860 1600 20 8 : tunables 0 0 0 : slabdata 43 43 0
files_cache 1887 1887 640 51 8 : tunables 0 0 0 : slabdata 37 37 0
signal_cache 3595 3724 1152 28 8 : tunables 0 0 0 : slabdata 133 133 0
sighand_cache 2373 2445 2112 15 8 : tunables 0 0 0 : slabdata 163 163 0
task_xstate 4920 5226 832 39 8 : tunables 0 0 0 : slabdata 134 134 0
task_struct 2303 2420 2944 11 8 : tunables 0 0 0 : slabdata 220 220 0
anon_vma 27367 27392 64 64 1 : tunables 0 0 0 : slabdata 428 428 0
shared_policy_node 5525 5525 48 85 1 : tunables 0 0 0 : slabdata 65 65 0
numa_policy 248 248 264 62 4 : tunables 0 0 0 : slabdata 4 4 0
radix_tree_node 321897 643216 584 56 8 : tunables 0 0 0 : slabdata 11486 11486 0
idr_layer_cache 953 975 2112 15 8 : tunables 0 0 0 : slabdata 65 65 0
dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-1024 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-512 0 0 512 64 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-256 0 0 256 64 4 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-128 0 0 128 64 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-192 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-8192 314 340 8192 4 8 : tunables 0 0 0 : slabdata 85 85 0
kmalloc-4096 983 1024 4096 8 8 : tunables 0 0 0 : slabdata 128 128 0
kmalloc-2048 4865 4928 2048 16 8 : tunables 0 0 0 : slabdata 308 308 0
kmalloc-1024 10084 10464 1024 32 8 : tunables 0 0 0 : slabdata 327 327 0
kmalloc-512 34318 88704 512 64 8 : tunables 0 0 0 : slabdata 1386 1386 0
kmalloc-256 35482 174592 256 64 4 : tunables 0 0 0 : slabdata 2728 2728 0
kmalloc-192 52022 85176 192 42 2 : tunables 0 0 0 : slabdata 2028 2028 0
kmalloc-128 30732 35392 128 64 2 : tunables 0 0 0 : slabdata 553 553 0
kmalloc-96 20418 35070 96 42 1 : tunables 0 0 0 : slabdata 835 835 0
kmalloc-64 375761 1018560 64 64 1 : tunables 0 0 0 : slabdata 15915 15915 0
kmalloc-32 34304 34304 32 128 1 : tunables 0 0 0 : slabdata 268 268 0
kmalloc-16 18432 18432 16 256 1 : tunables 0 0 0 : slabdata 72 72 0
kmalloc-8 25088 25088 8 512 1 : tunables 0 0 0 : slabdata 49 49 0
kmem_cache_node 683 704 64 64 1 : tunables 0 0 0 : slabdata 11 11 0
kmem_cache 576 576 256 64 4 : tunables 0 0 0 : slabdata 9 9 0
With the above command, we can determine which Slab
caches take up the most memory.
From the field data analysis, it was found that Buddy
more than 100 G of memory were allocated, and Slab
only a few G of memory was used. Slab
This shows that the leaked memory is not kmalloc
leaked out, but from the Buddy
leak.
Buddy
The allocated memory may be used by Slab
large page memory, page cache, block cache, driver, application page fault mapping or mmap
other kernel code that needs to apply memory in page units. Among these parts, the most likely problem is still the driver.
Source code analysis
Usually in the high-speed network card driver, in order to achieve high performance, the memory will be Buddy
directly divided according to the Nth power of the page (the large page memory is also obtained from Buddy
it, just the large page table used when mapping to the user layer, There is no subdivision here), and then IO
map to the network card, and use RingBuffer
arrays or DMA
chains to form multiple queues. The X710
network card is a relatively high-end network card, and it should also have this function, and its implementation cannot be separated from these basic methods. The Buddy
main function for allocating memory is __get_free_pages
(different kernel versions, there are also some macro definitions and variants, but they are all the same, and the memory must be allocated to Pages
the power of , and input with parameters ).N
N
order
Quick analysis:
- It can be quickly inferred from the following output that the direct
Buddy
allocation page function is indeed used in sending and receiving data, although he uses a "variant" functionalloc_pages_node
(this function must also indirectly call theBuddy
memory allocation function, because there areorder
parameters, it is not detailed here said).
$ grep -rHn pages
src/Makefile:107: @echo "Copying manpages..."
src/kcompat.h:5180:#ifndef dev_alloc_pages
src/kcompat.h:5181:#define dev_alloc_pages(_order) alloc_pages_node(NUMA_NO_NODE, (GFP_ATOMIC | __GFP_COLD | __GFP_COMP | __GFP_MEMALLOC), (_order))
src/kcompat.h:5184:#define dev_alloc_page() dev_alloc_pages(0)
src/kcompat.h:5620: __free_pages(page, compound_order(page));
src/i40e_txrx.c:1469: page = dev_alloc_pages(i40e_rx_pg_order(rx_ring));
src/i40e_txrx.c:1485: __free_pages(page, i40e_rx_pg_order(rx_ring));
src/i40e_txrx.c:1858: * Also address the case where we are pulling data in on pages only
src/i40e_txrx.c:1942: * For small pages, @truesize will be a constant value, half the size
src/i40e_txrx.c:1951: * For larger pages, @truesize will be the actual space used by the
src/i40e_txrx.c:1955: * space for a buffer. Each region of larger pages will be used at
src/i40e_lan_hmc.c:295: * This will allocate memory for PDs and backing pages and populate
src/i40e_lan_hmc.c:394: /* remove the backing pages from pd_idx1 to i */
- From the following output, it can be quickly inferred that the driver uses
kmalloc
andvmalloc
functions, which are all kernel cache objects andSlab
are allocated during use. From the previous analysis, it can be seen that thereSlab
is no problem, so these parts theoretically will not There is a problem (if you really want to track down, you can actually calculatemalloc
the memory size of these, such as 100 bytes, then you can see if the abovekmalloc-128
cacheSlab
is normal).
$ grep -rHn malloc
src/kcompat.c:672: buf = kmalloc(len, gfp);
src/kcompat.c:683: void *ret = kmalloc(size, flags);
src/kcompat.c:746: adapter->config_space = kmalloc(size, GFP_KERNEL);
src/kcompat.h:52:#include <linux/vmalloc.h>
src/kcompat.h:1990:#ifndef vmalloc_node
src/kcompat.h:1991:#define vmalloc_node(a,b) vmalloc(a)
src/kcompat.h:1992:#endif /* vmalloc_node*/
src/kcompat.h:3587: void *addr = vmalloc_node(size, node);
src/kcompat.h:3596: void *addr = vmalloc(size);
src/kcompat.h:4011: p = kmalloc(len + 1, GFP_KERNEL);
src/kcompat.h:5342:static inline bool page_is_pfmemalloc(struct page __maybe_unused *page)
src/kcompat.h:5345: return page->pfmemalloc;
src/i40e_txrx.c:1930: !page_is_pfmemalloc(page);
src/i40e_main.c:11967: buf = kmalloc(INFO_STRING_LEN, GFP_KERNEL);
- Since memory leaks must occur in places that frequently apply for release, after a quick investigation of the above doubts, only the places where the sending and receiving queues apply for memory release are the most likely to have problems. It can be seen from the following code that he did not use the kernel network protocol stack
SKB
to send and receive data, butBuddy
the page memory that he used directly, and then mapped to itDMA
(this is actually one of the worst places to write high-speed network card drivers).
// src/i40e_txrx.c
/**
* i40e_alloc_mapped_page - recycle or make a new page
* @rx_ring: ring to use
* @bi: rx_buffer struct to modify
*
* Returns true if the page was successfully allocated or
* reused.
**/
static bool i40e_alloc_mapped_page(struct i40e_ring *rx_ring,
struct i40e_rx_buffer *bi)
{
struct page *page = bi->page;
dma_addr_t dma;
/* since we are recycling buffers we should seldom need to alloc */
if (likely(page)) {
rx_ring->rx_stats.page_reuse_count++;
return true;
}
/* alloc new page for storage */
page = dev_alloc_pages(i40e_rx_pg_order(rx_ring));
if (unlikely(!page)) {
rx_ring->rx_stats.alloc_page_failed++;
return false;
}
/* map page for use */
dma = dma_map_page_attrs(rx_ring->dev, page, 0,
i40e_rx_pg_size(rx_ring),
DMA_FROM_DEVICE,
I40E_RX_DMA_ATTR);
/* if mapping failed free memory back to system since
* there isn't much point in holding memory we can't use
*/
if (dma_mapping_error(rx_ring->dev, dma)) {
__free_pages(page, i40e_rx_pg_order(rx_ring));
rx_ring->rx_stats.alloc_page_failed++;
return false;
}
bi->dma = dma;
bi->page = page;
bi->page_offset = i40e_rx_offset(rx_ring);
/* initialize pagecnt_bias to 1 representing we fully own page */
bi->pagecnt_bias = 1;
return true;
}
According to past experience, if you continue to read, you have to look at the Data Sheet
sum Programming Guide
of this network card. Obviously, Intel
there is no plan to open these materials in the short term. If you really want to analyze it carefully, the difficulty is not small. It is estimated that it will take more than a month, and this analysis direction will be temporarily stopped. However, it is basically not inconsistent with the conclusions of the previous memory analysis.
data search
Next, go to the kernel Mail List
or source code repository to check:
- On the driver's release site ( Intel Ethernet Drivers and Utilities ), he says that they fixed the
Memory Leak
bug (the driver we had problems with was the2.3.6
version):
https://sourceforge.net/projects/e1000/files/i40e%20stable/2.3.6/
Changelog for i40e-linux-2.3.6
===========================================================================
- Fix mac filter removal timing issue
- Sync i40e_ethtool.c with upstream
- Fixes for TX hangs
- Some fixes for reset of VFs
- Fix build error with packet split disabled
- Fix memory leak related to filter programming status
- Add and modify branding strings
- Fix kdump failure
- Implement an ethtool private flag to stop LLDP in FW
- Add delay after EMP reset for firmware to recover
- Fix incorrect default ITR values on driver load
- Fixes for programming cloud filters
- Some performance improvements
- Enable XPS with QoS on newer kernels
- Enable support for VF VLAN tag stripping control
- Build fixes to force perl to load specific ./SpecSetup.pm file
- Fix the updating of pci.ids
- Use 16 byte descriptors by default
- Fixes for DCB
- Don't close client in debug mode
- Add change MTU log in VF driver
- Fix for adding multiple ethtool filters on the same location
- Add new branding strings for OCP XXV710 devices
- Remove X722 Support for Destination IP Cloud Filter
- Allow turning off offloads when the VF has VLAN set
- Check in the kernel source repository, this is the one that says to solve the memory leak problem
Patch
, but it doesn't actually:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972
author Alexander Duyck <[email protected]> 2017-10-04 08:44:43 -0700
committer Jeff Kirsher <[email protected]> 2017-10-10 08:04:36 -0700
commit 2b9478ffc550f17c6cd8c69057234e91150f5972 (patch)
tree 3c3478f6c489db75c980a618a44dbd0dc80fc3ef /drivers/net/ethernet/intel/i40e/i40e_txrx.c
parent e836e3211229d7307660239cc957f2ab60e6aa00 (diff)
download net-2b9478ffc550f17c6cd8c69057234e91150f5972.tar.gz
i40e: Fix memory leak related filter programming status
It looks like we weren't correctly placing the pages from buffers that had
been used to return a filter programming status back on the ring. As a
result they were being overwritten and tracking of the pages was lost.
This change works to correct that by incorporating part of
i40e_put_rx_buffer into the programming status handler code. As a result we
should now be correctly placing the pages for those buffers on the
re-allocation list instead of letting them stay in place.
Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming status")
Reported-by: Anders K. Pedersen <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
Tested-by: Anders K Pedersen <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 1519dfb..2756131 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1038,6 +1038,32 @@ reset_latency:
}
/**
+ * i40e_reuse_rx_page - page flip buffer and store it back on the ring
+ * @rx_ring: rx descriptor ring to store buffers on
+ * @old_buff: donor buffer to have page reused
+ *
+ * Synchronizes page for reuse by the adapter
+ **/
+static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
+ struct i40e_rx_buffer *old_buff)
+{
+ struct i40e_rx_buffer *new_buff;
+ u16 nta = rx_ring->next_to_alloc;
+
+ new_buff = &rx_ring->rx_bi[nta];
+
+ /* update, and store next to alloc */
+ nta++;
+ rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+ /* transfer page from old buffer to new buffer */
+ new_buff->dma = old_buff->dma;
+ new_buff->page = old_buff->page;
+ new_buff->page_offset = old_buff->page_offset;
+ new_buff->pagecnt_bias = old_buff->pagecnt_bias;
+}
+
+/**
* i40e_rx_is_programming_status - check for programming status descriptor
* @qw: qword representing status_error_len in CPU ordering
*
@@ -1071,15 +1097,24 @@ static void i40e_clean_programming_status(struct i40e_ring *rx_ring,
union i40e_rx_desc *rx_desc,
u64 qw)
{
- u32 ntc = rx_ring->next_to_clean + 1;
+ struct i40e_rx_buffer *rx_buffer;
+ u32 ntc = rx_ring->next_to_clean;
u8 id;
/* fetch, update, and store next to clean */
+ rx_buffer = &rx_ring->rx_bi[ntc++];
ntc = (ntc < rx_ring->count) ? ntc : 0;
rx_ring->next_to_clean = ntc;
prefetch(I40E_RX_DESC(rx_ring, ntc));
+ /* place unused page back on the ring */
+ i40e_reuse_rx_page(rx_ring, rx_buffer);
+ rx_ring->rx_stats.page_reuse_count++;
+
+ /* clear contents of buffer_info */
+ rx_buffer->page = NULL;
+
id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
@@ -1639,32 +1674,6 @@ static bool i40e_cleanup_headers(struct i40e_ring *rx_ring, struct sk_buff *skb,
}
/**
- * i40e_reuse_rx_page - page flip buffer and store it back on the ring
- * @rx_ring: rx descriptor ring to store buffers on
- * @old_buff: donor buffer to have page reused
- *
- * Synchronizes page for reuse by the adapter
- **/
-static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
- struct i40e_rx_buffer *old_buff)
-{
- struct i40e_rx_buffer *new_buff;
- u16 nta = rx_ring->next_to_alloc;
-
- new_buff = &rx_ring->rx_bi[nta];
-
- /* update, and store next to alloc */
- nta++;
- rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
-
- /* transfer page from old buffer to new buffer */
- new_buff->dma = old_buff->dma;
- new_buff->page = old_buff->page;
- new_buff->page_offset = old_buff->page_offset;
- new_buff->pagecnt_bias = old_buff->pagecnt_bias;
-}
-
-/**
* i40e_page_is_reusable - check if any reuse is possible
* @page: page struct to check
*
- Then look at
Mail List
the feedback of this driver on the kernel. There are many people who encounter this problem, and we are not alone:
https://www.mail-archive.com/[email protected]&q=subject:%22Re%5C%3A+Linux+4.12%5C%2B+memory+leak+on+router+with+i40e+NICs%22&o=newest
...
Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel
Still about 0.5GB of memory is leaking somewhere
Also can confirm that the latest kernel where memory is not
leaking (with
use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.
also checked that with ixgbe instead of i40e with same
net.git kernel there
is no memleak - after hour same memory usage - so for 100%
this is i40e
driver problem.
....
So far, all doubts point to the release of the memory application in the sending and receiving queue of this network card.
problem verification
Since we can't reproduce the problem in the development and test environment, the host without upgrading the network card driver is normal. Therefore, I found a host with this problem to downgrade the driver to the 2.2.4
version, and the others remained unchanged. After running for two weeks, everything was normal. The problem was rudely solved. In the future, it may be necessary to continue to follow up the update of the official website driver, and then upgrade after stabilization and verification.
Summary recommendations
- The entire production environment should have more complete logs and monitoring to facilitate timely discovery of problems and fault diagnosis.
- Changes to hardware, drivers, kernels, and systems should be strictly controlled and verified, and preferably discussed and evaluated with kernel-related developers.
- When solving problems, you should combine principle analysis, source code analysis, on-site analysis and test comparison, and do not go all the way to the dark.