MySQL server SWAP problem NUMA cause analysis

[Author]

Wang Dong: Ctrip technical support center database experts, investigation and development of intelligent database to database operation and maintenance difficult problems of automation tools have a strong interest.

【Problem Description】

We know that when mysqld process to use SWAP, it will seriously affect the performance of MySQL. SWAP problem is more complex, it will start from the principle of SWAP, share case studies and analysis of ideas we encounter.

SWAP [principle]

The swap is part of the disk space or file, as a memory to use. It has swapped out and swapped in two ways, the process is swapped out to inactive memory data stored on disk and frees the memory space occupied by the data, it is time to change into this part of the process of accessing the data again, read from disk memory.

swap expand the memory space, in order to reclaim memory. Memory recovery mechanism, the memory allocation is when there is not enough space, the system needs to recover a portion of memory, referred to as direct memory reclamation. There is also a special kswapd0 process used regularly to reclaim memory. To measure memory usage, memory defines three threshold, the page is divided into the minimum water level (min), low-level page (Low), pages high level (high)

Run the following commands, you can see a value corresponding to the water line, as shown in FIG.

cat /proc/zoneinfo |grep -E "Node|pages free|nr_inactive_anon|nr_inactive_file|min|low|high"|grep -v "high:"

Recycling behavior main memory
1, the remaining memory is less than when the system is low, kswapd recovery comes into play for memory until the memory reaches high level.
2, when the remaining memory to be recovered directly triggered min.
3, when trigger global recovery, and file + free <time = high, will be anonymous for the swap page.

[NUMA given SWAP]

The case of some cases we found that the system there are a lot of free space has been used swap. This is a result of NUMA architecture. Under NUMA architecture Every Node has a local memory space, memory usage is not balanced between Node, Node when a shortage of memory, it could lead to the swap.

【swappiness】

We probably understand the mechanisms of memory recovery, including the recovery of the memory page files and anonymous page. Recycling of the page file is directly recovered cache, or the dirty pages written back to disk and then recovered. Recovery anonymous page is through the swap, the data is written to disk and then release the memory.
By adjusting / proc sys vm value / / / swappiness can be adjusted using the swap aggressively, the swappiness value from 0-100, the smaller the value, the document sheet tends recovery, minimize the use of swap. We initially this value was adjusted to 1, but the findings do not avoid the swap. Indeed even if this value 0, when satisfied file + free <= high, or swap will occur.

[Close] NUMA scheme

In the case of NUMA open, due to inter-node NUMA memory usage is not balanced, it may result in swap, to solve this problem there are a number of programs below

1、 在mysqld_safe脚本中加上“numactl –interleave all”来启动mysqld
2、 Linux Kernel启动参数中加上numa=off,需要重启服务器
3、 在BIOS层面关闭NUMA
4、 MySQL 5.6.27/5.7.9开始引用innodb_numa_interleave选项

For 2,3,4 close NUMA scheme is relatively simple, not described in detail, the next focus of the program described below 1

[Open access] step numa interleave

1、 yum install numactl -y
2、修改/usr/bin/mysqld_safe文件
    cmd="`mysqld_ld_preload_text`$NOHUP_NICENESS"下新增一条脚本
    cmd="/usr/bin/numactl --interleave all $cmd"
3、service mysql stop
4、写入硬盘,防止数据丢失
    sync;sync;sync
5、延迟10秒
    sleep 10
6、清理pagecache、dentries和inodes
    sysctl -q -w vm.drop_caches=3
7、service mysql start
8、验证numactl –interleave all是否生效,可以通过下面命令,interleave_hit是采用interleave策略从该节点分配的次数,没有启动interleave策略的服务器,这个值会很低
    numastat -mn -p `pidof mysqld`

So far we MySQL5.6 server to solve the problems caused by swap due to the uneven distribution of memory between NUMA Node Through the above scheme. For MySQL5.7.23 version of the server, we used innodb_numa_interleave option, but the problem is not completely resolved.

Usage MySQL5.7 new innodb_numa_interleave option Problem

In the server options open innodb_numa_interleave still exist between NUMA Node memory allocation problem of imbalance can lead to swap produce. To address this issue made a further analysis:
1, MySQL version 5.7.23, has opened innodb_numa_interleave
2, use the command to view memory usage mysqld process, numastat -mn `pidof mysqld`
can be seen Node 0 uses about 122.5G memory, node 1 uses about 68.2G memory, where the available space on the left Node0 566M, if the latter application node 0 node allocate enough memory, it is possible to produce swap

Per-node process memory usage (in MBs) for PID 1801 (mysqld)
Node 0 Node 1 Total
--------------- --------------- ---------------
Huge 0.00 0.00 0.00
Heap 0.00 0.00 0.00
Stack 0.01 0.07 0.09
Private 125479.61 69856.82 195336.43
---------------- --------------- --------------- ---------------
Total 125479.62 69856.90 195336.52

3, innodb_numa_interleave is not in force yet, numa_maps file can check memory allocation mysqld process by analyzing the / proc / 1801 /
to one of the record, for example,

7f9067850000 表示内存的虚拟地址
interleave:0-1 表示内存所用的NUMA策略,这里使用了Interleave方式
anon=5734148 匿名页数量
dirty=5734148 脏页数量
active=5728403 活动列表页面的数量
N0=3607212 N1=2126936 节点0、1分配的页面数量
kernelpagesize_kB=4 页面大小为4K
7f9067850000 interleave:0-1 anon=5734148 dirty=5734148 active=5728403 N0=3607212 N1=2126936 kernelpagesize_kB=4

4, by analyzing the above document, the page number of the node Node 0 and Node 1 for statistical distribution, Node 0 can be calculated by way interleave memory allocated about 114.4G, Node 1 by way interleave memory allocated about 64.7G
described innodb_numa_interleave switch It is actually in force, but even mysql using interleave of distribution, there is still the problem of imbalance

5, by innodb_numa_interleave relevant source, it can be seen when the switch is turned on, MySQL function call specifies the linux set_mempolicy MPOL_INTERLEAVE policy to allocate memory across nodes set_mempolicy (MPOL_INTERLEAVE, numa_all_nodes_ptr-> maskp, numa_all_nodes_ptr-> size)
when the switch is off, set_mempolicy (MPOL_DEFAULT, NULL, 0), use the default local allocation policy

my_bool srv_numa_interleave = FALSE;
#ifdef HAVE_LIBNUMA
#include <numa.h>
#include <numaif.h>
struct set_numa_interleave_t
{
set_numa_interleave_t()
{
if (srv_numa_interleave) {
ib::info() << "Setting NUMA memory policy to"
" MPOL_INTERLEAVE";
if (set_mempolicy(MPOL_INTERLEAVE,
numa_all_nodes_ptr->maskp,
numa_all_nodes_ptr->size) != 0) {
ib::warn() << "Failed to set NUMA memory"
" policy to MPOL_INTERLEAVE: "
<< strerror(errno);
}
}
}
~set_numa_interleave_t()
{
if (srv_numa_interleave) {
ib::info() << "Setting NUMA memory policy to"
" MPOL_DEFAULT";
if (set_mempolicy(MPOL_DEFAULT, NULL, 0) != 0) {
ib::warn() << "Failed to set NUMA memory"
" policy to MPOL_DEFAULT: "
<< strerror(errno);
}
} }};

[Comparative Test and opening switch innodb_numa_interleave numactl -interleave = memory allocation in two ways NUMA nodes where all starting mysqld process]

Scene One, numactl --interleave = all the way to start mysqld process

1, systemd modify the configuration file, delete the my.cnf innodb_numa_interleave = on switch configuration, restart the MySQL service

/usr/bin/numactl --interleave=all /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid $MYSQLD_OPTS

2, run select count (*) from test.sbtest1 statement, this table has 200 million records, running 14 minutes, the data will be read buffer pool table in

3, the end of the run, you can see the file analysis numa_maps mysqld process uses a distribution interleave access across nodes, between two node basically the same size of memory allocated

7f9a3c5b3000 interleave:0-1 anon=1688811 dirty=1688811 N0=842613 N1=846198 kernelpagesize_kB=4
7f9a3c5b3000 interleave:0-1 anon=2497435 dirty=2497435 N0=1247949 N1=1249486 kernelpagesize_kB=4

4, the total allocation mysqld process is balanced

Scene Two, open innodb_numa_interleave way

1, the increase in my.cnf innodb_numa_interleave = on switch configuration, MySQL restart service, and perform an associated SQL statement scene

2, after the end of the run, you can see the file analysis numa_maps mysqld process using interleave mode allocated among different Node is the basic balance

7f71d8d98000 interleave:0-1 anon=222792 dirty=222792 N0=111652 N1=111140 kernelpagesize_kB=4
7f74a2e14000 interleave:0-1 anon=214208 dirty=214208 N0=107104 N1=107104 kernelpagesize_kB=4
7f776ce90000 interleave:0-1 anon=218128 dirty=218128 N0=108808 N1=109320 kernelpagesize_kB=4

3, but there are still some local memory allocation strategy using default, which is part of the memory allocated to all the Node 0

7f31daead000 default anon=169472 dirty=169472 N0=169472 kernelpagesize_kB=4

4, the final mysqld process allocates memory Node 0 1 bigger than about 1G Node

[MySQL5.7.23 how to enable numactl -interleave = all of]

MySQL5.7 mysqld_safe file version is no longer used, so enable numactl -interleave = all the way, with 5.6 of MySQL different methods, are summarized as follows:

1、修改vim /etc/my.cnf文件,删除innodb_numa_interleave配置项
2、修改systemd 的本地配置文件,vim /usr/lib/systemd/system/mysqld.service,增加/usr/bin/numactl --interleave=all命令
    # Start main service
    ExecStart=/usr/bin/numactl --interleave=all /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid $MYSQLD_OPTS
3、停止MySQL服务
    systemctl stop mysqld.service
4、重新加载配置文件
    systemctl daemon-reload
5、写入硬盘,防止数据丢失
    sync;sync;sync
6、延迟10秒 
    sleep 10
7、清理pagecache、dentries和inodes
    sysctl -q -w vm.drop_caches=3
8、启动MySQL服务
    systemctl start mysqld.service
9、验证是否生效,
    首先确认show global variables like ' innodb_numa_interleave';开关为关闭状态
    正常情况下mysqld进程会全部采用interleave跨节点访问的分配方式,如果可以查询到其他访问方式的信息,表示interleave方式没有正常生效
    less /proc/`pidof mysqld`/numa_maps|grep -v 'interleave'

【in conclusion】

numactl -interleave = all the way to start mysqld process between different Node NUMA memory allocated will be more balanced.
This difference is related to the implementation of the policy innodb_numa_interleave parameters, after opening, the global memory using the interleave mode of distribution, but the thread of memory using the local distribution of default.
And if you use numactl -interleave = all start mysqld process, all the way interleave memory allocation will adopt the.

Guess you like

Origin www.cnblogs.com/CtripDBA/p/11541680.html