MySQL Core File Dump 分析实战

Core file 能够在mysql crash的时候提供给我们第一手现场状况,这对我们分析mysqld crash的原因非常重要。但在生产环境,数据库占用的内存往往会高达几十G,甚至上百G, core file也会非常大,因为它会包含mysqld所有内存信息。 因此生产环境启用core file dump 不得不考虑磁盘空间,以及重启的时间(mysqld crash 后,mysqld_safe会将其重启,但重启需要等待core file dump完成,上百G的内存写入磁盘需要的时间会比较长。经测试,700M core file dump 花费数秒钟)。但毕竟core file提供了一种分析问题的方法,还是值得探索尝试。

环境

修改mysql配置

在my.cnf配置文档中,启用core-file

1
2
[mysqld]
core-file

注意,这里只需写上core-file即可,而core-file=ON或者core_file=ON则导致无法启动mysqld.

然后进入mysql,查看core_file变量,已经生效:

1
2
3
4
5
6
7
8

mysql> show global variables like 'core_file';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| core_file | ON |
+---------------+-------+
1 row in set (0.01 sec)

查看系统对mysqld进程的core file size限制

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@stg-p2pbusiness-mysql-01 ~]# cat /proc/`pidof mysqld`/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 10485760 unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 31405 31405 processes
Max open files 10000 10000 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 31405 31405 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us

启用core-file后,发现Max core file size 为unlimited。 ==并且无法通过在[mysqld_safe]中设置core-file-size来限制corefile大小==。

1
2
3
[mysqld_safe]
core-file-size=1024 # 1024 * 512 bytes
open-files-limit=10000

修改系统参数

我们一般通过mysqld_safe来启动mysqld, 启动mysqld的时候改变了用户/组,这种情况需要设置suid_dumpable为1,使系统为mysqld进程生产coredump(重启mysqld生效).

1
echo 1 > /proc/sys/fs/suid_dumpable

在一个空间充足的磁盘,创建一个目录保存core file,修改权限为777防止写入失败,修改系统core_pattern参数,使其指向新目录。默认core_uses_pid是1,即core file文档名为core.pid

1
2
3
4
mkdir corefiles
chmod 777 corefiles
echo “/dbfiles/corefiles/core” > /proc/sys/kernel/core_pattern
echo “1” > /proc/sys/kernel/core_uses_pid

使用命令 kill -11 pidof mysqld 或者 kill -sigsegv pidof mysqld 模拟Segmentation fault,使mysqld crash.

1
2
[root@stg-p2pbusiness-mysql-01 corefiles]# kill -11 `pidof mysqld`
[root@stg-p2pbusiness-mysql-01 corefiles]# /usr/bin/mysqld_safe: line 198: 14609 Segmentation fault (core dumped) nohup /usr/sbin/mysqld --basedir=/usr --datadir=/dbfiles/mysql_home/data --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --log-error=stg-p2pbusiness-mysql-01.localhost.localdomain.err --open-files-limit=10000 --pid-file=/var/run/mysqld/mysqld.pid --socket=/dbfiles/mysql_home/data/mysql.sock --port=3306 < /dev/null > /dev/null 2>&1

我们可以看到Segmentation fault (core dumped), 随后mysqld_safe重启了mysqld. 查看core file文档,这个文档有700多M。

1
2
3
4
5
[root@stg-p2pbusiness-mysql-01 corefiles]# ls -lh
total 704M
-rw------- 1 mysql mysql 7.0G Jul 22 17:29 core.14609
[root@stg-p2pbusiness-mysql-01 corefiles]# du -sh *
704M core.14609

这是一个稀疏文档(sparse file), 里面有许多空洞。通过ls 和du命令显示的文档大小是不同的。du命令显示的实际占用的磁盘大小,ls则是逻辑文档大小。通过top命令查看mysqld的实际占用内存,也是约700M (RES),与du命令显示的大小类同。而Virt虚拟内存(7G)则与ls显示的文档大小类同。

1
2
3

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14691 mysql 20 0 7797m 714m 10m S 0.0 8.9 0:01.36 mysqld

对这个core 文档再深入研究一下

1
2
3
4
5
6
7
8
9
10
11
[root@stg-p2pbusiness-mysql-01 corefiles]# file core.14609
core.14609: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/mysqld --basedir=/usr --datadir=/dbfiles/mysql_home/data --plugin-dir', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/mysqld', platform: 'x86_64'

[root@stg-p2pbusiness-mysql-01 corefiles]# stat core.14609
File: `core.14609'
Size: 7504252928 Blocks: 1440384 IO Block: 4096 regular file
Device: fd02h/64770d Inode: 3670018 Links: 1
Access: (0600/-rw-------) Uid: ( 498/ mysql) Gid: ( 498/ mysql)
Access: 2018-07-22 17:34:15.689005465 +0800
Modify: 2018-07-22 17:29:56.449005465 +0800
Change: 2018-07-22 17:29:56.449005465 +0800

使用GDB加载core file,查看堆栈(需要安装mysql相应版本的debuginfo包: debuginfo-install mysql-community-server-5.7.22-1.el6.x86_64)

1
gdb mysqld core.14609

使用 info thread 查看当前状态mysqld内部线程,注意(LWP 14622)中的数字是thread_os_id

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
(gdb) info thread
40 Thread 0x7fb234894700 (LWP 14622) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
39 Thread 0x7fb23a8d9700 (LWP 14652) 0x00007fb3db997585 in sigwait () from /lib64/libpthread.so.0
38 Thread 0x7fb2121fc700 (LWP 14653) 0x00007fb3db99368c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
37 Thread 0x7fb233e93700 (LWP 14623) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
36 Thread 0x7fb2401b2700 (LWP 14641) 0x00007fb3db993a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
35 Thread 0x7fb22d088700 (LWP 14634) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
34 Thread 0x7fb23028d700 (LWP 14629) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
33 Thread 0x7fb232a91700 (LWP 14625) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
32 Thread 0x7fb22840b700 (LWP 14656) 0x00007fb3db993a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
31 Thread 0x7fb232090700 (LWP 14626) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
30 Thread 0x7fb23bbab700 (LWP 14648) 0x00007fb3db99368c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
29 Thread 0x7fb22f88c700 (LWP 14630) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
28 Thread 0x7fb236697700 (LWP 14619) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
27 Thread 0x7fb23c5ac700 (LWP 14647) 0x00007fb3db99368c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
26 Thread 0x7fb23d9ae700 (LWP 14645) 0x00007fb3db99368c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
25 Thread 0x7fb230c8e700 (LWP 14628) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
24 Thread 0x7fb212bfd700 (LWP 14651) 0x00007fb3db99368c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
23 Thread 0x7fb2135fe700 (LWP 14650) 0x00007fb3db993a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
22 Thread 0x7fb213fff700 (LWP 14649) 0x00007fb3db993a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
21 Thread 0x7fb203fff700 (LWP 14663) 0x00007fb3db99368c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
20 Thread 0x7fb23edb0700 (LWP 14643) 0x00007fb3db99700d in nanosleep () from /lib64/libpthread.so.0
19 Thread 0x7fb240bb3700 (LWP 14640) 0x00007fb3db993a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
18 Thread 0x7fb22a884700 (LWP 14638) 0x00007fb3db99368c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
17 Thread 0x7fb22da89700 (LWP 14633) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
16 Thread 0x7fb23e3af700 (LWP 14644) 0x00007fb3db99368c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
15 Thread 0x7fb23cfad700 (LWP 14646) 0x00007fb3db99368c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
14 Thread 0x7fb22e48a700 (LWP 14632) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
13 Thread 0x7fb22c687700 (LWP 14635) 0x00007fb3db993a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
12 Thread 0x7fb22ee8b700 (LWP 14631) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
11 Thread 0x7fb23168f700 (LWP 14627) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
10 Thread 0x7fb235295700 (LWP 14621) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
9 Thread 0x7fb22b285700 (LWP 14637) 0x00007fb3db99368c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
8 Thread 0x7fb233492700 (LWP 14624) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
7 Thread 0x7fb23f7b1700 (LWP 14642) 0x00007fb3db993a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
6 Thread 0x7fb235c96700 (LWP 14620) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
5 Thread 0x7fb237098700 (LWP 14618) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
4 Thread 0x7fb237a99700 (LWP 14617) 0x00007fb3db787614 in ?? () from /lib64/libaio.so.1
3 Thread 0x7fb22bc86700 (LWP 14636) 0x00007fb3db99368c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
* 2 Thread 0x7fb3d25f1700 (LWP 14610) 0x00007fb3da4386c7 in sigwaitinfo () from /lib64/libc.so.6
1 Thread 0x7fb3dbdb7720 (LWP 14609) 0x00007fb3db99497c in pthread_kill () from /lib64/libpthread.so.0

我们可以通过thread n ,切换到某一线程,然后使用bt命令查看该线程的堆栈:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
(gdb) thread 1
[Switching to thread 1 (Thread 0x7fb3dbdb7720 (LWP 14609))]#0 0x00007fb3db99497c in pthread_kill () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007fb3db99497c in pthread_kill () from /lib64/libpthread.so.0
#1 0x00000000007d26d4 in handle_fatal_signal (sig=11) at /export/home/pb2/build/sb_0-27500212-1520171533.24/rpm/BUILD/mysql-5.7.22/mysql-5.7.22/sql/signal_handler.cc:220
#2 <signal handler called>
#3 0x00007fb3da4e4383 in poll () from /lib64/libc.so.6
#4 0x0000000000dedda8 in Mysqld_socket_listener::listen_for_connection_event (this=0x3614ad0)
at /export/home/pb2/build/sb_0-27500212-1520171533.24/rpm/BUILD/mysql-5.7.22/mysql-5.7.22/sql/conn_handler/socket_connection.cc:852
#5 0x00000000007cd0b9 in connection_event_loop (argc=53, argv=0x34e7338)
at /export/home/pb2/build/sb_0-27500212-1520171533.24/rpm/BUILD/mysql-5.7.22/mysql-5.7.22/sql/conn_handler/connection_acceptor.h:66
#6 mysqld_main (argc=53, argv=0x34e7338) at /export/home/pb2/build/sb_0-27500212-1520171533.24/rpm/BUILD/mysql-5.7.22/mysql-5.7.22/sql/mysqld.cc:5132
#7 0x00007fb3da423d1d in __libc_start_main () from /lib64/libc.so.6
#8 0x00000000007c2699 in _start ()

有些堆栈与mysql在crash的时候打到error log中的堆栈是一致的,只是更详细一些。mysqld error log中只打印导致mysqld crash的线程堆栈,更有针对性,只是解析出来的函数信息比较模糊! mysqld error log:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=2
max_threads=300
thread_count=1
connection_count=1
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2508204 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0xf4fe15]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x7d2774]
/lib64/libpthread.so.0(+0xf7e0)[0x7fb3db9977e0]
/lib64/libc.so.6(__poll+0x53)[0x7fb3da4e4383]
/usr/sbin/mysqld(_ZN22Mysqld_socket_listener27listen_for_connection_eventEv+0x38)[0xdedda8]
/usr/sbin/mysqld(_Z11mysqld_mainiPPc+0x1819)[0x7cd0b9]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fb3da423d1d]
/usr/sbin/mysqld[0x7c2699]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file

在有些情况下,mysqld无法在堆栈中解析出函数信息,只有一些16进制的数字:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
mysqld got signal 11;
Attempting backtrace. You can use the following information
to find out where mysqld died. If you see no messages after
this, something went terribly wrong...
stack_bottom = 0x41fd0110 thread_stack 0x40000
[0x9da402]
[0x6648e9]
[0x7f1a5af000f0]
[0x7f1a5a10f0f2]
[0x7412cb]
[0x688354]
[0x688494]
[0x67a170]
[0x67f0ad]
[0x67fdf8]
[0x6811b6]
[0x66e05e]

这种情况下可使用resolve_stack_dump工具协助分析堆栈调用。mysql官方文档有详细说明(mysql5.7 doc chapter 28.5.1.5)

原文链接 大专栏  https://www.dazhuanlan.com/2019/08/17/5d576c3a91393/

猜你喜欢

转载自www.cnblogs.com/chinatrump/p/11417278.html
今日推荐