一、定位bug性质和范围
1、带符号分析dump
$ gdb IMActivityServer.symbol core.32530
(gdb) bt
#0 0x0000000000a6951a in ?? ()
#1 0x00000000018c6db8 in ?? ()
#2 0x8f127f1911ab2800 in ?? ()
#3 0x00000000018c6d00 in ?? ()
#4 0x0000000400000004 in ?? ()
#5 0x00000000018c6d88 in ?? ()
#6 0x00000000006ba9bf in ?? ()
#7 0x00000000018a4400 in ?? ()
#8 0x8f127f1911ab2800 in ?? ()
#9 0x00000000018c6d00 in ?? ()
#10 0x00007f518b789010 in ?? ()
#11 0x00000000018c6d00 in ?? ()
#12 0x0000000000693166 in ?? ()
#13 0x0000000000000000 in ?? ()
看不出任何信息,日志也看不出什么,怀疑是堆栈破坏
2、增加堆栈保护, 用编译参数-fstack-protector-all
为所有函数插入保护代码,编译版本,再次带符号查看崩溃dump
$ gdb IMActivityServer.symbol core.32530
(gdb) bt
#0 0x00007f08a95d7118 in ?? () from /lib64/libgcc_s.so.1
#1 0x00007f08a95d8019 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2 0x00007f08a93110c6 in backtrace () from /lib64/libc.so.6
#3 0x00007f08a927c334 in __libc_message () from /lib64/libc.so.6
#4 0x00007f08a9314a77 in __fortify_fail () from /lib64/libc.so.6
#5 0x00007f08a9314a40 in __stack_chk_fail () from /lib64/libc.so.6
#6 0x00000000006909a9 in ActivityService::cmdMsgParse (this=<optimized out>, ptNullCmd=<optimized out>,
nCmdLen=<optimized out>) at ActivityServer.cpp:930
#7 0x00007f08a75ae1e8 in ?? ()
#8 0x00007f0818000978 in ?? ()
#9 0x00007f0818000a08 in ?? ()
#10 0x00007f0818000970 in ?? ()
#11 0x0000000000000000 in ?? ()
堆栈可以看出函数名了cmdMsgParse,查看源码文件ActivityServer.cpp:930是函数返回地址,断定是某一个消息引起的堆栈破坏,这个是统一消息处理函数,消息量很大,
3、增加消息号处理日志,再放一个版本
多台服务器的日志最后一行都是:
180531-12:43:51 ActivityServer[17701] ERROR: [ActivityInfoManager.cpp:1293] cmd(51) param(0) len(70) server(0) id(0)
基本可以判断是51号消息引起的
二、详细分析bug
51号消息是一个通用转发包装消息,需要解析内部消息内容,考虑下断点统一处理函数
ActivityInfoManager::msgParseTask
1、先找到函数定义,看是否正确,有源码可以省略
(gdb) info func ActivityInfoManager::msgParseTask
All functions matching regular expression "ActivityInfoManager::msgParseTask":
File ActivityInfoManager.cpp:
bool ActivityInfoManager::msgParseTask(Cmd::t_NullCmd const*, unsigned int, void const*);
bool ActivityInfoManager::msgParseTask(unsigned int, Cmd::t_NullCmd const*, unsigned int, void const*);
bool ActivityInfoManager::msgParseTaskByWebTransfer(Cmd::t_NullCmd const*, unsigned int, void const*);
bool ActivityInfoManager::msgParseTaskGM(Cmd::t_NullCmd const*, unsigned int, unsigned int);
2、再断下来解析参数
(gdb) break ActivityInfoManager.cpp:1293
(gdb) c
Continuing.
[Switching to Thread 0x7f83877fe700 (LWP 414)]
Breakpoint 1, ActivityInfoManager::msgParseTask (this=0x7f840c16e010, ptNullCmd=0x7f83877edc30, cmdLen=84,
server_param=0x7f83877edc20) at ActivityInfoManager.cpp:1293
1293 ActivityInfoManager.cpp: 没有那个文件或目录.
(gdb) info locals
__temp_format__ = <error reading variable __temp_format__ (can't compute CFA for this frame)>
s = 0x7f83877edc20
buffercmd = <error reading variable buffercmd (can't compute CFA for this frame)>
cmd = <optimized out>
pBase = <optimized out>
// 查看ptNullCmd的结构体定义,有源码可以省略
(gdb) info types Cmd::t_NullCmd
All types matching regular expression "Cmd::t_NullCmd":
File ../../common/zNullCmd.h:
Cmd::t_NullCmd;
Cmd::t_NullCmd;
Cmd::t_NullCmd;
Cmd::t_NullCmd;
Cmd::t_NullCmd;
// 显示结构体的内容,有源码可以省略
(gdb) ptype Cmd::t_NullCmd
type = class Cmd::t_NullCmd {
public:
union {
<no data fields>
};
void t_NullCmd(BYTE, BYTE);
}
// 显示参数详细内容
(gdb) p *(const Cmd::t_NullCmd *) 0x7f83877edc30
$11 = {{{byCmd = 51 '3', byParam = 0 '\000'}, {cmd = 51 '3', para = 0 '\000'}}}
(gdb) p *(Cmd::Activity::stServerParam*)0x7f83877fdc10
$6 = {asServer = 0 '\000', type = 0, serverid = 0}
// 查看消息值
(gdb) p ptNullCmd->cmd
$2 = 33 '!'
3、根据参数值下条件断点
// 清除老断点
(gdb) clear
Deleted breakpoint 1
// 下条件断点
(gdb) break ActivityInfoManager.cpp:1293 if ptNullCmd->cmd == 51
Breakpoint 2 at 0x6bdd50: file ActivityInfoManager.cpp, line 1293.
(gdb) c
Continuing.
4、分析51号消息
扫描二维码关注公众号,回复:
2388404 查看本文章
// 查看断下来的消息
(gdb) p *(const Cmd::t_NullCmd *) 0x7f83877fdc30
$11 = {{{byCmd = 51 '3', byParam = 0 '\000'}, {cmd = 51 '3', para = 0 '\000'}}}
// 51号消息结构等价于
struct stActivityInCmd{
BYTE cmd;
BYTE para;
DWORD ActId;
WORD size;
char data[0]
}
// 消息在内部data中,可以知道data是stActivityInCmd结构体地址+8字节,头也是Cmd::t_NullCmd结构体
(gdb) p *(const Cmd::t_NullCmd *) 0x7f83877fdc30+8
$11 = {{{byCmd = 51 '3', byParam = 12 '\f'}, {cmd = 51 '3', para = 12 '\f'}}}
// 可以看出是51号消息,子消息号是12,查源码知道消息是'Cmd::Activity::stOpMount'
// 显示详细结构内容
(gdb) p *(Cmd::Activity::stOpMount*)0x7f83877fdc38
$10 = {<Cmd::t_NullCmd> = {{{byCmd = 51 '3', byParam = 12 '\f'}, {cmd = 51 '3', para = 12 '\f'}}}, dwReqFunction = 10,
szName = "领工资\000sigin_pay_map_.size:[%d]\000", byOpType = 1 '\001', dwIndex = 0, dwUserId = 22684283, wAddType = 32,
dwTimeStart = 1527740441, dwTimeEnd = 1528345241, bNeedDelMount = false, wDelType = 0, bExtension = true}
查看消息Cmd::Activity::stOpMount
的处理流程,发现一处堆栈覆盖问题
Cmd::Activity::stActivityInCmd cmd;
...
bcopy(rev->data, cmd.data, rev->size);
...
消息没有初始化就使用了,直接往data里面写数据,参考上面的结构体定义,data指向的是结构体堆栈末尾,导致数据直接写入了堆栈中,覆盖了原有堆栈内容。
三、修复bug
修复的方法很简单,初始化一下结构体再使用就可以了。
四、gdb打印日志
$ gdb attach 28644
// 加载符号
(gdb) symbol-file IMActivityServer.symbol
Reading symbols from /home/ztgame/IMTESTVERSION/release/IMActivityServer.symbol...done.
// 开启日志
(gdb) set logging on
Future logs will be written to gdb.txt.
Copying output to gdb.txt.
// 下断点
(gdb) break ActivityInfoManager.cpp:1290
Breakpoint 2 at 0x6b70f0: file ActivityInfoManager.cpp, line 1290.
// 导入python库
(gdb) python import datetime
// 增加断点脚本命令
(gdb) commands 2 //指令集设置命令,断点序号
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent //断点触发时不打印断点信息
>python gdb.execute("set $now=\"" + datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') + "\"")
>printf "%s cmd(%u) param(%u) len(%u) srvtype(%u) srvid(%u)\n",$now,ptNullCmd->cmd, ptNullCmd->para,cmdLen,((Cmd::Activity::stServerParam*)server_param)->type,((Cmd::Activity::stServerParam*)server_param)->serverid
>continue
>end //指令集设置结束时必须用end结束
(gdb) c
打开gdb.txt
2018-05-31 19:45:45 cmd(17) param(1) len(38) srvtype(43) srvid(4301)
2018-05-31 19:45:45 cmd(17) param(1) len(49) srvtype(43) srvid(4301)
2018-05-31 19:45:45 cmd(17) param(1) len(49) srvtype(43) srvid(4301)
2018-05-31 19:45:45 cmd(17) param(1) len(38) srvtype(43) srvid(4301)
2018-05-31 19:45:45 cmd(17) param(1) len(38) srvtype(43) srvid(4301)
2018-05-31 19:45:46 cmd(22) param(14) len(54) srvtype(150) srvid(15000)