systemtap latencytop

淘宝核心系统团队博客
基础 极致 分享

    Home
    招聘信息
    淘宝核心系统团队介绍

« 开源混合存储方案(Flashcache)
IP地址库介绍 »
latencytop深度了解你的Linux系统的延迟

我们在系统调优或者定位问题的时候,经常会发现多线程程序的效率很低,但是又不知道问题出在哪里,就知道上下文切换很多,但是为什么上下文切换,是谁导致切换,我们就不知道了。上下文切换可以用dstat这样的工具查看,比如:
1
2
3
4
5
6
7
8
9
10
11
12
13
14

$dstat
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  9   2  87   2   0   1|7398k   31M|   0     0 | 9.8k   11k|  16k   64k
20   4  69   3   0   4|  26M   56M|  34M  172M|   0     0 |  61k  200k
21   5  64   6   0   3|  26M  225M|  35M  175M|   0     0 |  75k  216k
21   5  66   4   0   4|  25M  119M|  34M  173M|   0     0 |  66k  207k
19   4  68   5   0   3|  23M   56M|  33M  166M|   0     0 |  60k  197k

#或者用systemtap脚本来看
$sudo stap -e 'global cnt; probe scheduler.cpu_on {cnt<<<1;} probe timer.s(1){printf("%d\n", @count(cnt)); delete cnt;}'
217779
234141
234759

每秒高达200k左右的的上下文切换, 谁能告诉我发生了什么? 好吧,latencytop来救助了!

它的官网:http://www.latencytop.org/

    Skipping audio, slower servers, everyone knows the symptoms of latency. But to know what’s going on in the system, what’s causing the latency, how to fix it… that’s a hard question without good answers right now.

    LatencyTOP is a Linux* tool for software developers (both kernel and userspace), aimed at identifying where in the system latency is happening, and what kind of operation/action is causing the latency to happen so that the code can be changed to avoid the worst latency hiccups.

它是Intel贡献的另外一个性能查看器,还有一个是powertop,都是很不错的工具.

Latencytop通过在内核上下文切换的时候,记录被切换的进程的内核栈,然后通过匹配内核栈的函数来判断是什么原因导致上下文切换,同时他把几十种容易引起切换的场景的函数都记录起来,这样在判断系统问题的时候能容易定位到问题。

latencytop分成2个部分,内核部分和应用部分。内核部分负责调用栈的收集并且通过/proc来暴露, 应用部分负责显示.

工作界面截图如下:

latencytop在2.6.256后被内核吸收成为其中一部分,只要编译的时候打开该选项就好,如何确认呢?
1
2

$ cat /proc/latency_stats
Latency Top version : v0.1

看到这个就好了, 遗憾的是RHEL6竟然带了latencytop应用部分,而没有打开编译选项,让我们情何以堪呢?
在ubuntu下可以这么安装:
1
2
3
4

$ uname -r
2.6.38-yufeng
$ apt-get install latencytop
$ sudo latencytop #就可以使用了

但是latencytop比较傻的是默认是开图像界面的,我们很不习惯,我们要文本界面, 自己动手把!
1
2
3
4
5
6
7
8
9
10
11

$ apt-get source latencytop
$ diff -up Makefile.orig Makefile
--- Makefile.orig   2011-03-29 20:10:29.025845447 +0800
+++ Makefile    2011-03-28 14:48:11.232318002 +0800
@@ -1,5 +1,5 @@
# FIXME: Use autoconf ?
-HAS_GTK_GUI = 0
+#HAS_GTK_GUI = 0

DESTDIR =
SBINDIR = /usr/sbin

重新make下就好了, 文本界面出现了. 具体使用参看 man latencytop。

fcicq同学说:

    加个 –nogui 参数就好了. 不需要重新编译.

谢谢!

好了,那么latencytop支持多少种的延迟原因呢?让latencytop.trans告诉你,我们也可以自己修改这个文件,把新的延迟原因加上去。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147

$ cat /usr/share/latencytop/latencytop.trans
#
1   vfs_read        Reading from file
1   vfs_write       Writing to file
1   __mark_inode_dirty  Marking inode dirty
1   vfs_readdir     Reading directory content
1   vfs_unlink      Unlinking file
1   blocking_notifier_call_chain    Blocking notifier
1   lock_super      Superblock lock contention
1   vfs_create      Creating a file
1   KAS_ScheduleTimeout Binary AMD driver delay
1   firegl_lock_device  Binary AMD driver delay
#
2   __bread         Synchronous buffer read
2   do_generic_mapping_read Reading file data
2   sock_sendmsg        Sending data over socket
2   do_sys_open     Opening file
2   do_sys_poll     Waiting for event (poll)
2   core_sys_select     Waiting for event (select)
2   proc_reg_read       Reading from /proc file
2   __pollwait      Waiting for event (poll)
2   sys_fcntl       FCNTL system call
2   scsi_error_handler  SCSI error handler
2   proc_root_readdir   Reading /proc directory
2   ksoftirqd       Waking ksoftirqd
2   worker_thread       .
2   do_unlinkat     Unlinking file
2   __wait_on_buffer    Waiting for buffer IO to complete
2   pdflush         pdflush() kernel thread
2   kjournald       kjournald() kernel thread
2   blkdev_ioctl        block device IOCTL
2   kauditd_thread      kernel audit daemon
2   tty_ioctl       TTY IOCTL
2   __filemap_fdatawrite_range fdatasync system call
2   do_sync_write       synchronous write
2   kthreadd        kthreadd kernel thread
2   usb_port_resume     Waking up USB device
2   usb_autoresume_device   Waking up USB device
2   kswapd          kswapd() kernel thread
2   md_thread       Raid resync kernel thread
2   i915_wait_request   Waiting for GPU command to complete
2   request_module      Loading a kernel module

#
3   tty_wait_until_sent Waiting for TTY to finish sending
3   pipe_read       Reading from a pipe
3   pipe_write      Writing to a pipe
3   pipe_wait       Waiting for pipe data
3   read_block_bitmap   Reading EXT3 block bitmaps
3   scsi_execute_req    Executing raw SCSI command
3   sys_wait4       Waiting for a process to die
3   sr_media_change     Checking for media change
3   sr_do_ioctl     SCSI cdrom ioctl
3   sd_ioctl        SCSI disk ioctl
3   sr_cd_check     Checking CDROM media present
3   ext3_read_inode     Reading EXT3 inode
3   htree_dirblock_to_tree  Reading EXT3 directory htree
3   ext3_readdir        Reading EXT3 directory
3   ext3_bread      Synchronous EXT3 read
3   ext3_free_branches  Unlinking file on EXT3
3   ext3_get_branch     Reading EXT3 indirect blocks
3   ext3_find_entry     EXT3: Looking for file
3   __ext3_get_inode_loc    Reading EXT3 inode
3   ext3_delete_inode   EXT3 deleting inode
3   sync_page       Writing a page to disk
3   tty_poll        Waiting for TTY data
3   tty_read        Waiting for TTY input
3   tty_write       Writing data to TTY
3   update_atime        Updating inode atime
3   page_cache_sync_readahead   Pagecache sync readahead
3   do_fork         Fork() system call
3   sys_mkdirat     Creating directory
3   lookup_create       Creating file
3   inet_sendmsg        Sending TCP/IP data
3   tcp_recvmsg     Receiving TCP/IP data
3   link_path_walk      Following symlink
3   path_walk       Walking directory tree
3   sys_getdents        Reading directory content
3   unix_stream_recvmsg Waiting for data on unix socket
3   ext3_mkdir      EXT3: Creating a directory
3   journal_get_write_access    EXT3: Waiting for journal access
3   synchronize_rcu     Waiting for RCU
3   input_close_device  Closing input device
3   mousedev_close_device   Closing mouse device
3   mousedev_release    Closing mouse device
3   mousedev_open       Opening mouse device
3   kmsg_read       Reading from dmesg
3   sys_futex       Userspace lock contention
3   do_futex        Userspace lock contention
3   vt_waitactive       vt_waitactive IOCTL
3   acquire_console_sem Waiting for console access
3   filp_close      Closing a file
3   sync_inode      (f)syncing an inode to disk
3   ata_exec_internal_sg    Executing internal ATA command
3   writeback_inodes    Writing back inodes
3   ext3_orphan_add     EXT3 adding orphan
3   ext3_mark_inode_dirty   EXT3 marking inode dirty
3   ext3_unlink         EXT3 unlinking file
3   ext3_create     EXT3 Creating a file
3   log_do_checkpoint   EXT3 journal checkpoint
3   generic_delete_inode    Deleting an inode
3   proc_delete_inode   Removing /proc file
3   do_truncate     Truncating file
3   sys_execve      Executing a program
3   journal_commit_transaction  EXT3: committing transaction
3   __stop_machine_run  Freezing the kernel (for module load)
3   sys_munmap      unmapping memory
3   sys_mmap        mmaping memory
3   sync_buffer     Writing buffer to disk (synchronous)
3   inotify_inode_queue_event   Inotify event
3   proc_lookup     Looking up /proc file
3   generic_make_request    Creating block layer request
3   get_request_wait    Creating block layer request
3   alloc_page_vma      Allocating a VMA
#3  __d_lookup      Looking up a dentry
3   blkdev_direct_IO    Direct block device IO
3   sys_mprotect        mprotect() system call
3   shrink_icache_memory    reducing inode cache memory footprint
3   vfs_stat_fd     stat() operation
3   cdrom_open      opening cdrom device
3   sys_epoll_wait      Waiting for event (epoll)
3   sync_sb_inodes      Syncing inodes
3   tcp_connect     TCP/IP connect
3   ata_scsi_ioctl      ATA/SCSI disk ioctl
3   do_rmdir        Removing directory
3   vfs_rmdir       Removing directory
3   sys_flock       flock() on a file
3   usbdev_open     opening USB device
3   lock_kernel     Big Kernel Lock contention
3   blk_execute_rq      Submitting block IO
3   scsi_cmd_ioctl      SCSI ioctl command
3   acpi_ec_transaction ACPI hardware access
3   journal_get_undo_access Waiting for EXT3 journal undo operation
3   i915_irq_wait       Waiting for GPU interrupt
3   i915_gem_throttle_ioctl Throttling GPU while waiting for commands

#
#
5   do_page_fault       Page fault
5   handle_mm_fault     Page fault
5   filemap_fault       Page fault
5   sync_filesystems    Syncing filesystem
5   sys_nanosleep       Application requested delay
5   sys_pause       Application requested delay
5   evdev_read      Reading keyboard/mouse input
5   do_fsync        fsync() on a file (type 'F' for details)
5   __log_wait_for_space    Waiting for EXT3 journal space

延迟原因非常的详细.

本来到这里,我要介绍的要介绍了,但是且慢,由于这个东西要在2.6.26后的系统上使用,我们的线上系统大部分是RHEL 5U4, 2.6.18的, 我们如何使用呢?

这时候 systemtap 一如既往的前来救助了!

systemtap 1.4版本以后带了个latencytop.stp, 也是intel的贡献. 那我们试验下穷人家的latencytop.
它在那里呢?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

$ uname -r
2.6.18-164.el5

$ stap -V
Systemtap translator/driver (version 1.5 /0.137 non-git sources)
Copyright (C) 2005-2011 Red Hat, Inc. and others
This is free software; see the source for copying conditions.
enabled features: AVAHI LIBRPM LIBSQLITE3 NSS BOOST_SHARED_PTR TR1_UNORDERED_MAP NLS

$ ls -al /usr/share/doc/systemtap/examples/profiling/latencytap.stp
-rwxr-xr-x 1 chuba users 16240 Feb 17 22:02/usr/share/doc/systemtap/examples/profiling/latencytap.stp

$ sudo stap -t --all-modules /usr/share/doc/systemtap/examples/profiling/latencytap.stp
ERROR: Skipped too many probes, check MAXSKIPPED or try again with stap -t for more details.
WARNING: Number of errors: 0, skipped probes: 101
WARNING: Skipped due to global 'dequeue' lock timeout: 2
WARNING: Skipped due to global 'this_sleep' lock timeout: 99
----- probe hit report:
kernel.trace("deactivate_task")!, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:47:1), hits: 254, cycles: 680min/43327avg/2248467max, from: kernel.trace("deactivate_task")
kernel.trace("activate_task")!, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:58:1), hits: 255, cycles: 890min/502549avg/2271568max, from: kernel.trace("activate_task")
kernel.function("[email protected]/sched.c:1969")?, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:78:7), hits: 509, cycles: 213min/1002207avg/5382852max, from: kernel.function("finish_task_switch") from: scheduler.cpu_on
begin, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:123:1), hits: 1, cycles: 1802min/1802avg/1802max, from: begin
begin, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:131:1), hits: 1, cycles: 227979min/227979avg/227979max, from: begin
Pass 5: run failed.  Try again with another '--vp 00001' option.

出错了! 原因是lock timeout, 原来stap的全局变量是用锁保护的,现在超时了!知道原因好办,打个patch吧!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

$ diff -up translate.cxx.orig  translate.cxx
--- translate.cxx.orig     2011-03-22 21:26:52.000000000 +0800
+++ /translate.cxx     2011-03-29 20:31:28.000000000 +0800
@@ -5802,10 +5802,10 @@ translate_pass (systemtap_session& s)
       s.op->newline() << "#define MAXACTION_INTERRUPTIBLE (MAXACTION * 10)";
       s.op->newline() << "#endif";
       s.op->newline() << "#ifndef TRYLOCKDELAY";
-      s.op->newline() << "#define TRYLOCKDELAY 10 /* microseconds */";
+      s.op->newline() << "#define TRYLOCKDELAY 50 /* microseconds */";
       s.op->newline() << "#endif";
       s.op->newline() << "#ifndef MAXTRYLOCK";
-      s.op->newline() << "#define MAXTRYLOCK 100 /* 1 millisecond total */";
+      s.op->newline() << "#define MAXTRYLOCK 500 /* 1 millisecond total */";
       s.op->newline() << "#endif";
       s.op->newline() << "#ifndef MAXMAPENTRIES";
       s.op->newline() << "#define MAXMAPENTRIES 2048";

#编译安装后再来一次
$ sudo stap  --all-modules /usr/share/doc/systemtap/examples/profiling/latencytap.stp
ERROR: probe overhead exceeded threshold
WARNING: Number of errors: 1, skipped probes: 0
Pass 5: run failed.  Try again with another '--vp 00001' option.

#又错了,这次原因是probe overhead exceeded threshold, 看下代码我们知道,脚本的开销太大了,超过正常的负载,通过查看代码可以用STP_NO_OVERLOAD来解除这个限制

#再来一次
$ sudo  stap -DSTP_NO_OVERLOAD --all-modules -DMAXSKIPPED=1024 /usr/share/doc/systemtap/examples/profiling/latencytap.stp

Reason                                  Count  Average(us)  Maximum(us) Percent%
Userspace lock contention                 345     16409195     83258717      45%
                                         1453       867513     60231852      10%
                                           95      7391754     33821926       5%
migration() kernel thread                1733       402701      3571412       5%
                                         7239        87993       401552       5%
Reading from a pipe                       212      2922207     52151180       4%
                                          142      2267850     17990214       2%
                                          108      2457247      7494331       2%
Waking ksoftirqd                           16     16082822     59266312       2%
Waiting for event (select)                 99      2113310     28510974       1%
kjournald() kernel thread                 148      1313447     13983084       1%
Application requested delay                94      1059898     10011409       0%
                                           41      2391993      7618788       0%
Waiting for event (select)                 38      2259444     29057362       0%
                                          719        92947       584944       0%
Waiting for event (poll)                    1     57582711     57582711       0%
Application requested delay                 3     19030709     36000553       0%
Waiting for event (select)                 39      1341880      5847683       0%
                                           34       936628      6649350       0%
                                            5      6163603     10008484       0%
...

这次看到结果了,哈哈,小高兴一把。但是在繁忙的系统上这个脚本的资源占用特别多,也是不爽的。 幸运的是这个脚本支持查看某个进程的延迟情况, 就是在 latencytap.stp 后面加个-x 参数。

这个脚本设计应该是支持进程ID, 但是结果写成了线程ID,属于bug!!!

动手改下吧:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

$ diff -u latencytap.stp.orig  latencytap.stp
---  latencytap.stp.orig    2011-02-17 22:02:40.000000000 +0800
+++ latencytap.stp     2011-03-29 20:43:51.000000000 +0800
@@ -15,7 +15,7 @@
global this_sleep;
global debug = 0;

-function log_event:long (p:long) { return (!traced_pid || traced_pid == p) }
+function log_event:long (p:long) { return (!traced_pid || traced_pid == task_pid(p)) }

#func names from hex addresses
function func_backtrace:string (ips:string)
@@ -50,14 +50,14 @@
   # check to see if task is in appropriate state:
   # TASK_INTERRUPTIBLE      1
   # TASK_UNINTERRUPTIBLE    2
-  if (log_event($p->pid) && (s & 3)) {
+  if (log_event($p) && (s & 3)) {
     dequeue[$p] = gettimeofday_us();
   }
}

probe kernel.trace("activate_task") !,
       kernel.function("activate_task") {
-  if (!log_event($p->pid)) next
+  if (!log_event($p)) next

   a = gettimeofday_us()
   d = dequeue[$p]

#再来一次
$ sudo stap  --all-modules /usr/share/doc/systemtap/examples/profiling/latencytap.stp -x $$
...

这下终于爽了,旧内核用systemtap版本的,新内核用内核版本的,世界和谐!

通过对线上MySQL的诊断发现大部分时间花在mutex锁的竞争上来,我说过了,我会收拾你的,等着瞧!

玩得开心!
latencytop systemtap

This entry was posted by 褚霸 on 2011年03月30日 at 11:07, and is filed under Linux开发, Performance. Follow any responses to this post through RSS 2.0. You can leave a response or trackback from your own site.
4 comments

    Raymond (322 天)

    绝对的好工具,我们也碰到过线程太多性能下降的问题,未分析这么深入,谢谢分享的如此详细。
    ydzhang (316 天)

    的确是个好东西,学习了,一直关注着淘宝核心系统团队博客,希望这次能顺利通过淘宝的实习生招聘,暑假去淘宝实习。
    yegle (295 天)

    加个语法高亮,或者至少用等宽字体显示啊…
    Microsoft Office Home And Student 2010 (84 天)

    Microsoft Office Home And Student 2010…

    [...]淘宝核心系统团队博客 | latencytop深度了解你的Linux系统的延迟[...]…
    Reply ↓

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注

姓名 *

电子邮件 *

站点

评论

您可以使用这些 HTML 标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

    标签
    代理 代码管理 分布式架构 分析 夸父 对象 开发模式 开源 性能测试 指南 算法 管理员 缓存 虚拟化 beanstalkd BugFix cache Cassandra CDN connect CPU CPU affinity epoll flashcache flush gdb http InnoDB kvm L1 Linux mysql OceanBase overview qemu redis release smp ssd systemtap Tair tcp TFS traffic server unix
    分类目录
        BugFix
        Cassandra
        CDN
        Dynamo
        hypertable
        JVM
        Linux开发
        OceanBase
        Performance
        Tair
        Tengine
        TFS
        优化
        分布式架构
        底层架构
        开发模式
        数据库
        未分类
        算法
    功能
        登录
        文章 RSS
        评论 RSS
        WordPress.org
    业界同行
        搜搜web开发
    成员个人博客
        丁奇个人博客
        三百个人博客
        伯松个人博客
        千石个人博客
        华庭个人博客
        叔度个人博客
        坤谷个人博客
        撒迦个人博客
        无施个人博客
        日照个人博客
        李子个人博客
        正祥个人博客
        永攀个人博客
        茂七个人博客
        褚霸个人博客
        那岩个人博客
        郁白个人博客
        雕梁个人博客
    淘宝其他官方博客
        淘宝DBA
        淘宝JAVA中间件
        淘宝UED
        淘宝招聘
        淘宝搜索技术博客
        淘宝数据平台
        淘宝质量保障

淘宝核心系统团队博客 © 2011 | Powered by WordPress and Mystique theme by digitalnature
20 queries in 0.77 seconds (25.64M)

猜你喜欢

转载自foxox.iteye.com/blog/1408069