Introduction to blktrace
blktrace is a tracing tool for block device I/O in the Linux kernel. It was developed by the maintainer of the Linux kernel block device layer. Through this tool, users can obtain various detailed information about the I/O request queue, including the process name, process number, execution time, physical block number for reading and writing, block size, etc.
How blktrace works
1: When blktrace is tested, threads will be allocated to the number of logical CPUs on the physical machine, and each thread will be bound to a logical CPU to collect data.
For example: 9630 mobile phone has cpu0, cpu1, cpu2, cpu3, so 4 threads are started.
2: blktrace generates a file for each thread in the path mounted by debugfs (default path: /sys/kernel/debug) (with a corresponding file descriptor), and then calls the ioctrl function,
Generate a system call, pass the corresponding parameters to the kernel to call the corresponding function processing, and the kernel writes data to this file descriptor through the debugfs file system.
3: blktrace needs to be used in conjunction with blkparse. Use blkparse to parse the binary data in a specific format generated by blktrace.
Use of blktrace on mobile phones
The first step: turn on the trace function of the kernel.
Steps: 1. source build/envsetup.sh
2. lunch select project
3. Enter the command kuconfig to configure the kernel
4. 选中kernelhacking--->Tracer----->Support for tracing block IO actions 5. make systemimage -j4
6. Download the newly generated system.img boot.img to the phone
Step 2: Download the executable blktrace/blkparse executable program to your phone
Steps: 1.abd root to enter root permissions
2.adb remount remount
3.adb push blktrace /system/bin/
4.adb push blkparse /system/bin
5.adb shell
6.cd /system/bin
7. Modify the properties of blktrace/blkparse, chmod 0777 blktrace
Blktrace preparation
1: Because the mobile phone has cpu0/cpu1/cpu2/cpu3, a total of 4 CPUs, blktrace starts a total of 4 threads, and one CPU corresponds to one thread. Therefore, ensure that 4 CPUs are awake before running the blktrace monitoring command.
2: View method: adb root
adb shell
cd /sys/devices/system/cpu/
cat online
If it displays 0-3, it means that all 4 CPUs are awake. If it displays 0, it means that only cpu0 is awake.
3: Method to wake up the cpu:
echo 1 >/sys/devices/system/cpu/cpufreq/sprdemand/cpu_hotplug_disable
4: Just make sure that the four CPUs are awake before running the blktrace monitoring command. Once the blktrace monitoring command is started, whether cpu1/cpu2/cpu3 are on or off, it will not affect the performance of blktrace.
Use Cases
1. mount -t vfat /dev/block/mmcblk0p1 /data/temp
2. blktrace /dev/block/mmcblk1p1 -o /data/trace The meaning of this command: blktrace monitors the T card and outputs the monitoring results to the /data directory. The generated file is named trace.blktrace.0 trace.blktrace .1 trace.blktrace.2trace.blktrace.3 Their number is determined by the number of CPUs
3. Reopen a terminal dd if=/dev/zero of=/data/temp/11bs=512 count=1024
4. At this time, blktrace can monitor writes. Press ctrl+c to terminate monitoring.
5. Since the trace.blktrace.x file contains binary data, blkparse is required to parse it.
6. File parsing: blkparse -i trace The function of this command is to output the parsing results to the screen (this command is executed in the directory generated by trace.blktrace.x)
At this time, the terminal will output
179,1 2 0 0.000361250 0 m Ncfq110A fifo= (null)
179,1 2 0 0.000364094 0 m N cfq110A dispatch_insert
179,1 2 0 0.000378047 0 m Ncfq110A dispatched a request
179,0 1 25 0.000381250 2152 A W2993285 + 1 <- (179,1) 2991237
179,1 2 0 0.000382875 0 m Ncfq110A activate rq, drv=1
179,1 1 26 0.000386313 2152 Q W2993285 + 1 [kworker/u8:1]
179,1 2 1 0.000387250 88 D W2444 + 2 [mmcqd/0]
179,1 1 27 0.000405024 2152 G W2993285 + 1 [kworker/u8:1]
179,1 1 28 0.000409250 2152 P N[kworker/u8:1]
179,0 1 29 0.001225735 2152 A W2993286 + 1024 <- (179,1) 29912
Detailed explanation of blktrace command
1. blktrace /dev/block/mmcblk0p1 -o /data/trace 命令解析:监控mmcblk0p1块设备,将生成的文件存储在/data目录下,一共生成4个文件,文件以trace开头,分别为trace.blktrace.0 trace.blktrace.1 trace.blktrace.2trace.blktrace.3分别对应cpu0、cpu1、cpu2、cpu3
2. blktrace /dev/block/mmcblk0p1 -D /data/trace 命令解析:监控mmcblk0p1块设备,在/data目录下建立一个名字为trace的文件夹,trace文件夹下存放的是名字为 mmcblk0p1.blktrace.0mmcblk0p1.blktrace.1 mmcblk0p1.blktrace.2 mmcblk0p1.blktrace.3 分别对应cpu0 cpu1 cpu2 cpu3
3. blktrace /dev/block/mmcblk0p1 -o /data/trace -w 10 命令解析:-w 选项表示多长时间之后停止监控(单位:秒) -w 10 表示10秒之后停止监控
4. blktrace /dev/block/mmcblk0p1 -o /data/trace -a WRITE 命令解析:-a 代表只监控写操作
选项 -a action 表示要监控的动作,action的动作有:
READ (读)
WRITE(写)
BARRIER
SYNC
QUEUE
REQUEUE
ISSUE
COMPLETE
FS
PC
详见:http://www.cse.unsw.edu.au/~aaronc/iosched/doc/blktrace.html(blktrace user guide)
blkparse工具解析
1. 实时解析,实时数据的解析即上blktrace的“终端输出”实现实时解析的命令:blktrace -d/dev/block/mmcblk0p1 -o - |blkparse -i -
2. 文件解析,分为两种
(1) 在手机上生成解析文件
i. 实现方法:进入trace.blktrace.0 trace.blktrace.1 trace.blktrace.2 trace.blktrace.3所在的目录输入命令:blkparse -itrace -o /data/trace.txt
(2) 在PC上实现解析
ii. 实现方法,将手机上生成的trace.blktrace.0 trace.blktrace.1 trace.blktrace.2 trace.blktrace.3的文件拷贝到PC上输入命令:./blkparse -i trace -otrace.txt
blktrace解析文件格式
默认的输出内容格式为:"%D ,%8s %5T.%9t %5p * =",
如:
8,16 0 35 1.157274569 2544 GWBS 121897312 + 8 [jbd2/sdb-8]
其中:
%D 主从设备号 :8.16,CPU_id: 0 ###因为此时解析的文件是trace.blktrace.0,搜集的cpu0的信息,所以CPU_id为0###
%8s io序列号,一般从1开始 :35
%5T.%9t 此IO操作发生时的时间戳秒.纳秒:1.157274569
%5p process ID :2544
* IO action:解释见下面
= RWBS data。R表示读 W表示写D表示块被丢弃B表示barrier operation S表示同步IO:如上面的WBS,表示同步写操作
121897312是相对8,16的扇区起始号,+8,为后面连续的8个扇区(默认一个扇区512byte,所以8个扇区就是4K),后面的[jbd2/sdb-8]是程序的名字。
IO action列表
C -- complete A previouslyissued request has been completed. Theout‐
put will detail the sectorand size of that request, as well as the
success or failure of it.
D -- issued A request thatpreviously resided on the block layer queue
or in the i/o scheduler hasbeen sent to the driver.
I -- inserted A request isbeing sent to the i/o scheduler for addition
to the internal queue andlater service by the driver. The request
is fully formed at thistime.
Q -- queued This notes intent to queue i/o at the given location. No
real requests exists yet.
B -- bounced The data pagesattached to this bio are not reachable by
the hardware and must be bounced to a lower memory location. This
causes a big slowdown in i/operformance, since the data must be
copied to/from kernelbuffers. Usually this can be fixed with using
better hardware -- either abetter i/o controller, or a platform
with an IOMMU.
M -- back merge A previously inserted request exists that ends on the
boundary of where this i/obegins, so the i/o scheduler can merge
them together.
F -- front merge Same as the back merge, except this i/o ends where a
previously inserted requestsstarts.
M --front or back merge Oneof the above
G -- get request To send anytype of request to a block device, a
struct request containermust be allocated first.
S -- sleep No available request structures wereavailable, so the
issuer has to wait for oneto be freed.
P -- plug When i/o is queuedto a previously empty block device queue,
Linux will plug the queue inanticipation of future ios being added
before this data is needed.
U -- unplug Some requestdata already queued in the device, start send‐
ing requests to the driver. This may happen automatically if a
timeout period has passed (see next entry) or if a number of
requests have been added tothe queue.
T -- unplug due to timer If nobody requests the i/o that wasqueued
after plugging the queue,Linux will automatically unplug it after
a defined period has passed.
X -- split On raid or device mapper setups, anincoming i/o may strad‐
dle a device or internalzone and needs to be chopped up into
smaller pieces for service.This may indicate a performance problem
due to a bad setup of thatraid/dm device, but may also just be
part of normal boundary conditions. dm is notably bad at this and
will clone lots of i/o.
A -- remap For stackeddevices, incoming i/o is remapped to device
below it in the i/o stack.The remap action details what exactly is
being remapped to what.
详见:http://www.cse.unsw.edu.au/~aaronc/iosched/doc/blktrace.html(blktrace user guide)
实例分析
8,16 0 8 0.018543948 8191 Q W 12989792 + 24 [postgres]
8,16 0 9 0.018547191 8191 G W 12989792 + 24 [postgres]
8,16 0 10 0.018548571 8191 P N[postgres]
8,16 0 11 0.018550601 8191 I W 12989792 + 24 [postgres]
8,16 0 12 0.018551421 8191 U N [postgres] 1
8,16 0 13 0.018552618 8191 D W 12989792 + 24 [postgres]
8,16 0 14 0.018638488 8191 C W 12989792 + 24 [0]
以上就是一次IO请求的生命周期,从actions看到,分别是QGPIUDC
Q:先产生一个该位置的IO意向插入到io队列,此时并没有真正的请求
G:发送一个实际的Io请求给设备
P(plugging):插入:即等待即将到来的更多的io请求进入队列,以便系统能进行IO优化,减少执行IO请求时花的时间
I:将IO请求进行调度,到这一步请求已经完全成型(formed)好了
U (unplugging):拔出,设备决定不再等待其他的IO请求并使得系统必须响应当前IO请求,将该IO请求传给设备驱动器。可以看到,在P和U之间会等待IO,然后进行调度。这里会对IO进行一点优化,
但是程度很低,因为等待的时间很短,是纳秒级别的
D :发布刚才送入驱动器的IO请求
C:结束IO请求,这里会返回一个执行状态:失败或者成功,在进程号处为0表示执行成功,反之则反
到此一个IO的周期就结束了
利用btt分析blktrace数据
blkparse只是将blktrace数据转成可以人工阅读的格式,由于数据量通常很大,人工分析并不轻松。btt是对blktrace数据进行自动分析的工具。
btt不能分析实时数据,只能对blktrace保存的数据文件进行分析。使用方法:
把原本按CPU分别保存的文件合并成一个,合并后的文件名为sdb.blktrace.bin:
$ blkparse -i sdb -d sdb.blktrace.bin
执行btt对sdb.blktrace.bin进行分析:
$ btt -i sdb.blktrace.bin
下面是一个btt实例:
我们看到69.6173%的时间消耗在D2C,也就是硬件层,这是正常的,我们说过D2C是衡量硬件性能的指标,这里单个IO平均0.396594毫秒,已经是相
当快了,单个IO最慢10.70692毫秒,不算坏。Q2G和G2I都很小,完全正常。I2D稍微有点大,应该是cfq scheduler的调度策略造成的,你可以试试其
它scheduler,比如deadline,比较两者的差异,然后选择最适合你应用特点的那个。
G2I – IO请求进入IO Scheduler所消耗的时间,包括merge的时间;
I2D – IO请求在IO Scheduler中等待的时间;
D2C – IO请求在driver和硬件上所消耗的时间;
Q2C – 整个IO请求所消耗的时间(Q2I + I2D + D2C = Q2C),相当于iostat的await。