blktrace tool source code

1.1.1  About blktrace

We know that in the iostat tool, await represents the average time required for a single I/O, but it also includes the time consumed by the I/O Scheduler and the time consumed by the hardware, so it cannot be used as an indicator of hardware performance. As for iostat svctm is more of an obsolete indicator. And blktrace can come in handy in this case, from which you can analyze whether the IO Scheduler is slow or the hardware response is slow.

Blktrace is a user-mode tool used to collect detailed information (such as IO request submission, queuing, merging, completion, etc.) when IO proceeds to the block device layer (block layer, so called blk trace) in the disk IO information. column information). Users can obtain various details of the I/O request queue, including the process name, process number, execution time, physical block number, block size, etc. for reading and writing. It is an I/O analysis under Linux. Great tool for related content.

It is currently integrated into kernel 2.6.17 and later kernel versions.

1.1.2  About blktrace source code

blktrace code download address:

git clone git://git.kernel.dk/blktrace.git

Or download from the snapshot site:

http://brick.kernel.dk/snaps/

The make command can be executed directly.

PS: The libaio-dev package needs to be installed.

 

1.1.3  The principle of blktrace is the io stack

After an I/O request enters the block layer, it will go through the following process:

l     Remap: may be remapped to other devices by DM (Device Mapper) or MD (Multiple Device, Software RAID)

l     Split: The I/O request may be split into multiple physical I/Os because the I/O request is not aligned with the sector boundary, or the size is too large

l     Merge: May be merged into one I/O because it is physically adjacent to other I/O requests

l     is sent to the driver by the IO Scheduler according to the scheduling policy

l     is submitted to the hardware by the driver, passes through the HBA, cable (optical fiber, network cable, etc.), switch (SAN or network), and finally reaches the storage device. After the device completes the IO request, it sends the result back.

l The indicators displayed by    blkparse are as follows:

 

Q - IO request is about to be generated

G - IO request generation

I – IO requests enter the IO Scheduler queue

D – IO request goes to driver

C - IO request has been executed

According to the timestamp corresponding to the above steps, the time consumed by the I/O request in each stage can be calculated:

Q2G – the time spent generating IO requests, including the time of remap and split;

G2I – the time it takes for IO requests to enter the IO Scheduler, including the time for merge;

I2D - the time the IO request waits in the IO Scheduler;

D2C - time spent on driver and hardware for IO requests;

Q2C - The time consumed by the entire IO request (Q2I + I2D + D2C = Q2C), equivalent to iostat's await.

D2C can be used as an indicator of hardware performance;

I2D can be used as an indicator of IO Scheduler performance.

See the table below for others:

Act

Description

A

IO was remapped to a different device

B

IO bounced

C

IO completion

D

IO issued to driver

F

IO front merged with request on queue

G

Get request

I

IO inserted onto request queue

M

IO back merged with request on queue

P

Plug request

Q

IO handled by request queue code

S

Sleep request

T

Unplug due to timeout

U

Unplug request

X

Split

 

1.1.4 btrecord、btreplay

btrecord is used to record the I/O load generated by blktrace. It will parse each file generated by blktrace, extracting the I/O description information that will be used for I/O playback.

#blktrace /dev/vda -w 100

         Use blktrace to generate a load file, the generated file is in the current directory, the default file name is vda.blktrace.0, and vda is the device under test.

btrecored -d ./ vda -v

        Pay attention to the spaces in the command, record the trace file data in the current directory with the file name vda.blktrace.0, and the default output file name is: vda.replay.0.

btreplay -d ./ vda -W -v

btreplay is used to replay I/O. It is based on data files generated by btrecord. Play back the trace file in the current directory, the default is: vda.replay.0 file, -W means write operation is allowed.

 

1.1.5 blkiomon

Use blkiomon to dynamically display the IO stack, the command is as follows:

./blktrace /dev/vda -a issue -a complete -w 10 -o - | ./blkiomon -I 1 -h -

1.1.6  btt

#blktrace -w 100 /dev/vda 

blkparse查看的数据量比较大,不利于查看,通过btt可以对blktrace数据进行自动分析。先用blkparse可以把原本按CPU分别保存的文件合并成一个,文件名为vda.bin:

./blkparse -i vda -d vda.bin

执行btt对vda.bin进行分析:

#btt -i vdb.bin

1.1.7 blktrace工具使用

可以直接使用如下命令进行监控:

blktrace -d /dev/nvme0n1 -o test1

注意的是文件里面存的是二进制数据,需要blkparse来解析。

或者直接如下命令,实时解析输出,不过这个存在一个风险就是需要实时的将大量的事件进行排序:

blktrace -d /dev/nvme0n1 -o - | blkparse -i –

            另外,blktrace也能使用C/S模式,

被测机器SUT上运行blktrace -l,然后在客户机上运行:

blktrace -d /dev/sda -h <server hostname>

            此外还有工具verify_blkparse,用于检测blkparse工具输出的文件在时间上是否是正确的。Blkrawverfity用于blktrace工具输出的文件数据是否正确。

 

1.1.8 参考

使用README

 

1.2     (可)blktrace工具源码——下篇

上篇中介绍了blktrace工具,以及工具的使用和原理。

接下来我们一起剖析下blktrace工具源码。

            直接来看main函数,调用setlocale函数设置地域化信息。调用getpagesize函数来获取页大小,继而获取在线CPU数量。

     接着调用signal来设置信号处理函数:

        signal(SIGINT, handle_sigint);

        signal(SIGHUP, handle_sigint);

        signal(SIGTERM, handle_sigint);

        signal(SIGALRM, handle_sigint);

        signal(SIGPIPESIG_IGN);

          信号处理函数主要作用是快速停止跟踪,说白了就是调用ioctl来修改配置:

(void)ioctl(dpp->fd, BLKTRACESTOP);

            然后通过函数handle_args来处理参数,根据执行参数设置相关静态变量.当然如果没有参数就调用show_usage函数来显示使用帮助。

     最后调用run_tracers函数来启动磁盘跟踪.

 

1.2.1 工作模式

blktrace支持3中工作模式,代码中定义如下,主要参数是-l和-h来激活后面两个,默认就是第一个模式。

enum {

        Net_none = 0,

        Net_server,

        Net_client,

};

1.2.2 blktrace架构

其实很多时候代码只是实现,虽然其中包含了架构的思想,但是在代码这层,架构看上去非常的扁平,缺少立体感。来看下blktrace的架构图,大体如下:


            整体架构比较清晰,在IO栈中将相关事件给blktrace进行记录,最后由blkparse进行翻译成可读。

            因为blktrace工具的作者本身就是内核中块I/O的maintainer.

重要的数据结构有blk_io_trace,位于blktrace_api.h文件中:

/*

 * The trace itself

 */

struct blk_io_trace {

        __u32 magic;            /* MAGIC << 8 | version */

        __u32 sequence;         /* event number */

        __u64 time;             /* in nanoseconds */

        __u64 sector;           /* disk offset */

        __u32 bytes;            /*


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325536644&siteId=291194637