FIO Disk Performance Test

fio is an open source stress testing tool, mainly used to test hard disk io performance. This tool is very customizable and can perform various mixed io tests according to the tester's ideas. It supports 13 different types of io engines (libaio, sync, mmap, posixaio, network, etc.). It can test block devices or files, simulate various io operations through multi-threads or processes, and test statistics such as iops, bandwidth, and latency. We primarily use the fio tool for storage performance testing.

Disk Performance Basics

Although SSDs are widely used now, HDDs are still used in large quantities in some storage systems, which are cheap after all; SSDs are stored by electricity, and HDDs are stored by mechanical movement. It is obvious that SSDs have inherent advantages in storage performance; The main factors affecting SSD performance are the design of the storage main control chip, the design of the flash memory unit, and of course the process level. At present, the cutting-edge technology of SSD main control chip design is in the hands of Intel, and the performance of Intel SSD is currently the best.
For the performance of HDD mechanical hard disk, 95% of the time-consuming data reading and writing is consumed in mechanical movement, and one data operation is one IO service, which includes the following links:

Seek time refers to the time required to move the read-write head to the correct track. The shorter the seek time, the faster the I/O operation. At present, the average seek time of the disk is generally 3-15ms;
Rotational latency is the time it takes for the platter to rotate to move the sector where the requested data resides under the read-write head. Rotational latency depends on the rotational speed of the disk and is usually expressed in 1/2 the time it takes the disk to make one revolution. For example, the average rotational latency of a 7200rpm disk is about 60*1000/7200/2 = 4.17ms
The internal interface transfer time, that is, the transfer time from the disk platter to the internal disk buffer, is on the order of 440 microseconds.
The external interface transfer time, ie the time from disk buffering to interface transfer, is on the order of 110 microseconds.

For a certain disk, the rotation delay remains unchanged, and the internal interface transmission time and external interface transmission time are at the microsecond level, which is basically negligible, so the core that affects HDD performance is the seek time;

There are two indicators for measuring disk performance: IOPS, IO throughput/bandwidth;

IOPS refers to the number of IO services that can be completed per second. The main time-consuming of an IO service is the seek time. If there are a large number of random IOs, each seek time is at the upper limit and IOPS drops;
- Theoretical upper and lower limits = [1000ms/(3ms+4.17ms), 1000ms/(15ms+4ms)] = [139, 53]; but in the real environment, the value of a disk IOPS is affected by the data size of each IO read and write, such as There must be differences in IOPS under various values such as 512B, 4k, and 1M;
IO throughput indicates the number of bytes of IO read and write data completed within a specified time. Its value is closely related to the size of each IO read and write data, which will also affect IOPS. For example, each read and write data block is larger, In this way, the overhead caused by the seek is reduced to the greatest extent, thereby improving the throughput, but the IOPS will be reduced at this time;

Example description: It takes much more time to write 10,000 files with a size of 1kb to the hard disk than to write 10 files with a size of 1mb.

Although the total amount of data is 10mb, the data distribution of 10,000 files of 1kb size on the magnetic track is discontinuous, and it may take thousands of IOPS to read all the data, which greatly increases the seek time and takes a long time.
- This situation is used to guide the HDD IOPS performance test, that is, the small data block multi-request test.
In comparison, the data distribution continuity of 10 1mb files on the magnetic track is better, and it may only need a dozen IOPS to complete the reading of all the data, which greatly reduces the seek time and consumes less time.
- This situation is used to guide the HDD IO throughput/bandwidth performance test, that is, the test of large data blocks and small number of requests.

The pursuit of disk performance IOPS and IO throughput is closely related to the business. For example, the storage of large files hopes to achieve the highest throughput, while the business of small files or random read and write is more pursuing IOPS; if the business depends on throughput and IOPS to a certain extent, Then you need to find an appropriate blocksize to set the size of each IO, so as to balance between IOPS and throughput, that is (IOPS × block_size = IO throughput);

Methods to provide IO performance:

IO merging, that is, merging multiple IO requests;
IO unpacking, using RAID and other disk arrays, concurrently distributes a common IO request to multiple physical disks, providing IO throughput
IO large cache, each IO write operation is not directly written to the physical device, but stored in the IO cache, until the cache is full and then updated to the disk device;
IO pre-reading Buffer, through IO pre-reading, although only 4k is read at a time, but 512k is pre-judged to the buffer, of course, this is a pitfall for applications below 4k.

NOTE: In computer programming, the performance of programs is often optimized from the perspectives of temporal locality, spatial locality, and cache hit rate, and in the computer field, this principle of locality is universal.

fio installation

Ubuntu environment

sudo apt-get install fio gunplot

Embedded platform or source code compilation
fio source code download address: https://github.com/axboe/fio/tags

git clone https://github.com/axboe/fio.git

 ./configure --cc=arm-linux-gnueabihf-gcc --prefix=./build
 make                                          
 make install

Confirm whether libaio and libaio-devel are installed in the system. If these two packages are not installed, the fio tool cannot use the asynchronous libaio engine. If libaio and libaio-devel are not installed before installing fio, then you need to recompile and install fio after installing libaio and libaio-devel, otherwise you cannot use the asynchronous engine libaio.

User guides

fio is a tool that can configure different parameters according to test requirements for io performance testing. There are hundreds of configurable parameters, which can cover almost all io models. Detailed parameters can be found on fio's official website: https://fio.readthedocs . io/en/latest/fio_doc.html .
Commonly used parameters in storage performance testing are as follows:

parameter	parameter value	explain
filename	Device name or file name such as: raw device /dev/sdb, file /home/test.img	Define the test object, usually a device or a file. If you want to test the bare disk device /dev/sdb, you can set filename=/dev/sdb; if you want to test the performance of the file system, you can set filename=/home/test.img
name	test name	Define the test name. Required, the name of the local fio test, has no impact on performance.
rw	Test type, optional values are: read, write, rw, randread, randwrite, randrw	Define the read-write type of the test: read-sequential read, write-sequential write, rw(readwrite)-mixed sequential read and write, randread-random read, randwrite-random write, randrw-mixed random read and write. For the read-write mixed type, the default read-write ratio is 1:1
rwmixwrite rwmixread	Read and write ratio in mixed read and write, the optional value is [0,100]	Define the proportion of writing or reading in the mixed reading and writing mode. For example, a rwmixread value of 10 means that the read/write ratio is 10:90.
ioengine	io engine selection, optional values are sync, libaio, psync, vsync, mmap, etc.	Define how fio sends io requests. sync: basic read, write io. libaio: linux native asynchronous io, linux only supports queuing operations in the case of non-buffer. The default value is psync, which is the synchronous io model, and the asynchronous io model libaio, generally used in conjunction with direct=1.
direct	0 or 1	Define whether to use direct io: a value of 0 means using buffered io; a value of 1 means using direct io.
bs	with unit number	Define the block size of io, the unit is k, K, m, M. The default value is 4k.
numjobs	positive integer	Define the number of processes/threads for the test, the default value is 1, if the thread parameter is specified, the single-process multi-thread mode is used, otherwise the multi-process mode is used.
iodepth	positive integer	Define the number of io tasks that each process/thread can issue at the same time. The default is 1, which is suitable for asynchronous io scenarios. In synchronous io scenarios, the next task cannot be issued until the previous io task is completed, so iodepth does not work.

Other relevant parameters:

parameter	parameter value	explain
size	The integer defines the data volume of the io operation. Unless the runtime parameter is specified, fio will read and write all the data volume of the specified size before stopping the test.	The value of this parameter can be a number with a unit, such as size=2G, indicating that the amount of data read/written is 2G; it can also be a percentage, such as size=20%, indicating that the amount of data read and written accounts for the device/file 20% space; in addition to indicating the amount of io operation data, size also indicates the range of io operation, and read and write data within the range of [offset,offset+size] together with offset
runtime	positive integer	Defines the test time, in seconds. For the specified file device, if the read and write operations are completed within the test time, the same duty cycle will be maintained for reading and writing.
ramp_time	positive integer	Defines the warm-up time for the test, in seconds. Warm-up time is not counted in test statistics.
thinktime	positive integer	Set how long to wait before sending another io after one io is completed, the unit is microseconds. It is generally used to simulate application waiting, processing transactions, etc.
time_based	THAT	The default value is 0. If this parameter is set, the test will run until the end of runtime.
offset	Define the test starting position	fio is at the starting position of the hard disk test, and it is used with sequential read and write. Different offsets have a great influence on the performance of HDD, but have no effect on SSD theory.
group_reporting	THAT	For the case where the number of test concurrency is greater than 1, if you want to count the test results of all processes/threads and output the total test results instead of outputting each individually, you can use this parameter.
cpus_allowed	Specify the core of the cpu	The cpu cores that can be selected by all processes/threads tested by fio can be a single or multiple cores, or a range.
output	result output path	Result output file, the default is to print the result directly to the command line. After setting, the result will be written into the target file, and the command line will not display the result.

Command line test method

# fio -name=mytest \
-filename=/dev/sdb \
-direct=1 \
-iodepth=20 \
-thread \
-rw=randread \
-ioengine=libaio \
-bs=16k \
-size=5G \
-numjobs=2 \
-runtime=300 \
-group_reporting \

name=mytest: The name of this test mytest, self-defined.
filename=/dev/sdb: The name of the test bare disk, usually select the data directory of the disk to be tested, which is a bare disk without a file system.
direct=1: The test process bypasses the buffer that comes with the machine. Make the test results more realistic.
iodepth=20: Each thread sends 20 io tasks each time.
rw=randread: Test random read I/O.
ioengine=libaio: io engine uses libaio mode.
bs=16k: The block file size of a single io is 16k.
size=5G: The test file size is 5G this time, and the test is performed with 16k io each time.
numjobs=2: The number of test threads this time is 2.
runtime=300: The test time is 300 seconds. If not written, the 5G file will be divided into 16k until the writing is finished each time.
group_reporting: For displaying results, summarize information about each process.

Sequential read:

fio -name=mytest -filename=/dev/sdb -direct=1 -iodepth=20 -thread -rw=read -ioengine=libaio -bs=16k -size=5G -numjobs=2 -runtime=300 -group_reporting

Write in order:

fio -name=mytest -filename=/dev/sdb -direct=1 -iodepth=20 -thread -rw=write -ioengine=libaio -bs=16k -size=5G -numjobs=2 -runtime=300 -group_reporting

Write randomly:

fio -name=mytest -filename=/dev/sdb -direct=1 -iodepth=20 -thread -rw=randwrite -ioengine=libaio -bs=1k -size=5G -numjobs=2 -runtime=300 -group_reporting

Mixed sequential read and write:

fio -name=mytest -filename=/dev/sdb -direct=1 -iodepth=20 -thread -rw=rw -ioengine=libaio -bs=16k -size=5G -numjobs=2 -runtime=300 -group_reporting

Mixed random read and write:

fio -name=mytest -filename=/dev/sdb -direct=1 -iodepth=20 -thread -rw=randrw -ioengine=libaio -bs=16k -size=5G -numjobs=2 -runtime=300 -group_reporting

test file system

fio -name=mytest -filename=/test/test.img -direct=1 -iodepth=20 -thread -rw=randread -ioengine=libaio -bs=16k -size=5G -numjobs=2 -runtime=300 -group_reporting

filename=/test/test.img: The name of the test file, which is the space that has been created in the file system. This method is usually to partition and format a bare disk, and then mount it to this directory.

configuration file method

fio can also execute multiple different files at the same time by executing the job mode, and use few commands to execute the test. Test command: fio job.

Job file writing reference:

[global]
ioengine=libaio
direct=1
size=8g
filesize=500g
time_based
rw=randrw
rwmixread=70
bs=4k
runtime=120

[job1]
filename=/dev/sdb
numjobs=16
iodepth=1

[job2]
filename=/dev/sdc
numjobs=16
iodepth=1

Every job in the job file is executed at the same time, and the global part is global.
If you want to execute separately and count the results, you need to calculate the time and use the startdelay parameter to control.
Using the command line to test multiple devices at the same time, the stress will be unevenly distributed. But it can be solved by configuring the job file, which is equivalent to opening two windows to execute multiple commands at the same time.
In the job configuration file, add the group_reporting parameter, and you can see the statistical performance data after the test is completed.

Analysis of test results

# fio -name=hdd7k -filename=/dev/sdb -direct=1 -iodepth=20 -thread -rw=rw -ioengine=libaio -bs=4k -size=10G -numjobs=10 -runtime=300 -group_reporting
hdd7k: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=20
...
fio-3.29-7-g01686
Starting 10 threads
Jobs: 10 (f=9): [f(10)][100.0%][r=10.9MiB/s,w=10.7MiB/s][r=2785,w=2742 IOPS][eta 00m:00s] 
hdd7k: (groupid=0, jobs=10): err= 0: pid=73595: Thu Jan 13 11:19:06 2022
  read: IOPS=3258, BW=12.7MiB/s (13.3MB/s)(3820MiB/300065msec)
    slat (usec): min=2, max=547, avg= 6.47, stdev= 4.04
    clat (usec): min=27, max=4722.3k, avg=33300.91, stdev=77842.79
     lat (usec): min=47, max=4722.3k, avg=33307.62, stdev=77842.14
    clat percentiles (usec):
     |  1.00th=[    725],  5.00th=[   1745], 10.00th=[   2737],
     | 20.00th=[   3228], 30.00th=[   3687], 40.00th=[   4621],
     | 50.00th=[   5866], 60.00th=[   7832], 70.00th=[  14615],
     | 80.00th=[  39060], 90.00th=[ 116917], 95.00th=[ 135267],
     | 99.00th=[ 333448], 99.50th=[ 463471], 99.90th=[ 843056],
     | 99.95th=[1052771], 99.99th=[1686111]
   bw (  KiB/s): min=  200, max=91945, per=100.00%, avg=13408.99, stdev=1441.98, samples=5840
   iops        : min=   50, max=22985, avg=3351.87, stdev=360.48, samples=5840
  write: IOPS=3261, BW=12.7MiB/s (13.4MB/s)(3823MiB/300065msec); 0 zone resets
    slat (usec): min=3, max=532, avg= 6.68, stdev= 4.17
    clat (usec): min=38, max=4725.1k, avg=28027.57, stdev=69203.56
     lat (usec): min=50, max=4725.2k, avg=28034.48, stdev=69202.94
    clat percentiles (usec):
     |  1.00th=[    848],  5.00th=[   1860], 10.00th=[   2802],
     | 20.00th=[   3294], 30.00th=[   3752], 40.00th=[   4621],
     | 50.00th=[   5800], 60.00th=[   7439], 70.00th=[  11994],
     | 80.00th=[  27919], 90.00th=[ 111674], 95.00th=[ 122160],
     | 99.00th=[ 265290], 99.50th=[ 392168], 99.90th=[ 784335],
     | 99.95th=[ 977273], 99.99th=[1904215]
   bw (  KiB/s): min=  208, max=91939, per=100.00%, avg=13475.63, stdev=1444.81, samples=5818
   iops        : min=   52, max=22984, avg=3368.52, stdev=361.18, samples=5818
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.06%, 500=0.32%, 750=0.53%
  lat (usec)   : 1000=0.74%
  lat (msec)   : 2=4.20%, 4=28.23%, 10=32.04%, 20=8.83%, 50=8.44%
  lat (msec)   : 100=4.75%, 250=10.40%, 500=1.10%, 750=0.23%, 1000=0.07%
  lat (msec)   : 2000=0.05%, >=2000=0.01%
  cpu          : usr=0.18%, sys=0.45%, ctx=964464, majf=0, minf=210
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=977794,978699,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=20

Run status group 0 (all jobs):
   READ: bw=12.7MiB/s (13.3MB/s), 12.7MiB/s-12.7MiB/s (13.3MB/s-13.3MB/s), io=3820MiB (4005MB), run=300065-300065msec
  WRITE: bw=12.7MiB/s (13.4MB/s), 12.7MiB/s-12.7MiB/s (13.4MB/s-13.4MB/s), io=3823MiB (4009MB), run=300065-300065msec

Disk stats (read/write):
  sdb: ios=480787/482996, merge=492621/492299, ticks=6820973/5217066, in_queue=7483700, util=100.00%

When Fio is running, it will display the status information of the task, as follows:

# 运行中
Jobs: 10 (f=10): [M(10)][2.7%][r=2508KiB/s,w=2856KiB/s][r=627,w=714 IOPS][eta 04m:52s]

# 已完成
Jobs: 10 (f=9): [f(10)][100.0%][r=10.9MiB/s,w=10.7MiB/s][r=2785,w=2742 IOPS][eta 00m:00s]

The characters in the first set of square brackets indicate the current state of each thread. The first character is the first task defined in the task file, and so on. Possible values (in typical lifetime order) are:

P      Thread setup, but not started.
C      Thread created.
I      Thread initialized, waiting or generating necessary data.
p      Thread running pre-reading file(s).
/      Thread is in ramp period.
R      Running, doing sequential reads.
r      Running, doing random reads.
W      Running, doing sequential writes.
w      Running, doing random writes.
M      Running, doing mixed sequential reads/writes.
m      Running, doing mixed random reads/writes.
D      Running, doing sequential trims.
d      Running, doing random trims.
F      Running, currently waiting for fsync(2).
V      Running, doing verification of written data.
f      Thread finishing.
E      Thread exited, not reaped by main thread yet.
-      Thread reaped.
X      Thread reaped, exited with an error.
K      Thread reaped, exited due to signal.

This line in running means: execute 10 tasks, open 10 file descriptors, run a total of 10 mixed sequential read and write tasks, 2.7% completed, read 2508KiB/s, write 2856KiB/s; read IOPS 627, Write IOPS 714, estimating it took 04m:52s to complete all tasks.
When Fio is completed or ctrl+c is canceled, all result information will be output. The key parameters are explained below:

hdd7k: (groupid=0, jobs=10): err= 0: pid=73595: Thu Jan 13 11:19:06 2022

Defined task name, groupid, number of aggregated tasks, last error id (0 means no error) and pid, completion time.
read/write/trim: IOPS means average IOPS per second; BW means average bandwidth, the data after BW is binary bandwidth, and the decimal bandwidth in parentheses; the last two values represent (binary total IO size/total running time )
slat: submission latency, that is, the time consumed by asynchronous IO command generation (min minimum time, max maximum time, avg average time, stdev standard deviation). For synchronous I/O, this line does not show up because slat is actually a completion latency. The unit of this value can be nanoseconds, microseconds or milliseconds, fio automatically chooses the appropriate unit. In --minimal mode, the delay is always measured in microseconds.
clat: completion latency, which means the time required from submitting the IO instruction to the kernel to the IO instruction to run, excluding the submission latency; in the case of synchronous IO, clat is usually equal to (or very close to) 0, because the time from commit to completion only represents CPU time (the IO instruction has been generated).
lat: total latency, which means the time it takes from the creation of the IO unit by fio to the execution of all the IO instructions.
clat percentiles: completion delay percentage distribution, note that from the source code point of view, it is not slat + clat; it has its own structure description. For understanding ideas, refer to the following lat(nsec/usec/msec).
bw: Sample-based bandwidth statistics. It is best to sample in the same group of threads on the same disk, which is of practical significance, because the disk competition access is in line with the actual situation.
iops: Sample-based IOPS statistics. Same bw.
lat(nsec/usec/msec): The distribution of I/O completion latencies, which represents the time from when IO leaves fio to when it completes. Unlike the separate read/write/trim sections above, the data here and in the rest apply to all IO of the reporting group. 50=0.01% means that 0.01% of IOs are completed under 50us, 100=0.01% means that 0.01% of IOs take 50 to 99us to complete. And so on.
cpu: CPU usage. User and system time, as well as the number of context switches the thread has experienced, system and user time usage, and finally the number of major and minor page faults. The CPU utilization figure is an average of the tasks in the reporting group, while the context and fault counters are summed.
IO depths: The distribution of IO depths within the task life cycle. The numbers are divided into powers of 2 and each entry covers the depth from that value down to the next entry, eg 16= covers the depths from 16 to 31. Note that a depth distribution entry may not cover the same range as an equivalent submit/complete distribution entry.
IO submit: How many IO blocks were submitted in one submit call. Each entry represents this amount until the previous entry. For example, 4=100% means that we commit between 1 and 4 IOs per commit call. Note that the range covered by a commit distribution entry can be different than the range covered by an equivalent deep distribution entry.
IO complete: Same as submit, indicating the completed IO block.
IO issued rwt: Number of read/write/trim requests issued and number of missing and dropped.
IO latency: These values are used for latency_target and related options. When using these options, this snippet describes the IO depth required to meet the specified latency goals.

The above is the detailed parameter description in the report, and the summary report data is explained below:

Run status group 0 (all jobs):
 READ: bw=12.7MiB/s (13.3MB/s), 12.7MiB/s-12.7MiB/s (13.3MB/s-13.3MB/s), io=3820MiB (4005MB), run=300065-300065msec
WRITE: bw=12.7MiB/s (13.4MB/s), 12.7MiB/s-12.7MiB/s (13.4MB/s-13.4MB/s), io=3823MiB (4009MB), run=300065-300065msec

Disk stats (read/write):
sdb: ios=480787/482996, merge=492621/492299, ticks=6820973/5217066, in_queue=7483700, util=100.00%

bw: The minimum and maximum bandwidth of all threads in this group followed by the total bandwidth of all threads in this group. The values outside the brackets are in binary format, and the values in brackets are the equivalent values in decimal format.
io: Consolidates the IO performed by all threads in the group. The format is the same as bw.
run: The minimum and maximum run times for threads in this group.
ios: IO count of all thread groups.
merge: The number of merges performed by the IO scheduler.
ticks: The number of ticks to keep the disk busy.
in_queue: Total time spent in the disk queue.
util: disk utilization. A value of 100% means we keep the disk busy all the time, 50% means the disk is idle half the time.

Precautions

Test Notes

The fio test will read and write the hard disk of the system, and there is a risk of destroying customer data. Therefore, it should be used with caution in the production environment. It is not recommended to perform a bare disk read and write test on the hard disk in the production environment.
If you need to test, it is best to create a file, and then read and write to the specified file.
If performance analysis or tuning is required, iostat information needs to be collected during the test, for example: iostat -xd 1 1000 > result.log.

Performance Tuning Considerations

When FIO conducts a performance comparison test, the parameters must be consistent, otherwise the difference in parameters will lead to differences in results.
When testing the file system, write a large file first, and then do the hard disk IO read and write test. Avoid reading the file to other places under the small data block IO model.
If it is an SSD, it is best to set the core binding during the test, because the IOPS is too large, if the thread runs on CPU0 or a core occupied by other applications, then the optimal performance may not be achieved. Bind the core using the taskset -c 31 ./fio… command. Corresponding to which core to use, you can execute numactl -H to query the information of the core.

Reference:
FIO Disk Performance Test