今天曲山同学在线上问道:
我测试发现,如果cp一个文件,然后direct io读这个文件,会消耗很长时间。
我猜测dio不能用page cache,而这个文件cp以后都在cache里面,要强制刷到磁盘,才能读?
我cp这个文件很大,超过256M
由于数据文件默认是用bufferedio方式打开的,也就是说它的数据是先缓冲在pagecache里面的,写入的数据会导致大量的脏页,而且这部分数据如果内核内存不紧张的话,是一直放在内存里面的的。我们知道directio是直接旁路掉pagecache直接发起设备IO的,也就是说在发起IO之前要保证数据是先落地到介质去,所以如果文件比较大的话,这个时间会比较长。从pagecahce的回写行为我们可以知道,只要脏页的数量不超过总内存的10%, 我们的机器有4G的内存,所以2个100M的文件总共才200M,不会导致writeback发生,我们可以很顺利的观察到这个现象。
有了上面的分析,下面我们来重现下这个问题。以下是我的步骤:
$ uname -a |
Linux rds064075.sqa.cm4 2.6.32-131.21.1.tb477.el6.x86_64 #1 SMP Thu Feb 23 14:24:55 CST 2012 x86_64 x86_64 x86_64 GNU/Linux |
$ sudo sysctl vm.drop_caches=3 |
vm.drop_caches = 3 |
$ free -m && cat /proc/meminfo | grep -i dirty && time dd if =/dev/urandom of= test .dat count=6144 bs=16384 && free -m && cat /proc/meminfo | grep -i dirty && time dd if = test .dat of=/dev/null count=6144 bs=16384 && free -m && cat /proc/meminfo | grep -i dirty && time dd if = test .dat of=/dev/null count=6144 bs=16384 iflag=direct && free -m && cat /proc/meminfo | grep -i dirty |
$ free -m && cat /proc/meminfo | grep -i dirty && time dd if =/dev/urandom of= test .dat count=6144 bs=16384 && free -m && cat /proc/meminfo | grep -i dirty && time dd if = test .dat of=/dev/null count=6144 bs=16384 && free -m && cat /proc/meminfo | grep -i dirty && time dd if = test .dat of=/dev/null count=6144 bs=16384 iflag=direct && free -m && cat /proc/meminfo | grep -i dirty |
total used free shared buffers cached |
Mem: 48262 22800 25461 0 3 42 |
-/+ buffers/cache: 22755 25507 |
Swap: 2047 2047 0 |
Dirty: 344 kB |
6144+0 records in |
6144+0 records out |
100663296 bytes (101 MB) copied, 15.2308 s, 6.6 MB/s |
real 0m15.249s |
user 0m0.001s |
sys 0m15.228s |
total used free shared buffers cached |
Mem: 48262 22912 25350 0 3 139 |
-/+ buffers/cache: 22768 25493 |
Swap: 2047 2047 0 |
Dirty: 98556 kB |
6144+0 records in |
6144+0 records out |
100663296 bytes (101 MB) copied, 0.028041 s, 3.6 GB/s |
real 0m0.029s |
user 0m0.000s |
sys 0m0.029s |
total used free shared buffers cached |
Mem: 48262 22912 25350 0 3 139 |
-/+ buffers/cache: 22768 25493 |
Swap: 2047 2047 0 |
Dirty: 98556 kB |
6144+0 records in |
6144+0 records out |
100663296 bytes (101 MB) copied, 0.466601 s, 216 MB/s |
real 0m0.468s |
user 0m0.002s |
sys 0m0.101s |
total used free shared buffers cached |
Mem: 48262 22906 25356 0 3 140 |
-/+ buffers/cache: 22762 25500 |
Swap: 2047 2047 0 |
Dirty: 896 kB |
从上面的实验,我们可以看出来我们的文件是101MB左右,脏页用了98544KB内存,在direct方式读后,文件占用的脏页被清洗掉了,脏页变成了80K, 但是这块数据还是留在了pagecache(140-39), 符合我们的预期。
接着我们从源码角度来分析下这个现象,我们知道VFS文件的读是从generic_file_aio_read发起的,而不管具体的文件系统是什么。
在文卿和三百的帮助下,我们不费吹灰之力就找到了源码位置,偷懒的方式如下:
$ stap -L 'kernel.function("generic_file_aio_read")' |
kernel. function ( "generic_file_aio_read@mm/filemap.c:1331" ) $iocb:struct kiocb* $iov:struct iovec const* $nr_segs:long unsigned int $pos:loff_t $count:size_t |
准备好emacs,我们来看下读代码的实现:
mm/filemap.c:1331
/** |
* generic_file_aio_read - generic filesystem read routine |
* @iocb: kernel I/O control block |
* @iov: io vector request |
* @nr_segs: number of segments in the iovec |
* @pos: current file position |
* |
* This is the "read()" routine for all filesystems |
* that can use the page cache directly. |
*/ |
ssize_t |
generic_file_aio_read( struct kiocb *iocb, const struct iovec *iov, |
unsigned long nr_segs, loff_t pos) |
{ |
/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ |
if (filp->f_flags & O_DIRECT) { |
loff_t size; |
struct address_space *mapping; |
struct inode *inode; |
mapping = filp->f_mapping; |
inode = mapping->host; |
if (!count) |
goto out; /* skip atime */ |
size = i_size_read(inode); |
if (pos < size) { |
retval = filemap_write_and_wait_range(mapping, pos, |
pos + iov_length(iov, nr_segs) - 1); |
if (!retval) { |
retval = mapping->a_ops->direct_IO(READ, |
iocb, |
iov, pos, nr_segs); |
} |
if (retval > 0) { |
*ppos = pos + retval; |
count -= retval; |
} |
/* |
* Btrfs can have a short DIO read if we encounter |
* compressed extents, so if there was an error, |
or if |
* we've already read everything we wanted to, or if |
* there was a short read because we hit EOF, go |
ahead |
* and return. Otherwise fallthrough to |
buffered io for |
* the rest of the read. |
*/ |
if (retval < 0 || !count || *ppos >= size) { |
file_accessed(filp); |
goto out; |
} |
} |
} |
源码很清楚的说:在directio方式下打开的文件,先要透过filemap_write_and_wait_range回写数据,才开始后面的IO读流程。
最后一步骤,我们再用stap来确认下我们之前的实验:
$ cat dwb.stp |
global i; |
probe kernel. function ( "filemap_write_and_wait_range" ) { |
if (execname() != "dd" ) next; |
print_backtrace(); |
println( "===" ); |
if (i++>2) exit (); |
} |
$ sudo stap dwb.stp |
0xffffffff8110e200 : filemap_write_and_wait_range+0x0/0x90 [kernel] |
0xffffffff8110f278 : generic_file_aio_read+0x498/0x870 [kernel] |
0xffffffff8117323a : do_sync_read+0xfa/0x140 [kernel] |
0xffffffff81173c65 : vfs_read+0xb5/0x1a0 [kernel] |
0xffffffff81173da1 : sys_read+0x51/0x90 [kernel] |
0xffffffff8100b172 : system_call_fastpath+0x16/0x1b [kernel] |
=== |
0xffffffff8110e200 : filemap_write_and_wait_range+0x0/0x90 [kernel] |
0xffffffff811acbc8 : __blockdev_direct_IO+0x228/0xc40 [kernel] |
0xffffffffa008a24a |
=== |
0xffffffff8110e200 : filemap_write_and_wait_range+0x0/0x90 [kernel] |
0xffffffff8110f278 : generic_file_aio_read+0x498/0x870 [kernel] |
0xffffffff8117323a : do_sync_read+0xfa/0x140 [kernel] |
0xffffffff81173c65 : vfs_read+0xb5/0x1a0 [kernel] |
0xffffffff81173da1 : sys_read+0x51/0x90 [kernel] |
0xffffffff8100b172 : system_call_fastpath+0x16/0x1b [kernel] |
=== |
0xffffffff8110e200 : filemap_write_and_wait_range+0x0/0x90 [kernel] |
0xffffffff811acbc8 : __blockdev_direct_IO+0x228/0xc40 [kernel] |
0xffffffffa008a24a |
=== |
filemap_write_and_wait_range的调用栈很清晰的暴露了一切!
小结:文件系统比较复杂,最好不要混用bufferedio和directio!
祝玩得开心!
Post Footer automatically generated by wp-posturl plugin for wordpress.
Yu Feng Reply:
October 15th, 2012 at 5:42 pm
多谢曲山提供这么好的案例!