BufferedIO和DirectIO混用导致的脏页回写问题

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: BufferedIO和DirectIO混用导致的脏页回写问题

今天曲山同学在线上问道：

我测试发现，如果cp一个文件，然后direct io读这个文件，会消耗很长时间。
我猜测dio不能用page cache，而这个文件cp以后都在cache里面，要强制刷到磁盘，才能读？
我cp这个文件很大，超过256M

由于数据文件默认是用bufferedio方式打开的，也就是说它的数据是先缓冲在pagecache里面的，写入的数据会导致大量的脏页，而且这部分数据如果内核内存不紧张的话，是一直放在内存里面的的。我们知道directio是直接旁路掉pagecache直接发起设备IO的，也就是说在发起IO之前要保证数据是先落地到介质去，所以如果文件比较大的话，这个时间会比较长。从pagecahce的回写行为我们可以知道，只要脏页的数量不超过总内存的10%, 我们的机器有4G的内存，所以2个100M的文件总共才200M，不会导致writeback发生，我们可以很顺利的观察到这个现象。

有了上面的分析，下面我们来重现下这个问题。以下是我的步骤：

 
         $ uname -a 
        
         Linux rds064075.sqa.cm4 2.6.32-131.21.1.tb477.el6.x86_64 #1 SMP Thu Feb 23 14:24:55 CST 2012 x86_64 x86_64 x86_64 GNU/Linux 
        
         $ sudo sysctl vm.drop_caches=3 
        
         vm.drop_caches = 3 
        
         $ free -m && cat /proc/meminfo |grep -i dirty && time dd if=/dev/urandom of=test.dat count=6144 bs=16384 && free -m && cat /proc/meminfo |grep -i dirty && time dd if=test.dat of=/dev/null count=6144 bs=16384 && free -m && cat /proc/meminfo |grep -i dirty && timedd if=test.dat of=/dev/null count=6144 bs=16384  iflag=direct && free -m && cat/proc/meminfo |grep -i dirty 
        
         $ free -m && cat /proc/meminfo |grep -i dirty && time dd if=/dev/urandom of=test.dat count=6144 bs=16384 && free -m && cat /proc/meminfo |grep -i dirty && time dd if=test.dat of=/dev/null count=6144 bs=16384 && free -m && cat /proc/meminfo |grep -i dirty && timedd if=test.dat of=/dev/null count=6144 bs=16384  iflag=direct && free -m && cat/proc/meminfo |grep -i dirty 
        
                      total       used       free     shared    buffers     cached 
        
         Mem:         48262      22800      25461          0          3         42 
        
         -/+ buffers/cache:      22755      25507 
        
         Swap:         2047       2047          0 
        
         Dirty:               344 kB 
        
         6144+0 records in 
        
         6144+0 records out 
        
         100663296 bytes (101 MB) copied, 15.2308 s, 6.6 MB/s 
        
         real    0m15.249s 
        
         user    0m0.001s 
        
         sys    0m15.228s 
        
                      total       used       free     shared    buffers     cached 
        
         Mem:         48262      22912      25350          0          3        139 
        
         -/+ buffers/cache:      22768      25493 
        
         Swap:         2047       2047          0 
        
         Dirty:             98556 kB 
        
         6144+0 records in 
        
         6144+0 records out 
        
         100663296 bytes (101 MB) copied, 0.028041 s, 3.6 GB/s 
        
         real    0m0.029s 
        
         user    0m0.000s 
        
         sys    0m0.029s 
        
                      total       used       free     shared    buffers     cached 
        
         Mem:         48262      22912      25350          0          3        139 
        
         -/+ buffers/cache:      22768      25493 
        
         Swap:         2047       2047          0 
        
         Dirty:             98556 kB 
        
         6144+0 records in 
        
         6144+0 records out 
        
         100663296 bytes (101 MB) copied, 0.466601 s, 216 MB/s 
        
         real    0m0.468s 
        
         user    0m0.002s 
        
         sys    0m0.101s 
        
                      total       used       free     shared    buffers     cached 
        
         Mem:         48262      22906      25356          0          3        140 
        
         -/+ buffers/cache:      22762      25500 
        
         Swap:         2047       2047          0 
        
         Dirty:               896 kB

从上面的实验，我们可以看出来我们的文件是101MB左右，脏页用了98544KB内存，在direct方式读后，文件占用的脏页被清洗掉了，脏页变成了80K, 但是这块数据还是留在了pagecache(140-39), 符合我们的预期。

接着我们从源码角度来分析下这个现象，我们知道VFS文件的读是从generic_file_aio_read发起的，而不管具体的文件系统是什么。
在文卿和三百的帮助下，我们不费吹灰之力就找到了源码位置，偷懒的方式如下：

 
         $ stap -L 'kernel.function("generic_file_aio_read")' 
        
         kernel.function("generic_file_aio_read@mm/filemap.c:1331") $iocb:struct kiocb* $iov:struct iovec const* $nr_segs:long unsigned int $pos:loff_t $count:size_t

准备好emacs,我们来看下读代码的实现：
mm/filemap.c:1331

 
         /** 
        
          * generic_file_aio_read - generic filesystem read routine 
        
          * @iocb:       kernel I/O control block 
        
          * @iov:        io vector request 
        
          * @nr_segs:    number of segments in the iovec 
        
          * @pos:        current file position 
        
          * 
        
          * This is the "read()" routine for all filesystems 
        
          * that can use the page cache directly. 
        
          */ 
        
         ssize_t 
        
         generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, 
        
                         unsigned long nr_segs, loff_t pos) 
        
         { 
        
                 /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ 
        
                 if (filp->f_flags & O_DIRECT) { 
        
                         loff_t size; 
        
                         struct address_space *mapping; 
        
                         struct inode *inode; 
        
                         mapping = filp->f_mapping; 
        
                         inode = mapping->host; 
        
                         if (!count) 
        
                                 goto out; /* skip atime */ 
        
                         size = i_size_read(inode); 
        
                         if (pos < size) { 
        
                                 retval = filemap_write_and_wait_range(mapping, pos, 
        
                                                 pos + iov_length(iov, nr_segs) - 1); 
        
                                 if (!retval) { 
        
                                         retval = mapping->a_ops->direct_IO(READ, 
        
         iocb, 
        
                                                                 iov, pos, nr_segs); 
        
                                 } 
        
                                 if (retval > 0) { 
        
                                         *ppos = pos + retval; 
        
                                         count -= retval; 
        
                                 } 
        
                                 /* 
        
                                  * Btrfs can have a short DIO read if we encounter 
        
                                  * compressed extents, so if there was an error, 
        
         or if 
        
                                  * we've already read everything we wanted to, or if 
        
                                  * there was a short read because we hit EOF, go 
        
         ahead 
        
                                  * and return.  Otherwise fallthrough to 
        
         buffered io for 
        
                                  * the rest of the read. 
        
                                  */ 
        
                                 if (retval < 0 || !count || *ppos >= size) { 
        
                                         file_accessed(filp); 
        
                                         goto out; 
        
                                 } 
        
                         } 
        
                 }

源码很清楚的说：在directio方式下打开的文件，先要透过filemap_write_and_wait_range回写数据，才开始后面的IO读流程。
最后一步骤，我们再用stap来确认下我们之前的实验：

 
         $ cat dwb.stp 
        
         global i; 
        
         probe kernel.function("filemap_write_and_wait_range") { 
        
         if (execname() != "dd") next; 
        
         print_backtrace(); 
        
         println("==="); 
        
         if (i++>2) exit(); 
        
         } 
        
         $ sudo stap dwb.stp 
        
          0xffffffff8110e200 : filemap_write_and_wait_range+0x0/0x90 [kernel] 
        
          0xffffffff8110f278 : generic_file_aio_read+0x498/0x870 [kernel] 
        
          0xffffffff8117323a : do_sync_read+0xfa/0x140 [kernel] 
        
          0xffffffff81173c65 : vfs_read+0xb5/0x1a0 [kernel] 
        
          0xffffffff81173da1 : sys_read+0x51/0x90 [kernel] 
        
          0xffffffff8100b172 : system_call_fastpath+0x16/0x1b [kernel] 
        
         === 
        
          0xffffffff8110e200 : filemap_write_and_wait_range+0x0/0x90 [kernel] 
        
          0xffffffff811acbc8 : __blockdev_direct_IO+0x228/0xc40 [kernel] 
        
          0xffffffffa008a24a 
        
         === 
        
          0xffffffff8110e200 : filemap_write_and_wait_range+0x0/0x90 [kernel] 
        
          0xffffffff8110f278 : generic_file_aio_read+0x498/0x870 [kernel] 
        
          0xffffffff8117323a : do_sync_read+0xfa/0x140 [kernel] 
        
          0xffffffff81173c65 : vfs_read+0xb5/0x1a0 [kernel] 
        
          0xffffffff81173da1 : sys_read+0x51/0x90 [kernel] 
        
          0xffffffff8100b172 : system_call_fastpath+0x16/0x1b [kernel] 
        
         === 
        
          0xffffffff8110e200 : filemap_write_and_wait_range+0x0/0x90 [kernel] 
        
          0xffffffff811acbc8 : __blockdev_direct_IO+0x228/0xc40 [kernel] 
        
          0xffffffffa008a24a 
        
         ===

filemap_write_and_wait_range的调用栈很清晰的暴露了一切！

小结：文件系统比较复杂，最好不要混用bufferedio和directio！
祝玩得开心！

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: Linux Tags: bufferedio, directio, systemtap

Comments (5)

qushan

October 15th, 2012 at 17:32 | #1

Reply | Quote

霸爷分析的很清楚！解惑了！

Yu Feng Reply:
October 15th, 2012 at 5:42 pm

多谢曲山提供这么好的案例！
符风

October 15th, 2012 at 17:47 | #2

Reply | Quote

霸爷威武。
xanpeng

October 17th, 2012 at 15:45 | #3

Reply | Quote

霸爷，为什么free -m显示的used在direct read前后没变？难道是说page cache中的文件刷回设备之后，文件仍存在于page cache中，并且状态是Uptodate的？

Yu Feng Reply:
October 17th, 2012 at 3:55 pm

确实是，pagecache的东西没清掉，方便后面的buffered读，反正中间没发生数据改变。如果是写的话，会清掉的。
todaygood

October 17th, 2012 at 17:51 | #4

Reply | Quote

版主 systemtap 用得好熟啊！

Yu Feng Reply:
October 17th, 2012 at 5:57 pm

呵呵，不多得的好工具一定要用好。
zhh5919

August 13th, 2014 at 17:19 | #5

Reply | Quote

有一点疑问，dirtypage除了超过10%启动pdflush，不是还有每隔30s，难道是文件较大，每隔30s刷的页数比较少导致的吗

BufferedIO和DirectIO混用导致的脏页回写问题

猜你喜欢