Lowercase mode bluestore alignment process described non chunk

Part 1: Principles Introduction

: Introduction environment
EC mode. 1 + 2
dd IF = / dev / ZERO of File BS. 6K = 10 COUNT of oflag = = = Direct
performed after write 3k to 6k IO resolved into each OSD, filled to bdev_size (4k)

1, the first to write
new Blob (Blob1), (blob1 size min_alloc_size 64k) alloc blob from the disk
after the zero-padded io bdev_size 3k and aligned, the write blob1 extent1
Here Insert Picture Description
Here Insert Picture Description
second write (zero-padded portion covering the extent1 )
in two ways selected, the configuration parameters may be selected bluestore_clone_cow
Here Insert Picture Description
default bluestore_clone_cow = true using cow pattern covers
Here Insert Picture Description
the second mode cow into two write io perform:
clone original lextent1; ->
before use 1K io filled lextent1, write to lextent2; ->
after zero padding 2k io written lextent3; ->
release of pextent1 lextent1 rEFERENCE ->
temporary onode remove after cloning, the temporary release lextent reference pextent1, thereby releasing blob1
(. 1) _do_clone_range
Here Insert Picture Description
clone disk after extent1 shared blob1 Blob2 and two on the lextent, blob1 becomes shared_blob (blob-> ref = 2)

(2)do_write_small写拼好的第一个4k
Here Insert Picture Description
do_write_small—new_blob
Here Insert Picture Description
_do_alloc_write txc 0x7f5026c33080 1 blobs
Here Insert Picture Description
(3)do_write_small写剩余的2k数据(pad_zero补齐到bdev_size)
Here Insert Picture Description
(4)回收垃圾
estimate gc range(hex): [0, 1000)
Here Insert Picture Description
释放Blob1中的lextent1对blob1的引用
Here Insert Picture Description
Here Insert Picture Description
回收垃圾之后的映射效果:
blob1的引用计数减1,Blob1中第一个4k数据映射到extent2上
Here Insert Picture Description
(5)remove 临时onode,回收blob1的磁盘空间
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
删除临时onode,释放blob1
Here Insert Picture Description

1.第三个io
同理,第三个io下来时,分为两次写操作:
cow方式覆盖写extent2;
补零后写extent3
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
执行完各个op的io映射图如下:
Here Insert Picture Description
分析:
由于第三个io下来只覆盖写lextent2,所以只有这部分执行cow,数据写在新的blob(blob3)上, 老的blob(blob2)上还有已经写好的extent1,所以不能释放
同理申请blob4执行第四次写io
执行后的io映射图如下:
Here Insert Picture Description
当执行下一次io时:
由于不需要覆盖写,所以不需要再申请新的blob,复用blob4
综上,当ec小写拆分出的io与bdev_size非对齐时,执行cow覆盖写会对磁盘空间造成浪费;
如:2+1模式,6k io顺序写
磁盘浪费3/4(每4次写需要新申请3个blob)
2+1模式, 4k io顺序写时
磁盘浪费1/2(每两次写需要新申请1个新的blob)
第二部分:测试数据支持及场景应用
(1) 磁盘浪费测试(cow浪费磁盘空间, rmw不浪费磁盘空间)
Cow模式下拆分到每个osd上的io,看是否和bdev_size(4k)对齐:
如果大小为1k, 则每4次写申请一个blob
如果大小为2k, 则每4次写申请2个blob
如果大小为3k, 则每4次写申请3个blob
如果大小和bdev_size对齐,则不会有磁盘空间浪费
Here Insert Picture Description
(2)cow模式下浪费的空间再使用rmw模式覆盖写,磁盘空间回收测试
Here Insert Picture Description
(3)cow模式覆盖写后,再使用cow模式覆盖写测试
Here Insert Picture Description

第三部分:rmw模式下没有磁盘浪费说明:
Here Insert Picture Description
直接读取,合并,写入;不申请额外的空间
第四部分 bluestore默认使用cow模式而不用rmw的原因
环境介绍:
cephfs的数据资源池使用2+1的ec pool,组成为osd.0 osd.2 osd.3
为了避免写元数据时对small write统计的影响,元数据资源池使用其他osd创建,和数据osd隔离
(一)以osd.0为例,执行顺序小写时使用cow模式进行clone
Here Insert Picture Description
(1)在执行写操作之前:
Here Insert Picture Description
bluestore_write_small = 3183
(2)执行bs=6k count=36的写操作:
Here Insert Picture Description
(3)执行后,osd.0 的数据统计如下:
Here Insert Picture Description
bluestore_write_small = 3237
dd数据之后, bluestore执行small_write的次数为3237 – 3183 = 54次
下面分析osd.0在cow模式下接收36次大小为3k(拆分后)的io下发时实际进行54次小写的原因:
接收36次3k io 需要clone的次数 = 36 * 3 / 4 = 27 (原理见第一部分)
每次执行clone需要执行small write的次数 = 2 (一次合并后的数据, 一次补零后的新数据)
所以一共执行small write的次数为27 * 2 = 54 次
(二)将osd.0 clone的模式修改为rmw
Here Insert Picture Description
测试同样下发36次,大小为3k的io,实际执行small write的次数
(1)下发io之前的数据统计:
Here Insert Picture Description
bluestore_write_small = 3237
1、执行io下发操作
Here Insert Picture Description
拆分后,落在osd.0 上的io大小为36次3k写

1、执行完写之后, osd.0的数据统计
Here Insert Picture Description
bluestore_write_small = 3318
dd数据之后, bluestore执行small_write的次数为3318 – 3237 = 81次
下面分析osd.0在rmw模式下接收36次大小为3k(拆分后)的io下发时实际进行81次小写的原因:
接收36次3k io 需要clone的次数 = 36 * 3 / 4 = 27 (原理见第一部分)
每次clone操作需要执行的small write的次数 = 3(除cow模式下的那两次,还有clone时的一次)
Here Insert Picture Description
所以, 总共执行的small write次数为 27 * 3 = 81次
综上,当执行小写时,bluestore在执行small write时会有放大,根据覆盖写时执行clone的模式的不同,放大的情况也不一样:
当clone模式为cow时:每次clone对于small_write的放大为1倍
当clone模式为rmw时: 每次clone对于small_write的放大为2倍

cow :
bluestore_write_small = num * size / bdev_size * 2

rmw:
bluestore_write_small = num * size / bdev_size * 3

Wherein:
NUM number of issued write io
size: ec io Split Ends falls on each OSD size, the unit is K
bdev_size: default size 4k

num * size / bdev_size num is the next non-aligned when made small io, clone execution frequency (the number read is executed)

The fifth part summarizes
rmw mode and more, although not cause waste of disk space, but also has been able to reclaim wasted disk space when writing cover, but produce more than cow mode during the execution of write amplification covered in lowercase, the deffer write will bring more of a read operation, bluestore lowercase performance will have a greater bottleneck; under the cow waste disk mode brings different levels (size after the split according to the size of the actual split io the more waste than the closer bdev_size - see the first part and the second part), but for small write, write amplification is relatively small, the impact on performance is smaller, the actual deployment environment, can be flexibly configured according to business needs.

Published 47 original articles · won praise 6 · views 6719

Guess you like

Origin blog.csdn.net/qq_23929673/article/details/104033338