http://www.zphj1987.com/2016/09/19/%E6%9B%BF%E6%8D%A2OSD%E6%93%8D%E4%BD%9C%E7%9A%84%E4%BC%98%E5%8C%96%E4%B8%8E%E5%88%86%E6%9E%90/
Foreword
There wrote before the correct way to remove the OSD , which is simply a way to talk about how to remove under reduce migration herein, this belongs to an extension, about the steps Ceph operation and maintenance among frequent mention of bad disk swap disk Optimization
Two basic environment hosts each host OSD 8, a total of 16 OSD, the copy is set to 2, the number of PG is set to 800, the average number of counted down on each PG 100 to OSD herein, this data will be analyzed by different differences of approach
Before starting the test environment is set first noout
, and then to simulate an abnormality has occurred by stopping the OSD OSD, after different treatment
Test three methods
First out a OSD, then removed OSD, then increase OSD
- Stops the specified OSD process
- out specified OSD
- crush remove the specified OSD
- Add a new OSD
General production environment is set to noout
, of course, can not set, then the program go to the control node out, the default five minutes after the process stops short this place out if there is a trigger, whether it is man-made trigger, or triggered automatically data flow is constant, for convenience here tests using artificial trigger, the above-mentioned pre-set environment is noout
Get the most primitive distribution before starting the test
[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > pg1.txt |
Get the current PG distributed, saved to a file pg1.txt, the OSD PG PG distribution record is located, recorded, to facilitate later to compare to arrive need to migrate data
Stops the specified OSD process
[root@lab8106 ~]# systemctl stop ceph-osd@15 |
Stop the process does not trigger migration, would lead to a change of PG state, such as the original primary PG stop on the OSD, then stop off after the OSD, the original copy of the PG will upgrade the role of the main PG
out off a OSD
[root@lab8106 ~]# ceph osd out 15 |
Before the trigger out, the current status of PG should have active+undersized+degraded
, after the trigger out, all states should gradually become the PG active+clean
, the normal wait for a cluster, check the current status of the distribution of PG again
[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > pg2.txt |
Saves the current distribution pg2.txt PG
changes OUT PG before and after comparison, the following changes are more specific, list only the change portions
[root@lab8106 ~]# diff -y -W 100 pg1.txt pg2.txt --suppress-common-lines |
Here we are concerned with the number of changes, only counts the number of changes in the PG
[root@lab8106 ~]# diff -y -W 100 pg1.txt pg2.txt --suppress-common-lines|wc -l |
After the first time out are subject to change 102 PG, this figure in mind that behind the statistics will be used
OSD deleted from the crush inside
[root@lab8106 ~]# ceph osd crush remove osd.15 |
crush later removed it will also trigger migration, wait for PG balanced, that is all become active+clean
state
[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > pg3.txt |
Get the current state of the distribution of PG
PG changes now to compare before and after the crush remove
[root@lab8106 ~]# diff -y -W 100 pg2.txt pg3.txt --suppress-common-lines|wc -l |
We re-add a new OSD
[root@lab8106 ~]# ceph-deploy osd prepare lab8107:/dev/sdi |
Following the addition of new statistics of the current state of PG
[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > pg4.txt |
Before and after comparison
[root@lab8106 ~]# diff -y -W 100 pg3.txt pg4.txt --suppress-common-lines|wc -l |
Replace the entire process is completed, the total change in the statistics above PG
102 +137 +167 = 406
That is, change by this method is 406 PG, because it is the only dual-host, there may be some amplification problem, do not do in-depth discussion here, because my three test environments are the same, the only horizontal comparison principle the same, here is the data to analyze the differences
First crush reweight 0, then out, then increase osd
First, restore the environment to a test environment before
[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > 2pg1.txt |
PG most primitive distribution records
crush reweight specified OSD
[root@lab8106 ~]# ceph osd crush reweight osd.16 0 |
After waiting for equilibrium distribution record the current PG
[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > 2pg2.txt |
Changes before and after comparison
[root@lab8106 ~]# diff -y -W 100 2pg1.txt 2pg2.txt --suppress-common-lines|wc -l |
crush remove the specified OSD
[root@lab8106 ~]# ceph osd crush remove osd.16 |
Because this place above the crush already zero so deleting it will not cause changes in PG
and then directly ceph osd rm osd.16
and there is no change in PG
Adding new OSD
[root@lab8106 ~]#ceph-deploy osd prepare lab8107:/dev/sdi |
After waiting to get the current balance of distribution PG
[root@lab8106 ceph]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > 2pg3.txt |
To compare before and after
[root@lab8106 ~]# diff -y -W 100 2pg2.txt 2pg3.txt --suppress-common-lines|wc -l |
The overall change is PG
166+159=325
Started norebalance, then do crush remove, then do add
The initial environmental recovery environment, then get the current distribution of PG
[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > 3pg1.txt |
Cluster to do more markers, prevent migration
Set norebalance, nobackfill, norecover, will lift the back there is a place for these settings
[root@lab8106 ~]# ceph osd set norebalance |
crush reweight specified OSD
[root@lab8106 ~]# ceph osd crush reweight osd.15 0 |
This place has been done since the top mark, it appears only state changes, but no real migration, we look at the statistics Esen
[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > 3pg2.txt |
注意这里只是计算了,并没有真正的数据变动,可以通过监控两台的主机的网络流量来判断,所以这里的变动并不用计算到需要迁移的 PG 数目当中
crush remove 指定 OSD
[root@lab8106 ~]#ceph osd crush remove osd.15 |
删除指定的 OSD
删除以后同样是没有 PG 的变动的
ceph osd rm osd.15 |
这个地方有个小地方需要注意一下,不做 ceph auth del osd.15 把15的编号留着,这样好判断前后的 PG 的变化,不然相同的编号,就无法判断是不是做了迁移了
增加新的 OSD
[root@lab8106 ~]#ceph-deploy osd prepare lab8107:/dev/sdi |
我的环境下,新增的 OSD 的编号为16了
解除各种标记
我们放开上面的设置,看下数据的变动情况
[root@lab8106 ceph]# ceph osd unset norebalance |
设置完了后数据才真正开始变动了,可以通过观察网卡流量看到,来看下最终pg变化
[root@lab8106 ceph]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg > 3pg3.txt |
这里我们只需要跟最开始的 PG 分布状况进行比较就可以了,因为中间的状态实际上都没有做数据的迁移,所以不需要统计进去,可以看到这个地方动了195个 PG
总共的 PG 迁移量为
195
数据汇总
现在通过表格来对比下三种方法的迁移量的比较(括号内为迁移 PG 数目)
方法一 | 方法二 | 方法三 | |
---|---|---|---|
所做操作 | stop osd (0) out osd(102) crush remove osd (137) add osd(167) |
crush reweight osd(166) out osd(0) crush remove osd (0) add osd(159) |
set 标记(0) crush reweight osd(0) crush remove osd (0) add osd(195) |
PG迁移数量 | 406 | 325 | 195 |
可以很清楚的看到三种不同的方法,最终的触发的迁移量是不同的,处理的好的话,能节约差不多一半的迁移的数据量,这个对于生产环境来说还是很好的,关于这个建议先在测试环境上进行测试,然后再操作,上面的操作只要不对磁盘进行格式化,操作都是可逆的,也就是可以比较放心的做,记住所做的操作,每一步都做完都去检查 PG 的状态是否是正常的
总结
从我自己的操作经验来看,最开始是用的第一种方法,后面就用第二种方法减少了一部分迁移量,最近看到资料写做剔除OSD的时候可以关闭迁移防止无效的过多的迁移,然后就测试了一下,确实能够减少不少的迁移量,这个减少在某些场景下还是很好的,当然如果不太熟悉,用哪一种都可以,最终能达到的目的是一样的
附加
有人问到一个问题,为什么按照这个流程操作的时候,会出现slow request?在进行了一次验证后,发现在迁移过程中的请求路径还是很长的,所以出现slow request还是很容易的
假如我们有三个osd,分别为0,1,2,里面有各种的分布,我们在踢掉一个osd.2后,可能出现的一个情况是
某个PG(0.3b)的[2,0]分布变成了[1,0]
而此时后台的osd.1的PG(0.3b)这个目录里面的内容实际是空的,如果这个时候,前端的请求一个对象正好是分布在0.3b这个PG上的时候,后台需要先将osd.0上面的这个0.3b的对象写入到osd.1的0.3b的pg里面去,然后再去响应客户端的请求,自然路径就长了,如果这样的请求一多,响应前台的性能就有问题了,增加节点的时候同理
请求到这种空PG的对象,PG的状态会这样变化:
从active+degraded 变成active+recovery_wait+degraded
迁移的数据量是一定的,这个看是请求的时候实时迁移然后响应还是提前迁移,然后响应,所以这个中间操作过程尽量的快的完成,然后好迁移完响应前端的请求