Alternatively OSD operation Optimization Analysis

http://www.zphj1987.com/2016/09/19/%E6%9B%BF%E6%8D%A2OSD%E6%93%8D%E4%BD%9C%E7%9A%84%E4%BC%98%E5%8C%96%E4%B8%8E%E5%88%86%E6%9E%90/

Foreword

There wrote before the correct way to remove the OSD , which is simply a way to talk about how to remove under reduce migration herein, this belongs to an extension, about the steps Ceph operation and maintenance among frequent mention of bad disk swap disk Optimization

Two basic environment hosts each host OSD 8, a total of 16 OSD, the copy is set to 2, the number of PG is set to 800, the average number of counted down on each PG 100 to OSD herein, this data will be analyzed by different differences of approach

Before starting the test environment is set first  noout, and then to simulate an abnormality has occurred by stopping the OSD OSD, after different treatment

Test three methods

First out a OSD, then removed OSD, then increase OSD

  1. Stops the specified OSD process
  2. out specified OSD
  3. crush remove the specified OSD
  4. Add a new OSD

General production environment is set to  noout, of course, can not set, then the program go to the control node out, the default five minutes after the process stops short this place out if there is a trigger, whether it is man-made trigger, or triggered automatically data flow is constant, for convenience here tests using artificial trigger, the above-mentioned pre-set environment is noout

Get the most primitive distribution before starting the test

[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg   > pg1.txt

 

Get the current PG distributed, saved to a file pg1.txt, the OSD PG PG distribution record is located, recorded, to facilitate later to compare to arrive need to migrate data

Stops the specified OSD process

[root@lab8106 ~]# systemctl stop ceph-osd@15

Stop the process does not trigger migration, would lead to a change of PG state, such as the original primary PG stop on the OSD, then stop off after the OSD, the original copy of the PG will upgrade the role of the main PG

out off a OSD

[root@lab8106 ~]# ceph osd out 15

Before the trigger out, the current status of PG should have  active+undersized+degraded, after the trigger out, all states should gradually become the PG  active+clean, the normal wait for a cluster, check the current status of the distribution of PG again

[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg   > pg2.txt

 

Saves the current distribution pg2.txt PG
changes OUT PG before and after comparison, the following changes are more specific, list only the change portions

[root@lab8106 ~]# diff -y -W 100 pg1.txt pg2.txt  --suppress-common-lines

 

Here we are concerned with the number of changes, only counts the number of changes in the PG

[root@lab8106 ~]# diff -y -W 100 pg1.txt pg2.txt  --suppress-common-lines|wc -l
102

 

After the first time out are subject to change 102 PG, this figure in mind that behind the statistics will be used

OSD deleted from the crush inside

[root@lab8106 ~]# ceph osd crush remove osd.15

crush later removed it will also trigger migration, wait for PG balanced, that is all become  active+clean state

[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg   > pg3.txt

 

Get the current state of the distribution of PG
PG changes now to compare before and after the crush remove

[root@lab8106 ~]# diff -y -W 100 pg2.txt pg3.txt  --suppress-common-lines|wc -l
137

 

We re-add a new OSD

[root@lab8106 ~]# ceph-deploy osd prepare lab8107:/dev/sdi
[root@lab8106 ~]# ceph-deploy osd activate lab8107:/dev/sdi1

 

Following the addition of new statistics of the current state of PG

[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg   > pg4.txt

 

Before and after comparison

[root@lab8106 ~]# diff -y -W 100 pg3.txt pg4.txt  --suppress-common-lines|wc -l
167

 

Replace the entire process is completed, the total change in the statistics above PG

102 +137 +167 = 406

That is, change by this method is 406 PG, because it is the only dual-host, there may be some amplification problem, do not do in-depth discussion here, because my three test environments are the same, the only horizontal comparison principle the same, here is the data to analyze the differences

First crush reweight 0, then out, then increase osd

First, restore the environment to a test environment before

[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg   > 2pg1.txt

 

PG most primitive distribution records

crush reweight specified OSD

[root@lab8106 ~]# ceph osd crush reweight osd.16 0
reweighted item id 16 name 'osd.16' to 0 in crush map

After waiting for equilibrium distribution record the current PG

[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg   > 2pg2.txt
dumped pgs in format plain

 

Changes before and after comparison

[root@lab8106 ~]# diff -y -W 100 2pg1.txt 2pg2.txt  --suppress-common-lines|wc -l
166

 

crush remove the specified OSD

[root@lab8106 ~]# ceph osd crush remove osd.16
removed item id 16 name 'osd.16' from crush map

Because this place above the crush already zero so deleting it will not cause changes in PG
and then directly  ceph osd rm osd.16 and there is no change in PG

Adding new OSD

[root@lab8106 ~]#ceph-deploy osd prepare lab8107:/dev/sdi
[root@lab8106 ~]#ceph-deploy osd activate lab8107:/dev/sdi1

After waiting to get the current balance of distribution PG

[root@lab8106 ceph]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg   > 2pg3.txt

 

To compare before and after

[root@lab8106 ~]# diff -y -W 100 2pg2.txt 2pg3.txt --suppress-common-lines|wc -l
159

 

The overall change is PG

166+159=325

Started norebalance, then do crush remove, then do add

The initial environmental recovery environment, then get the current distribution of PG

[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg   > 3pg1.txt
dumped pgs in format plain

 

Cluster to do more markers, prevent migration

Set norebalance, nobackfill, norecover, will lift the back there is a place for these settings

[root@lab8106 ~]# ceph osd set norebalance
set norebalance
[root@lab8106 ~]# ceph osd set nobackfill
set nobackfill
[root@lab8106 ~]# ceph osd set norecover
set norecover

 

crush reweight specified OSD

[root@lab8106 ~]# ceph osd crush reweight osd.15 0
reweighted item id 15 name 'osd.15' to 0 in crush map

This place has been done since the top mark, it appears only state changes, but no real migration, we look at the statistics Esen

[root@lab8106 ~]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg   > 3pg2.txt
[root@lab8106 ~]# diff -y -W 100 3pg1.txt 3pg2.txt --suppress-common-lines|wc -l
158

 

注意这里只是计算了,并没有真正的数据变动,可以通过监控两台的主机的网络流量来判断,所以这里的变动并不用计算到需要迁移的 PG 数目当中

crush remove 指定 OSD

[root@lab8106 ~]#ceph osd crush remove osd.15

删除指定的 OSD

删除以后同样是没有 PG 的变动的

ceph osd rm osd.15

 

这个地方有个小地方需要注意一下,不做 ceph auth del osd.15 把15的编号留着,这样好判断前后的 PG 的变化,不然相同的编号,就无法判断是不是做了迁移了

增加新的 OSD

[root@lab8106 ~]#ceph-deploy osd prepare lab8107:/dev/sdi
[root@lab8106 ~]#ceph-deploy osd activate lab8107:/dev/sdi1

我的环境下,新增的 OSD 的编号为16了

解除各种标记

我们放开上面的设置,看下数据的变动情况

[root@lab8106 ceph]# ceph osd unset norebalance
unset norebalance
[root@lab8106 ceph]# ceph osd unset nobackfill
unset nobackfill
[root@lab8106 ceph]# ceph osd unset norecover
unset norecover

 

设置完了后数据才真正开始变动了,可以通过观察网卡流量看到,来看下最终pg变化

[root@lab8106 ceph]# ceph pg dump pgs|awk '{print $1,$15}'|grep -v pg   > 3pg3.txt
dumped pgs in format plain
[root@lab8106 ~]# diff -y -W 100 3pg1.txt 3pg3.txt --suppress-common-lines|wc -l
195

 

这里我们只需要跟最开始的 PG 分布状况进行比较就可以了,因为中间的状态实际上都没有做数据的迁移,所以不需要统计进去,可以看到这个地方动了195个 PG
总共的 PG 迁移量为

195

数据汇总

现在通过表格来对比下三种方法的迁移量的比较(括号内为迁移 PG 数目)

  方法一 方法二 方法三
所做操作 stop osd (0)
out osd(102)
crush remove osd (137)
add osd(167)
crush reweight osd(166)
out osd(0)
crush remove osd (0)
add osd(159)
set 标记(0)
crush reweight osd(0)
crush remove osd (0)
add osd(195)
PG迁移数量 406 325 195

可以很清楚的看到三种不同的方法,最终的触发的迁移量是不同的,处理的好的话,能节约差不多一半的迁移的数据量,这个对于生产环境来说还是很好的,关于这个建议先在测试环境上进行测试,然后再操作,上面的操作只要不对磁盘进行格式化,操作都是可逆的,也就是可以比较放心的做,记住所做的操作,每一步都做完都去检查 PG 的状态是否是正常的

总结

从我自己的操作经验来看,最开始是用的第一种方法,后面就用第二种方法减少了一部分迁移量,最近看到资料写做剔除OSD的时候可以关闭迁移防止无效的过多的迁移,然后就测试了一下,确实能够减少不少的迁移量,这个减少在某些场景下还是很好的,当然如果不太熟悉,用哪一种都可以,最终能达到的目的是一样的

附加

有人问到一个问题,为什么按照这个流程操作的时候,会出现slow request?在进行了一次验证后,发现在迁移过程中的请求路径还是很长的,所以出现slow request还是很容易的

假如我们有三个osd,分别为0,1,2,里面有各种的分布,我们在踢掉一个osd.2后,可能出现的一个情况是
某个PG(0.3b)的[2,0]分布变成了[1,0]
而此时后台的osd.1的PG(0.3b)这个目录里面的内容实际是空的,如果这个时候,前端的请求一个对象正好是分布在0.3b这个PG上的时候,后台需要先将osd.0上面的这个0.3b的对象写入到osd.1的0.3b的pg里面去,然后再去响应客户端的请求,自然路径就长了,如果这样的请求一多,响应前台的性能就有问题了,增加节点的时候同理

请求到这种空PG的对象,PG的状态会这样变化:

从active+degraded 变成active+recovery_wait+degraded

迁移的数据量是一定的,这个看是请求的时候实时迁移然后响应还是提前迁移,然后响应,所以这个中间操作过程尽量的快的完成,然后好迁移完响应前端的请求

Guess you like

Origin www.cnblogs.com/wangmo/p/11428302.html