What is 1 PG?
PG is the directory
Below I will slowly explain the PG.
I have prepared 3 nodes admin, node1, node2 with a total of 6 OSDs, min_size=2, explain min_size=2, that is to say at least 2, otherwise it will not provide services to the outside world.
[root@admin ceph]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_OK
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e53: 6 osds: 6 up, 6 in
flags sortbitwise,require_jewel_osds
pgmap v373: 128 pgs, 2 pools, 330 bytes data, 5 objects
30932 MB used, 25839 MB / 56772 MB avail
128 active+clean
[root@admin ceph]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05424 root default
-2 0.01808 host admin
0 0.00879 osd.0 up 1.00000 1.00000
3 0.00929 osd.3 up 1.00000 1.00000
-3 0.01808 host node1
1 0.00879 osd.1 up 1.00000 1.00000
4 0.00929 osd.4 up 1.00000 1.00000
-4 0.01808 host node2
2 0.00879 osd.2 up 1.00000 1.00000
5 0.00929 osd.5 up 1.00000 1.00000
[root@admin ~]# ceph osd pool get rbd min_size
min_size: 2
[root@admin ~]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
56772M 25839M 30932M 54.48
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 216 0 7382M 1
test-pool 1 114 0 7382M 4
In the above ceph df, we can see that the ID of the rbd pool is 0, so the pg in the rbd pool starts with 0. Let's take a look at which osd the pg in the rbd pool are distributed in, and what the name is.
First, let's upload a file to the rbd pool with rados:
#下面这个text.txt将是我上传的文件
[root@admin tmp]# cat test.txt
abc
123
ABC
# 下面命令表示上传到rbd池,上传上去的名称叫wzl
[root@admin tmp]# rados -p rbd put wzl ./test.txt
#搜索看wzl这个文件分布在哪些osd中
[root@admin tmp]# ceph osd map rbd wzl
osdmap e53 pool 'rbd' (0) object 'wzl' -> pg 0.ff62cf8d (0.d) -> up ([5,4,3], p5) acting ([5,4,3], p5)
Look, the pg uploaded to the rbd pool with the name of wzl is distributed in 5, 4, 3osd, and the name of the pg is 0.d
Let's look for these pg
# admin节点上osd.3下的
[root@admin tmp]# ll /var/lib/ceph/osd/ceph-3/current/ |grep 0.d
drwxr-xr-x 2 ceph ceph 59 Feb 9 20:06 0.d_head
drwxr-xr-x 2 ceph ceph 6 Feb 9 02:33 0.d_TEMP
# node1节点上的osd.4的
[root@node1 ~]# ll /var/lib/ceph/osd/ceph-4/current/|grep 0.d
drwxr-xr-x 2 ceph ceph 59 Feb 9 20:06 0.d_head
drwxr-xr-x 2 ceph ceph 6 Feb 9 02:34 0.d_TEMP
# node2节点上的osd.5的
[root@node2 ~]# ll /var/lib/ceph/osd/ceph-5/current/|grep 0.d
drwxr-xr-x 2 ceph ceph 59 Feb 9 20:06 0.d_head
drwxr-xr-x 2 ceph ceph 6 Feb 9 02:34 0.d_TEMP
We can see that the pg names distributed on the three osds 5, 4, and 3 are all 0.d_head, which is the reason for the three copies of pg, and the names of each copy are the same.
2 What should I do if the PG is damaged or lost?
Let's talk about the status of pg first
Degraded
A simple understanding is that pg has some failures, but it can still provide services to the outside world. Let me test it:
As mentioned above, the pg of the wzl file is distributed in osd.3 osd.4 osd.5, what will happen if osd.3 hangs up:
[root@admin tmp]# systemctl stop [email protected]
[root@admin tmp]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05424 root default
-2 0.01808 host admin
0 0.00879 osd.0 up 1.00000 1.00000
3 0.00929 osd.3 down 1.00000 1.00000
-3 0.01808 host node1
1 0.00879 osd.1 up 1.00000 1.00000
4 0.00929 osd.4 up 1.00000 1.00000
-4 0.01808 host node2
2 0.00879 osd.2 up 1.00000 1.00000
5 0.00929 osd.5 up 1.00000 1.00000
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_WARN
clock skew detected on mon.node1
66 pgs degraded
66 pgs stuck unclean
66 pgs undersized
recovery 2/18 objects degraded (11.111%)
1/6 in osds are down
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e55: 6 osds: 5 up, 6 in; 66 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v423: 128 pgs, 2 pools, 138 bytes data, 6 objects
30932 MB used, 25839 MB / 56772 MB avail
2/18 objects degraded (11.111%)
66 active+undersized+degraded
62 active+clean
The above shows that I have stopped osd.3, and as a result, 66 pg statuses are in the active+undersized+degraded state, that is, the degraded state. I will see if I can still download the file I just uploaded.
[root@admin tmp]# rados -p rbd get wzl wzl.txt
[root@admin tmp]# cat wzl.txt
abc
123
ABC
It shows that although ceph is unhealthy and 66 PGs are downgraded, they can still serve externally.
Peered (serious illness/injury)
We closed osd.3 above, and there are two 0.d PGs in the cluster, which are in osd.4 and osd.5 respectively. I will now take a look at the status of these 3 0.d PGs
[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d 1 0 1 0 0 12 1 1 active+undersized+degraded 2018-02-09 20:35:27.585529 53'1 71:71 [5,4] 5 [5,4] 5 0'0 2018-02-09 01:37:29.127711 0'0 2018-02-09 01:37:29.127711
Now 0.d has only been distributed in osd.5 osd.4, the status is active+undersized+degraded
Now let's stop osd.4 and see the status of pg
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_WARN
clock skew detected on mon.node1
99 pgs degraded
19 pgs stuck unclean
99 pgs undersized
recovery 7/18 objects degraded (38.889%)
2/6 in osds are down
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e97: 6 osds: 4 up, 6 in; 99 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v585: 128 pgs, 2 pools, 138 bytes data, 6 objects
30942 MB used, 25829 MB / 56772 MB avail
7/18 objects degraded (38.889%)
62 active+undersized+degraded
37 undersized+degraded+peered
29 active+clean
[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d 1 0 2 0 0 12 1 1 undersized+degraded+peered 2018-02-09 20:42:08.558726 53'1 81:105 [5] 5 [50'0 2018-02-09 01:37:29.127711 0'0 2018-02-09 01:37:29.127711
ceph has ERR, because it stopped 2 osd, exceeded the limit of min_size, and the limit PG 0.d status is already
Undersized+degraded+peered It can also be seen from the above that only one 0.d is alive now in osd.5
Now we reset min_size to 1
[root@admin tmp]# ceph osd pool set rbd min_size 1
set pool 0 min_size to 1
[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d 1 0 2 0 0 12 1 1 active+undersized+degraded 2018-02-09 20:59:03.684989 53'1 99:163 [5] 5 [50'0 2018-02-09 01:37:29.127711 0'0 2018-02-09 01:37:29.127711
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_WARN
clock skew detected on mon.node1
99 pgs degraded
19 pgs stuck unclean
99 pgs undersized
recovery 7/18 objects degraded (38.889%)
2/6 in osds are down
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e99: 6 osds: 4 up, 6 in; 99 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v594: 128 pgs, 2 pools, 138 bytes data, 6 objects
30942 MB used, 25829 MB / 56772 MB avail
7/18 objects degraded (38.889%)
79 active+undersized+degraded
29 active+clean
20 undersized+degraded+peered
See if min_size=1 is reset now, the pg status is no longer Peered, but changed to degraded, and the health status is also restored to warn, which can be used for external services.
Remapped (self-healing)
Ceph has a very powerful function, self-healing. If an osd stops for more than 300 seconds, the cluster considers that it has no possibility of being resurrected, and starts to replicate the mirrored data on other existing osds from the only copy of the data. This is its self-healing function
[root@admin tmp]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05424 root default
-2 0.01808 host admin
0 0.00879 osd.0 up 1.00000 1.00000
3 0.00929 osd.3 down 0 1.00000
-3 0.01808 host node1
1 0.00879 osd.1 up 1.00000 1.00000
4 0.00929 osd.4 down 0 1.00000
-4 0.01808 host node2
2 0.00879 osd.2 up 1.00000 1.00000
5 0.00929 osd.5 up 1.00000 1.00000
It can be seen from the above that the status of osd.3 and osd.4 in the ceph osd tree is down within 300 seconds of being stopped, or 1.00000, which means that it still belongs to an osd member of the cluster. However, after 300, the osd has not been up yet, and the cluster expelled it, and the status is changed to out, that is, the weight is changed to 0. At this time, the cluster is still restored through the only remaining data mirror:
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_ERR
clock skew detected on mon.node1
22 pgs are stuck inactive for more than 300 seconds
19 pgs degraded
66 pgs peering
7 pgs stuck degraded
22 pgs stuck inactive
88 pgs stuck unclean
7 pgs stuck undersized
19 pgs undersized
recovery 1/18 objects degraded (5.556%)
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e106: 6 osds: 4 up, 4 in
flags sortbitwise,require_jewel_osds
pgmap v609: 128 pgs, 2 pools, 138 bytes data, 6 objects
20636 MB used, 16699 MB / 37336 MB avail
1/18 objects degraded (5.556%)
44 remapped+peering
40 active+clean
22 peering
18 active+undersized+degraded
3 activating
1 activating+undersized+degraded
Look, there are already 44 pg in the remapped+peering (severely injured but starting to recover) state. I'll check again when it's done recovering
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_WARN
clock skew detected on mon.node1
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e107: 6 osds: 4 up, 4 in
flags sortbitwise,require_jewel_osds
pgmap v634: 128 pgs, 2 pools, 138 bytes data, 6 objects
20637 MB used, 16698 MB / 37336 MB avail
128 active+clean
[root@admin tmp]# ceph osd map rbd wzl
osdmap e107 pool 'rbd' (0) object 'wzl' -> pg 0.ff62cf8d (0.d) -> up ([5,1,0], p5) acting ([5,1,0], p5)
See, pg has been fully restored, health has been restored, and the pg file 0.d of the file wzl has been redistributed in osd.5 osd.1 osd.0
Note: If we restore the previously stopped osd, the restored pg will be restored to the original distributed osd, and the pg will be deleted when remapped. Let's take a look:
[root@admin tmp]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05424 root default
-2 0.01808 host admin
0 0.00879 osd.0 up 1.00000 1.00000
3 0.00929 osd.3 up 1.00000 1.00000
-3 0.01808 host node1
1 0.00879 osd.1 up 1.00000 1.00000
4 0.00929 osd.4 up 1.00000 1.00000
-4 0.01808 host node2
2 0.00879 osd.2 up 1.00000 1.00000
5 0.00929 osd.5 up 1.00000 1.00000
[root@admin tmp]# ceph osd map rbd wzl
osdmap e113 pool 'rbd' (0) object 'wzl' -> pg 0.ff62cf8d (0.d) -> up ([5,4,3], p5) acting ([5,4,3], p5)
We can see that after osd is restored, pg returns to several osds originally distributed. Now pg 0.d is back on osd.5 osd.4 osd.3.
Recover
If the osd where the three copies of pg are located is normal, but some of the three copies are lost or damaged, the ceph cluster will find that all copy files of this pg are inconsistent, and it will copy a copy from other normal pg to overwrite it. . Let's do an experiment:
Directly delete the 0.d PG on osd.3
[root@admin tmp]# ll /var/lib/ceph/osd/ceph-3/current/|grep 0.d
drwxr-xr-x 2 ceph ceph 59 Feb 9 20:06 0.d_head
drwxr-xr-x 2 ceph ceph 6 Feb 9 02:33 0.d_TEMP
[root@admin tmp]# rm -rf /var/lib/ceph/osd/ceph-3/current/0.d_head/
[root@admin tmp]# ll /var/lib/ceph/osd/ceph-3/current/|grep 0.d
drwxr-xr-x 2 ceph ceph 6 Feb 9 02:33 0.d_TEMP
Now the pg 0.d under osd.3 has been deleted, and then the cluster is notified to scan
# 扫描这个pg
[root@admin tmp]# ceph pg scrub 0.d
instructing pg 0.d on osd.5 to scrub
# 重新插pg状态
[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d 1 0 0 0 0 12 1 1 active+clean+inconsistent 2018-02-09 21:31:09.239568 53'1 113:212 [5,4,3] 5 [5,4,3] 5 53'1 2018-02-09 21:31:09.239177 0'0 2018-02-09 01:37:29.127711
You can see that there is one more inconsistent in the state now, which means that the cluster finds that the pg3 replicas are inconsistent.
You can perform ceph pg repair 0.d to repair, it will copy one from other pg copies sent
# pg不一致,ceph不健康
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_ERR
clock skew detected on mon.node1
1 pgs inconsistent
1 scrub errors
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e113: 6 osds: 6 up, 6 in
flags sortbitwise,require_jewel_osds
pgmap v667: 128 pgs, 2 pools, 138 bytes data, 6 objects
30943 MB used, 25828 MB / 56772 MB avail
127 active+clean
1 active+clean+inconsistent
# ceph拷贝pg过来修复
[root@admin tmp]# ceph pg repair 0.d
instructing pg 0.d on osd.5 to repair
# ceph健康状态修复
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_WARN
clock skew detected on mon.node1
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 26, quorum 0,1,2 admin,node1,node2
osdmap e113: 6 osds: 6 up, 6 in
flags sortbitwise,require_jewel_osds
pgmap v669: 128 pgs, 2 pools, 138 bytes data, 6 objects
30943 MB used, 25828 MB / 56772 MB avail
128 active+clean
recovery io 0 B/s, 0 objects/s
# pg状态恢复
[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d 1 0 0 0 0 12 1 1 active+clean 2018-02-09 21:34:33.338065 53'1 113:220 [5,4,3] 5 [5,4,3] 5 53'1 2018-02-09 21:34:33.321122 53'1 2018-02-09 21:34:33.321122