pg diagnostics for ceph

What is 1 PG?

PG is the directory

Below I will slowly explain the PG.

I have prepared 3 nodes admin, node1, node2 with a total of 6 OSDs, min_size=2, explain min_size=2, that is to say at least 2, otherwise it will not provide services to the outside world.

[root@admin ceph]# ceph -s
    cluster 55430962-45e4-40c3-bc14-afac24c69acb
     health HEALTH_OK
     monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
            election epoch 26, quorum 0,1,2 admin,node1,node2
     osdmap e53: 6 osds: 6 up, 6 in
            flags sortbitwise,require_jewel_osds
      pgmap v373: 128 pgs, 2 pools, 330 bytes data, 5 objects
            30932 MB used, 25839 MB / 56772 MB avail
                 128 active+clean
[root@admin ceph]# ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.05424 root default                                     
-2 0.01808     host admin                                   
 0 0.00879         osd.0       up  1.00000          1.00000 
 3 0.00929         osd.3       up  1.00000          1.00000 
-3 0.01808     host node1                                   
 1 0.00879         osd.1       up  1.00000          1.00000 
 4 0.00929         osd.4       up  1.00000          1.00000 
-4 0.01808     host node2                                   
 2 0.00879         osd.2       up  1.00000          1.00000 
 5 0.00929         osd.5       up  1.00000          1.00000
[root@admin ~]# ceph osd pool get rbd min_size
min_size: 2
[root@admin ~]# ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
    56772M     25839M       30932M         54.48 
POOLS:
    NAME          ID     USED     %USED     MAX AVAIL     OBJECTS 
    rbd           0       216         0         7382M           1 
    test-pool     1       114         0         7382M           4 

In the above ceph df, we can see that the ID of the rbd pool is 0, so the pg in the rbd pool starts with 0. Let's take a look at which osd the pg in the rbd pool are distributed in, and what the name is.

First, let's upload a file to the rbd pool with rados:

#下面这个text.txt将是我上传的文件
[root@admin tmp]# cat test.txt 
abc
123
ABC
# 下面命令表示上传到rbd池,上传上去的名称叫wzl
[root@admin tmp]# rados -p rbd put wzl ./test.txt
#搜索看wzl这个文件分布在哪些osd中
[root@admin tmp]# ceph osd map rbd wzl
osdmap e53 pool 'rbd' (0) object 'wzl' -> pg 0.ff62cf8d (0.d) -> up ([5,4,3], p5) acting ([5,4,3], p5)

Look, the pg uploaded to the rbd pool with the name of wzl is distributed in 5, 4, 3osd, and the name of the pg is 0.d

Let's look for these pg

# admin节点上osd.3下的
[root@admin tmp]# ll /var/lib/ceph/osd/ceph-3/current/ |grep 0.d
drwxr-xr-x 2 ceph ceph   59 Feb  9 20:06 0.d_head
drwxr-xr-x 2 ceph ceph    6 Feb  9 02:33 0.d_TEMP
# node1节点上的osd.4的
[root@node1 ~]# ll /var/lib/ceph/osd/ceph-4/current/|grep 0.d
drwxr-xr-x 2 ceph ceph   59 Feb  9 20:06 0.d_head
drwxr-xr-x 2 ceph ceph    6 Feb  9 02:34 0.d_TEMP
# node2节点上的osd.5的
[root@node2 ~]# ll /var/lib/ceph/osd/ceph-5/current/|grep 0.d
drwxr-xr-x 2 ceph ceph   59 Feb  9 20:06 0.d_head
drwxr-xr-x 2 ceph ceph    6 Feb  9 02:34 0.d_TEMP

We can see that the pg names distributed on the three osds 5, 4, and 3 are all 0.d_head, which is the reason for the three copies of pg, and the names of each copy are the same.

2 What should I do if the PG is damaged or lost?

    Let's talk about the status of pg first

    Degraded

    A simple understanding is that pg has some failures, but it can still provide services to the outside world. Let me test it:

As mentioned above, the pg of the wzl file is distributed in osd.3 osd.4 osd.5, what will happen if osd.3 hangs up:

[root@admin tmp]# systemctl stop [email protected]
[root@admin tmp]# ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.05424 root default                                     
-2 0.01808     host admin                                   
 0 0.00879         osd.0       up  1.00000          1.00000 
 3 0.00929         osd.3     down  1.00000          1.00000 
-3 0.01808     host node1                                   
 1 0.00879         osd.1       up  1.00000          1.00000 
 4 0.00929         osd.4       up  1.00000          1.00000 
-4 0.01808     host node2                                   
 2 0.00879         osd.2       up  1.00000          1.00000 
 5 0.00929         osd.5       up  1.00000          1.00000 
[root@admin tmp]# ceph -s
    cluster 55430962-45e4-40c3-bc14-afac24c69acb
     health HEALTH_WARN
            clock skew detected on mon.node1
            66 pgs degraded
            66 pgs stuck unclean
            66 pgs undersized
            recovery 2/18 objects degraded (11.111%)
            1/6 in osds are down
            Monitor clock skew detected 
     monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
            election epoch 26, quorum 0,1,2 admin,node1,node2
     osdmap e55: 6 osds: 5 up, 6 in; 66 remapped pgs
            flags sortbitwise,require_jewel_osds
      pgmap v423: 128 pgs, 2 pools, 138 bytes data, 6 objects
            30932 MB used, 25839 MB / 56772 MB avail
            2/18 objects degraded (11.111%)
                  66 active+undersized+degraded
                  62 active+clean

The above shows that I have stopped osd.3, and as a result, 66 pg statuses are in the active+undersized+degraded state, that is, the degraded state. I will see if I can still download the file I just uploaded.

[root@admin tmp]# rados -p rbd get wzl wzl.txt
[root@admin tmp]# cat wzl.txt 
abc
123
ABC

It shows that although ceph is unhealthy and 66 PGs are downgraded, they can still serve externally.

 Peered (serious illness/injury)

We closed osd.3 above, and there are two 0.d PGs in the cluster, which are in osd.4 and osd.5 respectively. I will now take a look at the status of these 3 0.d PGs

[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d	1	0	1	0	0	12	1	1	active+undersized+degraded	2018-02-09 20:35:27.585529	53'1	71:71	[5,4]	5	[5,4]	5	0'0	2018-02-09 01:37:29.127711	0'0	2018-02-09 01:37:29.127711

Now 0.d has only been distributed in osd.5 osd.4, the status is active+undersized+degraded

Now let's stop osd.4 and see the status of pg

[root@admin tmp]# ceph -s
    cluster 55430962-45e4-40c3-bc14-afac24c69acb
     health HEALTH_WARN
            clock skew detected on mon.node1
            99 pgs degraded
            19 pgs stuck unclean
            99 pgs undersized
            recovery 7/18 objects degraded (38.889%)
            2/6 in osds are down
            Monitor clock skew detected 
     monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
            election epoch 26, quorum 0,1,2 admin,node1,node2
     osdmap e97: 6 osds: 4 up, 6 in; 99 remapped pgs
            flags sortbitwise,require_jewel_osds
      pgmap v585: 128 pgs, 2 pools, 138 bytes data, 6 objects
            30942 MB used, 25829 MB / 56772 MB avail
            7/18 objects degraded (38.889%)
                  62 active+undersized+degraded
                  37 undersized+degraded+peered
                  29 active+clean


[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d	1	0	2	0	0	12	1	1	undersized+degraded+peered	2018-02-09 20:42:08.558726	53'1	81:105	[5]	5	[50'0	2018-02-09 01:37:29.127711	0'0	2018-02-09 01:37:29.127711

ceph has ERR, because it stopped 2 osd, exceeded the limit of min_size, and the limit PG 0.d status is already

Undersized+degraded+peered It can also be seen from the above that only one 0.d is alive now in osd.5

Now we reset min_size to 1 

[root@admin tmp]# ceph osd pool set rbd min_size 1
set pool 0 min_size to 1
[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d	1	0	2	0	0	12	1	1	active+undersized+degraded	2018-02-09 20:59:03.684989	53'1	99:163	[5]	5	[50'0	2018-02-09 01:37:29.127711	0'0	2018-02-09 01:37:29.127711
[root@admin tmp]# ceph -s
    cluster 55430962-45e4-40c3-bc14-afac24c69acb
     health HEALTH_WARN
            clock skew detected on mon.node1
            99 pgs degraded
            19 pgs stuck unclean
            99 pgs undersized
            recovery 7/18 objects degraded (38.889%)
            2/6 in osds are down
            Monitor clock skew detected 
     monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
            election epoch 26, quorum 0,1,2 admin,node1,node2
     osdmap e99: 6 osds: 4 up, 6 in; 99 remapped pgs
            flags sortbitwise,require_jewel_osds
      pgmap v594: 128 pgs, 2 pools, 138 bytes data, 6 objects
            30942 MB used, 25829 MB / 56772 MB avail
            7/18 objects degraded (38.889%)
                  79 active+undersized+degraded
                  29 active+clean
                  20 undersized+degraded+peered

See if min_size=1 is reset now, the pg status is no longer Peered, but changed to degraded, and the health status is also restored to warn, which can be used for external services.

Remapped (self-healing)

Ceph has a very powerful function, self-healing. If an osd stops for more than 300 seconds, the cluster considers that it has no possibility of being resurrected, and starts to replicate the mirrored data on other existing osds from the only copy of the data. This is its self-healing function

[root@admin tmp]# ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.05424 root default                                     
-2 0.01808     host admin                                   
 0 0.00879         osd.0       up  1.00000          1.00000 
 3 0.00929         osd.3     down        0          1.00000 
-3 0.01808     host node1                                   
 1 0.00879         osd.1       up  1.00000          1.00000 
 4 0.00929         osd.4     down        0          1.00000 
-4 0.01808     host node2                                   
 2 0.00879         osd.2       up  1.00000          1.00000 
 5 0.00929         osd.5       up  1.00000          1.00000 

It can be seen from the above that the status of osd.3 and osd.4 in the ceph osd tree is down within 300 seconds of being stopped, or 1.00000, which means that it still belongs to an osd member of the cluster. However, after 300, the osd has not been up yet, and the cluster expelled it, and the status is changed to out, that is, the weight is changed to 0. At this time, the cluster is still restored through the only remaining data mirror:

[root@admin tmp]# ceph -s
    cluster 55430962-45e4-40c3-bc14-afac24c69acb
     health HEALTH_ERR
            clock skew detected on mon.node1
            22 pgs are stuck inactive for more than 300 seconds
            19 pgs degraded
            66 pgs peering
            7 pgs stuck degraded
            22 pgs stuck inactive
            88 pgs stuck unclean
            7 pgs stuck undersized
            19 pgs undersized
            recovery 1/18 objects degraded (5.556%)
            Monitor clock skew detected 
     monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
            election epoch 26, quorum 0,1,2 admin,node1,node2
     osdmap e106: 6 osds: 4 up, 4 in
            flags sortbitwise,require_jewel_osds
      pgmap v609: 128 pgs, 2 pools, 138 bytes data, 6 objects
            20636 MB used, 16699 MB / 37336 MB avail
            1/18 objects degraded (5.556%)
                  44 remapped+peering
                  40 active+clean
                  22 peering
                  18 active+undersized+degraded
                   3 activating
                   1 activating+undersized+degraded

Look, there are already 44 pg in the remapped+peering (severely injured but starting to recover) state. I'll check again when it's done recovering

[root@admin tmp]# ceph -s
    cluster 55430962-45e4-40c3-bc14-afac24c69acb
     health HEALTH_WARN
            clock skew detected on mon.node1
            Monitor clock skew detected 
     monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
            election epoch 26, quorum 0,1,2 admin,node1,node2
     osdmap e107: 6 osds: 4 up, 4 in
            flags sortbitwise,require_jewel_osds
      pgmap v634: 128 pgs, 2 pools, 138 bytes data, 6 objects
            20637 MB used, 16698 MB / 37336 MB avail
                 128 active+clean
[root@admin tmp]# ceph osd map rbd wzl
osdmap e107 pool 'rbd' (0) object 'wzl' -> pg 0.ff62cf8d (0.d) -> up ([5,1,0], p5) acting ([5,1,0], p5)

See, pg has been fully restored, health has been restored, and the pg file 0.d of the file wzl has been redistributed in osd.5 osd.1 osd.0

Note: If we restore the previously stopped osd, the restored pg will be restored to the original distributed osd, and the pg will be deleted when remapped. Let's take a look:

[root@admin tmp]# ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.05424 root default                                     
-2 0.01808     host admin                                   
 0 0.00879         osd.0       up  1.00000          1.00000 
 3 0.00929         osd.3       up  1.00000          1.00000 
-3 0.01808     host node1                                   
 1 0.00879         osd.1       up  1.00000          1.00000 
 4 0.00929         osd.4       up  1.00000          1.00000 
-4 0.01808     host node2                                   
 2 0.00879         osd.2       up  1.00000          1.00000 
 5 0.00929         osd.5       up  1.00000          1.00000 
[root@admin tmp]# ceph osd map rbd wzl
osdmap e113 pool 'rbd' (0) object 'wzl' -> pg 0.ff62cf8d (0.d) -> up ([5,4,3], p5) acting ([5,4,3], p5)

We can see that after osd is restored, pg returns to several osds originally distributed. Now pg 0.d is back on osd.5 osd.4 osd.3.

Recover

If the osd where the three copies of pg are located is normal, but some of the three copies are lost or damaged, the ceph cluster will find that all copy files of this pg are inconsistent, and it will copy a copy from other normal pg to overwrite it. . Let's do an experiment:

Directly delete the 0.d PG on osd.3

[root@admin tmp]# ll /var/lib/ceph/osd/ceph-3/current/|grep 0.d
drwxr-xr-x 2 ceph ceph    59 Feb  9 20:06 0.d_head
drwxr-xr-x 2 ceph ceph     6 Feb  9 02:33 0.d_TEMP
[root@admin tmp]# rm -rf /var/lib/ceph/osd/ceph-3/current/0.d_head/
[root@admin tmp]# ll /var/lib/ceph/osd/ceph-3/current/|grep 0.d
drwxr-xr-x 2 ceph ceph     6 Feb  9 02:33 0.d_TEMP

Now the pg 0.d under osd.3 has been deleted, and then the cluster is notified to scan

# 扫描这个pg
[root@admin tmp]# ceph pg scrub 0.d
instructing pg 0.d on osd.5 to scrub
# 重新插pg状态
[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d	1	0	0	0	0	12	1	1	active+clean+inconsistent	2018-02-09 21:31:09.239568	53'1	113:212	[5,4,3]	5	[5,4,3]	5	53'1	2018-02-09 21:31:09.239177	0'0	2018-02-09 01:37:29.127711

You can see that there is one more inconsistent in the state now, which means that the cluster finds that the pg3 replicas are inconsistent.

You can perform ceph pg repair 0.d to repair, it will copy one from other pg copies sent

# pg不一致,ceph不健康
[root@admin tmp]# ceph -s
    cluster 55430962-45e4-40c3-bc14-afac24c69acb
     health HEALTH_ERR
            clock skew detected on mon.node1
            1 pgs inconsistent
            1 scrub errors
            Monitor clock skew detected 
     monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
            election epoch 26, quorum 0,1,2 admin,node1,node2
     osdmap e113: 6 osds: 6 up, 6 in
            flags sortbitwise,require_jewel_osds
      pgmap v667: 128 pgs, 2 pools, 138 bytes data, 6 objects
            30943 MB used, 25828 MB / 56772 MB avail
                 127 active+clean
                   1 active+clean+inconsistent
# ceph拷贝pg过来修复
[root@admin tmp]# ceph pg repair 0.d
instructing pg 0.d on osd.5 to repair
# ceph健康状态修复
[root@admin tmp]# ceph -s
    cluster 55430962-45e4-40c3-bc14-afac24c69acb
     health HEALTH_WARN
            clock skew detected on mon.node1
            Monitor clock skew detected 
     monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
            election epoch 26, quorum 0,1,2 admin,node1,node2
     osdmap e113: 6 osds: 6 up, 6 in
            flags sortbitwise,require_jewel_osds
      pgmap v669: 128 pgs, 2 pools, 138 bytes data, 6 objects
            30943 MB used, 25828 MB / 56772 MB avail
                 128 active+clean
recovery io 0 B/s, 0 objects/s
# pg状态恢复
[root@admin tmp]# ceph pg dump|grep ^0.d
dumped all in format plain
0.d	1	0	0	0	0	12	1	1	active+clean	2018-02-09 21:34:33.338065	53'1	113:220	[5,4,3]	5	[5,4,3]	5	53'1	2018-02-09 21:34:33.321122	53'1	2018-02-09 21:34:33.321122

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325537592&siteId=291194637