Ceph's pg distribution

    The previous section "pg diagnostics for ceph" explained that ceph guarantees the reliability of sufficient data. Ceph pays attention to random, average. But we sometimes find that some osds of the same size have a lot of remaining space, and some are full, such as the following example:

# 3个节点6个osd
[root@admin tmp]# ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.05424 root default                                     
-2 0.01808     host admin                                   
 0 0.00879         osd.0       up  1.00000          1.00000 
 3 0.00929         osd.3       up  1.00000          1.00000 
-3 0.01808     host node1                                   
 1 0.00879         osd.1       up  1.00000          1.00000 
 4 0.00929         osd.4       up  1.00000          1.00000 
-4 0.01808     host node2                                   
 2 0.00879         osd.2       up  1.00000          1.00000 
 5 0.00929         osd.5       up  1.00000          1.00000
#节点admin上的osd使用情况
[root@admin tmp]# df -h|grep osd
/dev/mapper/cephvg-osd    9.0G  8.5G  559M  94% /srv/ceph/osd
/dev/mapper/cephvg2-osd2  9.5G  6.0G  3.6G  63% /srv/ceph/osd2
#节点node1 上的osd使用情况
[root@node1 ~]# df -h|grep osd
/dev/mapper/cephvg-osd    9.0G  6.7G  2.4G  74% /srv/ceph/osd
/dev/mapper/cephvg2-osd2  9.5G  7.9G  1.7G  83% /srv/ceph/osd2
#节点node2上的osd使用情况
[root@node2 ~]# df -h|grep osd
/dev/mapper/cephvg-osd    9.0G  6.1G  3.0G  68% /srv/ceph/osd
/dev/mapper/cephvg2-osd2  9.5G  8.4G  1.2G  89% /srv/ceph/osd2

Above we can see that there are 6 OSDs on three nodes, each osd has a total capacity of 10G and is an ordinary disk, so why is the difference in the usage of osd so big?

Let's briefly explain:

We specified the number of pg when we created the pool, and also limited the pool's disk capacity quota

[root@admin tmp]# ceph osd pool ls detail|grep pool0
pool 3 'pool0' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 127 flags hashpspool max_bytes 104857600 stripe_width 0
# pool0 设置pg32个,最大容量100M(max_bytes 104857600 换算出来的)
[root@admin tmp]# ceph osd pool ls detail|grep pool1
pool 4 'pool1' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 129 flags hashpspool max_bytes 5368709120 stripe_width 0
# pool1 也是32个pg,最大容量5G

We see that the number of PGs in pool0 and pool1 is the same, but the total capacity is far different. Then let's think about the total capacity / the number of pg = pg capacity

The capacity of the pg in pool0 is very different from that of the pg in pool1, and each pg is distributed to the osd, but the size of the osd disk is the same, which is why the osd usage rate is very different.

[root@admin tmp]# ceph -s
    cluster 55430962-45e4-40c3-bc14-afac24c69acb
     health HEALTH_WARN
            clock skew detected on mon.node1
            2 near full osd(s)
            Monitor clock skew detected 
     monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
            election epoch 30, quorum 0,1,2 admin,node1,node2
     osdmap e134: 6 osds: 6 up, 6 in
            flags nearfull,sortbitwise,require_jewel_osds
      pgmap v2188: 192 pgs, 4 pools, 4474 MB data, 284 objects
            44349 MB used, 12422 MB / 56772 MB avail
                 192 active+clean
# 已经警告2个osd 快满了。

Then we will check in detail, the file 90m stored in our pool0 is a file with a size of only 90M

[root@admin tmp]# rados -p pool0 ls
rbd_object_map.acd0238e1f29
90m
rbd_id.image0
rbd_directory
rbd_header.acd0238e1f29

And there are 4 large files stored in pool1

[root@admin tmp]# rados -p pool1 ls|grep G
3G1
3G3
3G
3G2

Let's take a look at the pg distribution of the files in the two pools respectively

[root@admin tmp]# rados -p pool0 ls|grep 90
90m
[root@admin tmp]# ceph osd map pool0 90m
osdmap e134 pool 'pool0' (3) object '90m' -> pg 3.149ba74f (3.f) -> up ([2,0,4], p2) acting ([2,0,4], p2)
#上面是pool0里面的小文件 分布在0,2,4osd上
[root@admin tmp]# rados -p pool1 ls |grep G
3G1
3G3
3G
3G2
[root@admin tmp]# ceph osd map pool0 90m
osdmap e134 pool 'pool0' (3) object '90m' -> pg 3.149ba74f (3.f) -> up ([2,0,4], p2) acting ([2,0,4], p2)
[root@admin tmp]# ceph osd map pool1 3G
osdmap e134 pool 'pool1' (4) object '3G' -> pg 4.e7764a6c (4.c) -> up ([4,0,5], p4) acting ([4,0,5], p4)
[root@admin tmp]# ceph osd map pool1 3G1
osdmap e134 pool 'pool1' (4) object '3G1' -> pg 4.f6d15484 (4.4) -> up ([1,5,0], p1) acting ([1,5,0], p1)
[root@admin tmp]# ceph osd map pool1 3G2
osdmap e134 pool 'pool1' (4) object '3G2' -> pg 4.860667f (4.1f) -> up ([3,2,1], p3) acting ([3,2,1], p3)
[root@admin tmp]# ceph osd map pool1 3G3
osdmap e134 pool 'pool1' (4) object '3G3' -> pg 4.5f18be84 (4.4) -> up ([1,5,0], p1) acting ([1,5,0], p1)
# 上面是pool1里面的大文件,大部分都分布在osd.0 和osd.5上

You see, because the capacity of each pg in pool1 is very large, and most of these pg are distributed on osd.0 and osd.5, so the two nearly fully loaded osd we see below should be 0 and 5 , let's verify it below to see if it looks like this

[root@admin ~]# ceph health detail
HEALTH_WARN clock skew detected on mon.node1; 2 near full osd(s); Monitor clock skew detected 
osd.0 is near full at 93%
osd.5 is near full at 88%
mon.node1 addr 172.18.1.241:6789/0 clock skew 0.386219s > max 0.05s (latency 0.00494154s)

So we need to ensure 3 rules when creating a pool:

1 The number of PGs per osd is around 100

The number of 2 pg is 2 N power

3 The total capacity of each pool is basically the same as the capacity of pg converted from the number of pg

If there is such an uneven usage of osd, there are two approaches:

1 Cut peaks and fill valleys, transfer large files to other pools

2 Modify the number of PGs of the pool (the number of PGs can only be increased but not reduced, so increase the number of PGs for large file pools)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324398117&siteId=291194637