The previous section "pg diagnostics for ceph" explained that ceph guarantees the reliability of sufficient data. Ceph pays attention to random, average. But we sometimes find that some osds of the same size have a lot of remaining space, and some are full, such as the following example:
# 3个节点6个osd
[root@admin tmp]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05424 root default
-2 0.01808 host admin
0 0.00879 osd.0 up 1.00000 1.00000
3 0.00929 osd.3 up 1.00000 1.00000
-3 0.01808 host node1
1 0.00879 osd.1 up 1.00000 1.00000
4 0.00929 osd.4 up 1.00000 1.00000
-4 0.01808 host node2
2 0.00879 osd.2 up 1.00000 1.00000
5 0.00929 osd.5 up 1.00000 1.00000
#节点admin上的osd使用情况
[root@admin tmp]# df -h|grep osd
/dev/mapper/cephvg-osd 9.0G 8.5G 559M 94% /srv/ceph/osd
/dev/mapper/cephvg2-osd2 9.5G 6.0G 3.6G 63% /srv/ceph/osd2
#节点node1 上的osd使用情况
[root@node1 ~]# df -h|grep osd
/dev/mapper/cephvg-osd 9.0G 6.7G 2.4G 74% /srv/ceph/osd
/dev/mapper/cephvg2-osd2 9.5G 7.9G 1.7G 83% /srv/ceph/osd2
#节点node2上的osd使用情况
[root@node2 ~]# df -h|grep osd
/dev/mapper/cephvg-osd 9.0G 6.1G 3.0G 68% /srv/ceph/osd
/dev/mapper/cephvg2-osd2 9.5G 8.4G 1.2G 89% /srv/ceph/osd2
Above we can see that there are 6 OSDs on three nodes, each osd has a total capacity of 10G and is an ordinary disk, so why is the difference in the usage of osd so big?
Let's briefly explain:
We specified the number of pg when we created the pool, and also limited the pool's disk capacity quota
[root@admin tmp]# ceph osd pool ls detail|grep pool0
pool 3 'pool0' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 127 flags hashpspool max_bytes 104857600 stripe_width 0
# pool0 设置pg32个,最大容量100M(max_bytes 104857600 换算出来的)
[root@admin tmp]# ceph osd pool ls detail|grep pool1
pool 4 'pool1' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 129 flags hashpspool max_bytes 5368709120 stripe_width 0
# pool1 也是32个pg,最大容量5G
We see that the number of PGs in pool0 and pool1 is the same, but the total capacity is far different. Then let's think about the total capacity / the number of pg = pg capacity
The capacity of the pg in pool0 is very different from that of the pg in pool1, and each pg is distributed to the osd, but the size of the osd disk is the same, which is why the osd usage rate is very different.
[root@admin tmp]# ceph -s
cluster 55430962-45e4-40c3-bc14-afac24c69acb
health HEALTH_WARN
clock skew detected on mon.node1
2 near full osd(s)
Monitor clock skew detected
monmap e1: 3 mons at {admin=172.18.1.240:6789/0,node1=172.18.1.241:6789/0,node2=172.18.1.242:6789/0}
election epoch 30, quorum 0,1,2 admin,node1,node2
osdmap e134: 6 osds: 6 up, 6 in
flags nearfull,sortbitwise,require_jewel_osds
pgmap v2188: 192 pgs, 4 pools, 4474 MB data, 284 objects
44349 MB used, 12422 MB / 56772 MB avail
192 active+clean
# 已经警告2个osd 快满了。
Then we will check in detail, the file 90m stored in our pool0 is a file with a size of only 90M
[root@admin tmp]# rados -p pool0 ls
rbd_object_map.acd0238e1f29
90m
rbd_id.image0
rbd_directory
rbd_header.acd0238e1f29
And there are 4 large files stored in pool1
[root@admin tmp]# rados -p pool1 ls|grep G
3G1
3G3
3G
3G2
Let's take a look at the pg distribution of the files in the two pools respectively
[root@admin tmp]# rados -p pool0 ls|grep 90
90m
[root@admin tmp]# ceph osd map pool0 90m
osdmap e134 pool 'pool0' (3) object '90m' -> pg 3.149ba74f (3.f) -> up ([2,0,4], p2) acting ([2,0,4], p2)
#上面是pool0里面的小文件 分布在0,2,4osd上
[root@admin tmp]# rados -p pool1 ls |grep G
3G1
3G3
3G
3G2
[root@admin tmp]# ceph osd map pool0 90m
osdmap e134 pool 'pool0' (3) object '90m' -> pg 3.149ba74f (3.f) -> up ([2,0,4], p2) acting ([2,0,4], p2)
[root@admin tmp]# ceph osd map pool1 3G
osdmap e134 pool 'pool1' (4) object '3G' -> pg 4.e7764a6c (4.c) -> up ([4,0,5], p4) acting ([4,0,5], p4)
[root@admin tmp]# ceph osd map pool1 3G1
osdmap e134 pool 'pool1' (4) object '3G1' -> pg 4.f6d15484 (4.4) -> up ([1,5,0], p1) acting ([1,5,0], p1)
[root@admin tmp]# ceph osd map pool1 3G2
osdmap e134 pool 'pool1' (4) object '3G2' -> pg 4.860667f (4.1f) -> up ([3,2,1], p3) acting ([3,2,1], p3)
[root@admin tmp]# ceph osd map pool1 3G3
osdmap e134 pool 'pool1' (4) object '3G3' -> pg 4.5f18be84 (4.4) -> up ([1,5,0], p1) acting ([1,5,0], p1)
# 上面是pool1里面的大文件,大部分都分布在osd.0 和osd.5上
You see, because the capacity of each pg in pool1 is very large, and most of these pg are distributed on osd.0 and osd.5, so the two nearly fully loaded osd we see below should be 0 and 5 , let's verify it below to see if it looks like this
[root@admin ~]# ceph health detail
HEALTH_WARN clock skew detected on mon.node1; 2 near full osd(s); Monitor clock skew detected
osd.0 is near full at 93%
osd.5 is near full at 88%
mon.node1 addr 172.18.1.241:6789/0 clock skew 0.386219s > max 0.05s (latency 0.00494154s)
So we need to ensure 3 rules when creating a pool:
1 The number of PGs per osd is around 100
The number of 2 pg is 2 N power
3 The total capacity of each pool is basically the same as the capacity of pg converted from the number of pg
If there is such an uneven usage of osd, there are two approaches:
1 Cut peaks and fill valleys, transfer large files to other pools
2 Modify the number of PGs of the pool (the number of PGs can only be increased but not reduced, so increase the number of PGs for large file pools)