ceph存储 pg归置组处于stuck以及degraded状态解决方案

https://blog.csdn.net/skdkjzz/article/details/42486793

由于对ceph的兴趣，我们经常自己搭建ceph集群，可能是单节点，也可能是多节点，但是经常遇到pg归置组异常状态，下面是遇到的一些情况：

1、单节点的时候pg归置组unclean或者degraded

这个时候应该检查，自己是几个osd，副本数是多少，副本的最小值是多少，还有故障域是不是osd

2、多个节点的时候pg归置组unclean或者degraded

这个时候就麻烦了，需要查看log日志，以及osd的dump信息，看一个池子和mds是不是正常，以及对象是否有丢失，最重要就是根据报警信息解决问题

可以使用ceph -s 以及 ceph -w 已经 ceph health detail 以及 ceph osd dump 等查看具体原因

3、当遇到是故障域导致的时候两个方案：

3.1)设置配置文件osd crush chooseleaf type = 0 (默认为1，是host)

3.2)重新编译crush map 找到 chooseleaf 出然后修改为 osd

4、官方对pg状态总结如下：

http://ceph.com/docs/master/dev/placement-group/

Todo: 图的状态以及如何他们可以重叠

创建

PG仍在创建

活跃

到PG的请求将被处理

清理

在PG的所有对象复制正确的次数

下载

必要的数据是一个副本下来，所以PG是离线

重播

PG正等待客户端的OSD坠毁后重放操作

分裂

在PG分割成多个PG的（不是功能2012-02）

擦洗

PG被检查不一致

降级

在PG的一些对象没有足够的时间复制

不一致

PG的副本并不一致（例如，对象是错误的大小，对象是从一个副本丢失恢复完成后，等）

peering

在PG经历peering过程

修复

正在检查的PG将被修复发现的任何不一致（如果可能的话）

恢复

正在迁移/同步对象的复制品

恢复_等待

PG正等待本地/远程恢复预订

回填

一个特殊的恢复的情况下，在PG的全部内容进行扫描和同步，代替推断需要传送从PG最近的操作日志

回填_等待

PG正排队等候，开始回填

backfill_toofull

拒绝回填预订，OSD太满

残缺

一个PG缺少必要的历史时期，从它的日志。如果您看到此状态下，报告错误，并尝试启动失败的OSD可能包含所需信息。

陈旧

PG是在一个未知的状态 - 监视器没有收到更新，因为PG映射改变。

重新映射

PG暂时从什么CRUSH指定映射到一组不同的OSD

5、收集的一些关于pg归置组的解决方案

Placement groups
A Placement Group (PG) aggregates a series of objects into a group, and maps the group to a series of OSDs.
Tracking object placement and object metadata on a per-object basis is computationally expensive–i.e., a
system with millions of objects cannot realistically track placement on a per-object basis. Placement groups
address this barrier to performance and scalability. Additionally, placement groups reduce the number of
processes and the amount of per-object metadata Ceph must track when storing and retrieving data.
一个归置组(PG)把一系列对象汇聚到一组，并且把这个组映射到一系列OSD。跟踪每个对象的位置和元数据
需要大量计算。例如，一个拥有数百万对象的系统，不可能在每对象级追踪位置。归置组可应对这个影响性能
和扩展性的问题，另外，归置组减小了ceph存储、检索数据时必须追踪的每对象元数据的处理量和尺寸。
Each placement group requires some amount of system resources:
每个归置组都需要一定量系统资源：
Directly: Each PG requires some amount of memory and CPU.
Indirectly: The total number of PGs increases the peering count.
Increasing the number of placement groups reduces the variance in per-OSD load across your cluster. We
recommend approximately 50-100 placement groups per OSD to balance out memory and CPU requirements
and per-OSD load. For a single pool of objects, you can use the following formula:
直接地：每个PG需要一些内存和CPU；
间接地：PG总量增加了连接建立数量；
增加PG数量能减小集群内每个OSD间的变迁，我们推荐每个OSD大约50-100个归置组，以均衡内存、CPU
需求、和每OSD负载。对于单存储池里的对象，你可用下面的公式：
(OSDs * 100)
Total PGs = ------------
Replicas
When using multiple data pools for storing objects, you need to ensure that you balance the number of
placement groups per pool with the number of placement groups per OSD so that you arrive at a reasonable
total number of placement groups that provides reasonably low variance per OSD without taxing system
resources or making the peering process too slow.
当用了多个数据存储池来存储数据时，你得确保均衡每个存储池的归置组数量、且归置组数量分摊到每个
OSD，这样才能达到较合理的归置组总量，并因此使得每个 OSD无需耗费过多系统资源或拖慢连接进程就能
实现较小变迁。
3.3.8.1设置归置组数量
SET THE NUMBER OF PLACEMENT GROUPS
To set the number of placement groups in a pool, you must specify the number of placement groups at the
time you create the pool.
你必须在创建存储池时设置一个存储池的归置组数量。
See Create a Pool for details.
详情参见创建一个存储池。
3.3.8.2获取归置组数量
GET THE NUMBER OF PLACEMENT GROUPS
To get the number of placement groups in a pool, execute the following:
要获取一个存储池的归置组数量，执行命令：
ceph osd pool get {pool-name} pg_num
3.3.8.3获取归置组统计信息
GET A CLUSTER’S PG STATISTICS
To get the statistics for the placement groups in your cluster, execute the following:
要获取集群里归置组的统计信息，执行命令：
ceph pg dump [--format {format}]
Valid formats are plain (default) and json.
可用格式有纯文本（默认）和json。
3.3.8.4获取卡住的归置组统计信息
GET STATISTICS FOR STUCK PGS
To get the statistics for all placement groups stuck in a specified state, execute the following:
要获取所有卡在某状态的归置组统计信息，执行命令：
ceph pg dump_stuck inactive|unclean|stale [--format <format>] [-t|--threshold <seconds>]
Inactive Placement groups cannot process reads or writes because they are waiting for an OSD with the most
up-to-date data to come up and in.
inactive（不活跃）归置组不能处理读写，因为它们在等待一个有最新数据的 OSD复活且进入集群。
Unclean Placement groups contain objects that are not replicated the desired number of times. They should
be recovering.
unclean（不干净）归置组含有复制数未达到期望数量的对象，它们应该在恢复中。
Stale Placement groups are in an unknown state - the OSDs that host them have not reported to the monitor
cluster in a while (configured by mon_osd_report_timeout).
stale（不新鲜）归置组处于未知状态：存储它们的 OSD有段时间没向监视器报告了（由
mon_osd_report_timeout配置）。
Valid formats are plain (default) and json. The threshold defines the minimum number of seconds the
placement group is stuck before including it in the returned statistics (default 300 seconds).
可用格式有纯文本（默认）和json。阀值定义的是，归置组被认为卡住前等待的最小时间（默认 300秒）。
3.3.8.5获取归置组图
GET A PG MAP
To get the placement group map for a particular placement group, execute the following:
要获取一个具体归置组的归置组图，执行命令：
ceph pg map {pg-id}
For example:
例如：
ceph pg map 1.6c
Ceph will return the placement group map, the placement group, and the OSD status:
ceph将返回归置组图、归置组、和OSD状态：
osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
3.3.8.6获取一个PG的统计信息
GET A PGS STATISTICS
To retrieve statistics for a particular placement group, execute the following:
要查看一个具体归置组的统计信息，执行命令：
ceph pg {pg-id} query
3.3.8.7洗刷一个归置组
Scrub a placement group
To scrub a placement group, execute the following:
要洗刷一个归置组，执行命令：
ceph pg scrub {pg-id}
Ceph checks the primary and any replica nodes, generates a catalog of all objects in the placement group and
compares them to ensure that no objects are missing or mismatched, and their contents are consistent.
Assuming the replicas all match, a final semantic sweep ensures that all of the snapshot-related object
metadata is consistent. Errors are reported via logs.
ceph检查原始的和任何复制节点，生成归置组里所有对象的目录，然后再对比，确保没有对象丢失或不匹配，
并且它们的内容一致。
3.3.8.8恢复丢失的
REVERT LOST
If the cluster has lost one or more objects, and you have decided to abandon the search for the lost data, you
must mark the unfound objects as lost.
如果集群丢了一或多个对象，而且必须放弃搜索这些数据，你就要把未找到的对象标记为丢失。
If all possible locations have been queried and objects are still lost, you may have to give up on the lost
objects. This is possible given unusual combinations of failures that allow the cluster to learn about writes
that were performed before the writes themselves are recovered.
如果所有可能的位置都查询过了，而仍找不到这些对象，你也许得放弃它们了。这可能是罕见的失败组合导致
的，集群在写入完成前，未能得知写入是否已执行。
Currently the only supported option is “revert”, which will either roll back to a previous version of the object
or (if it was a new object) forget about it entirely. To mark the “unfound” objects as “lost”, execute the
following:
当前只支持revert选项，它使得回滚到对象的前一个版本（如果它是新对象）或完全忽略它。要把unfound
对象标记为lost，执行命令：
ceph pg {pg-id} mark_unfound_lost revert
Important Use this feature with caution, because it may confuse applications that expect the object(s)
to exist.
重要：要谨慎使用，它可能迷惑那些期望对象存在的应用程序。
3.3.9 CRUSH图
CRUSH MAPS
The CRUSH algorithm determines how to store and retrieve data by computing data storage locations.
CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server
or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a single
point of failure, a performance bottleneck, and a physical limit to its scalability.
CRUSH算法通过计算数据存储位置来确定如何存储和检索。CRUSH授权ceph客户端直接连接OSD，而非通
过一个中央服务器或经纪人。数据存储、检索算法的使用，使ceph避免了单点失败、性能瓶颈、和伸缩的物
理限制。
CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly store and retrieve data
in OSDs with a uniform distribution of data across the cluster. For a detailed discussion of CRUSH, see
CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data
CRUSH需要一张集群的地图，且使用CRUSH把数据伪随机地存储、检索于整个集群的OSD里。CRUSH的讨
论详情参见CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data。
CRUSH Maps contain a list of OSDs, a list of ‘buckets’ for aggregating the devices into physical locations, and
a list of rules that tell CRUSH how it should replicate data in a Ceph cluster’s pools. By reflecting the
underlying physical organization of the installation, CRUSH can model—and thereby address—potential
sources of correlated device failures. Typical sources include physical proximity, a shared power source, and a
shared network. By encoding this information into the cluster map, CRUSH placement policies can separate
object replicas across different failure domains while still maintaining the desired distribution. For example,
to address the possibility of concurrent failures, it may be desirable to ensure that data replicas are on devices
in different shelves, racks, power supplies, controllers, and/or physical locations.
CRUSH图包含OSD列表、把设备汇聚为物理位置的“桶” 列表、和指示CRUSH如何复制存储池里的数据的
规则列表。由于对所安装底层物理组织的表达，CRUSH能模型化、并因此定位到潜在的相关失败设备源头，
典型的源头有物理距离、共享电源、和共享网络，把这些信息编码到集群图里，CRUSH归置策略可把对象副
本分离到不同的失败域，却仍能保持期望的分布。例如，要定位同时失败的可能性，可能希望保证数据复制到
的设备位于不同机架、不同托盘、不同电源、不同控制器、甚至不同物理位置。
When you create a configuration file and deploy Ceph with mkcephfs, Ceph generates a default CRUSH map
for your configuration. The default CRUSH map is fine for your Ceph sandbox environment. However, when
you deploy a large-scale data cluster, you should give significant consideration to developing a custom
CRUSH map, because it will help you manage your Ceph cluster, improve performance and ensure data safety.
当你写好配置文件，用mkcephfs部署ceph后，它生成了一个默认的CRUSH图，对于你的沙盒环境来说它很
好。然而，部署一个大规模数据集群的时候，应该好好设计自己的CRUSH图，因为它帮你管理ceph集群、
提升性能、和保证数据安全性。
For example, if an OSD goes down, a CRUSH Map can help you can locate the physical data center, room, row
and rack of the host with the failed OSD in the event you need to use onsite support or replace hardware.
例如，如果一个OSD挂了，

ceph存储 pg归置组处于stuck以及degraded状态解决方案

猜你喜欢