ceph device management

Device management

Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are used by which daemons and collects health metrics about these devices in order to provide tools to predict and/or automatically respond to hardware failures.

Device tracking

You can query which storage devices are in use using the following command:

ceph device ls

It is also possible to list devices by daemon or host:

ceph device ls-by-daemon <daemon>
ceph device ls-by-host <host>

For any individual device, its location information and usage can be queried:

ceph device info <devid>

Enable monitoring

Ceph can also monitor health metrics related to your device. For example, SATA hard drives implement a standard called SMART, which provides a wide range of internal metrics about device usage and health, such as hours powered on, number of power cycles, or unrecoverable read errors. Other device types like SAS and NVMe implement a similar set of metrics (via slightly different criteria). All of this can be smartctlcollected by Ceph via tools.

Health monitoring can be enabled or disabled in the following ways:

ceph device monitoring on

or

ceph device monitoring off

crawl

If monitoring is enabled, metrics are automatically crawled at regular intervals. This time interval can be configured as:

ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>

The default is to crawl every 24 hours.

Fetching for all devices can be triggered manually:

ceph device scrape-health-metrics

A single device can be grabbed using the following command:

ceph device scrape-health-metrics <device-id>

Or you can grab the device for a single daemon using the following command:

ceph device scrape-daemon-health-metrics <who>

Device-stored health metrics can be retrieved (optionally for a specific timestamp) using:

ceph device get-health-metrics <devid> [sample-timestamp]

Failure prediction

Ceph can predict life expectancy and device failure based on collected health metrics. There are three modes:

  • none : Turn off the device failure prediction function.
  • local : Use a pre-trained predictive model from the ceph-mgr daemon
  • cloud : Share device health and performance metrics using an external cloud service run by ProphetStor, either using its free service or a paid service with more accurate predictions

Prediction mode can be configured with the following command:

ceph config set global device_failure_prediction_mode <mode>

Forecasting usually runs in the background periodically, so it may take a while for the life expectancy value to populate. The life expectancy of all devices can be seen in the following output:

ceph device ls

You can also query metadata for a specific device using the following functions:

ceph device info <devid>

Predictions of equipment life expectancy are explicitly mandated by:

ceph device predict-life-expectancy <devid>

If you are not using Ceph's internal device failure prediction, but have some external source of information about device failures, you can inform Ceph of the device life expectancy by:

ceph device set-life-expectancy <devid> <from> [<to>]

Life expectancy is expressed in time intervals, so the uncertainty can be expressed in terms of wide intervals. The end of the interval does not need to be specified.

health alert

mgr/devicehealth/warn_thresholdControls how long a device failure is expected to occur before a health warning is generated.

The storage life expectancy of all devices can be checked and any appropriate health alerts generated by:

ceph device check-health

automatic isolation

If the option is enabled mgr/devicehealth/self_heal(which is the default), then for devices that are expected to fail soon, the module will automatically migrate data by marking the device as "out".

mgr/devicehealth/mark_out_thresholdControls how long we can expect a device failure to occur before we automatically mark an osd as "out".

Guess you like

Origin blog.csdn.net/QTM_Gitee/article/details/130416635