Ceph is currently a very popular open source distributed storage system. It has the advantages of high scalability, high performance, and high reliability. It also provides block storage services (rbd), object storage services (rgw) and file system storage services (cephfs). When storing, Ceph makes full use of the computing power of storage nodes. When storing each data, it calculates the location of the data and tries to distribute it as balanced as possible.
Currently, Ceph is also the mainstream backend storage of OpenStack.
② Features of Ceph
high performance:
Abandoning the traditional centralized storage metadata addressing scheme, using the CRUSH algorithm, the data distribution is balanced, and the degree of parallelism is high;
Considering the isolation of disaster recovery domains, it is possible to implement copy placement rules for various loads, such as cross-computer room, rack awareness, etc.;
It can support the scale of thousands of storage nodes and support TB to PB level data.
High availability:
The number of copies can be flexibly controlled;
Support fault domain separation and strong data consistency;
Automatic repair and self-healing in various fault scenarios;
No single point of failure, managed automatically.
High scalability:
decentralization;
Flexible expansion;
It grows linearly with the number of nodes.
Rich in features:
Supports three storage interfaces: block storage, file storage, and object storage;
Supports custom interfaces and multiple language drivers.
③ Ceph architecture
Ceph supports three interfaces:
Object: has a native API, and is also compatible with Swift and S3 APIs;
Block: supports thin provisioning, snapshots, and clones;
File: Posix interface, supports snapshots.
illustrate:
RADOS: The full name is Reliable Autonomic Distributed Object Store, that is, a reliable, automated, and distributed object storage system. RADOS is the essence of Ceph clusters, and users can implement cluster operations such as data allocation and failover;
Librados: Rados provides libraries, because RADOS is a protocol that is difficult to access directly, so the upper layer RBD, RGW, and CephFS are all accessed through librados, which currently provides support for PHP, Ruby, Java, Python, C, and C++.
MDS: Stores metadata for the Ceph file system.
④ Ceph core components
OSD is the process responsible for physical storage. It is generally configured in one-to-one correspondence with disks. A disk starts an OSD process. Its main functions are to store data, copy data, balance data, restore data, and perform heartbeat checks with other OSDs. Responsible for response The client requests to return the process of specific data, etc. The OSD is the only component in the Ceph cluster that stores actual user data, and usually an OSD daemon is bound to a physical disk in the cluster. Therefore, generally speaking, the total number of physical disks in a Ceph cluster is the same as the total number of OSD daemons storing user data on each physical disk.
Ceph introduces the concept of PG (placement group). PG is a virtual concept and does not correspond to any entity. Ceph first maps objects to PGs, and then maps them to OSDs from PGs.
Pool is a logical partition of storage objects. It specifies the type of data redundancy and the corresponding copy distribution strategy. It supports two types: replicated and erasure code.
The relationship between Pool, PG and OSD is as follows:
There are many PGs in a Pool;
A PG contains a bunch of objects, and an object can only belong to one PG;
PG has a master-slave distinction, and a PG is distributed on different OSDs (for three-copy types);
Monitor monitoring: A Ceph cluster needs a small cluster composed of multiple Monitors. They synchronize data through Paxos to save OSD metadata and are responsible for monitoring the Map views (such as OSD Map, Monitor Map, PG Map and CRUSH) running on the entire Ceph cluster. Map), maintain the health status of the cluster, maintain various charts showing the status of the cluster, and manage cluster client authentication and authorization.
The full name of MDS is Ceph Metadata Server. It is a metadata service that CephFS services depend on. It is responsible for saving the metadata of the file system and managing the directory structure. Object storage and block device storage do not require metadata services; if you are not using CephFS, you can not install it.
Mgr: ceph officially developed ceph-mgr, the main goal is to realize the management of ceph cluster and provide a unified entrance for the outside world, such as cephmetrics, zabbix, calamari, prometheus. The Ceph manager daemon (ceph-mgr) was introduced in the Kraken release and runs alongside the monitor daemon to provide additional monitoring and interfaces to external monitoring and management systems.
The full name of RGW is RADOS gateway, which is an object storage service provided by Ceph. The interface is compatible with S3 and Swift.
CephFS: The ceph file system provides a posix-compliant file system that uses the Ceph storage cluster to store user data on the file system. Like RBD (block storage) and RGW (object storage), the CephFS service is also native to librados interface implementation.
A block is a sequence of bytes (usually 512), and a block-based storage interface is a mature and common way of storing data on media including hard disks, solid-state drives, CDs, floppy disks, and even tapes. The ubiquity of the block device interface is ideal for interacting with massive data stores including Ceph, which is thin-provisioned, resizable, and stores data in stripes across multiple OSDs.
advantage:
Data protection is provided through means such as Raid and LVM;
Multiple cheap hard drives are combined to increase capacity;
A logical disk combined with multiple disks improves read and write efficiency;
shortcoming:
When using SAN architecture networking, the cost of fiber optic switches is high;
Data cannot be shared between hosts;
scenes to be used:
docker container, virtual machine disk storage allocation;
log storage;
file storage;
A Linux kernel-level block device that allows users to access Ceph like any other Linux block device.
② File system storage service (CephFS)
Ceph File System (CephFS), is built on top of Ceph distributed object storage, CephFS provides the most advanced, multi-purpose, highly available and high-performance file storage in various scenarios, including shared home directory, FTP and NFS shared storage etc.
Ceph has block storage, why do we need a file system interface? Mainly because of different application scenarios, Ceph's block devices have excellent read and write performance, but they cannot be mounted in multiple places at the same time. Currently, they are mainly used as virtual disks on OpenStack, while Ceph's file system interface has relatively high read and write performance. The device interface is poor, but it has excellent sharing.
advantage:
Low cost, just any machine;
Easy file sharing.
shortcoming:
Low read and write speed;
slow transfer rate;
scenes to be used:
log storage;
File storage with directory structure.
③ Object storage service (RGW)
The Ceph Object Gateway is built on librados, which provides a RESTful gateway between applications and Ceph storage clusters. Ceph Object Storage supports two interfaces:
S3 Compatible: Provides object storage functionality through the interface and is compatible with most subsets of the Amazon S3 RESTful API;
Swift Compatibility: Provides object storage functionality through an interface compatible with a large subset of the OpenStack Swift API.
advantage:
High read and write speed with block storage;
It has features such as file storage sharing.
scenes to be used:
image storage;
video storage.
3. Ceph cluster deployment
① Ceph deployment tools
ceph-deploy: The official deployment tool, ceph-deploy is no longer actively maintained, it does not support RHEL8, CentOS 8 or newer operating systems;
ceph-ansible: Red Hat's deployment tool;
ceph-chef: A tool for automatically deploying Ceph using chef;
puppet-ceph: the ceph module of puppet;
cephadm: Only Octopus and newer are supported (recommended).
② Cluster deployment planning
IP
hostname
Role
disk
operating system
192.168.182.130
local-168-182-130
monitor,mgr,rgw,mds,osd
2*20G
centos7
192.168.182.131
local-168-182-131
monitor,mgr,rgw,mds,osd
2*20G
centos7
192.168.182.132
local-168-182-132
monitor,mgr,rgw,mds,osd
2*20G
centos7
monitor: Ceph monitors the management nodes and undertakes important management tasks of the Ceph cluster. Generally, 3 or 5 nodes are required.
mgr: Ceph cluster management node (manager), which provides a unified entrance for the outside world.
rgw: Ceph Object Gateway, a service that enables clients to access a Ceph cluster using standard object storage APIs.
mds: Ceph metadata server, MetaData Server, mainly saves the metadata of the file system service, and this component is only required when using file storage.
osd: Ceph storage node Object Storage Daemon, the node actually responsible for data storage.
The current node installs mon and mgr roles, and deploys prometheus, grafana, alertmanager, node-exporter and other services:
#先安装一个节点,其它节点通过后面的命令添加到集群中即可
#您需要知道用于集群的第一个监视器守护程序的IP地址。
#如果有多个网络和接口,要确保选择任何可供访问Ceph群集的主机访问的网络和接口。
cephadm bootstrap --mon-ip 192.168.182.130
##### 命令特点:
#在本地主机上为新集群创建监视和管理器守护程序。
#为Ceph集群生成一个新的SSH密钥,并将其添加到root用户的/root/.ssh/authorized_keys文件中。
#将与新集群进行通信所需的最小配置文件编写为/etc/ceph/ceph.conf。
#将client.admin管理密钥的副本写入/etc/ceph/ceph.client.admin.keyring。
#将公用密钥的副本写入 /etc/ceph/ceph.pub。
# 查看部署的服务
docker ps
#=======输出信息=======CephDashboard is now available at:URL: https://local-168-182-130:8443/User: admin
Password: 0ard2l57ji
You can access the CephCLI with:
sudo /usr/sbin/cephadm shell --fsid d1e9b986-89b8-11ed-bec2-000c29ca76a9 -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring
Please consider enabling telemetry to help improve Ceph:
ceph telemetry on
For more information see:
https://docs.ceph.com/docs/master/mgr/telemetry/
According to the prompt, there is a web address: https://ip:8443/, the screenshot here is the screenshot after deployment:
View the cluster status through the ceph command:
ceph -s
⑧ Add new node
Install the cluster's public SSH key in the new host's root user authorized_keys file:
Cephadm is a utility or management tool for managing Ceph clusters. The goal of Cephadm is to provide a fully functional, robust, and well-maintained installation and management layer that can be used by any environment that does not run Ceph in Kubernetes.
The specific features of Cephadm are as follows:
Deploy all components in containers : using containers simplifies dependencies and packaging complexity between different distributions, RPM and Deb packages are of course still built, but as more and more users transition to cephadm (or Rook ) and build containers, the less OS-specific bugs you can see;
Tight integration with the Orchestrator API : Ceph's Orchestrator interface was developed extensively during the development of cephadm to match the implementation and clearly abstract away the (slightly different) functionality present in Rook, the end result is both look and feel Both are like part of Ceph;
No dependencies on management tools : Tools like Salt and Ansible are great for large-scale deployments in large environments, but making Ceph dependent on such tools means that users also need to learn the associated software. What's more, deployments that rely on these tools (Salt, Ansible, etc.) can end up being more complex, harder to debug and (most notably) slower than deployment tools specifically designed to manage Ceph;
Minimal operating system dependencies : Cephadm requires Python 3, LVM and container runtime (Podman or Docker), any current Linux distribution will do;
Isolate clusters from each other : Supporting multiple Ceph clusters co-existing on the same host has always been a relatively niche scenario, but it does exist, and isolating clusters from each other in a robust, general way that makes testing and redeploying clusters A safe and natural process for both developers and users;
Automatic upgrades : Once Ceph "owns" its deployment, it can upgrade Ceph in a safe and automated manner;
Easy Migration from "Legacy" Deployment Tools : Need to easily transition to cephadm from existing Ceph deployments in existing tools such as ceph-ansible, ceph-deploy, and DeepSea.
Here is a list of some things cephadm can do:
cephadm can add Ceph containers to the cluster;
cephadm can remove Ceph containers from the cluster;
The cephadm model has a simple "Bootstrap" step that is started from the command line that starts a minimal Ceph cluster (one monitor and manager daemons) on the local host. Then, deploy the rest of the cluster using orchestrator commands to add additional hosts, use storage devices, and deploy daemons for cluster services.
④ Enable ceph shell
The cephadm command is generally only used as a guide for deployment, but it is recommended to enable the ceph command, because the ceph command is more concise and powerful:
The ceph-common package can be installed, which contains all Ceph commands, including ceph, rbd, mount.ceph (for mounting the CephFS filesystem), etc.:
Host labels: Orchestrator supports assigning labels to hosts. Labels are free-form and have no specific meaning to themselves and each host can have multiple labels, which can be used to specify the daemon process to be placed.
Special host tags, the following host tags have special meanings for cephalosporins:
_no_schedule: Do not schedule or deploy daemons on this host, this tag prevents cephadm from deploying daemons on this host, if it is added to an existing host that already contains Ceph daemons, it will cause cephadm to move the daemon elsewhere (except OSD, which is not automatically deleted).
_no_autotune_memory: do not automatically tune memory on this host, this flag will prevent daemon memory from being tuned even if osd_memory_target_autotune or similar options are enabled for one or more daemons on this host;
_admin: Distribute client.admin and ceph.conf to this host, by default an _admin label is applied to the first host in the cluster (where bootstrap is initially run), and the client.admin key is set to Distributing to this host via functionality, adding this tag to other hosts usually causes CEPHADM to deploy configuration and keyring files. Added to the default location since versions 16.2.10 and 17.2.1 Cephadm also stores configuration and keyring files in the file directory, ceph orch client-keyring ... /etc/ceph/etc/ceph//var/lib/ceph/ /config.
④ Maintenance mode
Put the host into and out of maintenance mode (stop all Ceph daemons on the host):
ceph-volume sequentially scans each host in the cluster from time to time to determine which devices are present and whether they are eligible to be used as OSDs.
To view the list, run the following command:
# ceph orch device ls [--hostname=...][--wide][--refresh]
ceph orch device ls
# 使用 --wide 选项提供与设备相关的所有详细信息, 包括设备可能不符合用作 OSD 条件的任何原因。
ceph orch device ls --wide
In the example above, you can see the fields named "Health", "Identity" and "Fault", this information is passed through the integration with libstoragemgmt By default, this integration is disabled (because libstoragemgmt may not be 100% compatible with your hardware compatible).
To have cephadm include these fields, enable CEPHADM's Enhanced Device Scanning option as follows:
ceph config set mgr mgr/cephadm/device_enhanced_scan true
Create a new OSD:
Ceph uses any available and unused storage devices:
Removing an OSD from a cluster involves two steps: evacuating all placement groups (PGs) from the cluster, and removing OSDs without PGs from the cluster.
Activate existing OSDs. If reinstalling the host's operating system, you need to activate existing OSDs. For this use case, cephadm provides a wrapper that activates all existing OSDs on the host:
List the usage of each disk in the cluster in detail:
ceph osd df
⑧ pool related operations
View the number of pools in the ceph cluster:
ceph osd lspools
#或者
ceph osd pool ls
Create a pool in the ceph cluster:
#这里的100指的是PG组:
ceph osd pool create rbdtest 100
⑨ PG related
View the mapping information of the pg group:
ceph pg dump
# 或者
# ceph pg ls
View a PG map:
ceph pg map 7.1a
View PG status:
ceph pg stat
Display all pg statistics in a cluster:
ceph pg dump --format plain
6. Practical operation demonstration
① Block storage usage (RDB)
Use create to create a pool pool:
ceph osd lspools
# 创建
ceph osd pool create ceph-demo 6464
# 创建命令时需要指定PG、PGP数量,还可以指定复制模型还是纠删码模型,副本数量等
# osd pool create <pool>[<pg_num:int>][<pgp_num:int>][replicated|erasure][<erasure_code_ create pool profile>][<rule>][<expected_num_objects:int>][<size:int>][<pg_num_min:int>][on|off| warn][<target_size_bytes:int>][<target_size_ratio:float>]
Obtain pool property information, which can be reset. There are many parameters, which can be set as follows:
# 1、获取 pg 个数
ceph osd pool get ceph-demo pg_num
# 2、获取 pgp 个数
ceph osd pool get ceph-demo pgp_num
# 3、获取副本数
ceph osd pool get ceph-demo size
# 4、获取使用模型
ceph osd pool get ceph-demo crush_rule
# 5、设置副本数
ceph osd pool set ceph-demo size 2
# 6、设置 pg 数量
ceph osd pool set ceph-demo pg_num 128
# 7、设置 pgp 数量
ceph osd pool set ceph-demo pgp_num 128
ceph osd pool create cephfs_data 128
ceph osd pool create cephfs_metadata 128
# 创建命令时需要指定PG、PGP数量,还可以指定复制模型还是纠删码模型,副本数量等
# osd pool create <pool>[<pg_num:int>][<pgp_num:int>][replicated|erasure][<erasure_code_ create pool profile>][<rule>][<expected_num_objects:int>][<size:int>][<pg_num_min:int>][on|off| warn][<target_size_bytes:int>][<target_size_ratio:float>]
PG (Placement Group), pg is a virtual concept used to store objects, PGP (Placement Group for Placement purpose), is equivalent to an osd arrangement and combination stored in pg.
create file system
ceph fs new 128 cephfs_metadata cephfs_data
#此时再回头查看文件系统,mds节点状态
ceph fs ls
ceph mds stat
View storage pool quotas:
ceph osd pool get-quota cephfs_metadata
The kernel driver mounts the ceph file system:
Create a mount point:
mkdir /mnt/mycephfs
Obtain the storage key, if not go to the management node to copy again:
cat /etc/ceph/ceph.client.admin.keyring
#将存储密钥保存到/etc/ceph/admin.secret文件中:
vim /etc/ceph/admin.secret
# AQBFVrFjqst6CRAA9WaF1ml7btkn6IuoUDb9zA==
#如果想开机挂载可以写入/etc/rc.d/rc.local文件中
mount:
# Ceph 存储集群默认需要认证,所以挂载时需要指定用户名 name 和创建密钥文件一节中创建的密钥文件 secretfile ,例如:
# mount -t ceph {
ip-address-of-monitor}:6789://mnt/mycephfs
mount -t ceph 192.168.182.130:6789://mnt/mycephfs -o name=admin,secretfile=/etc/ceph/admin.secret
Uninstall:
umount /mnt/mycephfs
Common commands:
# 查看存储池副本数
ceph osd pool get [存储池名称] size
# 修改存储池副本数
ceph osd pool set [存储池名称] size 3
# 打印存储池列表
ceph osd lspools
# 创建存储池
ceph osd pool create [存储池名称][pg_num的取值]
# 存储池重命名
ceph osd pool rename [旧的存储池名称][新的存储池名称]
# 查看存储池的pg_num
ceph osd pool get [存储池名称] pg_num
# 查看存储池的pgp_num
ceph osd pool get [存储池名称] pgp_num
# 修改存储池的pg_num值
ceph osd pool set [存储池名称] pg_num [pg_num的取值]
# 修改存储池的pgp_num值
ceph osd pool set [存储池名称] pgp_num [pgp_num的取值]
③ Object storage usage (RGW)
rados is a utility for interacting with Ceph's Object Storage Cluster (RADOS), part of Ceph's Distributed File System:
Check how many pools there are in the ceph cluster:
rados lspools
# 同 ceph osd pool ls 输出结果一致
Show overall system usage:
rados df
Create a pool:
ceph osd pool create test
Create an object object:
rados create test-object -p test
View object file:
rados -p test ls
Delete an object:
rados rm test-object -p test
Use Ceph storage through the api interface: In order to use the Ceph SGW REST interface, you need to initialize a Ceph object gateway user for the S3 interface, then create a sub-user for the Swif interface, and finally you can access the object gateway through the created user authentication.
Create an S3 Gateway user:
radosgw-admin user create --uid="rgwuser"--display-name="This is first rgw test user"
info:
{
"user_id":"rgwuser","display_name":"This is first rgw test user","email":"","suspended":0,"max_buckets":1000,"subusers":[],"keys":[{
"user":"rgwuser","access_key":"48AIAPCYK7S4X9P72VOW","secret_key":"oC5qKL0BMMzUJHAS76rQAwIoJh4s6NwTnLklnQYX"}],"swift_keys":[],"caps":[],"op_mask":"read, write, delete","default_placement":"","default_storage_class":"","placement_tags":[],"bucket_quota":{
"enabled":false,"check_on_raw":false,"max_size":-1,"max_size_kb":0,"max_objects":-1},"user_quota":{
"enabled":false,"check_on_raw":false,"max_size":-1,"max_size_kb":0,"max_objects":-1},"temp_url_keys":[],"type":"rgw","mfa_ids":[]}
The python-boto package is used here to connect to S3 using authentication information, and then create a bucket of my-first-s3-bucket, and finally list all created buckets, print the name and creation time:
Create a Swift user:
#要通过 Swift 访问对象网关,需要 Swift 用户,我们创建subuser作为子用户。
radosgw-admin subuser create --uid=rgwuser --subuser=rgwuser:swift --access=full
Info:
{
"user_id":"rgwuser","display_name":"This is first rgw test user","email":"","suspended":0,"max_buckets":1000,"subusers":[{
"id":"rgwuser:swift","permissions":"full-control"}],"keys":[{
"user":"rgwuser","access_key":"48AIAPCYK7S4X9P72VOW","secret_key":"oC5qKL0BMMzUJHAS76rQAwIoJh4s6NwTnLklnQYX"}],"swift_keys":[{
"user":"rgwuser:swift","secret_key":"6bgDOAsosiD28M0eE8U1N5sZeGyrhqB1ca3uDtI2"}],"caps":[],"op_mask":"read, write, delete","default_placement":"","default_storage_class":"","placement_tags":[],"bucket_quota":{
"enabled":false,"check_on_raw":false,"max_size":-1,"max_size_kb":0,"max_objects":-1},"user_quota":{
"enabled":false,"check_on_raw":false,"max_size":-1,"max_size_kb":0,"max_objects":-1},"temp_url_keys":[],"type":"rgw","mfa_ids":[]}
{
"user_id":"rgwuser","display_name":"This is first rgw test user","email":"","suspended":0,"max_buckets":1000,"subusers":[{
"id":"rgwuser:swift","permissions":"full-control"}],"keys":[{
"user":"rgwuser","access_key":"48AIAPCYK7S4X9P72VOW","secret_key":"oC5qKL0BMMzUJHAS76rQAwIoJh4s6NwTnLklnQYX"}],"swift_keys":[{
"user":"rgwuser:swift","secret_key":"AVThl3FGiVQW3VepkQl4Wsoyq9lbPlLlpKhXLhtR"}],"caps":[],"op_mask":"read, write, delete","default_placement":"","default_storage_class":"","placement_tags":[],"bucket_quota":{
"enabled":false,"check_on_raw":false,"max_size":-1,"max_size_kb":0,"max_objects":-1},"user_quota":{
"enabled":false,"check_on_raw":false,"max_size":-1,"max_size_kb":0,"max_objects":-1},"temp_url_keys":[],"type":"rgw","mfa_ids":[]}
Test access to the Swift interface:
#注意,以下命令需要python环境和可用的pip服务。
yum install python-pip -y
pip install --upgrade python-swiftclient
#测试
swift -A http://192.168.182.130/auth/1.0-U rgwuser:swift -K'AVThl3FGiVQW3VepkQl4Wsoyq9lbPlLlpKhXLhtR' list