【Distributed application】ceph distributed storage

1. Storage basis

1.1 Stand-alone storage device

DAS (Direct Attached Storage, storage directly connected to the motherboard bus of the computer)

  • Disks with IDE, SATA, SCSI, SAS, USB interfaces
  • The so-called interface is a disk device driven by a storage device, providing block-level storage

NAS (Network Attached Storage, which is storage attached to the file system of the current host through the network)
NFS, CIFS, FTP

  • File system-level storage itself is a well-made file system. After outputting in the user space through the nfs interface, the client communicates with the remote host based on the kernel module and converts it to use like a local file system. There is no way for the storage service to format it again to create a file system block

SAN (Storage Area Network)

  • SCSI protocol (only used to transmit data access operations, the physical layer uses SCSI cables for transmission), FCSAN (the physical layer uses optical fiber for transmission), iSCSI (the physical layer uses Ethernet for transmission)
  • It is also a kind of network storage, but the difference is that the interface provided by SAN to the client host is block-level storage

1.2 The problem of stand-alone storage

Insufficient storage processing capacity

  • The IO value of the traditional IDE is 100 times/s, the SATA solid-state disk is 500 times/s, and the solid-state hard disk reaches 2000-4000 times/s. Even if the IO capacity of the disk is dozens of times larger, it is still not enough to resist the simultaneous access of hundreds of thousands, millions or even hundreds of millions of users during the peak period of website access, which is also limited by the IO capacity of the host network.

Insufficient storage capacity

  • No matter how large the capacity of a single disk is, it cannot meet the data capacity limit required by users for normal access.

single point of failure problem

  • There is a single point of failure problem in single-machine storage data

Commercial Storage Solutions

  • EMC, NetAPP, IBM, DELL, Huawei, Inspur

Distributed storage (software-defined storage SDS)

  • Ceph、TFS、FastDFS、MooseFS(MFS)、HDFS、GlusterFS(GFS)
  • The storage mechanism will disperse and store data on multiple nodes, which has the advantages of high scalability, high performance, and high availability.

1.3 Types of distributed storage

  • Block storage (such as a hard disk, generally a storage is mounted and used by a server, suitable for container or virtual machine storage volume allocation, log storage, file storage) is a bare device, used to provide unorganized storage
    space, the underlying Store data in chunks

  • File storage (such as NFS, which solves the problem that block storage cannot be shared, and one storage can be mounted by multiple servers at the same time, and is suitable for directory structure storage and log storage) is an interface for organizing and storing data, generally established at the block
    level On top of the storage structure, data is stored in the form of files, and the metadata and actual data of files are stored separately

  • Object storage (such as OSS, a storage can be accessed by multiple services at the same time, has the high-speed read and write capabilities of block storage, and also has the characteristics of file storage sharing, suitable for image storage and video storage) file storage provided based on the API interface, each
    file They are all one object, and the file sizes are different. The metadata and actual data of the file are stored together.

2. Introduction to Ceph

  • Ceph is developed in C++ language and is an open, self-healing and self-managing open source distributed storage system. It has the advantages of high scalability, high performance and high reliability.

  • Ceph is currently supported by many cloud computing vendors and is widely used. RedHat, OpenStack, and Kubernetes can all be integrated with Ceph to support the back-end storage of virtual machine images.

  • It is roughly estimated that 70%-80% of cloud platforms in my country use Ceph as the underlying storage platform, which shows that Ceph has become the standard configuration of open source cloud platforms. At present, domestic companies that have successfully built distributed storage systems using Ceph include Huawei, Ali, ZTE, H3C, Inspur, China Mobile, NetEase, LeTV, 360, Tianhe Storage, Shanyan Data, etc.

2.1 Advantages of Ceph

  • High scalability: decentralized, supports the use of ordinary X86 servers, supports the scale of thousands of storage nodes, and supports expansion from TB to EB level.
  • High reliability: no single point of failure, multiple data copies, automatic management, automatic repair.
  • High performance: Abandoning the traditional centralized storage metadata addressing scheme, using the CRUSH algorithm, the data distribution is balanced, and the degree of parallelism is high.
  • Powerful functions: Ceph is a unified storage system that integrates block storage interface (RBD), file storage interface (CephFS), and object storage interface (RadosGW), so it is suitable for different application scenarios.

2.2 Ceph Architecture

insert image description here

From bottom to top, the Ceph system can be divided into four levels:

  • RADOS basic storage system (Reliab1e, Autonomic, Distributed object store, that is, reliable, automated, and distributed object storage)
    RADOS is the lowest-level functional module of Ceph, and it is an infinitely scalable object storage service that can disassemble files Countless objects (fragments) are stored in the hard disk, which greatly improves the stability of the data. It is mainly composed of OSD and Monitor. Both OSD and Monitor can be deployed in multiple servers. This is the origin of ceph distribution and the origin of high scalability.

  • LIBRADOS The basic library
    Librados provides a way to interact with RADOS, and provides API interfaces for Ceph services to upper-layer applications. Therefore, the upper-layer RBD, RGW, and CephFS are all accessed through Librados. Currently, PHP, Ruby, Java, Python, and Go , C and C++ support for client application development directly based on RADOS (rather than the entire Ceph).

  • High-level application interface: includes three parts
    1) Object storage interface RGW (RADOS Gateway)
    gateway interface, based on the object storage system developed by Librados, provides a RESTful API interface compatible with S3 and Swift.
    2) Block storage interface RBD (Reliable Block Device)
    provides a block device interface based on Librados, which is mainly used for Host/VM.
    3) The file storage interface CephFS (Ceph File System)
    The Ceph file system provides a POSIX-compliant file system that uses the Ceph storage cluster to store user data on the file system. Based on the distributed file system interface provided by Librados.

  • Application layer: various APPs developed based on high-level interfaces or the basic library Librados, or many clients such as Host and VM

2.3Ceph core components

insert image description here

  • Ceph is an object-based storage system that divides each data stream to be managed (such as files and other data) into one or more object data (Object) of fixed size (default 4 megabytes), and uses it as an atomic unit (Atom is the smallest unit of an element) to complete the reading and writing of data.

OSD (Object Storage Daemon, daemon process ceph-osd)

It is the process responsible for physical storage. It is generally configured to correspond to the disk one by one, and one disk starts one OSD process. The main function is to store data, copy data, balance data, restore data, and perform heartbeat checks with other OSDs, and is responsible for the process of returning specific data in response to client requests. Typically at least 3 OSDs are required for redundancy and high availability.

PG (Placement Group placement group)

PG is just a virtual concept and does not exist physically. It is similar to the index in the database in data addressing: Ceph first maps each object data to a PG through the HASH algorithm, and then maps the PG to the OSD through the CRUSH algorithm.

Pool

  • Pool is a logical partition for storing objects, which acts as a namespace. Each Pool contains a certain number (configurable) of PGs. Pool can be used as a fault isolation domain, which can be isolated according to different user scenarios.

There are two types of data storage methods in the Pool:

  • Multiple copies (replicated) : similar to raid1, an object data saves 3 copies by default, and puts them in different OSDs
  • Erasure Code : Similar to raid5, it consumes a little more CPU, but saves disk space, and only one copy of object data is saved. Since some functions of Ceph do not support erasure coding pools, this type of storage pool is not used much

The relationship between Pool, PG and OSD:

  • There are many PGs in a Pool; a PG contains a bunch of objects, and an object can only belong to one PG; PGs are divided into masters and slaves, and a PG is distributed on different OSDs (for multi-copy types)

Monitor (daemon process ceph-mon)

Metadata used to save OSD. Responsible for maintaining the mapping views of the cluster state (Cluster Map: OSD Map, Monitor Map, PG Map and CRUSH Map), maintaining various charts showing the cluster state, and managing cluster client authentication and authorization. A Ceph cluster usually requires at least 3 or 5 (odd number) Monitor nodes to achieve redundancy and high availability, and they synchronize data between nodes through the Paxos protocol.

Manager (daemon ceph-mgr)

Responsible for tracking runtime metrics and the current state of the Ceph cluster, including storage utilization, current performance metrics, and system load. Provides additional monitoring and interfaces to external monitoring and management systems, such as zabbix, prometheus, cephmetrics, etc. A Ceph cluster usually requires at least 2 mgr nodes to achieve high availability, and information synchronization between nodes is realized based on the raft protocol.

MDS (Metadata Server, daemon ceph-mds)

It is the metadata service that the CephFS service depends on. Responsible for saving the metadata of the file system and managing the directory structure. Object storage and block device storage do not require metadata services; if you are not using CephFS, you can not install it.

2.4 OSD storage backend

  • OSDs have two ways of managing the data they store. In Luminous 12.2.z and later releases, the default (and recommended) backend is BlueStore. Before Luminous was released, FileStore was the default and only option.

Filestore

FileStore is a legacy method of storing objects in Ceph. It relies on a standard file system (only XFS) combined with a key/value database (traditionally LevelDB, now BlueStore is RocksDB) for storing and managing metadata.
FileStore is well tested and used extensively in production. However, due to its overall design and dependence on traditional file systems, it has many shortcomings in performance.

Bluestore

BlueStore is a special-purpose storage backend designed specifically for managing data on disk for OSD workloads. BlueStore's design is based on a decade of experience supporting and managing Filestores. Compared with Filestore, BlueStore has better read and write performance and security.

Key features of BlueStore include:

  • 1) BlueStore directly manages the storage device, that is, directly uses the original block device or partition to manage the data on the disk. This avoids the intervention of abstraction layers (such as local file systems such as XFS), which can limit performance or increase complexity.
  • 2) BlueStore uses RocksDB for metadata management. RocksDB's key/value database is embedded in order to manage internal metadata, including mapping object names to block locations on disk.
  • 3) All data and metadata written to BlueStore is protected by one or more checksums. No data or metadata is read from disk or returned to the user without verification.
  • 4) Support for inline compression. Data can optionally be compressed before being written to disk.
  • 5) Support multi-device metadata layering. BlueStore allows its internal log (WAL write-ahead log) to be written to a separate high-speed device (such as SSD, NVMe or NVDIMM) for improved performance. Internal metadata can be stored on faster devices if there is plenty of faster storage available.
  • 6) Support efficient copy-on-write. RBD and CephFS snapshots rely on the copy-on-write cloning mechanism efficiently implemented in BlueStore. This will result in efficient I/O for regular snapshots and erasure-coded pools (which rely on clones for efficient two-phase commit).

2.5 Ceph Data Stored Procedures

insert image description here

  • 1) The client obtains the latest Cluster Map from mon

  • 2) In Ceph, everything is an object. The data stored by Ceph will be divided into one or more fixed-size objects (Object). Object size can be adjusted by the administrator, usually 2M or 4M.
    Each object will have a unique OID,
    which is composed of ino and ono: ● ino: the FileID of the file, used to uniquely identify each file globally
    ● ono: the number of the slice
    For example: a file FileID is A , it is cut into two objects, one is numbered 0 and the other is numbered 1, then the oids of these two files are A0 and A1.
    The advantage of OID is that it can uniquely identify each different object and store the affiliation between the object and the file. Since all data in Ceph are virtualized into uniform objects, the efficiency of reading and writing will be relatively high.

  • 3) Obtain a hexadecimal feature code by using the HASH algorithm on the OID, take the remainder from the feature code and the total number of PGs in the Pool, and the obtained serial number is PGID.
    That is, Pool_ID + HASH(OID) % PG_NUM to get PGID

  • 4) The PG will replicate according to the number of copies set, and calculate the IDs of the target primary and secondary OSDs in the PG by using the CRUSH algorithm on the PGID, and store them on different OSD nodes (in fact, all objects in the PG are stored on the OSD) .
    That is, through CRUSH (PGID), the data in PG is stored in each OSD group.
    CRUSH is a data distribution algorithm used by Ceph, similar to consistent hashing, so that data is allocated to the expected place.

2.6 Ceph Version Release Lifecycle

  • Starting from the Nautilus version (14.2.0), Ceph will have a new stable version released every year, which is expected to be released in March every year. Every year, the new version will have a new name (for example, "Mimic") and a main version number (for example, 13 for Mimic, since "M" is the 13th letter of the alphabet).

  • The format of the version number is xyz, x indicates the release cycle (for example, 13 for Mimic, 17 for Quincy), y indicates the type of release version, that is
    x.0.z: y equals 0, indicating the development version
    x.1.z : y equals 1, indicating a release candidate (for test clusters)
    x.2.z : y equals 2, indicating a stable/bugfix release (for users)

2.7 Ceph cluster deployment

At present, Ceph officially provides many ways to deploy Ceph clusters. The commonly used methods are ceph-deploy, cephadm and binary:

  • ceph-deploy: A cluster automation deployment tool that has been used for a long time, is mature and stable, is integrated by many automation tools, and can be used for production deployment.

  • cephadm: From Octopus and newer releases use cephadm to deploy ceph clusters systemd installs and manages Ceph clusters. Not recommended for production environments at this time.

  • Binary: Manual deployment, deploying the Ceph cluster step by step, supports more customization and understanding of deployment details, and is more difficult to install.

3. Deploy Ceph cluster based on ceph-deploy

//Ceph production environment recommendation:
1. All storage clusters use 10G networks.
2. The cluster network (cluster-network, used for cluster internal communication) is separated from the public network (public-network, used for external access to the Ceph cluster). 3.
mon , mds and osd are separately deployed on different hosts (one host node can run multiple components in the test environment)
4. OSD can also use SATA
5. Cluster planning according to capacity
6. Xeon E5 2620 V3 or above CPU, 64GB or higher memory
7. Distributed deployment of cluster hosts to avoid power supply or network failure of the cabinet

3.1 Ceph Environment Planning

CPU name Public network Cluster network Role
admin 192.168.243.104 admin (the management node is responsible for the overall deployment of the cluster), client
node01 192.168.243.100 192.168.100.101 mon、mgr、osd(/dev/sdb、/dev/sdc、/dev/sdd)
node02 192.168.243.102 192.168.100.102 mon、mgr、osd(/dev/sdb、/dev/sdc、/dev/sdd)
node03 192.168.243.103 192.168.100.103 mon、osd(/dev/sdb、/dev/sdc、/dev/sdd)
client 192.168.243.105 client

Optional Step: Create Ceph Admin Users

可选步骤:创建 Ceph 的管理用户
useradd cephadm
passwd cephadm

visudo
cephadm ALL=(root) NOPASSWD:ALL

Close selinux and firewall

systemctl disable --now firewalld
setenforce 0
sed -i 's/enforcing/disabled/' /etc/selinux/config

insert image description here

3.2 Set the host name according to the plan

hostnamectl set-hostname admin
hostnamectl set-hostname node01
hostnamectl set-hostname node02
hostnamectl set-hostname node03
hostnamectl set-hostname clien

insert image description here

3.3 Configure hosts resolution

vim /etc/hosts
192.168.243.100 node01
192.168.243.102 node02
192.168.243.103 node03
192.168.243.104 admin
192.168.243.105 client

insert image description here

3.4. Install common software and dependent packages

yum -y install epel-release
yum -y install yum-plugin-priorities yum-utils ntpdate python-setuptools python-pip gcc gcc-c++ autoconf libjpeg libjpeg-devel libpng libpng-devel freetype freetype-devel libxml2 libxml2-devel zlib zlib-devel glibc glibc-devel glib2 glib2-devel bzip2 bzip2-devel zip unzip ncurses ncurses-devel curl curl-devel e2fsprogs e2fsprogs-devel krb5-devel libidn libidn-devel openssl openssh openssl-devel nss_ldap openldap openldap-devel openldap-clients openldap-servers libxslt-devel libevent-devel ntp libtool-ltdl bison libtool vim-enhanced python wget lsof iptraf strace lrzsz kernel-devel kernel-headers pam-devel tcl tk cmake ncurses-devel bison setuptool popt-devel net-snmp screen perl-devel pcre-devel net-snmp screen tcpdump rsync sysstat man iptables sudo libconfig git bind-utils tmux elinks numactl iftop bwm-ng net-tools expect snappy leveldb gdisk python-argparse gperftools-libs conntrack ipset jq libseccomp socat chrony sshpass

insert image description here
insert image description here

3.5. Configure ssh on the admin management node to log in to all nodes without password

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
sshpass -p 'abc1234' ssh-copy-id -o StrictHostKeyChecking=no root@admin
sshpass -p 'abc1234' ssh-copy-id -o StrictHostKeyChecking=no root@node01
sshpass -p 'abc1234' ssh-copy-id -o StrictHostKeyChecking=no root@node02
sshpass -p 'abc1234' ssh-copy-id -o StrictHostKeyChecking=no root@node03

insert image description here
insert image description here
insert image description here

3.6. Configure time synchronization

systemctl enable --now chronyd
timedatectl set-ntp true					#开启 NTP
timedatectl set-timezone Asia/Shanghai		#设置时区
chronyc -a makestep							#强制同步下系统时钟
timedatectl status							#查看时间同步状态
chronyc sources -v							#查看 ntp 源服务器信息
timedatectl set-local-rtc 0					#将当前的UTC时间写入硬件时钟

#重启依赖于系统时间的服务
systemctl restart rsyslog 
systemctl restart crond

#关闭无关服务
systemctl disable --now postfix

insert image description here

3.7 Configure Ceph yum source

wget https://download.ceph.com/rpm-nautilus/el7/noarch/ceph-release-1-1.el7.noarch.rpm --no-check-certificate

rpm -ivh ceph-release-1-1.el7.noarch.rpm --force
# 执行完上面所有的操作之后重启所有主机(可选)
sync
reboot

insert image description here
insert image description here
insert image description here

4. Deploy Ceph cluster

4.1. Create a Ceph working directory for all nodes, and follow-up work will be carried out in this directory

mkdir -p /etc/ceph

4.2. Install the ceph-deploy deployment tool

cd /etc/ceph
yum install -y ceph-deploy

ceph-deploy --version

insert image description here

4.3. Install the Ceph software package on the management node for other nodes

#ceph-deploy 2.0.1 默认部署的是 mimic 版的 Ceph,若想安装其他版本的 Ceph,可以用 --release 手动指定版本
cd /etc/ceph
ceph-deploy install --release nautilus node0{
    
    1..3} admin

#ceph-deploy install 本质就是在执行下面的命令:
yum clean all
yum -y install epel-release
yum -y install yum-plugin-priorities
yum -y install ceph-release ceph ceph-radosgw

#也可采用手动安装 Ceph 包方式,在其它节点上执行下面的命令将 Ceph 的安装包都部署上:
sed -i 's#download.ceph.com#mirrors.tuna.tsinghua.edu.cn/ceph#' /etc/yum.repos.d/ceph.repo
yum install -y ceph-mon ceph-radosgw ceph-mds ceph-mgr ceph-osd ceph-common ceph

insert image description here
insert image description here

4.4. Generate initial configuration

#在管理节点运行下述命令,告诉 ceph-deploy 哪些是 mon 监控节点
cd /etc/ceph
ceph-deploy new --public-network 192.168.243.0/24 --cluster-network 192.168.100.0/24 node01 node02 node03

#命令执行成功后会在 /etc/ceph 下生成配置文件
ls /etc/ceph
ceph.conf					#ceph的配置文件
ceph-deploy-ceph.log		#monitor的日志
ceph.mon.keyring			#monitor的密钥环文件

insert image description here

4.5. Initialize the mon node on the management node

cd /etc/ceph
ceph-deploy mon create node01 node02 node03			#创建 mon 节点,由于 monitor 使用 Paxos 算法,其高可用集群节点数量要求为大于等于 3 的奇数台

ceph-deploy --overwrite-conf mon create-initial		#配置初始化 mon 节点,并向所有节点同步配置
													# --overwrite-conf 参数用于表示强制覆盖配置文件

ceph-deploy gatherkeys node01	


					#可选操作,向 node01 节点收集所有密钥

insert image description here
insert image description here
insert image description here

After the command is executed successfully, a configuration file will be generated under /etc/ceph

ls /etc/ceph
ceph.bootstrap-mds.keyring			#引导启动 mds 的密钥文件
ceph.bootstrap-mgr.keyring			#引导启动 mgr 的密钥文件
ceph.bootstrap-osd.keyring			#引导启动 osd 的密钥文件
ceph.bootstrap-rgw.keyring			#引导启动 rgw 的密钥文件
ceph.client.admin.keyring			#ceph客户端和管理端通信的认证密钥,拥有ceph集群的所有权限
ceph.conf
ceph-deploy-ceph.log
ceph.mon.keyring

insert image description here

View the automatically started mon process on the mon node

ps aux | grep ceph
root        1823  0.0  0.2 189264  9216 ?        Ss   19:46   0:00 /usr/bin/python2.7 /usr/bin/ceph-crash
ceph        3228  0.0  0.8 501244 33420 ?        Ssl  21:08   0:00 /usr/bin/ceph-mon -f --cluster ceph --id node03 --setuser ceph --setgroupceph
root        3578  0.0  0.0 112824   988 pts/1    R+   21:24   0:00 grep --color=auto ceph

insert image description here

View Ceph cluster status on the management node

cd /etc/ceph
ceph -s
  cluster:
    id:     7e9848bb-909c-43fa-b36c-5805ffbbeb39
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
 
  services:
    mon: 3 daemons, quorum node01,node02,node03
    mgr: no daemons active
    osd: 0 osds: 0 up, 0 in
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

insert image description here

View the status of the mon cluster election

ceph quorum_status --format json-pretty | grep leader
"quorum_leader_name": "node01",

insert image description here

Expand mon node

ceph-deploy mon add <节点名称>

4.6. Deploy nodes capable of managing Ceph clusters (optional)

It is possible to execute ceph commands on each node to manage the cluster

cd /etc/ceph
ceph-deploy --overwrite-conf config push node01 node02 node03		#向所有 mon 节点同步配置,确保所有 mon 节点上的 ceph.conf 内容必须一致

ceph-deploy admin node01 node02 node03			#本质就是把 ceph.client.admin.keyring 集群认证文件拷贝到各个节点

insert image description here

View on mon node

ls /etc/ceph
ceph.client.admin.keyring  ceph.conf  rbdmap  tmpr8tzyc

cd /etc/ceph
ceph -s

insert image description here

4.7. Deploy osd storage nodes

Do not partition the host after adding the hard disk, use it directly

lsblk 
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0   60G  0 disk 
├─sda1   8:1    0  500M  0 part /boot
├─sda2   8:2    0    4G  0 part [SWAP]
└─sda3   8:3    0 55.5G  0 part /
sdb      8:16   0   20G  0 disk 
sdc      8:32   0   20G  0 disk 
sdd      8:48   0   20G  0 disk 

If it is an old hard disk, you need to wipe (delete the partition table) the disk first (optional, new hard disk without data can not be done)

cd /etc/ceph
ceph-deploy disk zap node01 /dev/sdb
ceph-deploy disk zap node02 /dev/sdb
ceph-deploy disk zap node03 /dev/sdb

add osd node

ceph-deploy --overwrite-conf osd create node01 --data /dev/sdb
ceph-deploy --overwrite-conf osd create node02 --data /dev/sdb
ceph-deploy --overwrite-conf osd create node03 --data /dev/sdb

insert image description here
insert image description here

View ceph cluster status

ceph -s
  cluster:
    id:     7e9848bb-909c-43fa-b36c-5805ffbbeb39
    health: HEALTH_WARN
            no avtive mgr
 
  services:
    mon: 3 daemons, quorum node01,node02,node03 (age 119m)
    mgr: no daemons active
    osd: 3 osds: 3 up (since 35s), 3 in (since 35s)
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   3.0 GiB used, 57 GiB / 60 GiB avail
    pgs: 


ceph osd stat
ceph osd tree
rados df
ssh root@node01 systemctl status ceph-osd@0
ssh root@node02 systemctl status ceph-osd@1
ssh root@node03 systemctl status ceph-osd@2

insert image description here

ceph osd status #Check the status of osd, it can only be executed after deploying mgr

+----+--------+-------+-------+--------+---------+--------+---------+-----------+
| id |  host  |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+----+--------+-------+-------+--------+---------+--------+---------+-----------+
| 0  | node01 | 1025M | 18.9G |    0   |     0   |    0   |     0   | exists,up |
| 1  | node02 | 1025M | 18.9G |    0   |     0   |    0   |     0   | exists,up |
| 2  | node03 | 1025M | 18.9G |    0   |     0   |    0   |     0   | exists,up |
+----+--------+-------+-------+--------+---------+--------+---------+-----------+

insert image description here

ceph osd df #Check osd capacity, it needs to be executed after mgr is deployed

ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP META  AVAIL  %USE VAR  PGS STATUS 
 0   hdd 0.01949  1.00000 20 GiB 1.0 GiB 1.8 MiB  0 B 1 GiB 19 GiB 5.01 1.00   0     up 
 1   hdd 0.01949  1.00000 20 GiB 1.0 GiB 1.8 MiB  0 B 1 GiB 19 GiB 5.01 1.00   0     up 
 2   hdd 0.01949  1.00000 20 GiB 1.0 GiB 1.8 MiB  0 B 1 GiB 19 GiB 5.01 1.00   0     up 
                    TOTAL 60 GiB 3.0 GiB 5.2 MiB  0 B 3 GiB 57 GiB 5.01                 
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

insert image description here

Expand osd nodes

cd /etc/ceph
ceph-deploy --overwrite-conf osd create node01 --data /dev/sdc
ceph-deploy --overwrite-conf osd create node02 --data /dev/sdc
ceph-deploy --overwrite-conf osd create node03 --data /dev/sdc
ceph-deploy --overwrite-conf osd create node01 --data /dev/sdd
ceph-deploy --overwrite-conf osd create node02 --data /dev/sdd
ceph-deploy --overwrite-conf osd create node03 --data /dev/sdd
  • Adding OSD will involve the migration of PG. Since the cluster has no data at this time, the health status will soon become OK. If you add nodes in the production environment, it will involve the migration of a large amount of data.

insert image description here

4.8. Deploy mgr node

  • The ceph-mgr daemon runs in Active/Standby mode, which ensures that if the Active node or its ceph-mgr daemon fails, one of the Standby instances can take over its tasks without service interruption. According to the official architecture principles, mgr must have at least two nodes to work.
cd /etc/ceph
ceph-deploy mgr create node01 node02


ceph -s
  cluster:
    id:     7e9848bb-909c-43fa-b36c-5805ffbbeb39
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
 
  services:
    mon: 3 daemons, quorum node01,node02,node03
    mgr: node01(active, since 10s), standbys: node02
    osd: 0 osds: 0 up, 0 in

insert image description here

Fix HEALTH_WARN issue: mons are

allowing insecure global_id reclaim问题:
禁用不安全模式:ceph config set mon auth_allow_insecure_global_id_reclaim false

insert image description here

Expand the mgr node

ceph-deploy mgr create <节点名称>

4.9, open the monitoring module

Execute the command on the ceph-mgr Active node to open

ceph -s | grep mgr

yum install -y ceph-mgr-dashboard

cd /etc/ceph

ceph mgr module ls | grep dashboard
#开启 dashboard 模块
ceph mgr module enable dashboard --force
#禁用 dashboard 的 ssl 功能
ceph config set mgr mgr/dashboard/ssl false
#配置 dashboard 监听的地址和端口
ceph config set mgr mgr/dashboard/server_addr 0.0.0.0
ceph config set mgr mgr/dashboard/server_port 8000
# 重启 dashboard
ceph mgr module disable dashboard
ceph mgr module enable dashboard --force

insert image description here
insert image description here

Confirm access to the url of the dashboard

ceph mgr services

#设置 dashboard 账户以及密码
echo "12345678" > dashboard_passwd.txt
ceph dashboard set-login-credentials admin -i dashboard_passwd.txt
  或
ceph dashboard ac-user-create admin administrator -i dashboard_passwd.txt

浏览器访问:http://192.168.80.11:8000 ,账号密码为 admin/12345678

insert image description here
insert image description here

5. Resource Pool Pool Management

  • Above we have completed the deployment of the Ceph cluster, but how do we store data in Ceph? First we need to define a Pool resource pool in Ceph. Pool is an abstract concept for storing Object objects in Ceph. We can understand it as a logical partition on Ceph storage. Pool is composed of multiple PGs; PGs are mapped to different OSDs through the CRUSH algorithm; at the same time, Pool can set the replica size, and the default number of replicas is 3.

  • The Ceph client requests the status of the cluster from the monitor, and writes data to the Pool. According to the number of PGs, the data is mapped to different OSD nodes through the CRUSH algorithm to realize data storage. Here we can understand Pool as a logical unit for storing Object data; of course, the current cluster does not have a resource pool, so it needs to be defined.

  • Create a Pool resource pool whose name is mypool, and set the number of PGs to 64. When setting PGs, you also need to set PGP (usually the values ​​of PGs and PGP are the same):

PG (Placement Group),pg 是一个虚拟的概念,用于存放 object,PGP(Placement Group for Placement purpose),相当于是 pg 存放的一种 osd 排列组合
cd /etc/ceph
ceph osd pool create mypool 64 64

#查看集群 Pool 信息
ceph osd pool ls    或    rados lspools
ceph osd lspools

#查看资源池副本的数量
ceph osd pool get mypool size

#查看 PG 和 PGP 数量
ceph osd pool get mypool pg_num
ceph osd pool get mypool pgp_num

#修改 pg_num 和 pgp_num 的数量为 128
ceph osd pool set mypool pg_num 128
ceph osd pool set mypool pgp_num 128

ceph osd pool get mypool pg_num
ceph osd pool get mypool pgp_num

#修改 Pool 副本数量为 2
ceph osd pool set mypool size 2
ceph osd pool get mypool size

#修改默认副本数为 2
vim ceph.conf
......
osd_pool_default_size = 2
#推送给其它节点
ceph-deploy --overwrite-conf config push node01 node02 node03
#重启每个节点的mon
systemctl restart ceph-mon.target

insert image description here
insert image description here
insert image description here
insert image description here

5.1 Delete the Pool resource pool

1) There is a risk of data loss in the command to delete a storage pool. Ceph prohibits such operations by default. The administrator needs to enable the operation of deleting a storage pool in the ceph.conf configuration file first.

vim ceph.conf
......
[mon]
mon allow pool delete = true

insert image description here

2) Push the ceph.conf configuration file to all mon nodes

ceph-deploy --overwrite-conf config push node01 node02 node03

3) All mon nodes restart the ceph-mon service

systemctl restart ceph-mon.target

insert image description here

4) Execute the delete Pool command

ceph osd pool rm pool01 pool01 --yes-i-really-really-mean-it

insert image description here

Guess you like

Origin blog.csdn.net/wang_dian1/article/details/131722038