Distributed storage Ceph introduction and construction

1: Types of storage
1. Stand-alone storage device
DAS (directly attached storage, which is directly connected to the motherboard bus of the computer)
IDE, SATA, SCSI, SAS, USB interface disk
The so-called interface is a storage device driven A disk device that provides block-level storage

●NAS (Network Attached Storage, which is storage attached to the current host file system through the network)
NFS, CIFS, FTP
file system-level storage is itself a well-made file system. After outputting in the user space through the nfs interface, The client performs network communication with the remote host based on the kernel module, and converts it to use like a local file system. This kind of storage service cannot format it again to create a file system block

●SAN (storage area network)
SCSI protocol (only used to transmit data access operations, the physical layer uses SCSI cables for transmission), FCSAN (the physical layer uses optical fiber for transmission), iSCSI (the physical layer uses Ethernet for transmission)
It is also a kind of network storage, but the difference is that the interface provided by SAN to the client host is block-level storage


Problems with stand-alone storage
●Insufficient storage and processing capacity
The IO value of traditional IDE is 100 times/s, that of SATA solid-state disk is 500 times/s, and that of solid-state hard drives reaches 2000-4000 times/s. Even if the IO capacity of the disk is dozens of times larger, it is still not enough to resist the simultaneous access of hundreds of thousands, millions or even hundreds of millions of users during the peak period of website access, which is also limited by the IO capacity of the host network.

●Insufficient storage capacity
No matter how large the capacity of a single disk is, it cannot meet the data capacity limit required by users for normal access.

●Single point of failure
There is a single point of failure for data stored on a single machine

Two: Commercial storage solutions
EMC, NetAPP, IBM, DELL, Huawei, Inspur

Three distributed storage (software-defined storage SDS)
Ceph, TFS, FastDFS, MooseFS (MFS), GlusterFS (GFS)

The storage mechanism will disperse and store data on multiple nodes, which has the advantages of high scalability, high performance, and high availability.

#Types of distributed storage

Block storage (such as a hard disk, generally a storage is mounted by a server, suitable for container or virtual machine storage volume allocation, log storage, file storage) block storage provides a storage volume that works like a hard drive, organized
into blocks of the same size. Typically, either the operating system formats a block-based storage volume with a file system, or applications (such as databases) access it directly to store data.

●File storage (such as NFS, solves the problem that block storage cannot be shared, and one storage can be mounted by multiple servers at the same time, suitable for directory structure storage and log storage) allows
data to be organized as a traditional file system. Data is kept in a file which has a name and some associated metadata such as modification timestamp, owner and access permissions. Provides file-based storage using a hierarchy of directories and subdirectories to organize how files are stored.

●Object storage (such as OSS, a storage can be accessed by multiple services at the same time, has the high-speed read and write capabilities of block storage, and also has the characteristics of file storage sharing, suitable for image storage and video storage) file storage provided based on the API interface,
each A file is an object, and the file size is different, and the metadata and actual data of the file are stored together.
Object storage allows arbitrary data and metadata to be stored as a unit, tagged with a unique identifier within a flat storage pool. Store and retrieve data using APIs instead of accessing data as blocks or in file system hierarchies.

2 Introduction to Ceph
Ceph is developed using C++ language and is an open source distributed storage system that is open, self-healing and self-managing. It has the advantages of high scalability, high performance and high reliability.

Ceph is currently supported by many cloud computing vendors and is widely used. RedHat, OpenStack, and Kubernetes can all be integrated with Ceph to support the back-end storage of virtual machine images.
It is roughly estimated that 70%-80% of cloud platforms in my country use Ceph as the underlying storage platform, which shows that Ceph has become the standard configuration of open source cloud platforms. At present, domestic companies that use Ceph to build distributed storage systems are more successful, such as Huawei, Ali, ZTE, H3C, Inspur, China Mobile, Netease, LeTV, 360, Tristar Storage, Shanyan Data, etc. 

Three Ceph advantages

●High scalability: decentralized, supports the use of ordinary X86 servers, supports the scale of thousands of storage nodes, and supports expansion from TB to EB level.
●High reliability: no single point of failure, multiple data copies, automatic management, automatic repair.
●High performance: Abandoning the traditional centralized storage metadata addressing scheme, using the CRUSH algorithm, the data distribution is balanced, and the degree of parallelism is high.
● Powerful functions: Ceph is a unified storage system that integrates block storage interface (RBD), file storage interface (CephFS), and object storage interface (RadosGW), so it is suitable for different application scenarios.

Four Ceph architecture

From bottom to top, the Ceph system can be divided into four levels:


RADOS basic storage system (Reliab1e, Autonomic, Distributed object store, that is, reliable, automated, and distributed object storage)
RADOS is the lowest functional module of Ceph, and it is an infinitely expandable object storage service that can disassemble files It is decomposed into countless objects (fragments) and stored in the hard disk, which greatly improves the stability of the data. It is mainly composed of OSD and Monitor. Both OSD and Monitor can be deployed in multiple servers. This is the origin of ceph distribution and the origin of high scalability.

● LIBRADOS The basic library
Librados provides a way to interact with RADOS, and provides Ceph service API interfaces to upper-layer applications. Therefore, the upper-layer RBD, RGW, and CephFS are all accessed through Librados. Currently, PHP, Ruby, Java, Python, Go, C, and C++ support for client application development directly on RADOS (rather than whole Ceph).

●High-level application interface: includes three parts
1) Object storage interface RGW (RADOS Gateway)
gateway interface, based on the object storage system developed by Librados, provides a RESTful API interface compatible with S3 and Swift.

2) Block storage interface RBD (Reliable Block Device)
provides a block device interface based on Librados, which is mainly used for Host/VM.

3) The file storage interface CephFS (Ceph File System)
The Ceph file system provides a POSIX-compliant file system that uses the Ceph storage cluster to store user data on the file system. Based on the distributed file system interface provided by Librados.

●Application layer: various APPs developed based on high-level interfaces or the basic library Librados, or many clients such as Host and VM
 

Five Ceph Core Components

Ceph is an object-based storage system that divides each data stream to be managed (such as files and other data) into one or more object data (Object) of fixed size (default 4 megabytes), and uses it as an atomic unit (Atom is the smallest unit of an element) to complete the reading and writing of data.

●OSD (Object Storage Daemon, daemon process ceph-osd)
is a process responsible for physical storage, and is generally configured to correspond to disks one by one, and one disk starts an OSD process. The main function is to store data, copy data, balance data, restore data, and perform heartbeat checks with other OSDs, and is responsible for the process of returning specific data in response to client requests. Typically at least 3 OSDs are required for redundancy and high availability.

●PG (Placement Group)
PG is a virtual concept and does not exist physically. It is similar to the index in the database in data addressing: Ceph first maps each object data to a PG through the HASH algorithm, and then maps the PG to the OSD through the CRUSH algorithm.

●Pool
Pool is a logical partition for storing objects, and it functions as a namespace. Each Pool contains a certain number (configurable) of PGs. The pool can be used as a fault isolation domain, which is not uniformly isolated according to different user scenarios.

Two types of data storage are supported in #Pool:
Replicated: similar to raid1, one object data is saved with 3 copies by default, and placed in different OSDs
Erasure Code: similar to raid5, which consumes less CPU Slightly larger, but saves disk space, and only one copy of object data is saved. Since some functions of Ceph do not support erasure coding pools, this type of storage pool is not used much

#Pool, PG and OSD relationship:
There are many PGs in a Pool; a PG contains a bunch of objects, and an object can only belong to one PG; PGs are divided into masters and slaves, and a PG is distributed on different OSDs (for Triple copy type)

●Monitor (the daemon process ceph-mon)
is used to save the metadata of the OSD. Responsible for maintaining the mapping views of the cluster state (Cluster Map: OSD Map, Monitor Map, PG Map and CRUSH Map), maintaining various charts showing the cluster state, and managing cluster client authentication and authorization. A Ceph cluster usually requires at least 3 or 5 (odd number) Monitor nodes to achieve redundancy and high availability, and they synchronize data between nodes through the Paxos protocol.

● The Manager (daemon ceph-mgr)
is responsible for tracking runtime metrics and the current state of the Ceph cluster, including storage utilization, current performance metrics, and system load. Provides additional monitoring and interfaces to external monitoring and management systems, such as zabbix, prometheus, cephmetrics, etc. A Ceph cluster usually requires at least 2 mgr nodes to achieve high availability, and information synchronization between nodes is realized based on the raft protocol.

●MDS (Metadata Server, daemon process ceph-mds)
is a metadata service that CephFS services depend on. Responsible for saving the metadata of the file system and managing the directory structure. Object storage and block device storage do not require metadata services; if you are not using CephFS, you can not install it.
 

Six OSD storage backend

OSDs have two ways of managing the data they store. In Luminous 12.2.z and later releases, the default (and recommended) backend is BlueStore. Before Luminous was released, FileStore was the default and only option.
● Filestore
FileStore is a legacy method of storing objects in Ceph. It relies on a standard file system (only XFS) combined with a key/value database (traditionally LevelDB, now BlueStore is RocksDB) for storing and managing metadata.
FileStore is well tested and used extensively in production. However, due to its overall design and dependence on traditional file systems, it has many shortcomings in performance.

● Bluestore
BlueStore is a special-purpose storage backend designed specifically for OSD workload management of data on disk. BlueStore's design is based on a decade of experience supporting and managing Filestores. Compared with Filestore, BlueStore has better read and write performance and security.

The main functions of #BlueStore include:
1) BlueStore directly manages storage devices, that is, directly uses raw block devices or partitions to manage data on disks. This avoids the intervention of abstraction layers (such as local file systems such as XFS), which can limit performance or increase complexity.
2) BlueStore uses RocksDB for metadata management. RocksDB's key/value database is embedded in order to manage internal metadata, including mapping object names to block locations on disk.
3) All data and metadata written to BlueStore is protected by one or more checksums. No data or metadata is read from disk or returned to the user without verification.
4) Support for inline compression. Data can optionally be compressed before being written to disk.
5) Support multi-device metadata layering. BlueStore allows its internal log (WAL write-ahead log) to be written to a separate high-speed device (such as SSD, NVMe or NVDIMM) for improved performance. Internal metadata can be stored on faster devices if there is plenty of faster storage available.
6) Support efficient copy-on-write. RBD and CephFS snapshots rely on the copy-on-write cloning mechanism efficiently implemented in BlueStore. This will result in efficient I/O for regular snapshots and erasure-coded pools (which rely on clones for efficient two-phase commit).
 

Seven stored procedures of Ceph data

1) The client obtains the latest Cluster Map from mon

2) In Ceph, everything is an object. The data stored by Ceph will be divided into one or more fixed-size objects (Object). Object size can be adjusted by the administrator, usually 2M or 4M.
Each object will have a unique OID, which is composed of ino and ono:
ino: the FileID of the file, which is used to uniquely identify each file globally
ono: the number of the slice
For example: a file FileID is A , it is cut into two objects, one is numbered 0 and the other is numbered 1, then the oids of these two files are A0 and A1.
The advantage of OID is that it can uniquely identify each different object and store the affiliation between the object and the file. Since all data in Ceph are virtualized into uniform objects, the efficiency of reading and writing will be relatively high.

3) Obtain a hexadecimal feature code by using the HASH algorithm on the OID, take the remainder from the feature code and the total number of PGs in the Pool, and the obtained serial number is PGID.
That is, Pool_ID + HASH(OID) % PG_NUM to get PGID

4) The PG will replicate according to the number of copies set, and calculate the IDs of the target primary and secondary OSDs in the PG by using the CRUSH algorithm on the PGID, and store them on different OSD nodes (in fact, all objects in the PG are stored on the OSD) .
That is, through CRUSH (PGID), the data in PG is stored in each OSD group


Eight Ceph Version Release Life Cycle

Starting from the Nautilus version (14.2.0), Ceph will have a new stable version released every year, which is expected to be released in March every year. Every year, the new version will have a new name (for example, "Mimic") and a main version number (for example, 13 for Mimic, since "M" is the 13th letter of the alphabet).

The format of the version number is xyz, x indicates the release cycle (for example, 13 for Mimic, 17 for Quincy), y indicates the type of release version, that is
x.0.z: y equals 0, indicating the development version
x.1.z : y equals 1, indicating a release candidate (for test clusters)
x.2.z : y equals 2, indicating a stable/bugfix release (for users)

Nine Ceph cluster deployment


At present, Ceph officially provides a variety of methods for deploying Ceph clusters. The commonly used methods are ceph-deploy, cephadm and binary: for production deployment.

● cephadm: Use cephadm to deploy ceph clusters from Octopus and newer versions, use containers and systemd to install and manage Ceph clusters. Not recommended for production environments at this time.

●Binary: manual deployment, deploy Ceph cluster step by step, support more customization and understand deployment details, and installation is more difficult.

Ten experiments: Deploy Ceph cluster based on ceph-deploy
//Ceph production environment recommendation:
1. All storage clusters use 10G network
2. Cluster network (cluster-network, used for cluster internal communication) and public network (public-network, used External access to the Ceph cluster)
3. Mon, mds and osd are separately deployed on different hosts (in the test environment, one host node can run multiple components)
4. OSD can also use SATA
5. Cluster planning according to capacity
6. Xeon E5 2620 V3 or above CPU, 64GB or above memory
7. Distributed deployment of cluster hosts to avoid power supply or network failure of the cabinet

Ceph Environment Planning

Host name Public network Cluster network role
admin 192.168.50.25 admin (the management node is responsible for the overall deployment of the cluster)
node01 192.168.50.23 192.168.50.23 mon, mgr, osd
node02 192.168.50.24 192.168.50.24 mon, mgr, osd node
03 192.168.50.25 192.168. 50.25 mon, osd
clinet 192.168.50.20 client
1. Close selinux and firewall
systemctl disable --now firewalld
setenforce 0
sed -i 's/enforcing/disabled/' /etc/selinux/config

2. Set the hostname according to the plan:
hostnamectl set-hostname admin
hostnamectl set-hostname node01
hostnamectl set-hostname node02
hostnamectl set-hostname node03
hostnamectl set-hostname client

3, put hosts parse
cat >> /etc/hosts << EOF
192.168.50.25 admin
192.168.50.22 node01 192.168.50.23
node02
192.168.50.24 node03
192.168.50.20 client
EOF

4. Install common software and dependent packages
yum -y install epel-release
yum -y install yum-plugin-priorities yum-utils ntpdate python-setuptools python-pip gcc gcc-c++ autoconf libjpeg libjpeg-devel libpng libpng-devel freetype freetype-devel libxml2 libxml2-devel zlib zlib-devel glibc glibc-devel glib2 glib2-devel bzip2 bzip2-devel zip unzip ncurses ncurses-devel curl curl-devel e2fsprogs e2fsprogs-devel krb5-devel libidn libidn-devel openssl openssh openssl-devel nss_ldap openldap openldap-devel openldap-clients openldap-servers libxslt-devel libevent-devel ntp libtool-ltdl bison libtool vim-enhanced python wget lsof iptraf strace lrzsz kernel-devel kernel-headers pam-devel tcl tk cmake ncurses-devel bison setuptool popt-devel net-snmp screen perl-devel pcre-devel net-snmp screen tcpdump rsync sysstat man iptables sudo libconfig git bind-utils tmux elinks numactl iftop bwm-ng net-tools expect snappy leveldb gdisk python-argparse gperftools-libs conntrack ipset jq libseccomp socat chrony sshpass
 

5. Configure ssh on the admin management node to log in to all nodes without password
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
sshpass -p '123' ssh-copy-id -o StrictHostKeyChecking=no root@admin
sshpass -p '123' ssh-copy-id -o StrictHostKeyChecking=no root@node01
sshpass -p '123' ssh-copy-id -o StrictHostKeyChecking=no root@node02
sshpass -p '123' ssh-copy-id - o StrictHostKeyChecking=no root@node03

6. Configure time synchronization
systemctl enable --now chronyd
timedatectl set-ntp true #Open NTP
timedatectl set-timezone Asia/Shanghai #Set time zone
chronyc -a makestep #Forcibly synchronize the system clock
timedatectl status #View time synchronization status
chronyc sources -v #View ntp source server information
timedatectl set-local-rtc 0 #Write the current UTC time into the hardware clock

#Restart services that depend on system time
systemctl restart rsyslog 
systemctl restart crond

#Close irrelevant services
systemctl disable --now postfix
 

7. Configure Ceph yum source
wget https://download.ceph.com/rpm-nautilus/el7/noarch/ceph-release-1-1.el7.noarch.rpm --no-check-certificate

rpm -ivh ceph-release-1-1.el7.noarch.rpm --force
 

At this time, the ceph.repo file is generated in the yum warehouse, and the download address inside is a foreign source, which can be directly modified to a domestic mirror source with sed

sed -i 's#download.ceph.com#mirrors.aliyun.com/ceph#' ceph.repo

8. Reboot all hosts after performing all the above operations (optional)
sync #改盘
reboot #Reboot

Deploy Ceph cluster

1. Create a Ceph working directory for all nodes, and follow-up work will be performed in this directory
mkdir -p /etc/ceph

2. Install the ceph-deploy deployment tool
cd /etc/ceph
yum install -y ceph-deploy #Install on the management node

ceph-deploy --version

3. Install the Ceph software package on the management node for other nodes
#ceph-deploy 2.0.1 The default deployment is the mimic version of Ceph. If you want to install other versions of Ceph, you can use --release to manually specify the version
cd /etc/ceph
ceph -deploy install --release nautilus node0{1..3} admin

#ceph-deploy install is essentially executing the following commands:
yum clean all
yum -y install epel-release
yum -y install yum-plugin-priorities
yum -y install ceph-release ceph ceph-radosgw

#You can also manually install the Ceph package, execute the following command on other nodes to deploy the Ceph installation package:
yum install -y ceph-mon ceph-radosgw ceph-mds ceph-mgr ceph-osd ceph-common ceph

Add hard disk and network card to node01 node02 node03

Refresh the disk interface and refresh the newly added disk

echo "- - -" >/sys/class/scsi_host/host0/scan

echo "- - -" >/sys/class/scsi_host/host1/scan

echo "- - -" >/sys/class/scsi_host/host2/scan

or restart directly

Modify the intranet network card configuration file

cd /etc/sysconfig/network-script/

cp ifcfg-ens33 ifcfg-ens35

4. Generate initial configuration
#Run the following command on the management node to tell ceph-deploy which is the mon monitoring node
cd /etc/ceph
ceph-deploy new --public-network 192.168.50.0/24 --cluster-network 192.168.100.0 /24 node01 node02 node03

#After the command is executed successfully, the configuration file will be generated under /etc/ceph
ls /etc/ceph
ceph.conf #ceph configuration file
ceph-deploy-ceph.log #monitor log
ceph.mon.keyring #monitor key ring document
 

5. Initialize mon node
cd /etc/ceph on the management node

 #Create a mon node. Since the monitor uses the Paxos algorithm, the number of high-availability cluster nodes requires an odd number greater than or equal to 3. ceph
-deploy mon create node01 node02 node03    

 #Configure the initial mon node and synchronize the configuration to all nodes     

ceph-deploy --overwrite-conf mon create-initial      
                                                # --overwrite-conf parameter is used to indicate mandatory overwrite configuration file

ceph-deploy gatherkeys node01 #Optional operation, collect all keys to node01 node
 

 #View the automatically opened mon process on the mon node
ps aux | grep ceph


#View Ceph cluster status on the management node
cd /etc/ceph
ceph -s

#Check mon cluster election
ceph quorum_status --format json-pretty | grep leader

#Expand mon node
ceph-deploy mon add <node name>

6. Deploy nodes capable of managing Ceph clusters (optional)
#Can implement ceph commands on each node to manage the cluster
cd /etc/ceph
ceph-deploy --overwrite-conf config push node01 node02 node03 #Synchronize configuration to all mon nodes, Make sure the ceph.conf content on all mon nodes must be consistent

ceph-deploy admin node01 node02 node03 #This book is to copy the ceph.client.admin.keyring cluster authentication file to each node


#View ls /etc/ceph on the mon node

7. Deploy osd storage nodes

#Do not partition the host after adding the hard disk, use
lsblk  directly


#If it is an old hard disk, you need to wipe (delete the partition table) the disk first (optional, new hard disk without data can not be done)
cd /etc/ceph
ceph-deploy disk zap node01 /dev/sdb
ceph-deploy disk zap node02 /dev/sdb
ceph-deploy disk zap node03 /dev/sdb

#Add osd node (add in adimin node)
ceph-deploy --overwrite-conf osd create node01 --data /dev/sdb
ceph-deploy --overwrite-conf osd create node02 --data /dev/sdb
ceph-deploy - -overwrite-conf osd create node03 --data /dev/sdb
 

 View osd status ceph osd stat

 Install the above operation and add the /dev/sdc /dev/sdd hard disk of node01 node02 node03

8. Deploy the mgr node

The #ceph-mgr daemon runs in Active/Standby mode, which ensures that when the Active node or its ceph-mgr daemon fails, one of the Standby instances can take over its tasks without interruption of service. According to the official architecture principles, mgr must have at least two nodes to work.

manager节点
cd /etc/ceph
ceph-deploy mgr create node01 node02

#Solution HEALTH_WARN problem in the above figure: mons are allowing insecure global_id reclaim problem:
disable insecure mode: ceph config set mon auth_allow_insecure_global_id_reclaim false

#Expand mgr node
ceph-deploy mgr create <node name>
 

9. Turn on the monitoring module
and first check which node is the mgr active node!

ceph -s | grep mgr

 #Execute the command node01
yum install -y ceph-mgr-dashboard on the ceph-mgr Active node

cd /etc/ceph

ceph mgr module ls | grep dashboard

#Open dashboard module
ceph mgr module enable dashboard --force

#Disable the ssl function of dashboard
ceph config set mgr mgr/dashboard/ssl false

#Configure the address and port monitored by dashboard
ceph config set mgr mgr/dashboard/server_addr 0.0.0.0
ceph config set mgr mgr/dashboard/server_port 8000

#重启 dashboard
ceph mgr module disable dashboard
ceph mgr module enable dashboard --force

#Confirm access to dashboard url
ceph mgr services

#Set dashboard account and password
echo "12345678" > dashboard_passwd.txt
ceph dashboard set-login-credentials admin -i dashboard_passwd.txt 

Browser access: http://192.168.50.22:8000, account password is admin/12345678

Resource Pool Pool Management

Above we have completed the deployment of the Ceph cluster, but how do we store data in Ceph? First we need to define a Pool resource pool in Ceph. Pool is an abstract concept for storing Object objects in Ceph. We can understand it as a logical partition on Ceph storage. Pool is composed of multiple PGs; PGs are mapped to different OSDs through the CRUSH algorithm; at the same time, Pool can set the replica size, and the default number of replicas is 3.

The Ceph client requests the status of the cluster from the monitor, and writes data to the Pool. According to the number of PGs, the data is mapped to different OSD nodes through the CRUSH algorithm to realize data storage. Here we can understand Pool as a logical unit for storing Object data; of course, the current cluster does not have a resource pool, so it needs to be defined.
 

Create a Pool resource pool, its name is mypool, and the number of PGs is set to 64. When setting PGs, you also need to set PGP (usually the values ​​of PGs and PGP are the same): PG (Placement Group), pg is a virtual concept
, Used to store objects, PGP (Placement Group for Placement purpose), is equivalent to an osd arrangement and combination stored in pg


cd /etc/ceph
ceph osd pool create mypool 64 64

How to calculate how many pg should the poor pool have?
If the number of OSDs is less than 5, set the number of pgs to 128;

The number of OSDs is 5-10, and the number of pgs is set to 512;

The number of OSDs is 10-50, and the number of pgs is set to 4096;

The number of OSDs is greater than 50, refer to the formula

      ( Target PGs per OSD ) x ( OSD # ) x ( %Data ) / ( Size )

Meaning: targeter Pgs per osd #and how many PGs are assigned to each OSD

             osd # The number of osd

          %data #pool percentage of total space

          size: number of duplicates  

For details, see the Ceph official website
Ceph Homepage - Ceph

#View cluster Pool information
ceph osd pool ls or rados lspools
ceph osd lspools

#View the number of resource pool copies
ceph osd pool get mypool size

#Check the number of PG and PGP
ceph osd pool get mypool pg_num
ceph osd pool get mypool pgp_num

#Modify the number of pg_num and pgp_num to 128
ceph osd pool set mypool pg_num 128
ceph osd pool set mypool pgp_num 128

ceph osd pool get mypool pg_num
ceph osd pool get mypool pgp_num

#Modify the number of Pool copies to 2
ceph osd pool set mypool size 2

ceph osd pool get mypool size

#Modify the default number of copies to 2
vim ceph.conf
......
osd_pool_default_size = 2

ceph-deploy --overwrite-conf config push node01 node02 node03 #Management node configuration synchronization to data nodes

 After the modification is complete, node1 node2 node3 nodes need to restart the service


#Delete Pool resource pool

1) There is a risk of data loss in the delete storage pool command. Ceph prohibits this type of operation by default. The administrator needs to enable the delete storage pool operation in the ceph.conf configuration file vim ceph.conf
 …
[
mon]
mon allow pool delete = true

2) Push the ceph.conf configuration file to all mon nodes
ceph-deploy --overwrite-conf config push node01 node02 node03

3) All mon nodes restart ceph-mon service
systemctl restart ceph-mon.target

4) Execute the delete Pool command
ceph osd pool rm pool01 pool01 --yes-i-really-really-mean-it

View OSD status ceph osd status

 View OSD usage

Two: Create CephFS file system MDS interface 
//server operation

1) Create mds service
cd /etc/ceph
ceph-deploy mds create node01 node02 node03 on the management node

2) View the mds service of each node
ssh root@node01 systemctl status ceph-mds@node01
ssh root@node02 systemctl status ceph-mds@node02 ssh
root@node03 systemctl status ceph-mds@node03

3) Create a storage pool and enable the ceph file system
The ceph file system requires at least two rados pools, one for storing data and one for storing metadata. At this time, the data pool is similar to the shared directory of the file system.
ceph osd pool create cephfs_data 128 #Create data Pool

ceph osd pool create cephfs_metadata 128 #Create metadata Pool

#Create cephfs, command format: ceph fs new <FS_NAME> <CEPHFS_METADATA_NAME> <CEPHFS_DATA_NAME>
ceph fs new mycephfs cephfs_metadata cephfs_data #Enable ceph, the metadata pool is in the front, and the data pool is in the back

ceph fs ls #View cephfs

4) Check the mds status, one is up, and the other two are on standby. The current work is the mds service
ceph on node01
-s mds: mycephfs:1 {0=node01=up:active} 2 up:standby

ceph mds stat
mycephfs:1 {0=node01=up:active} 2 up:standby

5) Create user
Syntax format: ceph fs authorize <fs_name> client.<client_id> <path-in-cephfs> rw

#The account is client.zhangsan, the user name is zhangsan, and zhangsan has read and write permissions to the / root directory of the ceph file system (not the root directory of the operating system)
ceph fs authorize mycephfs client.zhangsan / rw | tee /etc/ceph/ zhangsan.keyring

# The account is client.lisi, the user name is lisi, lisi has only read permission to the / root directory of the file system, and has read and write permission to the subdirectory /test of the root directory of the file system ceph fs authorize mycephfs client.lisi / r /
test rw | tee /etc/ceph/lisi.keyring

//client operation

1) The client must be in the public network


2) Create a working directory mkdir /etc/ceph on the client

3) Copy the ceph configuration file ceph.conf and account keyring files zhangsan.keyring, lisi.keyring
scp ceph.conf zhangsan.keyring lisi.keyring root@client:/etc/ceph on the ceph management node to the client

4) Install the ceph package on the client side
cd /opt
wget https://download.ceph.com/rpm-nautilus/el7/noarch/ceph-release-1-1.el7.noarch.rpm --no-check- certificate
rpm -ivh ceph-release-1-1.el7.noarch.rpm
yum install -y ceph 

5) Create a secret key file on the client side
cd /etc/ceph
ceph-authtool -n client.zhangsan -p zhangsan.keyring > zhangsan.key #Export the secret key of the zhangsan user to zhangsan.keyl
ceph-authtool -n client. lisi -p lisi.keyring > lisi.key

6) Client mount

●Method 1: Kernel-based
Syntax format:
mount -t ceph node01:6789,node02:6789,node03:6789:/ <local mount point directory> -o name=<user name>,secret=<secret key> mount
- t ceph node01:6789,node02:6789,node03:6789:/ <local mount point directory> -o name=<user name>,secretfile=<secret key file>

示例一:
mkdir -p /data/zhangsan
mount -t ceph node01:6789,node02:6789,node03:6789:/ /data/zhangsan -o name=zhangsan,secretfile=/etc/ceph/zhangsan.key

示例二:
mkdir -p /data/lisi
mount -t ceph node01:6789,node02:6789,node03:6789:/ /data/lisi -o name=lisi,secretfile=/etc/ceph/lisi.key

#Verify user permissions
cd /data/lisi
echo 123 > 2.txt
-bash: 2.txt: Insufficient permissions

echo 123 > test/2.txt
cat test/2.txt
123

Example 3:
#Stop the mds service on node02
ssh root@node02 "systemctl stop ceph-mds@node02"

ceph -s

#Test the mount point of the client is still available, if you stop all mds, the client will not be available

●Method 2: Based on the fuse tool
1) Copy the ceph configuration file ceph.conf and the account keyring files zhangsan.keyring, lisi.keyring
scp ceph.client.admin.keyring root@client on the ceph management node to the client :/etc/ceph

2) Install ceph-fuse on the client side
yum install -y ceph-fuse

3) Client mount
cd /opt/test
ceph-fuse -m node01:6789,node02:6789,node03:6789 /data/aa [-o nonempty] #When mounting, if the mount point is not empty, it will mount Failed to load, specify -o nonempty can be ignored

Create Ceph block storage system RBD interface 
1. Create a storage pool named rbd-demo dedicated to RBD (admin node operation)
ceph osd pool create rbd-demo 64 64

2. Convert the storage pool to RBD mode
ceph osd pool application enable rbd-demo rbd

3. Initialize the storage pool
rbd pool init -p rbd-demo # -p is equivalent to --pool

4. Create a mirror
rbd create -p rbd-demo --image rbd-demo1.img --size 10G

Can be abbreviated as:
rbd create rbd-demo/rbd-demo2.img --size 10G

 5. Mirror management
//Check which mirrors exist under the storage pool
rbd ls -l -p rbd-demo

//View the details of the image
rbd info -p rbd-demo --image rbd-demo1.img

rbd image 'rbd-demo.img':
    size 10 GiB in 2560 objects #The size of the mirror image and the number of strips it is divided into
    order 22 (4 MiB objects) #The number of strips, the effective range is 12 to 25, corresponding to 4K to 32M, and 22 represents 2 to the 22nd power, which is exactly 4M
    snapshot_count: 0
    id: 5fc98fe1f304 # ID of the mirror image
    block_name_prefix: rbd_data.5fc98fe1f304 #Name prefix
    format: 2 #The image format used, the default is 2
    features: layering , exclusive-lock, object-map, fast-diff, deep-flatten        
    #Features of the current image
    op_features:                                                                 
    #Optional features
    flags: 


//Modify the image size
rbd resize -p rbd-demo --image rbd-demo1.img --size 20G

rbd info -p rbd-demo --image rbd-demo1.img

#Use resize to adjust the size of the image. It is generally recommended to only increase but not decrease. If it is to decrease, you need to add the option --allow-shrink
rbd resize -p rbd-demo --image rbd-demo1.img --size 5G --allow-shrink

//Delete mirror
#Delete mirror directly
rbd rm -p rbd-demo --image rbd-demo2.img
rbd remove rbd-demo/rbd-demo2.img

#It is recommended to use the trash command. This command deletes the image to the recycle bin. If you want to retrieve it, you can restore
rbd trash move rbd-demo/rbd-demo1.img

rbd ls -l -p rbd-demo

rbd trash list -p rbd-demo
5fc98fe1f304 rbd-demo1.img

#Restore image
rbd trash restore rbd-demo/5fc98fe1f304

rbd ls -l -p rbd-demo

6. Linux client use

There are two ways for the client to use RBD:
● Map the image to the local block device of the system through the kernel module KRBD, usually the setting file is: /dev/rbd* ● The
other is through the librbd interface, usually the KVM virtual machine uses this kind of interface.

This example mainly uses the Linux client to mount the RBD image as a local disk. Before starting, you need to install the ceph-common package on the required client nodes, because the client needs to call the rbd command to map the RBD image to the local as a common hard disk. And also need to copy the ceph.conf configuration file and authorization keyring file to the corresponding node.
 

Manage Node Operations

//Create and authorize a user on the management node to access the specified RBD storage pool
#Example, specify the user ID as client.osd-mount, have all permissions to OSD, and have read-only permissions to Mon
ceph auth get- or-create client.osd-mount osd "allow * pool=rbd-demo" mon "allow r" > /etc/ceph/ceph.client.osd-mount.keyring

//Modify RBD image features, CentOS7 only supports layering and striping features by default, you need to turn off other features
rbd feature disable rbd-demo/rbd-demo1.img object-map,fast-diff,deep-flatten

//Send the user's keyring file and ceph.conf file to the client's /etc/ceph directory
cd /etc/ceph
scp ceph.client.osd-mount.keyring ceph.conf root@client:/etc/ceph
 

//linux client operation

#Install ceph-common package
yum install -y ceph-common

#Execute client mapping
cd /etc/ceph
rbd map rbd-demo/rbd-demo1.img --keyring /etc/ceph/ceph.client.osd-mount.keyring --user osd-mount
 

#Execute client mapping
cd /etc/ceph
rbd map rbd-demo/rbd-demo1.img --keyring /etc/ceph/ceph.client.osd-mount.keyring --user osd-mount

#View mapping
rbd showmapped
rbd device list

#Disconnect mapping
rbd unmap rbd-demo/rbd-demo1.img

#Format and mount
mkfs.xfs /dev/rbd0

mkdir -p /data/bb
mount /dev/rbd0 /data/bb

#Online expansion
Adjust the size of the image on the management node
rbd resize rbd-demo/rbd-demo1.img --size 30G

Refresh the device file
xfs_growfs /dev/rbd0 on the client side #Refresh the xfs file system capacity
resize2fs /dev/rbd0 #Refresh the ext4 type file system capacity

7. Snapshot management
Taking a snapshot of the rbd image can preserve the status history of the image, and it can also use the layering technology of the snapshot to clone the snapshot into a new image.

//Write a file on the client
echo 1111 > /data/bb/11.txt
echo 2222 > /data/bb/22.txt
echo 3333 > /data/bb/33.txt

//Create a snapshot of the image on the management node
rbd snap create --pool rbd-demo --image rbd-demo1.img --snap demo1_snap1

Can be abbreviated as:
rbd snap create rbd-demo/rbd-demo1.img@demo1_snap1

//List all snapshots of the specified image
rbd snap list rbd-demo/rbd-demo1.img

# Output in json format:
rbd snap list rbd-demo/rbd-demo1.img --format json --pretty-format

//Rollback mirror to specified
Before rolling back the snapshot, you need to unmap the mirror and then roll back.

#Operate on the client side
rm -rf /data/bb/*
umount /data/bb
rbd unmap rbd-demo/rbd-demo1.img

#Operate
rbd snap rollback on the management node rbd-demo/rbd-demo1.img@demo1_snap1

#Remap and mount
rbd map on the client side rbd-demo/rbd-demo1.img --keyring /etc/ceph/ceph.client.osd-mount.keyring --user osd-mount mount
/dev/rbd0 /data /bb
ls /data/bb # Found that the data has been restored

//Limit the number of snapshots that can be created by the mirror
rbd snap limit set rbd-demo/rbd-demo1.img --limit 3

#Remove the limit:
rbd snap limit clear rbd-demo/rbd-demo1.img

//Delete snapshot
#Delete the specified snapshot:
rbd snap rm rbd-demo/rbd-demo1.img@demo1_snap1

#Delete all snapshots:
rbd snap purge rbd-demo/rbd-demo1.img

//snapshot layering

Snapshot layering supports the use of snapshot clones to generate new images, which are almost identical to directly created images and support all operations of images. The only difference is that the clone image references a read-only upstream snapshot, and this snapshot must be protected.

#snapshot clone
1) Set the upstream snapshot to protected mode:
rbd snap create rbd-demo/rbd-demo1.img@demo1_snap666

rbd snap protect rbd-demo/rbd-demo1.img@demo1_snap666

2) Clone the snapshot as a new image
rbd clone rbd-demo/rbd-demo1.img@demo1_snap666 --dest rbd-demo/rbd-demo666.img

rbd ls -p rbd-demo

3) Command to view the sub-mirror
rbd children rbd-demo/rbd-demo1.img@demo1_snap666 of the snapshot after the clone is completed


//snapshot flatten

Usually, the image obtained by snapshot cloning retains a reference to the parent snapshot. At this time, the parent snapshot cannot be deleted, otherwise it will be affected.
rbd snap rm rbd-demo/rbd-demo1.img@demo1_snap666
#error snapshot 'demo1_snap666' is protected from removal.

If you want to delete a snapshot but want to keep its submirror, you must first flatten its submirror, and the flattening time depends on the size of the mirror
1) Flatten the submirror
rbd flatten rbd-demo/rbd-demo666.img

2) Unprotect the snapshot
rbd snap unprotect rbd-demo/rbd-demo1.img@demo1_snap666

3) Delete the snapshot
rbd snap rm rbd-demo/rbd-demo1.img@demo1_snap666

rbd ls -l -p rbd-demo #After deleting the snapshot, check that the sub-mirror still exists

8. Mirror image export and import

//Export image
rbd export rbd-demo/rbd-demo1.img /opt/rbd-demo1.img

//Import image
#Uninstall the client mount and unmap
umount /data/bb
rbd unmap rbd-demo/rbd-demo1.img

#Clear all snapshots under the mirror and delete the mirror
rbd snap purge rbd-demo/rbd-demo1.img
rbd rm rbd-demo/rbd-demo1.img

rbd ls -l -p rbd-demo

#import mirror
rbd import /opt/rbd-demo1.img rbd-demo/rbd-demo1.img

rbd ls -l -p rbd-demo

Create the RGW interface of the Ceph object storage system 
1. Object storage concept
Object storage (object storage) is a storage method for unstructured data. Each piece of data in object storage is stored as a separate object with a unique address to identify the data object. It is usually used for in a cloud computing environment.
Unlike other data storage methods, object-based storage does not use directory trees.

Although there are differences in design and implementation, most object storage systems present similar core resource types to the outside world. From the perspective of the client, it is divided into the following logical units:
Amazon S3:
Provides
1. User (User)
2. Storage bucket (Bucket)
3. Object (Object)

The relationship between the three is:
1. User stores Objects in the Bucket on the system
2. The bucket belongs to a certain user and can accommodate objects. One bucket is used to store multiple objects
3. The same user can have multiple buckets , different users are allowed to use buckets with the same name, so the user name can be used as the namespace of the bucket
 

●OpenStack Swift: 
Provides user, container and object corresponding to users, storage buckets and objects respectively, but it also provides a parent component account for user, which is used to represent a project or user, so an account can contain one to Multiple users, they can share the same set of containers and provide namespaces for containers

RadosGW:
Provides user, subuser, bucket, and object. The user corresponds to the user of S3, and the subuser corresponds to the user of Swift. However, neither user nor subuser supports providing namespaces for buckets, so the storage buckets of different users The same name is not allowed; however, since the jewel version, RadosGW has introduced a tenant (tenant) to provide namespaces for users and buckets, but he is an optional component


It can be seen from the above that the core resource types of most object storage are similar, such as Amazon S3, OpenStack Swift and RadosGw. Among them, S3 and Swift are not compatible with each other. In order to be compatible with S3 and Swift, Ceph provides RGW (RadosGateway) data abstraction layer and management layer on the basis of RadosGW cluster, which can be natively compatible with S3 and Swift API.
S3 and Swift can complete data exchange based on http or https, and the built-in Civetweb of RadosGW provides services. It can also support proxy servers including nginx, haproxy, etc. to receive user requests in the form of proxies, and then forward them to the RadosGW process.
The function of RGW depends on the implementation of the object gateway daemon, which is responsible for providing the REST API interface to the client. For redundant load balancing requirements, there is usually more than one RadosGW daemon on a Ceph cluster.
 

2. Create an RGW interface.
If you need to use an interface like S3 or Swift, you need to deploy/create a RadosGW interface. RadosGW is usually used as Object Storage, similar to Alibaba Cloud OSS.
 

//Create an RGW daemon process on the management node (this process generally requires high availability in a production environment, which will be introduced later)
cd /etc/ceph
ceph-deploy rgw create node01

ceph -s

#After successful creation, by default, a series of storage pools for RGW will be automatically created
ceph osd pool ls
rgw.root 
default.rgw.control #Controller information
default.rgw.meta #Record metadata
default.rgw.log #Log Information
default.rgw.buckets.index #is the rgw bucket information, generated after data is written
default.rgw.buckets.data #is the actual stored data information, generated after data is written
 

#By default, RGW listens to port 7480
ssh root@node01 netstat -lntp | grep 7480
 

curl node01:7480
<?xml version="1.0" encoding="UTF-8"?><ListAllMyBucketsResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<ListAllMyBucketsResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Owner>
    <ID>anonymous</ID>
    <DisplayName/>
  </Owner>
  <Buckets/>
</ListAllMyBucketsResult>

//Enable http+https, change the monitoring port
RadosGW daemon internally implemented by Civetweb, through the configuration of Civetweb can complete the basic management of RadosGW.

#To enable SSL on Civetweb, you first need a certificate and generate a certificate on the rgw node

node1 node operation
1) Generate CA certificate private key:
openssl genrsa -out civetweb.key 2048

2) Generate CA certificate public key:
openssl req -new -x509 -key civetweb.key -out civetweb.crt -days 3650 -subj "/CN=192.168.80.11"

#3. Merge the generated certificate into pem
cat civetweb.key civetweb.crt > /etc/ceph/civetweb.pem

#Change the listening port
Civetweb listens on port 7480 by default and provides the http protocol. If you need to modify the configuration, you need to edit the ceph.conf configuration file on the management node
cd /etc/ceph

vim ceph.conf
......
[client.rgw.node01]
rgw_host = node01
rgw_frontends = "civetweb port=80+443s ssl_certificate=/etc/ceph/civetweb.pem num_threads=500 request_timeout_ms=60000"

-------------------------------------------------- ----------
●rgw_host: the corresponding RadosGW name or IP address
●rgw_frontends: configure the listening port here, whether to use https, and some common configurations:
•port: if it is an https port, it needs to be behind the port Add an s.
• ssl_certificate: Specifies the path to the certificate.
• num_threads: the maximum number of concurrent connections, the default is 50, adjusted according to requirements, usually this value should be larger in the production cluster environment •
request_timeout_ms: send and receive timeout, in ms, the default is 30000
• access_log_file: access log path , the default is empty
error_log_file: error log path, the default is empty
------------------------------------ ------------------------

# After modifying the ceph.conf configuration file, you need to restart the corresponding RadosGW service, and then push the configuration file
ceph-deploy --overwrite-conf config push node0{1..3}

ssh root@node01 systemctl restart ceph-radosgw.target

#View port on rgw node
netstat -lntp | grep -w 80
netstat -lntp | grep 443

#Access verification on the client side
curl http://192.168.80.11:80
curl -k https://192.168.80.11:443
 

//Create a RadosGW account
Use the radosgw-admin command on the management node to create a RadosGW account

radosgw-admin user create --uid="rgwuser" --display-name="rgw test user"
......
    "keys": [
        {
            "user": "rgwuser",
            "access_key": "ER0SCVRJWNRIKFGQD31H",
            "secret_key": "YKYjk7L4FfAu8GHeQarIlXodjtj1BXVaxpKv2Nna"
        }
    ],


#After the creation is successful, the basic information of the user will be output, and the two most important information are access_key and secret_key. After the user is successfully created, if you forget the user information, you can use the following command to view
radosgw-admin user info --uid="rgwuser"
 

//S3 interface access test
1) Install python3 and python3-pip on the client side
yum install -y python3 python3-pip

python3 -V
Python 3.6.8

pip3 -V
pip 9.0.3 from /usr/lib/python3.6/site-packages (python 3.6)

2) Install the boto module for testing the connection to S3
pip3 install boto

3) Test access to the S3 interface
echo 123123 > /opt/123.txt

vim test.py
#coding:utf-8
import ssl
import boto.s3.connection
from boto.s3.key import Key
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create _unverified_https_context
    
#test user's keys information
access_key = "ER0SCVRJWNRIKFGQD31H" #Enter the access_key of RadosGW account
secret_key = "YKYjk7L4FfAu8GHeQarIlXodjtj1BXVaxpKv2Nna" #Enter the secret_key of RadosGW account

#rgw's ip and port
host = "192.168.80.11" #Enter the public network address of the RGW interface

#If you use port 443, the following link should be set is_secure=True
port = 443
#If you use port 80, the following link should be set is_secure=False
#port = 80
conn = boto.connect_s3(
    aws_access_key_id=access_key,
    aws_secret_access_key=secret_key,
    host =host,
    port=port,
    is_secure=True,
    validate_certs=False,
    calling_format=boto.s3.connection.OrdinaryCallingFormat()
)

#1: Create a bucket
#conn.create_bucket(bucket_name='bucket01')
#conn.create_bucket(bucket_name='bucket02')

#2: Determine whether it exists, return None
exists = conn.lookup('bucket01')
print(exists)
#exists = conn.lookup('bucket02')
#print(exists)

#3: Get a bucket
#bucket1 = conn.get_bucket('bucket01')
#bucket2 = conn.get_bucket('bucket02')

#4: View files under a bucket
#print(list(bucket1.list()))
#print(list(bucket2.list()))

#5: Store data on s3, the data source can be file, stream, or string
#5.1, upload files
#bucket1 = conn.get_bucket('bucket01')
# The value of name is the key of the data
#key = Key(bucket= bucket1, name='myfile')
#key.set_contents_from_filename('/opt/123.txt')
# Read the content of the file in s3, return string which is the content of file 123.txt
#print(key.get_contents_as_string())

#5.2, upload string
# If you have already obtained the object before, you don’t need to obtain it repeatedly
bucket2 = conn.get_bucket('bucket02')
key = Key(bucket=bucket2, name='mystr')
key.set_contents_from_string('hello world')
print(key. get_contents_as_string())

#6: To delete a bucket, all keys in the bucket must be deleted when deleting the bucket itself
bucket1 = conn.get_bucket('bucket01')
for key in bucket1:
    key.delete()
bucket1.delete()


4) Follow the above steps to execute the python script test
python3 test.py

OSD failure simulation and recovery 
1. Simulate OSD failure
If there are thousands of osds in the ceph cluster, it is normal for 2~3 failures every day, we can simulate down one osd

#If the osd daemon is running normally, the down osd will quickly return to normal, so you need to shut down the daemon
ssh root@node01 systemctl stop ceph-osd@0

#down 掉 osd
ceph osd down 0

ceph osd tree


2. Kick the broken osd out of the cluster
//Method 1:
#Move osd.0 out of the cluster, the cluster will start to automatically synchronize data
ceph osd out osd.0

#Remove osd.0 from crushmap
ceph osd crush remove osd.0

#Delete the account information corresponding to the daemon process
ceph auth rm osd.0

ceph auth list

#Delete osd.0
ceph osd rm osd.0

ceph osd stat
ceph -s

//Method 2:
ceph osd out osd.0

#Using the comprehensive steps, delete the configuration for the broken osd in the configuration file
ceph osd purge osd.0 --yes-i-really-mean-it

3. Rejoin the cluster after repairing the original broken osd
#Create osd on the osd node, no need to specify the name, it will automatically generate
cd /etc/ceph according to the serial number

ceph osd create

#创建账户
ceph-authtool --create-keyring /etc/ceph/ceph.osd.0.keyring --gen-key -n osd.0 --cap mon 'allow profile osd' --cap mgr 'allow profile osd' --cap osd 'allow *'

#Import new account key
ceph auth import -i /etc/ceph/ceph.osd.0.keyring

ceph auth list

#Update the keyring file in the corresponding osd folder
ceph auth get-or-create osd.0 -o /var/lib/ceph/osd/ceph-0/keyring

#Join crushmap
ceph osd crush add osd.0 1.000 host=node01 #1.000 represents weight

#Join the cluster
ceph osd in osd.0

ceph osd tree

#Restart osd daemon
systemctl restart ceph-osd@0

 ceph osd tree #After a while, the osd status is up    

//如果重启失败
报错:
Job for [email protected] failed because start of the service was attempted too often. See "systemctl  status [email protected]" and "journalctl -xe" for details.
To force a start use "systemctl reset-failed [email protected]" followed by "systemctl start [email protected]" again.

#运行
systemctl reset-failed [email protected] && systemctl restart [email protected]

Guess you like

Origin blog.csdn.net/zl965230/article/details/131043357