Ceph entry to proficiency - summary of Ceph distributed storage solutions

Hybrid hardware architecture proposal

Consider cost, choose SSD and HDD hybrid hardware architecture

  • Solution 1: Put the master data on SSD OSD, copy data on HDD OSD, and write a new crush rule
  • Solution 2: Storage pool allocation priority is realized by writing crush rules, and hot and cold data are separated
  • Solution 3: Hierarchical storage, ceph cache proxy technology
  • Solution 4: SSD solid state disk allocates block_db and block_wal partitions for acceleration

Reference: Ceph Distributed Storage Hybrid Hardware Architecture Solution

Ceph best practices

Node hardware configuration

OSD node configuration

For IOPS-intensive scenarios, the server configuration recommendations are as follows:
OSD: Configure four OSDs (lvm) on each NVMe SSD.
Controller: use Native PCIe bus.
Network: 12osd/10 Gigabit
memory: 16G + 2G/osd
CPU: 5c/ssd

For high-throughput type, the server configuration recommendations are as follows:
OSD: HDD/7200 RPM
Network: 12osd/10 Gigabit ports
Memory: 16G + 2G/osd
CPU: 1c/hdd

For high-capacity models, the server configuration recommendations are as follows:
OSDs: HDD/7200 RPM
Network: 12osd/10 Gigabit ports
Memory: 16G + 2G/osd
CPU: 1c/hdd

CPU中的1C = 1GHz

Other node configurations:
MDS: 4C/2G/10Gbps
Monitor: 2C/2G/10Gbps
Manager: 2C/2G/10Gbps
Bluestore: slow, DB and WAL ratio
slow(SATA):DB(SSD):WAL(NVMe SSD)=100:1:1

Ceph cluster practice plan

hardware recommendation

Take 1PB as an example, high throughput type

Number of OSD nodes
: 21
CPU: 16c
Memory: 64G
Network: 10Gbps * 2
Hard disk: 7200 RPM HDD/4T * 12 (12 OSD + 1 system)
System: Ubuntu 18.04

Monitor node
quantity: 3
CPU: 2c
memory: 2G
network: 10Gbps * 2
hard disk: 20G
system: Ubuntu 18.04

Number of Manager nodes
: 2
CPU: 2c
Memory: 2G
Network: 10Gbps * 2
Hard Disk: 20G
System: Ubuntu 18.04

Number of MDS (for cephFS)
: 2
CPU: 4c
Memory: 2G
Network: 10Gbps * 2
Hard Disk: 20G
System: Ubuntu 18.04

Intel Hardware Solution White Paper

PCIe/NVMe SSD: mechanical hard disk 1:12. Intel PCIe/NVMe P3700 as log disk

SATA solid state disk: mechanical hard disk 1:4. Intel SATA P3700 used as log disk

good configuration better configuration best configuration
CPU Intel Xeon Processor E5-2650v3 Intel Xeon Processor E5-2690 Intel Xeon Processor E5-2699v3
network card 10GbE 10GbE * 2 10GbE*4 + 2*40GbE
driver 1 * 1.6TB P3700 + 12 * 4T SAS (1:12) (P3700 for log and cache drives) 1 * 800GB P3700 + 4*1.6TB S510 (P3700 for log and cache drive) 4-6 2TB P3700
Memory 64GB 128GB >=128GB

A PCIe/NVMe SSD can be used as a journal drive, and a high-capacity, low-cost SATA SSD can be used as an OSD data drive. This configuration is most cost-effective for use cases/applications that require high performance, especially high IOPS and SLAs, and have moderate storage capacity requirements

Reference:  Intel Solutions for Ceph Deployments

New methods and ideas for Ceph performance optimization

The redhat official website gives the recommended Ceph cluster server hardware configuration (including CPU/Memory/Disk/Network) for different application scenarios. It is only used as a reference for server configuration selection and is not recommended.

Scenarios include the following:

Scenario: One side focuses on IOPS (IOPS with low latency), such as those with high real-time requirements but a small amount of data. Such as order generation.

Scenario 2: Focus on Throughput (throughput priority), high throughput, but appropriate IOPS latency requirements. For example, live streaming.

Scenario 3: Focus on capacity and price Cost/Capacity (large storage capacity), such as the storage of large files.

Red Hat CEPH Deployment Hardware Configuration Guide

Enterprise ceph application version selection and suggestions for using ssd+sata application bulestore engine

(1) Generally, the LTS version is about 2 years back from the current year, so the current stable version is the M version mimic.
(2) wal is the write-ahead log of RocksDB, which is equivalent to the previous journal data, and db is the metadata information of RocksDB. The disk selection principle is block.wal > block.db > block. Of course, all the data can also be placed on the same disk.
(3) By default, the sizes of wal and db are 512 MB and 1GB respectively. The official recommendation is to adjust block.db to be 4% of the main device, while block.wal is divided into about 6%, and the two add up to about 10%.

Ceph distributed storage hardware planning and operation and maintenance considerations

Ceph distributed storage hardware planning, operation and maintenance considerations? - Q & A - twt enterprise IT communication platform

About Ceph performance optimization and hardware selection |

http://vlambda.com/wz_x6mXhhxv3M.html

Interface type of SSD solid state drive

  • SATA
  • PCI-E
  • NVMe

Detailed Explanation of SATA, mSATA, M.2, M.2 (NVMe), PCIE SSD Interface_m.2 Interface_shuai0845's Blog-CSDN Blog

Recommended SSD model

  • Seagate Nytro 1351/1551
  • HGST SN260
  • Intel P4500
  • Intel P3700
  • Intel S3500
  • Intel S4500
  • Intel SSD 730 series
  • Intel D3-S4510 SSD
  • Micron 5100/5200 and soon 5300

Intel S37 series or S46 series, if you have money, go to Intel P series

configuration table 1

In order to make the ceph cluster run more stably and comprehensively cost-effective, make the following hardware configuration:

name quantity illustrate
OS Disk 2*600G SAS (SEAGATE 600GB SAS 6Gbps 15K 2.5 inches) Choose SSD, SAS, SATA according to your budget. RAID 1, to prevent system disk failure and cause ceph cluster problems
OSD Disk 8*4T SAS (SEAGATE 4T SAS 7200) Select SAS or SATA disks according to the budget, and each physical machine is configured with 8 4T disks for storing data. NoRAID
Monitor Disk 1*480G SSD (Intel SSD 730 series) The disk used for the monitor process, choose ssd, the speed is faster
Journal Disk 2*480G SSD (Intel SSD 730 series) Each ssd disk is divided into 4 areas, corresponding to 4 osds, and a node has a total of 8 journal partitions, corresponding to 8 osds. NoRAID
CPU E5-2630v4*2 Comprehensive budget, choose cpu, the higher the better
Memory >=64G Comprehensive budget, not bad money on 128G memory
NIC 40GB * 2 optical port + public network IP 40GB network card can guarantee data synchronization speed (front-end network and back-end network)

CPU

Each osd daemon process has at least one cpu core.
The calculation formula is as follows:

((cpu sockets * cpu cores per socket * cpu clock speed in GHZ)/No. of OSD) >= 1
 
Intel Xeon Processor E5-2630 V4(2.2GHz,10 core)计算:
  1 * 10 * 2.2 / 8 = 2.75  #大于1, 理论上能跑20多个osd进程,我们考虑到单节点的osd过多,引发数据迁移量的问题,所以限定了8个osd进程

Memory

An osd process requires at least 1G of memory, but considering the memory usage of data migration, it is recommended that an osd process pre-allocate 2G of memory.
When doing data recovery, 1TB of data needs about 1G of memory, so the more memory the better

disk

system disk

Select ssd, sas, and sata disks according to the budget, and must do raid to prevent downtime caused by disks

OSD Disk

Comprehensive cost performance, choose sas or sata disk, each with a size of 4T. If there is an IO-intensive business, you can also separately configure a higher-performance ssd as an osd disk, and then divide a region separately

Journal Disk

For log writing, choose ssd disk, fast

Monitor Disk

The disk used to run the monitoring process, it is recommended to choose ssd disk. If the monitor of ceph runs on a physical machine alone, two ssd disks are required for raid1. If the monitor and osd run on the same node, prepare a separate ssd as the monitor disk

NIC

An osd node is configured with two 10GB network cards, one as a public network for management, and one as a cluster network for communication between osds

reference:

ceph-hardware configuration

Ceph hardware selection, performance tuning

1. Selection of application scenarios

  • IOPS Low Latency - Block Storage
  • Throughput first - block storage, file system, object storage
  • Large storage capacity - object storage, file system

2. Suggestions on hardware selection and optimization

  • CPU: One OSD process allocates one CPU core [((cpu sockets * cpu cores per socket * cpu clock speed in GHZ) /No.Of OSD)>=1]

  • Memory: An OSD process allocates at least 1GB, and 1TB occupies 1GB of memory when restoring data, so it is best to allocate 2G RAM; mon and mds processes on nodes need to consider allocating more than 2GB or more memory space. The higher the RAM, the better the cephfs performance

  • Network card: A large cluster (dozens of nodes) should use a 10G network card. During data recovery and rebalancing, the network is very important. If there is a 10G network, the cluster recovery time will be shortened. In the cluster node, it is recommended to use dual network cards, and consider separating the cilent and cluster networks

  • Hard disk: SSD as log disk, 10-20GB; 4 OSD data disks are recommended to be equipped with an SSD; SSD selection: Intel SSD DC S3500 Series.

    To obtain high performance on sata/sas ssd, the ratio of ssd to osd should be 1:4, that is to say, 4 OSD data hard drives can share one ssd

    The situation of PCIe or NVMe flash memory device depends on the performance of the device, the barrier between ssd and osd can reach 1:12 or 1:18

  • OSD node density. The density of osd data partition Ceph osd nodes is also an important factor affecting cluster performance, available capacity and TCO. Generally speaking, a large number of small-capacity nodes is better than a small number of large-capacity nodes

  • BIOS turns on VT and HT; turns off energy saving; turns off NUMA

3. Operating system tuning

  • read_ahead, improve disk read operations by reading data ahead and recording it in random access memory
  • Adjust the maximum number of processes
  • turn off the swap partition
  • SSD IO scheduling usage: NOOP Mechanical IO scheduling usage: deadline
  • CPU frequency
  • cgroups - use cgroup to bind ceph's CPU and limit memory

4. Network tuning

  • MTU set to 9000
  • Set interrupt affinity manually or use irqbalance
  • Turn on TOE: ethtool -K ens33 tso on
  • RDMA
  • DPDK

5. Ceph tuning

  • PG number adjustment [Total_PGs=(Total_numbers_of_OSD * 100) / max_replication_count]

  • client parameter

  • OSD parameter

  • -recovery tuning parameter

Ceph performance optimization summary v0.94

configuration table 2

Applicable scenarios: hot data applications/virtualization scenarios

equipment specific configuration quantity
CPU Intel Xeon E5-2650 v3 2
Memory 64G DDR4 4
OS disk 2.5-inch Intel S3500 series 240GB SSD (RAID1) 2
SSD disk Intel S3700 400GB 12
10 Gigabit NIC Intel dual-port 10 Gigabit (including multi-mode module) 2

Selection considerations

Ceph software version comparison

The version number has three components xyzx identifies the release period (for example, 13 for imitation). y identifies the publication type:

  • x.0.z - development version (for early testers)
  • x.1.z - release candidate (for test cluster users)
  • x.2.z - stable/bugfix release (for users)
luminous 12 mimic 13 nautilus 14 Octopus 15
date 2017.10 2018.5 2019.3 2020.3
version 12.2.x 13.2.x 14.2.x 15.2.x
dashboard v1 has no management functions v2 has basic management functions v2 has more management functions Restful-API requires Python3 support
bluestore support stable support stable support There are improvements and performance improvements
kernel 4.4.z 4.9.z 4.9.z 4.14.z 4.9.z 4.14.z CentOS8 4.14 or above

Note: The release update cycle of each Ceph community stable version (TLS version) is generally about 2 years from the current year, which means that the latest stable version in 2020 should be version 13 of mimic. It is recommended to use mimic 13.2.x for the production environment

Ceph O version upgrade overview

Hardware Selection Scheme 1

Take the server PowerEdge R720 as an example

equipment configuration quantity illustrate
CPU Intel Xeon E5-2650 v3/E5-2630v4/E5-2690/Intel Xeon E5-2683 v4 2 Intel Xeon series processors allocate 1 cores per OSD daemon
Memory >=64G Each OSD daemon process allocates 1G-2G, process occupation and data migration recovery occupy memory, the larger the memory, the better
OS Disk Intel DC S3500 240GB 2.5 inches 2 RAID1 (system disk mirror redundancy, to prevent system disk failure, causing ceph cluster problems
OSD Disk Intel DC S4500 1.9TB 2.5 inches 12 NORAID bluestore deployment
NIC Gigabit network card + dual port 10 Gigabit (multimode fiber) 3 1 public network IP + front-end network (Public Network) + back-end network (Cluster Network)

Note: For IOPS-intensive scenarios, all are built based on SSDs to ensure optimal performance and give full play to bluestore's feature optimization for SSDs, which is conducive to deployment and post-cluster maintenance, and effectively reduces the difficulty of operation and maintenance. In addition, all-flash application scenarios must also consider the use of higher-performance CPUs. However, because all SSDs are used for deployment, the budgeted hardware cost will be high

Hardware Selection Scheme 2

equipment configuration quantity illustrate
CPU Intel Xeon E5-2650 v3/E5-2630v4/E5-2690/Intel Xeon E5-2683 v4 2 Intel Xeon series processors allocate 1 cores per OSD daemon
Memory >=64G Each OSD daemon process allocates 1G-2G, the memory occupied by the daemon process and data migration recovery, the larger the memory, the better
OS Disk SSD/HDD 240GB 2.5 inches 2 RAID1 (mirror redundancy of system disk)
SSD Disk Intel DC P3700 2TB (1:12) 2 Bulestore log and cache driver; cache pool buffer storage pool
HDD Disk 4T SAS/SATA HDD 2.5 inches 12 NORAID, data storage for OSD
NIC Gigabit network card + dual port 10 Gigabit (multimode fiber) 3 1 public network IP + front-end network (Public Network) + back-end network (Cluster Network)

Note: For IOPS-intensive scenarios, hardware costs will be reduced due to the use of SSD and HDD hybrid hardware solutions, but in terms of performance, due to the hybrid hard disk solution, the performance optimization features brought by bluestore may not be obvious, and hybrid architecture needs to be considered Optimization. In addition, using SSDs with PCIe physical interfaces, later hardware replacement is also a problem, which will also greatly increase the difficulty of Ceph cluster operation and maintenance.

Hardware Selection Scheme 3

equipment configuration quantity illustrate
CPU Intel Xeon E5-2650 v3/E5-2630v4/E5-2690/Intel Xeon E5-2683 v4 2 Intel Xeon series processors allocate 1 cores per OSD daemon
Memory >=64G Each OSD daemon process allocates 1G-2G, process occupation and data migration recovery occupy memory, the larger the memory, the better
OS Disk Intel DC S3500 240GB 2.5 inches 2 RAID1 (system disk mirror redundancy, to prevent system disk failure, causing ceph cluster problems
HDD Disk 7200转 4T SAS / SATA HDD 2.5英寸 12 NORAID,组成OSD的数据存储
NIC 千兆网卡+双口万兆(多模光纤) 3 1个公网IP+前端网络(Public Network)+后端网络(Cluster Network)

注:以1PB为例,高吞吐量型。OSD全部用HDD部署,系统盘SSD做RAID1保障冗余

硬件选型方案4

设备 配置 数量 说明
CPU Intel Xeon E5-2650 v3/E5-2630v4/E5-2690 2 英特尔至强系列处理器 每个OSD守护进程分配1 cores
Memory >=64G 每个OSD守护进程分配1G-2G,进程占用和数据迁移恢复占用内存,内存越大越好
OS Disk Intel DC S3500 240GB 2.5英寸 2 RAID1(系统盘镜像冗余,防止系统盘故障,引发ceph集群问题
SSD Disk PCIe/NVMe 4T 2 一块用于创建cache pool,另一块分区用于给osd disk分配wal和db
HDD Disk 7200转 4T SAS / SATA HDD 2.5英寸 12 NORAID,组成data pool。用于OSD的数据存储
NIC 千兆网卡+双口万兆(多模光纤) 3 1个公网IP+前端网络(Public Network)+后端网络(Cluster Network)

注:使用混合硬盘方案,SSD 盘组成cache pool,HDD组成data pool。利用SSD性能优势,还可以用于部署bluestore时指定wal和db,给HDD osd disk提升性能。

硬件选型方案5

数据参考:鲲鹏分布式存储解决方案

Linux块层SSD cache方案:Bcache

bcache技术+bluestore引擎提升HDD+SSD混合存储的最大性能

  • 安装bcache-tool
  • Linux升级到4.14.x内核
  • 创建后端设备 make-bcache -B /dev/sdx
    • 格式化bcache0
  • 创建前端设备 make-bcache -C /dev/sdx
  • 建立映射关系

bcache配置使用

OSD存储池启用压缩功能

ceph osd pool set compression_

硬件调优

NVMe SSD调优

  • 目的

    为减少数据跨片开销。

  • 方法

    将NVMe SSD与网卡插在统一Riser卡。

内存插法调优

  • 目的

    内存按1dpc方式插将获得最佳性能,即将DIMM0插满,此时内存带宽最大。

  • 方法

    优先插入DIMM 0,即插入DIMM 000、010、020、030、040、050、100、110、120、130、140、150插槽。三位数字中,第一位代表所属CPU,第二位代表内存通道,第三位代表DIMM,优先将第三位为0的插槽按内存通道从小到大依次插入。

系统调优

OS配置参数

参数名称 参数含义 优化建议 配置方法
vm.swappiness swap为系统虚拟内存,使用虚拟内存会导致性能下降,应避免使用。 默认值:60修改完后是否需要重启:否现象:用到swap时性能明显下降修改建议:关闭swap内存的使用,将该参数设定为0。 执行命令sudo sysctl vm.swappiness=0
MTU 网卡所能通过的最大数据包的大小,调大后可以减少网络包的数量以提高效率。 默认值:1500 Bytes修改完后是否需要重启:否现象:可以通过ip addr命令查看修改建议:网卡所能通过的最大数据包的大小设置为9000 Bytes。 执行命令vi /etc/sysconfig/network-scripts/ifcfg-$(Interface),并增加MTU="9000"****说明:${Interface}为网口名称。完成后重启网络服务。service network restart
pid_max 系统默认的“pid_max”值为32768,正常情况下是够用的,但跑重量任务时会出现不够用的情况,最终导致内存无法分配的错误。 默认值:32768修改完后是否需要重启:否现象:通过cat /proc/sys/kernel/pid_max查看修改建议:设置系统可生成最大线程数为4194303。 执行命令echo 4194303 > /proc/sys/kernel/pid_max
file-max “file-max”是设置系统所有进程一共可以打开的文件数量。同时一些程序可以通过setrlimit调用,设置每个进程的限制。如果得到大量使用完文件句柄的错误信息,则应该增加这个值。 默认值:13291808修改完后是否需要重启:否现象:通过cat /proc/sys/fs/file-max查看修改建议:设置系统所有进程一共可以打开的文件数量,设置为cat /proc/meminfo | grep MemTotal | awk '{print $2}' 所查看到的值。 执行命令echo ${file-max} > /proc/sys/fs/file-max****说明:${file-max}为cat /proc/meminfo | grep MemTotal | awk '{print $2}' 所查看到的值。
read_ahead Linux的文件预读readahead,指Linux系统内核将指定文件的某区域预读进页缓存起来,便于接下来对该区域进行读取时,不会因缺页(page fault)而阻塞。鉴于从内存读取比从磁盘读取要快很多,预读可以有效的减少磁盘的寻道次数和应用程序的I/O等待时间,是改进磁盘读I/O性能的重要优化手段之一。 默认值:128 kb修改完后是否需要重启:否现象:预读可以有效的减少磁盘的寻道次数和应用程序的I/O等待时间。通过/sbin/blockdev --getra /dev/sdb查看修改建议:通过数据预读并且记载到随机访问内存方式提高磁盘读操作,调整为8192 kb。 执行命令/sbin/blockdev --setra /dev/sdb****说明:此处以“/dev/sdb”为例,对所有服务器上的所有数据盘都要修改。
I/O Scheduler Linux I/O 调度器是Linux内核中的一个组成部分,用户可以通过调整这个调度器来优化系统性能。 默认值:CFQ修改完后是否需要重启:否现象:要根据不同的存储器来设置Linux I/O 调度器从而达到优化系统性能。修改建议:I/O调度策略,HDD设置为deadline,SSD设置为noop。 执行命令echo deadline > /sys/block/sdb/queue/scheduler****说明:此处以“/dev/sdb”为例,对所有服务器上的所有数据盘都要修改。
nr_requests 在Linux系统中,如果有大量读请求,默认的请求队列或许应付不过来,幸好Linux 可以动态调整请求队列数,默认的请求队列数存放在“/sys/block/hda/queue/nr_requests”文件中 默认值:128修改完后是否需要重启:否现象:通过适当的调整nr_requests 参数可以提升磁盘的吞吐量修改建议:调整硬盘请求队列数,设置为512。 执行命令echo 512 > /sys/block/sdb/queue/nr_requests****说明:此处以“/dev/sdb”为例,对所有服务器上的所有数据盘都要修改。

网络性能参数调优

注意具体的网卡型号来进行适度调优

  • irqbalance:关闭系统中断均衡服务;网卡中断绑核
  • rx_buff
  • ring_buff:增加网卡吞吐量
    • 查看当前buffer大小:ethtool -g <网卡名称>
    • 调整buffer大小:ethtool -G rx 4096 tx 4096
  • 打开网卡lro功能

网卡中断绑核的步骤

# 查询网卡归属于哪个numa节点
cat /sys/class/net/<网口名>/device/numa_node

# lscpu命令查询该numa节点对应哪些CPU core

# 查看网卡中断号
cat /proc/interrupts | grep <网口名> | awk -F ':' '{print $1}'

# 将软中断绑定到该numa节点对应的core上
echo <core编号> > /proc/irq/<中断号>/smp_affinity_list

注:NUMA的功能最好一开始就在BIOS设置中关闭,可减少不必要的绑核操作!

Ceph调优

Ceph配置调优

ceph.conf

[global]#全局设置
fsid = xxxxxxxxxxxxxxx                           #集群标识ID 
mon host = 10.0.1.1,10.0.1.2,10.0.1.3            #monitor IP 地址
auth cluster required = cephx                    #集群认证
auth service required = cephx                           #服务认证
auth client required = cephx                            #客户端认证
osd pool default size = 3                             #最小副本数 默认是3
osd pool default min size = 1                           #PG 处于 degraded 状态不影响其 IO 能力,min_size是一个PG能接受IO的最小副本数
public network = 10.0.1.0/24                            #公共网络(monitorIP段) 
cluster network = 10.0.2.0/24                           #集群网络
max open files = 131072                                 #默认0#如果设置了该选项,Ceph会设置系统的max open fds
mon initial members = node1, node2, node3               #初始monitor (由创建monitor命令而定)
##############################################################
[mon]
mon data = /var/lib/ceph/mon/ceph-$id
mon clock drift allowed = 1                             #默认值0.05#monitor间的clock drift
mon osd min down reporters = 13                         #默认值1#向monitor报告down的最小OSD数
mon osd down out interval = 600      #默认值300      #标记一个OSD状态为down和out之前ceph等待的秒数
##############################################################
[osd]
osd data = /var/lib/ceph/osd/ceph-$id
osd journal size = 20000 #默认5120                      #osd journal大小
osd journal = /var/lib/ceph/osd/$cluster-$id/journal #osd journal 位置
osd mkfs type = xfs                                     #格式化系统类型
osd max write size = 512 #默认值90                   #OSD一次可写入的最大值(MB)
osd client message size cap = 2147483648 #默认值100    #客户端允许在内存中的最大数据(bytes)
osd deep scrub stride = 131072 #默认值524288         #在Deep Scrub时候允许读取的字节数(bytes)
osd op threads = 16 #默认值2                         #并发文件系统操作数
osd disk threads = 4 #默认值1                        #OSD密集型操作例如恢复和Scrubbing时的线程
osd map cache size = 1024 #默认值500                 #保留OSD Map的缓存(MB)
osd map cache bl size = 128 #默认值50                #OSD进程在内存中的OSD Map缓存(MB)
osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier" #默认值rw,noatime,inode64  #Ceph OSD xfs Mount选项
osd recovery op priority = 2 #默认值10              #恢复操作优先级,取值1-63,值越高占用资源越高
osd recovery max active = 10 #默认值15              #同一时间内活跃的恢复请求数 
osd max backfills = 4  #默认值10                  #一个OSD允许的最大backfills数
osd min pg log entries = 30000 #默认值3000           #修建PGLog是保留的最大PGLog数
osd max pg log entries = 100000 #默认值10000         #修建PGLog是保留的最大PGLog数
osd mon heartbeat interval = 40 #默认值30            #OSD ping一个monitor的时间间隔(默认30s)
ms dispatch throttle bytes = 1048576000 #默认值 104857600 #等待派遣的最大消息数
objecter inflight ops = 819200 #默认值1024           #客户端流控,允许的最大未发送io请求数,超过阀值会堵塞应用io,为0表示不受限
osd op log threshold = 50 #默认值5                  #一次显示多少操作的log
osd crush chooseleaf type = 0 #默认值为1              #CRUSH规则用到chooseleaf时的bucket的类型
filestore xattr use omap = true                         #默认false#为XATTRS使用object map,EXT4文件系统时使用,XFS或者btrfs也可以使用
filestore min sync interval = 10                        #默认0.1#从日志到数据盘最小同步间隔(seconds)
filestore max sync interval = 15                        #默认5#从日志到数据盘最大同步间隔(seconds)
filestore queue max ops = 25000                        #默认500#数据盘最大接受的操作数
filestore queue max bytes = 1048576000      #默认100   #数据盘一次操作最大字节数(bytes
filestore queue committing max ops = 50000 #默认500     #数据盘能够commit的操作数
filestore queue committing max bytes = 10485760000 #默认100 #数据盘能够commit的最大字节数(bytes)
filestore split multiple = 8 #默认值2                  #前一个子目录分裂成子目录中的文件的最大数量
filestore merge threshold = 40 #默认值10               #前一个子类目录中的文件合并到父类的最小数量
filestore fd cache size = 1024 #默认值128              #对象文件句柄缓存大小
filestore op threads = 32  #默认值2                    #并发文件系统操作数
journal max write bytes = 1073714824 #默认值1048560    #journal一次性写入的最大字节数(bytes)
journal max write entries = 10000 #默认值100         #journal一次性写入的最大记录数
journal queue max ops = 50000  #默认值50            #journal一次性最大在队列中的操作数
journal queue max bytes = 10485760000 #默认值33554432   #journal一次性最大在队列中的字节数(bytes)
##############################################################
[client]
rbd cache = true #默认值 true      #RBD缓存
rbd cache size = 335544320 #默认值33554432           #RBD缓存大小(bytes)
rbd cache max dirty = 134217728 #默认值25165824      #缓存为write-back时允许的最大dirty字节数(bytes),如果为0,使用write-through
rbd cache max dirty age = 30 #默认值1                #在被刷新到存储盘前dirty数据存在缓存的时间(seconds)
rbd cache writethrough until flush = false #默认值true  #该选项是为了兼容linux-2.6.32之前的virtio驱动,避免因为不发送flush请求,数据不回写
              #设置该参数后,librbd会以writethrough的方式执行io,直到收到第一个flush请求,才切换为writeback方式。
rbd cache max dirty object = 2 #默认值0              #最大的Object对象数,默认为0,表示通过rbd cache size计算得到,librbd默认以4MB为单位对磁盘Image进行逻辑切分
      #每个chunk对象抽象为一个Object;librbd中以Object为单位来管理缓存,增大该值可以提升性能
rbd cache target dirty = 235544320 #默认值16777216    #开始执行回写过程的脏数据大小,不能超过 rbd_cache_max_dirty

PG分布调优

参数名称 参数说明 优化建议
pg_num Total PGs = (Total_number_of_OSD * 100) / max_replication_count,得出的结果向上取到最近的2的整数次幂。 默认值:8修改完后是否需要重启:否现象:pg数量太少的话会有warning提示。修改建议:根据计算公式具体计算得到的值
pgp_num pgp数量设置为与pg相同。 默认值:8修改完后是否需要重启:否现象:pgp数量建议与pg数量相同修改建议:根据计算公式具体计算得到的值
ceph balancer mode 使能balancer均衡器插件,并设置均衡器插件模式为“upmap”。 默认值:none修改完后是否需要重启:否现象:若PG数量不均衡会出现个别OSD负载较大而成为瓶颈。修改建议:upmap

OSD绑核

在BIOS关闭NUMA,可跳过此步骤 centos7关闭numa_如何禁用1个numa节点_qq_34065508的博客-CSDN博客

绑核方法:

  • numactl
  • cgroup
  • taskset

Bcache使能调优

Bcache是Linux内核块层cache,它使用SSD来作为HDD硬盘的cache,从而起到加速作用。Bcache内核模块需要重新编译内核使能。

zlib硬件加速调优

硬件选型方案6

计算 CentOS7 最新版 存储 CentOS8 最新版

硬件推荐列表

计算节点: R630 双CPU 1系统盘

存储节点内存根据所配容量计算 每TB容量2GB内存 + 16GB缓存 1系统盘

存储节点3.5(SAS/SATA): R730/R720 (单CPU适用SATA/双CPU适用于多NVME高性能)
存储节点2.5(SATA/U2): R630(10盘位 SATAx6 U2x4)

硬盘推荐

HDD(企业级): 希捷或西数HDD
SATA-SSD(企业级): 英特尔S3610/S3710(MLC) 三星PM883 镁光5200MAX
高性能 U2(可做存储盘): 英特尔P4600/P4500/P3700/P3600/P3500 三星PM963
高性能 PCI-E NVME(可做缓存盘): 傲腾905P/900P 英特尔/P4700/P3700 三星1725a/1725b

网卡:Mellanox ConnectX-3(支持RDMA)

计算节点

云平台,给客户虚拟机提供计算资源

节点数量:三台

硬件 配置
服务器 Dell R730
CPU E5-2650 v3 * 2
内存 128GB
硬盘 系统盘(RAID1) + 对接后端的Ceph分布式存储
网卡 40G ib网卡(存储管理网络) + G口板载网卡双绑(出口网络)
OS CentOS 7 最新版本
Kernel 默认内核

存储节点

对接云平台,提供分布式存储

节点数量:三台

硬件 配置
服务器 Dell R630
CPU E5-2609 v3 * 2
内存 128GB
硬盘 系统盘(RAID1) + SSD(NVMe/PCI-E/SATA3) + HDD(SAS) 2.5英寸
网卡 40G万兆 ib网卡 * 2 (前端管理网络+后端存储网络)+ 公网IP * 1
OS CentOS 7 最新版本
Kernel 默认内核
Ceph v14.2.10 (nautilus) OSD+MON+MDS+MGR

三节点,三副本

方案一:Bluestore-Bluefs(block_db+block_wal) + HDD数据盘

intel DC S3700/S3710 * 1(400G) 【¥3000】 + 5 * HDD(2T) 【¥2000】

total:10T

Solution 2: SSD as cache pool + Bluestore-Bluefs(block_db+block_wal) + HDD data disk as volume pool

intel DC S4600/S4500 * 2 (480G) 【¥2000】 + intel DC S3700/S3710(200G) 【¥3000】* 1 + 3 * HDD(2T) 【¥2000】

total:cache(1T)+volume(6T)

Solution 3: Pure HDD data disk

DeLL SAS HDD(2T) * 6 【¥2000】

total:12T

Solution 4: Pure SSD data disk

intel DC S4600/S4500 * 6 (480G) 【¥2000】

total:2.4T

Guess you like

Origin blog.csdn.net/wxb880114/article/details/130646069