Hybrid hardware architecture proposal
Consider cost, choose SSD and HDD hybrid hardware architecture
- Solution 1: Put the master data on SSD OSD, copy data on HDD OSD, and write a new crush rule
- Solution 2: Storage pool allocation priority is realized by writing crush rules, and hot and cold data are separated
- Solution 3: Hierarchical storage, ceph cache proxy technology
- Solution 4: SSD solid state disk allocates block_db and block_wal partitions for acceleration
Reference: Ceph Distributed Storage Hybrid Hardware Architecture Solution
Ceph best practices
Node hardware configuration
OSD node configuration
For IOPS-intensive scenarios, the server configuration recommendations are as follows:
OSD: Configure four OSDs (lvm) on each NVMe SSD.
Controller: use Native PCIe bus.
Network: 12osd/10 Gigabit
memory: 16G + 2G/osd
CPU: 5c/ssd
For high-throughput type, the server configuration recommendations are as follows:
OSD: HDD/7200 RPM
Network: 12osd/10 Gigabit ports
Memory: 16G + 2G/osd
CPU: 1c/hdd
For high-capacity models, the server configuration recommendations are as follows:
OSDs: HDD/7200 RPM
Network: 12osd/10 Gigabit ports
Memory: 16G + 2G/osd
CPU: 1c/hdd
CPU中的1C = 1GHz
Other node configurations:
MDS: 4C/2G/10Gbps
Monitor: 2C/2G/10Gbps
Manager: 2C/2G/10Gbps
Bluestore: slow, DB and WAL ratio
slow(SATA):DB(SSD):WAL(NVMe SSD)=100:1:1
Ceph cluster practice plan
hardware recommendation
Take 1PB as an example, high throughput type
Number of OSD nodes
: 21
CPU: 16c
Memory: 64G
Network: 10Gbps * 2
Hard disk: 7200 RPM HDD/4T * 12 (12 OSD + 1 system)
System: Ubuntu 18.04
Monitor node
quantity: 3
CPU: 2c
memory: 2G
network: 10Gbps * 2
hard disk: 20G
system: Ubuntu 18.04
Number of Manager nodes
: 2
CPU: 2c
Memory: 2G
Network: 10Gbps * 2
Hard Disk: 20G
System: Ubuntu 18.04
Number of MDS (for cephFS)
: 2
CPU: 4c
Memory: 2G
Network: 10Gbps * 2
Hard Disk: 20G
System: Ubuntu 18.04
Intel Hardware Solution White Paper
PCIe/NVMe SSD: mechanical hard disk 1:12. Intel PCIe/NVMe P3700 as log disk
SATA solid state disk: mechanical hard disk 1:4. Intel SATA P3700 used as log disk
good configuration | better configuration | best configuration | ||
---|---|---|---|---|
CPU | Intel Xeon Processor E5-2650v3 | Intel Xeon Processor E5-2690 | Intel Xeon Processor E5-2699v3 | |
network card | 10GbE | 10GbE * 2 | 10GbE*4 + 2*40GbE | |
driver | 1 * 1.6TB P3700 + 12 * 4T SAS (1:12) (P3700 for log and cache drives) | 1 * 800GB P3700 + 4*1.6TB S510 (P3700 for log and cache drive) | 4-6 2TB P3700 | |
Memory | 64GB | 128GB | >=128GB |
A PCIe/NVMe SSD can be used as a journal drive, and a high-capacity, low-cost SATA SSD can be used as an OSD data drive. This configuration is most cost-effective for use cases/applications that require high performance, especially high IOPS and SLAs, and have moderate storage capacity requirements
Reference: Intel Solutions for Ceph Deployments
New methods and ideas for Ceph performance optimization
The redhat official website gives the recommended Ceph cluster server hardware configuration (including CPU/Memory/Disk/Network) for different application scenarios. It is only used as a reference for server configuration selection and is not recommended.
Scenarios include the following:
Scenario: One side focuses on IOPS (IOPS with low latency), such as those with high real-time requirements but a small amount of data. Such as order generation.
Scenario 2: Focus on Throughput (throughput priority), high throughput, but appropriate IOPS latency requirements. For example, live streaming.
Scenario 3: Focus on capacity and price Cost/Capacity (large storage capacity), such as the storage of large files.
Red Hat CEPH Deployment Hardware Configuration Guide
Enterprise ceph application version selection and suggestions for using ssd+sata application bulestore engine
(1) Generally, the LTS version is about 2 years back from the current year, so the current stable version is the M version mimic.
(2) wal is the write-ahead log of RocksDB, which is equivalent to the previous journal data, and db is the metadata information of RocksDB. The disk selection principle is block.wal > block.db > block. Of course, all the data can also be placed on the same disk.
(3) By default, the sizes of wal and db are 512 MB and 1GB respectively. The official recommendation is to adjust block.db to be 4% of the main device, while block.wal is divided into about 6%, and the two add up to about 10%.
Ceph distributed storage hardware planning and operation and maintenance considerations
About Ceph performance optimization and hardware selection |
http://vlambda.com/wz_x6mXhhxv3M.html
Interface type of SSD solid state drive
- SATA
- PCI-E
- NVMe
Recommended SSD model
- Seagate Nytro 1351/1551
- HGST SN260
- Intel P4500
- Intel P3700
- Intel S3500
- Intel S4500
- Intel SSD 730 series
- Intel D3-S4510 SSD
- Micron 5100/5200 and soon 5300
Intel S37 series or S46 series, if you have money, go to Intel P series
configuration table 1
In order to make the ceph cluster run more stably and comprehensively cost-effective, make the following hardware configuration:
name | quantity | illustrate |
---|---|---|
OS Disk | 2*600G SAS (SEAGATE 600GB SAS 6Gbps 15K 2.5 inches) | Choose SSD, SAS, SATA according to your budget. RAID 1, to prevent system disk failure and cause ceph cluster problems |
OSD Disk | 8*4T SAS (SEAGATE 4T SAS 7200) | Select SAS or SATA disks according to the budget, and each physical machine is configured with 8 4T disks for storing data. NoRAID |
Monitor Disk | 1*480G SSD (Intel SSD 730 series) | The disk used for the monitor process, choose ssd, the speed is faster |
Journal Disk | 2*480G SSD (Intel SSD 730 series) | Each ssd disk is divided into 4 areas, corresponding to 4 osds, and a node has a total of 8 journal partitions, corresponding to 8 osds. NoRAID |
CPU | E5-2630v4*2 | Comprehensive budget, choose cpu, the higher the better |
Memory | >=64G | Comprehensive budget, not bad money on 128G memory |
NIC | 40GB * 2 optical port + public network IP | 40GB network card can guarantee data synchronization speed (front-end network and back-end network) |
CPU
Each osd daemon process has at least one cpu core.
The calculation formula is as follows:
((cpu sockets * cpu cores per socket * cpu clock speed in GHZ)/No. of OSD) >= 1
Intel Xeon Processor E5-2630 V4(2.2GHz,10 core)计算:
1 * 10 * 2.2 / 8 = 2.75 #大于1, 理论上能跑20多个osd进程,我们考虑到单节点的osd过多,引发数据迁移量的问题,所以限定了8个osd进程
Memory
An osd process requires at least 1G of memory, but considering the memory usage of data migration, it is recommended that an osd process pre-allocate 2G of memory.
When doing data recovery, 1TB of data needs about 1G of memory, so the more memory the better
disk
system disk
Select ssd, sas, and sata disks according to the budget, and must do raid to prevent downtime caused by disks
OSD Disk
Comprehensive cost performance, choose sas or sata disk, each with a size of 4T. If there is an IO-intensive business, you can also separately configure a higher-performance ssd as an osd disk, and then divide a region separately
Journal Disk
For log writing, choose ssd disk, fast
Monitor Disk
The disk used to run the monitoring process, it is recommended to choose ssd disk. If the monitor of ceph runs on a physical machine alone, two ssd disks are required for raid1. If the monitor and osd run on the same node, prepare a separate ssd as the monitor disk
NIC
An osd node is configured with two 10GB network cards, one as a public network for management, and one as a cluster network for communication between osds
reference:
Ceph hardware selection, performance tuning
1. Selection of application scenarios
- IOPS Low Latency - Block Storage
- Throughput first - block storage, file system, object storage
- Large storage capacity - object storage, file system
2. Suggestions on hardware selection and optimization
-
CPU: One OSD process allocates one CPU core [((cpu sockets * cpu cores per socket * cpu clock speed in GHZ) /No.Of OSD)>=1]
-
Memory: An OSD process allocates at least 1GB, and 1TB occupies 1GB of memory when restoring data, so it is best to allocate 2G RAM; mon and mds processes on nodes need to consider allocating more than 2GB or more memory space. The higher the RAM, the better the cephfs performance
-
Network card: A large cluster (dozens of nodes) should use a 10G network card. During data recovery and rebalancing, the network is very important. If there is a 10G network, the cluster recovery time will be shortened. In the cluster node, it is recommended to use dual network cards, and consider separating the cilent and cluster networks
-
Hard disk: SSD as log disk, 10-20GB; 4 OSD data disks are recommended to be equipped with an SSD; SSD selection: Intel SSD DC S3500 Series.
To obtain high performance on sata/sas ssd, the ratio of ssd to osd should be 1:4, that is to say, 4 OSD data hard drives can share one ssd
The situation of PCIe or NVMe flash memory device depends on the performance of the device, the barrier between ssd and osd can reach 1:12 or 1:18
-
OSD node density. The density of osd data partition Ceph osd nodes is also an important factor affecting cluster performance, available capacity and TCO. Generally speaking, a large number of small-capacity nodes is better than a small number of large-capacity nodes
-
BIOS turns on VT and HT; turns off energy saving; turns off NUMA
3. Operating system tuning
- read_ahead, improve disk read operations by reading data ahead and recording it in random access memory
- Adjust the maximum number of processes
- turn off the swap partition
- SSD IO scheduling usage: NOOP Mechanical IO scheduling usage: deadline
- CPU frequency
- cgroups - use cgroup to bind ceph's CPU and limit memory
4. Network tuning
- MTU set to 9000
- Set interrupt affinity manually or use irqbalance
- Turn on TOE: ethtool -K ens33 tso on
- RDMA
- DPDK
5. Ceph tuning
-
PG number adjustment [Total_PGs=(Total_numbers_of_OSD * 100) / max_replication_count]
-
client parameter
-
OSD parameter
-
-recovery tuning parameter
Ceph performance optimization summary v0.94
configuration table 2
Applicable scenarios: hot data applications/virtualization scenarios
equipment | specific configuration | quantity |
---|---|---|
CPU | Intel Xeon E5-2650 v3 | 2 |
Memory | 64G DDR4 | 4 |
OS disk | 2.5-inch Intel S3500 series 240GB SSD (RAID1) | 2 |
SSD disk | Intel S3700 400GB | 12 |
10 Gigabit NIC | Intel dual-port 10 Gigabit (including multi-mode module) | 2 |
Selection considerations
Ceph software version comparison
The version number has three components xyz . x identifies the release period (for example, 13 for imitation). y identifies the publication type:
- x.0.z - development version (for early testers)
- x.1.z - release candidate (for test cluster users)
- x.2.z - stable/bugfix release (for users)
luminous 12 | mimic 13 | nautilus 14 | Octopus 15 | |
---|---|---|---|---|
date | 2017.10 | 2018.5 | 2019.3 | 2020.3 |
version | 12.2.x | 13.2.x | 14.2.x | 15.2.x |
dashboard | v1 has no management functions | v2 has basic management functions | v2 has more management functions | Restful-API requires Python3 support |
bluestore | support | stable support | stable support | There are improvements and performance improvements |
kernel | 4.4.z 4.9.z | 4.9.z 4.14.z | 4.9.z 4.14.z | CentOS8 4.14 or above |
Note: The release update cycle of each Ceph community stable version (TLS version) is generally about 2 years from the current year, which means that the latest stable version in 2020 should be version 13 of mimic. It is recommended to use mimic 13.2.x for the production environment
Ceph O version upgrade overview
Hardware Selection Scheme 1
Take the server PowerEdge R720 as an example
equipment | configuration | quantity | illustrate |
---|---|---|---|
CPU | Intel Xeon E5-2650 v3/E5-2630v4/E5-2690/Intel Xeon E5-2683 v4 | 2 | Intel Xeon series processors allocate 1 cores per OSD daemon |
Memory | >=64G | Each OSD daemon process allocates 1G-2G, process occupation and data migration recovery occupy memory, the larger the memory, the better | |
OS Disk | Intel DC S3500 240GB 2.5 inches | 2 | RAID1 (system disk mirror redundancy, to prevent system disk failure, causing ceph cluster problems |
OSD Disk | Intel DC S4500 1.9TB 2.5 inches | 12 | NORAID bluestore deployment |
NIC | Gigabit network card + dual port 10 Gigabit (multimode fiber) | 3 | 1 public network IP + front-end network (Public Network) + back-end network (Cluster Network) |
Note: For IOPS-intensive scenarios, all are built based on SSDs to ensure optimal performance and give full play to bluestore's feature optimization for SSDs, which is conducive to deployment and post-cluster maintenance, and effectively reduces the difficulty of operation and maintenance. In addition, all-flash application scenarios must also consider the use of higher-performance CPUs. However, because all SSDs are used for deployment, the budgeted hardware cost will be high
Hardware Selection Scheme 2
equipment | configuration | quantity | illustrate |
---|---|---|---|
CPU | Intel Xeon E5-2650 v3/E5-2630v4/E5-2690/Intel Xeon E5-2683 v4 | 2 | Intel Xeon series processors allocate 1 cores per OSD daemon |
Memory | >=64G | Each OSD daemon process allocates 1G-2G, the memory occupied by the daemon process and data migration recovery, the larger the memory, the better | |
OS Disk | SSD/HDD 240GB 2.5 inches | 2 | RAID1 (mirror redundancy of system disk) |
SSD Disk | Intel DC P3700 2TB (1:12) | 2 | Bulestore log and cache driver; cache pool buffer storage pool |
HDD Disk | 4T SAS/SATA HDD 2.5 inches | 12 | NORAID, data storage for OSD |
NIC | Gigabit network card + dual port 10 Gigabit (multimode fiber) | 3 | 1 public network IP + front-end network (Public Network) + back-end network (Cluster Network) |
Note: For IOPS-intensive scenarios, hardware costs will be reduced due to the use of SSD and HDD hybrid hardware solutions, but in terms of performance, due to the hybrid hard disk solution, the performance optimization features brought by bluestore may not be obvious, and hybrid architecture needs to be considered Optimization. In addition, using SSDs with PCIe physical interfaces, later hardware replacement is also a problem, which will also greatly increase the difficulty of Ceph cluster operation and maintenance.
Hardware Selection Scheme 3
equipment | configuration | quantity | illustrate |
---|---|---|---|
CPU | Intel Xeon E5-2650 v3/E5-2630v4/E5-2690/Intel Xeon E5-2683 v4 | 2 | Intel Xeon series processors allocate 1 cores per OSD daemon |
Memory | >=64G | Each OSD daemon process allocates 1G-2G, process occupation and data migration recovery occupy memory, the larger the memory, the better | |
OS Disk | Intel DC S3500 240GB 2.5 inches | 2 | RAID1 (system disk mirror redundancy, to prevent system disk failure, causing ceph cluster problems |
HDD Disk | 7200转 4T SAS / SATA HDD 2.5英寸 | 12 | NORAID,组成OSD的数据存储 |
NIC | 千兆网卡+双口万兆(多模光纤) | 3 | 1个公网IP+前端网络(Public Network)+后端网络(Cluster Network) |
注:以1PB为例,高吞吐量型。OSD全部用HDD部署,系统盘SSD做RAID1保障冗余
硬件选型方案4
设备 | 配置 | 数量 | 说明 |
---|---|---|---|
CPU | Intel Xeon E5-2650 v3/E5-2630v4/E5-2690 | 2 | 英特尔至强系列处理器 每个OSD守护进程分配1 cores |
Memory | >=64G | 每个OSD守护进程分配1G-2G,进程占用和数据迁移恢复占用内存,内存越大越好 | |
OS Disk | Intel DC S3500 240GB 2.5英寸 | 2 | RAID1(系统盘镜像冗余,防止系统盘故障,引发ceph集群问题 |
SSD Disk | PCIe/NVMe 4T | 2 | 一块用于创建cache pool,另一块分区用于给osd disk分配wal和db |
HDD Disk | 7200转 4T SAS / SATA HDD 2.5英寸 | 12 | NORAID,组成data pool。用于OSD的数据存储 |
NIC | 千兆网卡+双口万兆(多模光纤) | 3 | 1个公网IP+前端网络(Public Network)+后端网络(Cluster Network) |
注:使用混合硬盘方案,SSD 盘组成cache pool,HDD组成data pool。利用SSD性能优势,还可以用于部署bluestore时指定wal和db,给HDD osd disk提升性能。
硬件选型方案5
数据参考:鲲鹏分布式存储解决方案
Linux块层SSD cache方案:Bcache
bcache技术+bluestore引擎提升HDD+SSD混合存储的最大性能
- 安装bcache-tool
- Linux升级到4.14.x内核
- 创建后端设备 make-bcache -B /dev/sdx
- 格式化bcache0
- 创建前端设备 make-bcache -C /dev/sdx
- 建立映射关系
OSD存储池启用压缩功能
ceph osd pool set compression_
硬件调优
NVMe SSD调优
-
目的
为减少数据跨片开销。
-
方法
将NVMe SSD与网卡插在统一Riser卡。
内存插法调优
-
目的
内存按1dpc方式插将获得最佳性能,即将DIMM0插满,此时内存带宽最大。
-
方法
优先插入DIMM 0,即插入DIMM 000、010、020、030、040、050、100、110、120、130、140、150插槽。三位数字中,第一位代表所属CPU,第二位代表内存通道,第三位代表DIMM,优先将第三位为0的插槽按内存通道从小到大依次插入。
系统调优
OS配置参数
参数名称 | 参数含义 | 优化建议 | 配置方法 |
---|---|---|---|
vm.swappiness | swap为系统虚拟内存,使用虚拟内存会导致性能下降,应避免使用。 | 默认值:60修改完后是否需要重启:否现象:用到swap时性能明显下降修改建议:关闭swap内存的使用,将该参数设定为0。 | 执行命令sudo sysctl vm.swappiness=0 |
MTU | 网卡所能通过的最大数据包的大小,调大后可以减少网络包的数量以提高效率。 | 默认值:1500 Bytes修改完后是否需要重启:否现象:可以通过ip addr命令查看修改建议:网卡所能通过的最大数据包的大小设置为9000 Bytes。 | 执行命令vi /etc/sysconfig/network-scripts/ifcfg-$(Interface),并增加MTU="9000"****说明:${Interface}为网口名称。完成后重启网络服务。service network restart |
pid_max | 系统默认的“pid_max”值为32768,正常情况下是够用的,但跑重量任务时会出现不够用的情况,最终导致内存无法分配的错误。 | 默认值:32768修改完后是否需要重启:否现象:通过cat /proc/sys/kernel/pid_max查看修改建议:设置系统可生成最大线程数为4194303。 | 执行命令echo 4194303 > /proc/sys/kernel/pid_max |
file-max | “file-max”是设置系统所有进程一共可以打开的文件数量。同时一些程序可以通过setrlimit调用,设置每个进程的限制。如果得到大量使用完文件句柄的错误信息,则应该增加这个值。 | 默认值:13291808修改完后是否需要重启:否现象:通过cat /proc/sys/fs/file-max查看修改建议:设置系统所有进程一共可以打开的文件数量,设置为cat /proc/meminfo | grep MemTotal | awk '{print $2}' 所查看到的值。 | 执行命令echo ${file-max} > /proc/sys/fs/file-max****说明:${file-max}为cat /proc/meminfo | grep MemTotal | awk '{print $2}' 所查看到的值。 |
read_ahead | Linux的文件预读readahead,指Linux系统内核将指定文件的某区域预读进页缓存起来,便于接下来对该区域进行读取时,不会因缺页(page fault)而阻塞。鉴于从内存读取比从磁盘读取要快很多,预读可以有效的减少磁盘的寻道次数和应用程序的I/O等待时间,是改进磁盘读I/O性能的重要优化手段之一。 | 默认值:128 kb修改完后是否需要重启:否现象:预读可以有效的减少磁盘的寻道次数和应用程序的I/O等待时间。通过/sbin/blockdev --getra /dev/sdb查看修改建议:通过数据预读并且记载到随机访问内存方式提高磁盘读操作,调整为8192 kb。 | 执行命令/sbin/blockdev --setra /dev/sdb****说明:此处以“/dev/sdb”为例,对所有服务器上的所有数据盘都要修改。 |
I/O Scheduler | Linux I/O 调度器是Linux内核中的一个组成部分,用户可以通过调整这个调度器来优化系统性能。 | 默认值:CFQ修改完后是否需要重启:否现象:要根据不同的存储器来设置Linux I/O 调度器从而达到优化系统性能。修改建议:I/O调度策略,HDD设置为deadline,SSD设置为noop。 | 执行命令echo deadline > /sys/block/sdb/queue/scheduler****说明:此处以“/dev/sdb”为例,对所有服务器上的所有数据盘都要修改。 |
nr_requests | 在Linux系统中,如果有大量读请求,默认的请求队列或许应付不过来,幸好Linux 可以动态调整请求队列数,默认的请求队列数存放在“/sys/block/hda/queue/nr_requests”文件中 | 默认值:128修改完后是否需要重启:否现象:通过适当的调整nr_requests 参数可以提升磁盘的吞吐量修改建议:调整硬盘请求队列数,设置为512。 | 执行命令echo 512 > /sys/block/sdb/queue/nr_requests****说明:此处以“/dev/sdb”为例,对所有服务器上的所有数据盘都要修改。 |
网络性能参数调优
注意具体的网卡型号来进行适度调优
- irqbalance:关闭系统中断均衡服务;网卡中断绑核
- rx_buff
- ring_buff:增加网卡吞吐量
- 查看当前buffer大小:ethtool -g <网卡名称>
- 调整buffer大小:ethtool -G rx 4096 tx 4096
- 打开网卡lro功能
网卡中断绑核的步骤
# 查询网卡归属于哪个numa节点
cat /sys/class/net/<网口名>/device/numa_node
# lscpu命令查询该numa节点对应哪些CPU core
# 查看网卡中断号
cat /proc/interrupts | grep <网口名> | awk -F ':' '{print $1}'
# 将软中断绑定到该numa节点对应的core上
echo <core编号> > /proc/irq/<中断号>/smp_affinity_list
注:NUMA的功能最好一开始就在BIOS设置中关闭,可减少不必要的绑核操作!
Ceph调优
Ceph配置调优
ceph.conf
[global]#全局设置
fsid = xxxxxxxxxxxxxxx #集群标识ID
mon host = 10.0.1.1,10.0.1.2,10.0.1.3 #monitor IP 地址
auth cluster required = cephx #集群认证
auth service required = cephx #服务认证
auth client required = cephx #客户端认证
osd pool default size = 3 #最小副本数 默认是3
osd pool default min size = 1 #PG 处于 degraded 状态不影响其 IO 能力,min_size是一个PG能接受IO的最小副本数
public network = 10.0.1.0/24 #公共网络(monitorIP段)
cluster network = 10.0.2.0/24 #集群网络
max open files = 131072 #默认0#如果设置了该选项,Ceph会设置系统的max open fds
mon initial members = node1, node2, node3 #初始monitor (由创建monitor命令而定)
##############################################################
[mon]
mon data = /var/lib/ceph/mon/ceph-$id
mon clock drift allowed = 1 #默认值0.05#monitor间的clock drift
mon osd min down reporters = 13 #默认值1#向monitor报告down的最小OSD数
mon osd down out interval = 600 #默认值300 #标记一个OSD状态为down和out之前ceph等待的秒数
##############################################################
[osd]
osd data = /var/lib/ceph/osd/ceph-$id
osd journal size = 20000 #默认5120 #osd journal大小
osd journal = /var/lib/ceph/osd/$cluster-$id/journal #osd journal 位置
osd mkfs type = xfs #格式化系统类型
osd max write size = 512 #默认值90 #OSD一次可写入的最大值(MB)
osd client message size cap = 2147483648 #默认值100 #客户端允许在内存中的最大数据(bytes)
osd deep scrub stride = 131072 #默认值524288 #在Deep Scrub时候允许读取的字节数(bytes)
osd op threads = 16 #默认值2 #并发文件系统操作数
osd disk threads = 4 #默认值1 #OSD密集型操作例如恢复和Scrubbing时的线程
osd map cache size = 1024 #默认值500 #保留OSD Map的缓存(MB)
osd map cache bl size = 128 #默认值50 #OSD进程在内存中的OSD Map缓存(MB)
osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier" #默认值rw,noatime,inode64 #Ceph OSD xfs Mount选项
osd recovery op priority = 2 #默认值10 #恢复操作优先级,取值1-63,值越高占用资源越高
osd recovery max active = 10 #默认值15 #同一时间内活跃的恢复请求数
osd max backfills = 4 #默认值10 #一个OSD允许的最大backfills数
osd min pg log entries = 30000 #默认值3000 #修建PGLog是保留的最大PGLog数
osd max pg log entries = 100000 #默认值10000 #修建PGLog是保留的最大PGLog数
osd mon heartbeat interval = 40 #默认值30 #OSD ping一个monitor的时间间隔(默认30s)
ms dispatch throttle bytes = 1048576000 #默认值 104857600 #等待派遣的最大消息数
objecter inflight ops = 819200 #默认值1024 #客户端流控,允许的最大未发送io请求数,超过阀值会堵塞应用io,为0表示不受限
osd op log threshold = 50 #默认值5 #一次显示多少操作的log
osd crush chooseleaf type = 0 #默认值为1 #CRUSH规则用到chooseleaf时的bucket的类型
filestore xattr use omap = true #默认false#为XATTRS使用object map,EXT4文件系统时使用,XFS或者btrfs也可以使用
filestore min sync interval = 10 #默认0.1#从日志到数据盘最小同步间隔(seconds)
filestore max sync interval = 15 #默认5#从日志到数据盘最大同步间隔(seconds)
filestore queue max ops = 25000 #默认500#数据盘最大接受的操作数
filestore queue max bytes = 1048576000 #默认100 #数据盘一次操作最大字节数(bytes
filestore queue committing max ops = 50000 #默认500 #数据盘能够commit的操作数
filestore queue committing max bytes = 10485760000 #默认100 #数据盘能够commit的最大字节数(bytes)
filestore split multiple = 8 #默认值2 #前一个子目录分裂成子目录中的文件的最大数量
filestore merge threshold = 40 #默认值10 #前一个子类目录中的文件合并到父类的最小数量
filestore fd cache size = 1024 #默认值128 #对象文件句柄缓存大小
filestore op threads = 32 #默认值2 #并发文件系统操作数
journal max write bytes = 1073714824 #默认值1048560 #journal一次性写入的最大字节数(bytes)
journal max write entries = 10000 #默认值100 #journal一次性写入的最大记录数
journal queue max ops = 50000 #默认值50 #journal一次性最大在队列中的操作数
journal queue max bytes = 10485760000 #默认值33554432 #journal一次性最大在队列中的字节数(bytes)
##############################################################
[client]
rbd cache = true #默认值 true #RBD缓存
rbd cache size = 335544320 #默认值33554432 #RBD缓存大小(bytes)
rbd cache max dirty = 134217728 #默认值25165824 #缓存为write-back时允许的最大dirty字节数(bytes),如果为0,使用write-through
rbd cache max dirty age = 30 #默认值1 #在被刷新到存储盘前dirty数据存在缓存的时间(seconds)
rbd cache writethrough until flush = false #默认值true #该选项是为了兼容linux-2.6.32之前的virtio驱动,避免因为不发送flush请求,数据不回写
#设置该参数后,librbd会以writethrough的方式执行io,直到收到第一个flush请求,才切换为writeback方式。
rbd cache max dirty object = 2 #默认值0 #最大的Object对象数,默认为0,表示通过rbd cache size计算得到,librbd默认以4MB为单位对磁盘Image进行逻辑切分
#每个chunk对象抽象为一个Object;librbd中以Object为单位来管理缓存,增大该值可以提升性能
rbd cache target dirty = 235544320 #默认值16777216 #开始执行回写过程的脏数据大小,不能超过 rbd_cache_max_dirty
PG分布调优
参数名称 | 参数说明 | 优化建议 |
---|---|---|
pg_num | Total PGs = (Total_number_of_OSD * 100) / max_replication_count,得出的结果向上取到最近的2的整数次幂。 | 默认值:8修改完后是否需要重启:否现象:pg数量太少的话会有warning提示。修改建议:根据计算公式具体计算得到的值 |
pgp_num | pgp数量设置为与pg相同。 | 默认值:8修改完后是否需要重启:否现象:pgp数量建议与pg数量相同修改建议:根据计算公式具体计算得到的值 |
ceph balancer mode | 使能balancer均衡器插件,并设置均衡器插件模式为“upmap”。 | 默认值:none修改完后是否需要重启:否现象:若PG数量不均衡会出现个别OSD负载较大而成为瓶颈。修改建议:upmap |
OSD绑核
在BIOS关闭NUMA,可跳过此步骤 centos7关闭numa_如何禁用1个numa节点_qq_34065508的博客-CSDN博客
绑核方法:
- numactl
- cgroup
- taskset
Bcache使能调优
Bcache是Linux内核块层cache,它使用SSD来作为HDD硬盘的cache,从而起到加速作用。Bcache内核模块需要重新编译内核使能。
zlib硬件加速调优
硬件选型方案6
计算 CentOS7 最新版 存储 CentOS8 最新版
硬件推荐列表
计算节点: R630 双CPU 1系统盘
存储节点内存根据所配容量计算 每TB容量2GB内存 + 16GB缓存 1系统盘
存储节点3.5(SAS/SATA): R730/R720 (单CPU适用SATA/双CPU适用于多NVME高性能)
存储节点2.5(SATA/U2): R630(10盘位 SATAx6 U2x4)
硬盘推荐
HDD(企业级): 希捷或西数HDD
SATA-SSD(企业级): 英特尔S3610/S3710(MLC) 三星PM883 镁光5200MAX
高性能 U2(可做存储盘): 英特尔P4600/P4500/P3700/P3600/P3500 三星PM963
高性能 PCI-E NVME(可做缓存盘): 傲腾905P/900P 英特尔/P4700/P3700 三星1725a/1725b
网卡:Mellanox ConnectX-3(支持RDMA)
计算节点
云平台,给客户虚拟机提供计算资源
节点数量:三台
硬件 | 配置 |
---|---|
服务器 | Dell R730 |
CPU | E5-2650 v3 * 2 |
内存 | 128GB |
硬盘 | 系统盘(RAID1) + 对接后端的Ceph分布式存储 |
网卡 | 40G ib网卡(存储管理网络) + G口板载网卡双绑(出口网络) |
OS | CentOS 7 最新版本 |
Kernel | 默认内核 |
存储节点
对接云平台,提供分布式存储
节点数量:三台
硬件 | 配置 |
---|---|
服务器 | Dell R630 |
CPU | E5-2609 v3 * 2 |
内存 | 128GB |
硬盘 | 系统盘(RAID1) + SSD(NVMe/PCI-E/SATA3) + HDD(SAS) 2.5英寸 |
网卡 | 40G万兆 ib网卡 * 2 (前端管理网络+后端存储网络)+ 公网IP * 1 |
OS | CentOS 7 最新版本 |
Kernel | 默认内核 |
Ceph | v14.2.10 (nautilus) OSD+MON+MDS+MGR |
三节点,三副本
方案一:Bluestore-Bluefs(block_db+block_wal) + HDD数据盘
intel DC S3700/S3710 * 1(400G) 【¥3000】 + 5 * HDD(2T) 【¥2000】
total:10T
Solution 2: SSD as cache pool + Bluestore-Bluefs(block_db+block_wal) + HDD data disk as volume pool
intel DC S4600/S4500 * 2 (480G) 【¥2000】 + intel DC S3700/S3710(200G) 【¥3000】* 1 + 3 * HDD(2T) 【¥2000】
total:cache(1T)+volume(6T)
Solution 3: Pure HDD data disk
DeLL SAS HDD(2T) * 6 【¥2000】
total:12T
Solution 4: Pure SSD data disk
intel DC S4600/S4500 * 6 (480G) 【¥2000】
total:2.4T