The performance is nearly doubled! Detailed explanation of using Boost technology to optimize the performance of SmartX hyper-converged Xinchuang platform hosting Dameng database

Currently, all key industries are accelerating the transformation of Xinchuang, and are gradually entering the "deep water zone" from the initial testing business and edge production business to trying to carry important workloads. Most users are not very familiar with the performance and characteristics of the Xinchuang ecological solution. They inevitably have questions and concerns about the Xinchuang transformation of the core business: What is the performance level of the Xinchuang database and Xinchuang hardware? How does it perform on a virtualized or hyper-converged platform? Is there room for optimization of software and hardware?

Based on the above requirements, the SmartX solution center conducted performance testing on the Xinchuang database product (Dameng DM8) on the SmartX hyper-converged Xinchuang platform (Xinchuang server based on Kunpeng chips), and used the unique Boost acceleration technology to adjust the database. excellent. The test results show that, combined with the parameter adjustment of the database, the Dameng database supported by the SmartX hyper-converged Xinchuang platform in Boost mode has achieved nearly 100% performance improvement.

Key points
● This test uses BenchmarkSQL to perform the test based on the TPC-C benchmark, and compares Dameng DM8 database on bare metal servers (based on SATA SSD and NVMe SSD respectively), SmartX hyper-converged Xinchuang platform without Boost mode optimization and after optimization performance on hyper-converged platforms.
● Optimization methods in Boost mode include: BIOS parameter optimization, enabling Boost mode and RDMA network optimization, virtual machine settings optimization (including turning on the CPU exclusive function and adjusting the virtual disk storage strategy to thick provisioning), virtual machine operating system parameter optimization (using CPU multi-core feature for network optimization), and database-related optimization.
● Without optimization, the performance of SmartX hyper-converged running Dameng database based on Xinchuang architecture is 80% of that of bare metal server (based on SATA SSD). After tuning in Boost mode, the database performance is nearly doubled, reaching 1.77 times that of bare metal servers (using SATA SSD as media) and 88% that of NVMe bare disks.

1 Test environment

1.1 Dameng DM8 database

DM8 is a new generation of self-developed database launched by Dameng Company on the basis of summarizing the development and application experience of DM series products and adhering to the concepts of open innovation, simplicity and practicality. DM8 absorbs and draws on the advantages of current advanced technical ideas and mainstream database products, integrates the advantages of distributed, elastic computing and cloud computing, and makes large-scale improvements in flexibility, ease of use, reliability, high security and other aspects. The centralized architecture can fully meet the needs of different scenarios, support ultra-large-scale concurrent transaction processing and transaction-analysis hybrid business processing, and dynamically allocate computing resources to achieve more refined resource utilization and lower cost investment.

1.2 Xinchuang Hardware (Kunpeng Chip)

The Xinchuang server used in this test is KunTai R722.

The test server is equipped with Kunpeng 920 series CPU, which is currently the industry's leading ARM-based processor. The processor uses a 7nm manufacturing process, is licensed based on the ARM architecture, and is independently designed by Huawei.

An obvious feature of the Kunpeng 920 5250 is that there are many CPU cores. A single-socket CPU has 48 cores, and a 2-socket CPU has a total of 96 cores (commonly used Intel Xeon series CPUs mostly have about 20 cores). Therefore, one of the focuses of later performance optimization is how to better utilize the advantages of multi-core. 

The detailed configuration of the server is as follows:

 

1.3 SmartX hyper-converged Xinchuang cloud infrastructure

Zhiling Haina SmartX takes the hyper-converged software SMTX OS as the core and provides a self-developed, decoupled, and production-ready hyper-converged Xinchuang cloud infrastructure product portfolio, which has helped many industry users build lightweight Xinchuang cloud bases. SMTX OS is the core software for building a hyper-converged platform. It has built-in native server virtualization ELF and distributed block storage ZBS. It can be equipped with advanced functions such as active-active, asynchronous replication, backup and recovery, network and security, etc., combined with the certification list. Using commercial servers, you can quickly build a powerful and agile cloud resource pool. For an in-depth understanding, please read: An article about hyper-converged Xinchuang cloud infrastructure .

Boost mode is the high-performance mode of SMTX OS. This mode uses memory sharing technology to shorten the I/O path of the virtual machine, thereby improving virtual machine performance and reducing I/O access latency. Boost mode is usually enabled with the RDMA network to maximize storage performance. If you want to learn more about the implementation principle of Boost mode, please read: How SPDK Vhost-user helps hyper-converged architecture improve I/O storage performance , or scan the QR code to download the e-book "SmartX Hyper-converged Technology Principles and Features Analysis Collection (including VMware comparison)" )” .

2 Test method

This test uses BenchmarkSQL to execute the test based on the TPC-C benchmark, comparing the Dameng DM8 database on the bare metal server (based on SATA SSD and NVMe SSD respectively), the SmartX hyper-converged Xinchuang platform without Boost mode optimization, and the optimized hyper-converged performance on the platform.

3 Test standard and reference

3.1 TPC-C test

TPC-C is an industry-recognized transaction processing performance benchmark. It is one of the standard benchmarks published by the Transaction Processing Performance Council (TPC) to test the performance of online transaction processing (OLTP) systems. The TPC-C test is based on a virtual online order processing application, which includes a series of transaction operations, such as customer orders, inventory management, delivery processing, etc. TPC-C test results are measured in transactions per minute (TPM).

BenchmarkSQL is a tool that can run benchmark tests using the TPC-C test specification. Specifically, BenchmarkSQL can use the transaction operations and data structures defined in the TPC-C test specification to simulate a TPC-C test environment and perform performance tests on the database system. Therefore, BenchmarkSQL can be considered as an implementation of TPC-C testing.

This test uses BenchmarkSQL to perform the test based on the TPC-C benchmark in order to more objectively evaluate the performance of the database on the hyper-converged Xinchuang platform.

The software versions used in this test are as follows:

 

3.2 Test reference

Users may have learned about the TPC-C test data of some databases in the past, but most of these data are based on x86 architecture server environments, and they may not necessarily have a good understanding of the TPC-C performance of Xinchuang chips. With this in mind, we first directly deployed the Dameng database software (physical machine deployment) on a bare metal server (based on Kunpeng chips), and then performed a set of TPC-C tests as a reference to compare with the subsequent performance of SmartX hyper-convergence. .

3.2.1 Performance under different storage media

Since the database is sensitive to disk I/O performance, in the test scenario, we used two different types of SSDs as storage media and tested them separately. First, the I/O stress test (8k random read and write) is performed on the SSD through the FIO test tool as the I/O benchmark performance of the two SSDs. The results are as follows:

 Then, we ran the TPC-C test (100 warehouse, 200 terminals) of Dameng database on the two SSDs, and the results are as follows:

*Note: In the TPC-C test, the value of NewOrder is taken as the test result, and the subsequent results are also the same, so I won't repeat them here.

The two sets of data were tested in the same server, and the following conclusions can be drawn: TPC-C test results increase with the growth of storage I/O capabilities, but the two are not completely equal (the I/O of NVMe SSD O The write capability is increased by 340% compared to SATA SSD, but TPC-C is only increased by about 102%).

3.2.2 Impact of CPU NUMA Group on performance

The tests are divided into two groups:

  • Group A: The database program is bound to 2 NUMA groups (48 cores) of the same CPU through the numactl command.
  • Group B: The database is not bound to the CPU and utilizes all CPU cores (96 cores) on the server.

The test results are as follows:

 The test results were a bit unexpected: group A (48 cores) performed better than group B (96 cores). In general, the impact of more CPU cores on database performance should theoretically be positive. But two factors in this test affected the results.

  • The working thread parameter of Dameng database supports a maximum of 64 (the official requirement is that the number of working threads is the same as the number of CPU cores), which cannot make full use of all 96 CPU cores.
  • The database works in cross-CPU NUMA groups, and memory access efficiency decreases.

Taking into account the characteristics of the database and the impact of NUMA, the virtual machine configuration in subsequent hyper-converged platform tests adopts a configuration of 48 vCPU (and ensuring that it is in the same CPU) for testing.

4 Test process

4.1 Test conditions

4.1.1 Virtual machine resource configuration

4.1.2 TPC-C test set

  • Adjust the terminal value to verify the performance of the database under different concurrent access pressures. A total of 8 sets of terminals tests ranging from 100 to 800 were executed.
  • Adjust warehouses values ​​to verify database performance under different data set sizes. A total of 3 sets of warehouse tests ranging from 100 to 300 were executed. Each group of warehouses performs a total of 24 groups of tests based on the different number of terminals mentioned above.

4.2 Test 1: Performance of SmartX hyper-converged running Dameng database without any optimization

 The maximum value of TPC-C NewOrder is generated under 100 warehouse and 300 terminal, and  90592  new orders (NewOrder) are completed every minute. Without any optimization, the database performance is not ideal and is  80% of the deployment performance of a bare metal server (based on SATA SSD) .

4.3 Test 2: Performance of SmartX hyper-converged running Dameng database after tuning in Boost mode

4.3.1 Optimization methods in SMTX OS Boost mode

The following will show how to improve the performance of Damon Database TPC-C test in SMTX OS Boost mode.

1) BIOS parameter optimization

Before turning on Boost mode, it is required to switch the power policy from "Energy Saving Mode" to "Performance Mode" in the server BIOS to ensure that the server's power is in optimal performance.

2) Enable Boost mode and RDMA network optimization:
  • In the first step of deploying an SMTX OS cluster: the cluster setup phase, select the Enable Boost mode check box.
  • In Step 5 of deploying the SMTX OS cluster: Configure the network stage, when creating a virtual distributed switch for the storage network, click the Enable RDMA  button to enable the RDMA function of the cluster.
3) Optimization of virtual machine settings
  • Enable CPU exclusive function

When creating a database virtual machine, check the CPU exclusive function. The background will automatically bind the vCPU of the virtual machine to NUMA, allowing the virtual machine to obtain better performance.

  • The virtual disk storage policy is adjusted to thick provisioning

Setting the virtual disk where the database is located from the default thin provision to thick provision will slightly improve I/O performance and reduce CPU usage.

4) Optimization of virtual machine operating system parameters
  • Leverage CPU multi-core features for network optimization 

Since the TPC-C test initiates requests through the benchmarkSQL virtual machine outside the SMTX OS cluster and stress tests the database through the network, network optimization is very necessary to give full play to the effects of Boost mode. Based on the multi-core advantages of Kunpeng CPU, network queue and interrupt tasks are assigned to different CPU cores for execution, which can reduce resource contention and effectively improve network transmission performance.

Method 1: Specify the CPU core for the network card queue

a. Use ls /sys/class/net/enp1s0/queues/ to check the network card queue status:

In the test environment, you can see that there are 4 groups of receiving queues and sending queues corresponding to the network card, depending on the actual situation.

b. Specify CPU cores for multiple network card queues respectively. The command is as follows:

echo 1 > /sys/class/net/enp1s0/queues/rx-0/rps_cpus
echo 2 > /sys/class/net/enp1s0/queues/rx-1/rps_cpus
echo 4 > /sys/class/net/enp1s0/queues/rx-2/rps_cpus
echo 8 > /sys/class/net/enp1s0/queues/rx-3/rps_cpus
echo 16 > /sys/class/net/enp1s0/queues/tx-0/xps_cpus
echo 32 > /sys/class/net/enp1s0/queues/tx-1/xps_cpus
echo 64 > /sys/class/net/enp1s0/queues/tx-2/xps_cpus
echo 128 > /sys/class/net/enp1s0/queues/tx-3/xps_cpus

Among them, echo 1 > /sys/class/net/enp1s0/queues/rx-0/rps_cpus means binding CPU 1 to queue rx-0. The corresponding values ​​​​of the four CPUs CPU 0, 1, 2, and 3 are respectively 1(20), 2(21), 4(22), 8(23).

Method 2: Specify the CPU core for the network card interrupt

a. Use the following command to check the network card interruption status:

cat /proc/interrupts | grep virtio0|cut -f 1 -d ":"

b. Modify the configuration file so that the irqbalance service no longer schedules these interrupts.

Modify the file through vim /etc/sysconfig/irqbalance and change the following parameters to:

IRQBALANCE_ARGS=--banirq=91-99

c. Manually allocate CPU cores to each network card interrupt, as follows:

echo 40 > /proc/irq/91/smp_affinity_list
echo 41 > /proc/irq/92/smp_affinity_list
echo 42 > /proc/irq/93/smp_affinity_list
echo 43 > /proc/irq/94/smp_affinity_list
echo 44 > /proc/irq/95/smp_affinity_list
echo 45 > /proc/irq/96/smp_affinity_list
echo 46 > /proc/irq/97/smp_affinity_list
echo 47 > /proc/irq/98/smp_affinity_list
echo 48 > /proc/irq/99/smp_affinity_list

Performing the network optimization of the above two parts can significantly improve the network performance in the TPC-C test, in which the peak sending speed is increased by up to  17.6% , and the peak receiving speed is increased by up to  27.1% .

5) Database related optimization
  • Adjust database log parameters to give full play to I/O concurrency capabilities

The number of database log files (logfiles) of Dameng DM8 is 2 by default. Since SMTX OS obtains stronger I/O concurrency capabilities after turning on Boost mode, the storage concurrency performance can be fully exploited by increasing the number of log files. In the test, when the number of log files was increased from 2 to 8, the performance was significantly improved in all scenarios. The result is as shown below:

 

After adding log files, the performance improvement ratio in the 100 warehouse scenario ranges from 21% to 35% (as shown in the figure). In the 300 warahouse scenario, the maximum improvement is 47% (relevant test data is available, but the chart is not shown).

  • Adjust DM8 database memory cache area parameters to optimize cache hit rate

Since the memory allocated by the virtual machine where the database is located is 96GB, the memory pool parameter and the memory target parameter are set to 90GB (6G is reserved for the operating system). Related parameters can be adjusted by modifying the database parameter file through /dm8/data/DAMENG/dm.ini.

MEMORY_POOL = 90000                 #Memory Pool Size In Megabyte
MEMORY_TARGET = 90000            #Memory Share Pool Target Size In Megabyte

There are four types of data buffers in the DM8 database, namely NORMAL, KEEP, FAST and RECYCLE.

The BUFFER parameter corresponding to the NORMAL buffer is recommended to be as large as possible to ensure a high hit rate (above 90%). In this test, adjust the BUFFER buffer size to 70GB and the number of BUFFER_POOLS to 48 (keep it consistent with the number of CPU cores).

BUFFER = 70000                             #Initial System Buffer Size In Megabytes
BUFFER_POOLS = 48                    #number of buffer pools

In addition, the RECYCLE cache area is used by the temporary table space, so related parameters must also be adjusted. Here, adjust the RECYCLE buffer size to 12GB, and the number of RECYCLE_POOLS to 48 (keeping it consistent with the number of CPU cores).

RECYCLE = 12000                          #system RECYCLE buffer size in Megabytes
RECYCLE_POOLS = 48                  #Number of recycle buffer pools

Finally, you need to adjust the working thread of the database according to the number of CPU cores. Here, adjust the working thread to 48 (keeping it consistent with the number of CPU cores).

WORKER_THREADS = 48              #Number Of Worker Threads

*Note: After modifying the dm.ini file parameters, the database must be restarted to take effect.

  • Database program sets NUMA binding 

The DM8 database program can improve memory access efficiency by binding NUMA to restrict the program to the same physical CPU, thereby improving database performance.

a. Log in to the SMTX OS node (the node where the database virtual machine is located) via ssh, and execute sudo virsh list to view the ID number of the virtual machine.

b. Execute sudo virsh vcpuinfo 1 based on the virtual machine ID to check the correspondence between the vCPU core and the physical CPU core.

c. Run sudo numactl –hardware to view the NUMA affinity relationship.

d. Start the database through the numactl command to bind NUMA:

numactl -C 0-16,17-40,41-47 sh DmServiceDMSERVER start

After completing all the above optimization operations, re-execute the TPC-C test and compare it with the test data before optimization.

4.3.2 The performance of SMTX OS Boost mode has been greatly improved after optimization

After enabling the SMTX OS Boost mode and related optimization settings, the performance of the database is significantly improved in each test scenario, almost doubled. Detailed data is as follows:

1) 100 warehouse scenarios

 

2) 200 warehouse scenarios

 

3) 300 warehouse scene

 

5 Test conclusion

Through Boost mode and related optimization, running Dameng Database on the SmartX hyper-converged Xinchuang platform can obtain the following benefits:

  • The performance is 1.77 times that of a bare metal server (using SATA SSD as a medium)   , and is close to that of a bare metal server (using NVMe SSD as a medium), reaching  87.6% of the performance of an NVMe bare disk .
  • SMTX OS provides  2  copies of data redundancy protection (although bare disk has good performance, it has no data redundancy protection).
  • SMTX OS only occupies  50% of the CPU and memory resources of a single server host  , which means that the remaining resources can run more services and effectively improve resource utilization.

 

*Full configuration: The database uses all CPU cores and memory resources of a single server, 96 CPUs, 256G memory.
*Half configuration: The database uses part of the CPU core and memory resources of a single server, 48 CPUs, 96G memory.

This test not only shows readers the real performance of Xinchuang database on the hyper-converged Xinchuang platform, but also verifies the performance optimization effect of SmartX hyper-converged Boost mode on the database. To learn more about the performance of SmartX hyper-convergence in database scenarios , please read: SmartX hyper-convergence financial industry database support evaluation collection and long-term implementation case review .

postscript

Judging from the test results, the SmartX hyper-converged platform can significantly improve Dameng Database TPC-C performance test performance with its outstanding I/O performance and related targeted optimization. Since the above test model is based on a simulated production scenario, the parameters of the database focus on the actual I/O placement (writing to the storage medium). You may have a question: Is it possible to further improve the performance of the database TPC-C performance test through the memory cache without leaving the disk?

The answer is yes. On the one hand, the database parameters can be adjusted to reduce the I/O disk load of the database, while at the same time expanding the memory of the database virtual machine and accelerating the database response capability by using a large amount of memory. On the other hand, since the original test model is that the external pressure virtual machine sends a request through the Gigabit network, and finally reaches the database virtual machine for processing, there will be multiple links in the middle: the virtual network card of the pressure machine→virtual switch→physical network card→physical Switch → physical network card → virtual switch → virtual network card of the database machine. The entire network transmission link will bring certain performance losses. We can simulate the impact of shielding network transmission, and do an additional test as a reference: install the stress program locally on the database virtual machine, so that the request pressure will be sent directly inside the database virtual machine and processed inside the virtual machine without going through the network.

After the above series of changes, we performed the TPC-C test again, and the results are as follows:

When  warehouse= 100 , the tpmc (NewOrder) value of TPC-C test under different concurrency scenarios:

When  warehouse= 200 , the tpmc (NewOrder) value of TPC-C test under different concurrency scenarios:

When  warehouse= 300 , the tpmc (NewOrder) value of TPC-C test under different concurrency scenarios:

It can be seen from the test results that the performance of TPC-C has been significantly improved, reaching 468113  TPM (the highest) in the 100 warehouse/100 terminal scenario  . However, this database configuration model has a large amount of data cached in the memory, and the I/O is not flushed in time. If the system encounters a sudden power outage, it may cause database inconsistency, so databases in production environments are generally rarely used ( Unless it is a read-only database), the test results are for reference only.

Download "SmartX Hyper-Converged Technology Principles and Feature Analysis Collection (Including VMware Comparison)" to learn more about how SmartX improves infrastructure performance and reliability through technological innovation.

Guess you like

Origin blog.csdn.net/weixin_43696211/article/details/131890407