How to use SmartX storage performance testing tool OWL to optimize performance management?

Author: Zhang Ruisong of SmartX Financial Team who is deeply involved in the industry

When operation and maintenance personnel manage clusters on a daily basis, they sometimes inevitably encounter the following confusions:

  • A new business is ready to go online. How to choose the storage environment to host the business when multiple sets of storage are available?

  • The business virtual machine runs very fast when it is first launched. However, after running for a period of time, why does the software and hardware still run slowly even though there are no obvious problems?

  • Business feedback indicates that virtual machine performance is sometimes good or bad. What’s going on?

  • The newly launched storage is different from the original configuration. How to judge the performance difference between the two and what kind of business they are suitable for running?

These scenarios all involve storage performance monitoring, and also test the ability of operation and maintenance personnel to use monitoring data to reasonably arrange business placement and optimize performance.

In order to help users better manage cluster storage performance, SmartX independently developed the automated storage performance testing tool OWL. In this article, we will introduce the functional features and usage of OWL, and show how to use OWL test results to optimize performance management and avoid performance bottlenecks through practical applications.

Introduction to OWL tools

OWL is an automated storage performance testing web platform developed by SmartX. It uses fio as a performance acquisition tool to perform cluster performance stress testing. Because fio can be adjusted to multi-queue, multi-bandwidth, and multi-I/O model test scenarios, it can simulate most business I/O (for example, fio is often used for performance testing and tuning of MySQL), so it has become the most popular software that supports OWL. good choice. In addition, OWL is not bound to SmartX hyper-converged clusters, and users can also use OWL in other environments for performance testing .

OWL can help users in the following three areas:

Adapt to different hardware configurations and provide each set of storage with its own storage performance "baseline"

In order to meet the IT infrastructure Xinchuang transformation needs, users may purchase domestic accessories that they have never had contact with before. With a combination of various accessories, engineers need to understand how much performance these new configurations of storage can achieve and what applications and databases they support. The traditional verification method is to directly use the new architecture to test-run a business virtual machine, while using OWL can simulate a similar I/O model to verify the performance of the cluster, thereby testing the performance baseline of this cluster storage.

Refer to the storage performance baseline and launch service virtual machines by category

Users can select appropriate storage clusters for business virtual machines that need to be online based on the storage performance baseline provided by OWL . For example, for database services with large IOPS, users can use all-flash clusters. For businesses with relatively lightweight IOPS and less data interaction, users can use hybrid-flash clusters with higher cost performance.

In addition, in addition to allowing users to know in advance the maximum I/O that each host can carry, OWL can also build a simulation environment to help users understand the I/O size they may need before the business goes online, and reasonably allocate virtual machine placement to avoid Putting multiple virtual machines that occupy a large bandwidth on one host will cause the bandwidth to be "chased by the business" after the business is officially launched.

Combined with the alarm function, proactively warn of performance bottleneck risks

After using OWL to obtain the performance test baseline, users can set the read and write bandwidth threshold corresponding to storage performance on the alarm rules of each cluster. When the bandwidth of the virtual machine reaches 70% and 80% of the main bandwidth, the operation and maintenance engineer will receive alarm prompts respectively, so as to observe the bandwidth usage of the virtual machine and other hosts in a timely manner. In this way, users can migrate this virtual machine to a relatively idle host or cluster before the new business is launched.

OWL usage method and testing process

Preparation before test

Since the OWL tool operates in the form of a virtual machine, the user needs to import ovf, configure the IP address for OWL, and ensure ssh communication between OWL and the test VM. Test VM configuration requirements are as follows:

  • Linux 2c 4G 40G+50G

  • Configure IP address and communicate with OWL tool ssh

  • Install FIO software

test process

  1. Log in to the OWL web interface.

  2. Create a test model.

  3. Add test objects.

  4. Create test tasks.

  5. Start the test task.

  6. OWL combines the alarm function to proactively warn of performance bottleneck risks.

For detailed testing process, please refer to the following Demo: Introduction to cluster storage performance monitoring, management and automated testing tools .

Utilize test results to optimize storage performance management

Common Test Models

The following is the I/O test model we commonly use in demonstrations.

picture

Alarm Threshold Calculation and Setting Method

After obtaining the performance baseline through the above test, users can calculate the corresponding write bandwidth threshold and read bandwidth threshold, and add alarm rules to the cluster. Let’s take the following figure as an example to introduce the calculation method of threshold.

two copies

picture

The above two sets of data were tested on 1 host running 1 virtual machine in an 8-node cluster, and 8 hosts running 1 virtual machine respectively.

Our main concern is bandwidth. Taking the write bandwidth as an example, in the 8P8V 256K sequential write scenario, the write bandwidth is 7278. We divide 7278 by 8 to get the average bandwidth of each node, and then convert MBPS into BPS. 70% of this value is the alarm threshold that we need to set as the attention level .

For the write bandwidth serious alarm threshold, we will look at the value in the 8P1V 256K scenario. The write bandwidth here is 1656.86 MBPS. After unit conversion, 80% of this value will be directly used as the severity alarm threshold. From this, we get two write bandwidth threshold values, as shown in the figure below.

picture

The calculation method of the read bandwidth alarm threshold is the same as that of the write bandwidth. In the above example, the read bandwidth threshold setting is as shown in the figure below.

picture

user practice

Case 1: Operation and maintenance engineers receive performance alarms in a timely manner to avoid business impact

A user used the OWL tool to conduct a bandwidth stress test and found that the bandwidth of a certain node in the cluster exceeded 1.7 GB/s, which exceeded the critical warning level threshold. SmartX automatically sends alarms in the background to remind operation and maintenance engineers that storage performance is close to the limit, thereby avoiding direct impact on business.

picture

Case 2: A state-owned bank uses OWL customized I/O model to test cluster performance

In order to meet regulatory requirements, a state-owned bank used OWL to test cluster performance for 12 consecutive hours according to a customized I/O model (48K, randrw=1:9). The test results show (as shown in the figure below) that the average IOPS standard deviation of the cluster can reach 54338, and the delay is about 1 millisecond.

picture

Case 3: A state-owned bank uses OWL to evaluate whether cluster performance meets the 99th Percentile requirements

A state-owned bank paid attention to the 99th Percentile requirement and used OWL to test the storage performance under the corresponding block size to intuitively understand the performance of the cluster in this scenario. The test results are shown in the figure.

picture

To learn more about SmartX hyper-converged intelligent operation and maintenance features, please read: An article to understand the SmartX hyper-converged hard disk health detection mechanism and operation and maintenance practices , or scan the QR code below to obtain the "SmartX hyper-converged technology principles and feature analysis collection (including VMware comparison details)" e-book.

Guess you like

Origin blog.csdn.net/weixin_43696211/article/details/132599142