PTS 3.0: Next-generation performance testing service supported by observability

Author: Xiao Changjun (Qionggu)

Hello everyone, I am Xiao Changjun from Alibaba Cloud Cloud Native Application Platform, nicknamed Qionggu. The theme I am sharing this time is "Next Generation Performance Testing Service with Observable Blessing". Everyone is familiar with the mention of performance testing. Performance testing has become an important means to evaluate system capabilities, identify system weaknesses, perform system tuning, and verify system stability .

Our general process for performance testing is to construct data, configure scenarios, initiate stress testing, and obtain stress testing results. However, test students also know that performance testing is not that simple. We also face the following problems:

1. Assessment of the scope of influence before pressure testing and how to accurately control the explosion radius of the pressure testing.

2. Monitor the relevant indicators of stress testing and business systems, and analyze the current system performance.

3. If the pressure test does not meet expectations, performance bottlenecks need to be analyzed.

4. The maximum capacity or current performance that can be supported needs to be given based on the current stress test results. These problems are faced by every testing team. With the current development of technology, how to better solve these problems?

In response to the above challenges, we propose the observability capability of performance stress testing, and propose the observability of stress testing links to address the above issues:

  • First, before implementing the stress test, perform a dial test and initiate a request through the dial test to build the entire stress test link topology. The impact scope of the entire stress test can be viewed globally through the link topology.
  • Secondly, the performance indicators can be observed, and the monitoring indicators involved in the pressure test link are obtained, and the pressure test and the water level chart of each business instance are automatically generated, and the pressure is measured and observed at the same time.
  • Thirdly, the indicators and link events of the stress test request are aggregated to realize link profiling and intelligent analysis, making performance bottlenecks observable.
  • Finally, through the stress test indicators mentioned earlier and the resource levels of each service instance, a gradient stress test evaluation is performed to verify the system service capacity. Construct performance stress testing observables to achieve everything from stress testing to data analysis.

On top of this, we have built a next-generation performance stress testing service with the observable blessing of performance testing PTS 3.0.

The overall architecture of the performance testing PTS 3.0 platform is divided into seven parts. From a bottom-up perspective, the underlying stress testing engine supports the self-developed PTS engine and is fully compatible with the open source JMeter stress testing engine. The K6 engine will be supported in the future, allowing users to configure the The pressure test is smoothly migrated to the PTS platform. Write stress measurement indicator data into Prometheus and Logs, open it to users for querying, and provide Grafana disk for users to call to meet users' needs for flexible data processing.

In the stress test preparation stage, the performance test PTS is connected with the application real-time monitoring service ARMS product, integrating various functions of ARMS, including obtaining application list, calling interface, database calling, container, infrastructure, Trace and other data. Through these data connections, it can be simplified Stress test configuration and build stress test link topology.

During the stress test execution phase, the stress test engine transparently transmits the link flag, opens up the ARMS call chain, and performs unified aggregation processing through streaming. During the stress test process, Grafana provides a stress test of various instance indicators, and uses ARMS intelligent insights and call chain analysis capabilities to realize performance bottleneck analysis. During the stress test, speed adjustment can be achieved while stressing.

After the stress test is completed, a detailed stress test report is automatically generated, providing performance baseline comparison and panoramic snapshots.

Each stage of the overall stress test can be supported through the natural language interaction provided by the cloud native large model to meet scenarios such as business migration to the cloud, major promotion activities, and specification selection. Through the above capabilities, Performance Test PTS 3.0 has the following features:

The stress test supported by the real-time monitoring service ARMS is fully observable, the stress test supported by the large language model is intelligent, the open source engine is fully embraced, and the stress test script task hosting is realized. The following focuses on introducing these features:

Visualization of stress testing links in stress testing observability

The performance test PTS is connected to the ARMS OpenTelemetry service and can be used after connecting to the ARMS OT probe without additional configuration. Before initiating a stress test, the stress test script test and link detection will be performed through the dial test capability, which can automatically and accurately identify the components that the requested link passes through, and establish a link topology map based on the dial test request, without involving the normal request path. links, so that we can intuitively perceive the links passed by the stress test and clarify the scope of the stress test impact.

Stress test data dashboard, full monitoring of all link indicators

The performance test PTS integrates the Grafana dashboard. During the stress test process, the stress test data dashboard will be dynamically generated based on the stress test link to achieve full monitoring of all link indicators. For example, the following monitoring panels are covered:

  • Business overview: Monitor core business indicators, such as scene request volume, business conversion rate, etc.
  • Stress testing market: Monitor stress testing service indicators, such as TPS, RT, success rate, number of abnormal requests, total number of requests, 90/95/99 RT, etc.
  • Application Monitoring Dashboard: The application monitoring indicators involved in covering request links have been changed. The application dimension includes indicators such as the number of each application instance, the number of requests, the number of errors, and RT.
  • Container monitoring dashboard: Container service monitoring covers the monitoring of core instance components such as API Server, Node, and Pod, and covers indicators such as QPS, success rate, number of Pods, and resource usage.

In addition, there are access layer SLB disks, ECS instance disks, database instance disks, etc. Through the above disk, the water level and status of each instance of the pressure test link can be monitored. The speed can be adjusted while observing through this disk to achieve the purpose of optimal pressure testing.

Performance bottlenecks can be observed and the root cause of the problem can be quickly located.

Many problems encountered in performance testing are that the stress test results are not as good as expected, which requires a step-by-step analysis of the performance bottlenecks of the current system or the entire link. The performance test PTS is integrated with the intelligent insight capabilities of the application real-time monitoring service ARMS to automatically screen abnormal events during the stress test, enter the details of the abnormal event, and plug in the interface involved in the event, the cause of the exception, the complete exception stack, and the occurrence of the exception. times, exception rate, exception time range, call chain and other information. Click Call Chain Analysis to enter the call chain details and view the exception analysis report. For example, in the above scenario, an abnormal event of obtaining a database connection timeout is detected. After the call Chain analysis pointed out that the maximum usage rate of the database connection pool at abnormal times was 100% (maximum number of active connections/maximum number of available connections) and gave suggestions for increasing the connection pool configuration. Through this function, the efficiency of performance analysis is greatly improved and the purpose of continuous performance tuning is achieved.

Observable system capacity, automated capacity planning and verification

Based on the above configuration and indicator data, we also plan to launch automated capacity planning and verification. Let’s first look at the three stages of pressure testing gradient:

First, if the resource load is not high, there is a linear growth relationship between TPS and resource usage.

The second is to continue to pressurize. When the resource load is saturated, as the concurrency increases, the TPS trend remains stable and the CPU begins to surge.

Third, when the resource load is full and the concurrency exceeds the maximum capacity point that the system can carry, the TPS trend and CPU will fluctuate significantly, and the service will become unavailable. With this capacity assessment method, the expected concurrency can be configured before the stress test. Combined with the automatically identified link components mentioned above, the expected maximum resource water level threshold for each instance can be configured at the same time. The pressure is gradually increased to reach the resource threshold. It can be calculated that if the Calculate the number of resource instances required for expected traffic and perform capacity planning. Then further increase the pressure to reach the resource load limit, calculate the maximum concurrency supported by this number of resource instances, and conduct capacity evaluation.

Supports generative AI and lowers the threshold for stress testing

The above is through deep integration with ARMS products to achieve comprehensive observability of stress testing, achieve continuous monitoring and feedback, conduct more in-depth performance analysis, and optimize performance issues, thereby maximizing the value of stress testing output. Performance test PTS 3.0 is also combined with Alibaba Cloud's native large language model to achieve intelligent stress testing through natural language interactive methods.

Through generative AI, it analyzes performance test instructions, creates stress test tasks, completes script debugging, and executes stress test tasks. From a full-link perspective, view dynamic icons and observe the overall performance status of the application system. Aim at performance bottlenecks, locate the problem, and discover the root cause of the bottleneck. You can also use large models to conduct in-depth analysis and interpretation of stress test reports, and provide stress test summaries, etc.

Next, we make a complete demonstration video:

Host the JMeter ecosystem to maximize the value of stress testing

In addition to capability upgrades, open source is the core product value of performance testing PTS. Currently, the performance testing PTS supports hosting the JMeter stress testing engine. The platform already has observability, intelligence and other capabilities to maximize the value of stress testing.

The JMeter script can be uploaded directly on the performance test PTS console page for stress testing. After uploading the script, the platform will parse the script and automatically download and complete the dependent jar packages, reducing user configuration costs and thereby improving the success rate of stress testing. The JMeter stress test configuration has also been further optimized to provide convenient waterfall flow configuration, top-down immersive configuration, and separation of basic configuration and optional high-level configuration, which reduces the cost of user configuration understanding and reduces the difficulty of configuring stress test scenarios. The ability to integrate with observables provided previously is also supported in JMeter stress testing. After the stress test is completed, a report is automatically generated, the platform stress test is reused, and performance analysis results are provided. While reusing the JMeter stress testing engine, a more stable, larger-scale, and more valuable stress testing experience can be obtained through the platform.

Performance Test PTS continues to remain open and provides OpenAPI. The product has the ability to integrate and be integrated, empowers cloud services, and recommends users to instances with specifications suitable for them. For example, on function computing products, the function performance detection provided by performance testing PTS can be used to obtain the upper limit of single instance performance, reduce the difficulty of concurrency configuration, recommend appropriate instance specifications to users, and reduce the cost of using function computing. On the microservice engine MSE product, it supports service performance testing such as Dubbo to discover service performance problems, and supports cloud native gateway performance testing to obtain the upper limit of gateway performance.

Currently, the performance test PTS can initiate stress tests from 22 regions around the world, supports maximum concurrency of millions and maximum TPS of tens of millions, meets the requirement of initiating global large-scale stress tests in real time, and serves tens of thousands of enterprises around the world.

Broadcom announces the termination of the existing VMware partner program deepin-IDE version update, replacing the old look with a new look Zhou Hongyi: Hongmeng native will definitely succeed WAVE SUMMIT welcomes its tenth session, Wen Xinyiyan will have the latest disclosure! Yakult Company confirms that 95 G data was leaked The most popular license among programming languages ​​in 2023 "2023 China Open Source Developer Report" officially released Julia 1.10 officially released Fedora 40 plans to unify /usr/bin and /usr/sbin Rust 1.75.0 release
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/10452088