Do you really understand the SLA in performance stress testing?

Do you really understand the SLA in performance stress testing?

About the author: Xiang Ling (flower name), Alibaba technical expert, PTS research and development, recently led the compilation and promotion of the ideas and standards of performance stress testing in the cloud era, member of the national standard project team for cloud computing performance testing, and warm-up of the internal stability guarantee system The person in charge of the system.

This article is the sixth issue of the "Performance Test Together" (PTT) series of topic sharing. This topic will conduct performance pressure testing from multiple latitudes such as design, implementation, execution, monitoring, problem location and analysis, and application scenarios. The whole process is disassembled to help you build a complete theoretical system of performance stress testing, and provide practical examples to follow.

This article mainly introduces how to correctly use SLA to determine the target of capacity preparation and improve the efficiency of pressure measurement. It is mainly divided into two parts: theory and practice.

SLA is everywhere

In the era of cloud computing, more and more enterprise services are migrated to the cloud. Major cloud service vendors have their own SLAs issued by their services, such as Alibaba Cloud's ECS server/RDS service/REDIS service, etc., all have corresponding SLAs and SLAs It is a formal commitment defined between the service provider and the customer.

In addition to cloud service vendors, apps/websites that provide various services, if customers can’t place an order while shopping/or cannot open videos with small videos on weekends, this will seriously affect the user’s experience. If a fault occurs Over time, a large number of customers will be lost and business will be lost. So, how to measure the quality of service provided to customers? Then how to measure the stability of the system? Needless to say, a unified language SLA is also needed. So, what exactly is SLA?

In various scenarios such as the launch of new systems, major promotions, and major architectural adjustments to the system, the architecture team and developers need to prepare the system in advance and perform performance stress testing on the system. During the stress testing process, the SLA and SLA What is the connection?

SLA definition

Service level agreement (English: service-level agreement, abbreviated SLA), also known as service level agreement, service level agreement, is a formal commitment defined between service providers and customers [Wikipedia definition]. The concept of SLA is a guarantee for the availability of website services for Internet companies.

SLA includes two elements, one is SLI and the other is SLO, where SLI defines a measurement index; SLO defines a state of service provision.

SLI: SLI is a carefully defined measurement index. It determines what to measure according to the characteristics of different systems. The determination of SLI is a very complicated process. SLI determines the specific indicators to be measured. When determining specific indicators, it is necessary to ensure that the indicators can accurately describe the service quality and whether the indicators are reliable.

SLO: SLO (Service Level Objective) specifies an expected state of the function provided by the service, and contains all the information that can describe what kind of function the service should provide. Generally described as: average qps per minute> 100k/s; 99% access delay <500ms; 99% bandwidth per minute> 200MB/s.

Several best practices when setting up SLO:

  • Specify the time window for calculation
  • Use consistent time window (XX hour rolling window, quarter rolling window)
  • There must be an exemption clause, such as: 95% of the time to be able to reach the SLO

SLA is distinguished by personnel-oriented dimensions, which can be divided into the following two dimensions.

First: Business dimension: Customers have the most physical perception of this part of the indicators, which are directly linked to the quality of user experience.

  • For example, response time, error rate, etc. Statistics show that if the response time is greater than 1s, 80% of users will be lost; the error rate index is a guarantee for the correctness of the function. If there is a business error at the beginning, then the customer cannot directly complete the desired operation, and the loss is also Inevitable. This part of the indicators directly affect the user experience.

  • Second: Service-side dimension: Describes the server-side indicators. This part of the indicators is mainly for developers and testers, in order to quickly locate problems when they occur.
  • For example, system indicators such as ECS/RDS, including CPU/LOAD, etc.

SLA in stress testing

In the performance pressure test design stage, an important link is to determine the "performance pressure test pass standard". The lack of this standard means that stress testing may be endless, and no one knows when it should end, which will affect the performance of performance stress testing and waste human and financial resources. Therefore, a series of quantified indicators in the "Performance Stress Test Pass Standard" are needed to determine whether the stress test results meet expectations and can be stopped. The source of this "standard" may be the expectations from the business side, the performance expectations of the R&D team, etc., and the final collation is called the SLA in stress testing. This SLA is closely related to the external SLA of the product, but there are differences. The connection is that the external SLA of the system is an important source of the SLA in the stress test. The difference is that the SLA in the stress test may cover more and more detailed indicators, while the external SLA does not care about so many details.

Is the pressure tested correctly?

In the stress test, it looks like a simple business request, but the back end is actually a complex system architecture, such as the unified access layer/container layer/storage layer. Even the container layer involves many different applications/different services. With complex architectures, how to quickly determine whether the stress test results meet business needs? How to quickly determine whether the water level of the system has been reached and pressure can no longer be applied?

As a part of the preparation (development or testing), you can imagine what is normal?
Give the command, start the pressure test! Well, A development depends on the A system, B development depends on the B system, C development depends on the network layer, and D testing depends on the pressure test results. Everyone was in a hurry. At this time, someone shouted in the group that my system couldn't handle it. Stop it (of course there is a risk, is it a misjudgment by this classmate). Okay, this time the pressure test stops. Of course, this is a better situation, and in some stress test scenarios, there is only one test student. How does he divide the work? I will look at the results of the stress test, the A system, and the B system. I was very busy.

Whether this pressure test can achieve the effect, of course. But is this state the best state? of course not! At this time SLA comes in handy.

  • First of all, development/testing/business students should align SLA indicators before stress testing, which means clarifying the service capabilities that the system needs to continue to provide, as well as the overall water level of the system, and reducing subsequent communication processes. Everyone is prepared with this goal.
  • Secondly, after the SLA is configured, the person in charge of the stress test only needs to focus on whether there is an SLA alarm. If the alarm continues, it means that the system is unable to handle it, and the stress test is stopped directly or the SLA directly stops the stress test. For the small partners of stress testing, it saves time and effort, neither misses some indicators, nor wastes stress testing time.

Do you really understand the SLA in performance stress testing?

How to use SLA correctly in PTS

Imagine that the development classmates are busy, and only "me" the tester has time to stare at the stress test. After the pressure test, the unqualified business dimension data and system dimension data will be directly notified to "I". "I" is only to decide whether to stop the pressure test and directly output the system capacity water level report. crooked? PTS provides such a function, namely setting SLA. SLA setting needs to be based on various collected indicators. The richer the collected indicators, the richer the SLA and the better it can meet the needs of different businesses.

In specific use, first understand the indicators provided by PTS, then select the indicators that fit your own business and set the corresponding thresholds, and finally perform stress testing.

First of all, to understand a package of indicators
monitoring indicators can be divided into client-related indicators, that is, business dimension indicators; the other is server-related indicators.
Do you really understand the SLA in performance stress testing?
Client monitoring indicators are the most intuitive way to determine whether the service provided by the system meets the requirements of the business. PTS provides indicators such as RPS/request failure RPS/response time.

The server-related indicators are distinguished from the perspective of R&D personnel. On the one hand, the performance of the server-side system directly affects the various indicators of the client, which are linked. On the other hand, when there is a problem on the client or server, it is easier to locate the problem. The PTS server indicators include monitoring data of related components such as SLB/ECS/RDS.
Do you really understand the SLA in performance stress testing?

Second, select core indicators and set thresholds

  • First of all, the client’s SLA indicators include three indicators: RT/RPS/success rate, which describe whether the client’s access is normal from response time/availability and access load, directly reflecting the customer’s sense of use and whether the core service provided is in Provide sustainable and available services; client-side indicators usually need to be set by testers and business parties based on specific businesses.
  • The success rate is a core indicator that measures the availability of the system. At the same time, the priority of the success rate is the business success rate. If the business success rate is not set, it is the default success rate such as code.
  • RT reflects the speed at which customers visit the website. Generally, Internet users are not particularly patient. The results of KissMetrics’ research showed that “a 1 second delay in web page response may reduce conversions by 7%” and “47% of consumers expect the web page to load within 2 seconds”.
  • RPS is the largest RPS that the system can carry, that is, the maximum water level of the system capacity.

Do you really understand the SLA in performance stress testing?

Secondly, the indicators on the server side include the indicators at the three levels of SLB/ECS/RDS. The indicators at each level are determined by the characteristics of the services provided by specific components. For example, ECS indicators include CPU/memory utilization/LOAD; SLB indicators include the number of discarded connections/abnormal back-end servers; RDS indicators include CPU/memory utilization/IOPS/connection utilization; most of the indicators in this part are developed The staff determined that there is a big rule, such as CPU generally not exceeding 80%, LOAD not exceeding 1.5 times the number of cores, etc. The specific situation is analyzed in detail.
Do you really understand the SLA in performance stress testing?

Third, after selecting the indicators and setting the corresponding thresholds for the indicators, you can rest assured of stress testing. In the pressure test, if the set SLA is triggered, an alarm is issued, or the pressure test is directly stopped. At the same time there will be a summary of the event.
Do you really understand the SLA in performance stress testing?

In this way, by aligning the corresponding SLA indicators in the early stage, and setting the SLA in the PTS, you can not only align the targets, but also liberate the manpower in the stress testing process, and it is very intuitive to see which indicators have reached the threshold. Before the SLA was set, everyone watched various indicator data in a hurry for fear of missing it. After adding the SLA, you can finish the stress test with a cup of tea. At the same time, in addition to helping friends to better improve the efficiency of stress testing by setting SLA, we will also combine SLA with intelligent stress testing, so stay tuned.
Do you really understand the SLA in performance stress testing?

summary

SLA is everywhere. This article mainly introduces what SLA is, the meaning of setting SLA in the stress testing process, and how to use SLA correctly. Properly use and set SLA, so that stress testing is no longer frantic. Please correct me if you have different opinions, thank you!

Reference reading:

  • Multithreading is officially supported! Redis 6.0 and the old version performance comparison evaluation
  • Problems of a 100-person R&D team: R&D management, performance appraisal, organizational culture and OKR
  • A microservice orchestration engine developed by Netflix, supporting visual workflow definition
  • Do you really understand stress testing? Actual combat describes the design and implementation of performance test scenarios
  • Some misunderstandings about Golang GC-is it really ahead of the Java algorithm?

Technical originality and architecture practice articles are welcome to submit via the "Contact Us" menu of the official account. Please indicate that it is from the high-availability framework "ArchNotes" WeChat official account and include the following QR code.

Highly available architecture

Changing the way the internet is built

Do you really understand the SLA in performance stress testing?

Guess you like

Origin blog.51cto.com/14977574/2546514