Yunji's road to full-link stress testing

" Yunji's Road to Full-Link Stress Test "

 

In fact, to be honest, our road to full-link stress testing is still quite difficult, starting from the initial struggle with which stress testing tool to use, to stressing frameworks, ordering interfaces, stressing stress testing environment, and step by step to the online environment. Fumbling, combined with the experience and solutions of other friends and businessmen, until today, it took nearly a long time for more than a year, and we finally figured out a full-link stress test road that belongs to our chicken farmer in the dark. We grow and mature gradually through the process of trial and error.

 

1. What exactly is a full-link stress test?

When the traffic is not large, developers and testers can do a functional test offline, as long as the function can run normally, but as the user scale increases linearly and the traffic increases, we will gradually realize that light It seems that it is far from enough to rely on regular functional tests. When the traffic comes up, we must pay attention to the system performance. After all, no one wants their system to be ruthlessly defeated by the traffic. It is necessary to be aware of it ( clear the capacity and water level of the system, and have guidance). capacity planning ) . Therefore, at this stage, most enterprises will choose to perform stress testing on the framework, middleware, and storage layer offline to determine their throughput. However, there is still a big gap between the offline and offline data of such stress testing results. After all, most companies The pressure test environment of the APP is not 1:1 with the production environment ( I have never seen such a local tyrant company so far ) , so the pressure test result data of the pressure test environment is purely for reference and cannot be used as a guide for the online environment. Data , the only way is to directly implement pressure testing in the online environment.

 

It is easy to say that stress testing is performed directly in the online environment, but there are great risks hidden behind it . In most cases, our system is accessed by users, especially during peak traffic, it must not cause system failures due to pressure measurement traffic, which will affect users' ordering, and even more cannot pollute online data . Imagine that 10 more IphoneXs suddenly appeared in User A 's order information. Do you want to send it or not? Or the balance suddenly decreases, which is absolutely unacceptable for users ( unless it increases, making a fortune in silence ) , and the online environment is often accompanied by various timed tasks. It is also included in the income, and the boss of the operation department is expected to invite the R&D students to drink tea.

 

Although there are many difficulties, it is the only way to detect the real capacity water level of the system, conduct capacity planning before the big promotion with guidance, and give a reasonable limiting water level. Therefore, the system needs to be able to intelligently and accurately distinguish which data is the real user traffic and which is the pressure measurement traffic, and then divert the traffic to the isolated environment for placement . I will talk about how to distinguish the pressure measurement traffic later.

 

So what exactly is a full-link stress test? I believe everyone is aware of the pressure test of a single interface. For example, when the website A interface is pressure tested, it is assumed that its QPS is 10W/s . However , when the website A interface is pressure tested, the B interface is also being pressure tested. At this time, the website A The QPS of the interface will no longer be 10W/s . This is because any interface in the system will not exist alone, and will be more or less restricted by some common resources. When the common resources become the bottleneck, the whole system will be affected . To put it simply, the so-called full-link stress test actually refers to the simultaneous implementation of stress testing on all the core links of the system . When the traffic of the entire system is hit, the performance bottleneck of the system will be exposed, and the system will be able to Detecting the real capacity water level of the entire system is the meaning of full-link stress testing and implementing full-link stress testing.

 

2. How does the system distinguish between real user traffic and pressure measurement traffic?

Before talking about how to distinguish the pressure measurement traffic, let's first talk about a more sensitive topic, whether the business system needs to be invasively transformed, and clearly tell everyone, yes! Because the full-link stress test task is not performed in the early stage of the service, it is basically needed in the later stage of the service, especially when the service becomes more and more complex, the transformation will be more difficult. However, the infrastructure team of the enterprise should realize that most of the traffic differentiation work should be done in the middleware and base components . Of course, if there is no intrusion at all, this is almost unrealistic.

 

The key points of implementing full-link stress testing are:

1. Distinguish pressure measurement flow data;

2. The pressure measurement flow data should be placed in an isolated environment.

  

In the earliest days, we were always doing stress testing of the read interface in the online environment, and we did not dare to implement the read / write parallel stress testing directly from the very beginning , because we had no idea! This is true, but when we know how to distinguish the stress test traffic and how to isolate the stress test data, we can start to boldly carry out the real online full-link stress test.

 

First, let's talk about how to distinguish the pressure measurement flow:

1. The pressure measurement traffic will be marked on the URL uniformly ;

2. After the access layer receives the request, the Filter intercepts and identifies the pressure measurement identifier, and puts it into ThreadLocal ;

3. When the access layer calls the service, the call chain buried point terminal obtains the pressure measurement identifier from ThreadLocal , and writes it into the context to transmit it downwards ;

4. When placing or storing, rely on the Base component to obtain the pressure measurement identifier from the service context to distinguish the data direction.

 

Some businesses need to call a third-party interface ( the most typical one is to call the bank payment interface ) . In this case, the business system needs to determine if it is marking traffic, then directly mock it, as shown in Figure 1 :

Figure 1 Pressure measurement flow identification

 

当我们明确如何区分真实用户流量和压测流量后,接下来的问题就是压测数据既然不能够污染线上环境,那么究竟应该落到哪里的?我们选取的做法是从2个维度来看(物理隔离和逻辑隔离并存),毕竟我们还不是土豪,没有办法构建跟生产环境1:1的存储系统。

 

对于那些真正需要落盘的数据(比如:订单数据),我们会将压测数据写入到影子库中,完全隔离开线上环境,这是最安全的,但是成本比较高,毕竟是不同的实例,为了安全起见还是非常有必要的,这便是物理隔离。而逻辑隔离是指,一些中间数据,比如需要写入MQRedis等的压测数据,我们采用的做法是逻辑隔离,比如写入Redis时,压测数据的Key统一加上压测标识;写入MQ时,我们会写入到不同的Topic(由于业务特点和介于RocketMQ本身的实现机制,我们选择了由NameServer路由到固定的几台压测MQ机器上)

 

三、压测流量如何下发及压测数据如何构建

在最开始的时候,研发同学和测试同学使用的压测工具似乎是百花齐放的,比如:JmeterApache AB等常规测试工具,但是这类工具无法瞬间发起超大规模的压测流量(难以做到分布式压测),后来我们也考虑过Ngrinder,但是这货也很难胜任全链路压测,因为Controller过于笨重,能够管理Agent的数量是极其有限的,这一点京东已经验证过了,我们也没有必要再走弯路,所以直接催生出了我们要自建全链路压测系统之路的构想。

 

鸡场的全链路压测军演系统叫做TItan,其整体架构如下所示:

2 云集全链路压测军演系统(Titan)整体架构

 

本篇文章不会重点对Titan进行详细介绍,因为我们预计在明年上半年左右会正式开源,并且可以保证的是,开源分支版本将会与云集内部版本同步,更不会对功能进行阉割

 

压测流量的下发是由Titan完成的,那么压测数据是如何构建的呢?最早的时候,我们采用的做法是对一些动态参数脚本手工构建,相信大家都知道,这是非常痛苦的一件事情,尤其是一些只能够使用一次的动态参数,如果压测规模较大,那么这将会让人痛不欲生,并且极其容易出错,所以我们接下来的做法便是构建压测数据工厂,压测数据来源于压测数据工厂构建,然后由Titan执行流量下发即可,一条路打通。

 

码字不容易,如果你觉得文章对你有帮助,请点赞并注明出处转载,多谢!

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326310149&siteId=291194637