Best Practices of Data Platform Traffic Playback|Featured

1 Background and challenges

1.1 Data Platform Business Background

The data platform uses big data intelligent analysis, data visualization and other technologies to present and apply multi-source heterogeneous data collected, constructed, managed and analyzed inside and outside the company, realizing data sharing, automatic generation of daily reports, fast and intelligent analysis , deeply dig the value of data, and meet the data analysis application needs among departments at all levels of the enterprise. Therefore, it also has the characteristics of large data volume, many scenarios, high data accuracy requirements, and guaranteed query performance.

1.2 Traditional testing methods

Based on the characteristics of the data platform, the cost and difficulty of offline data testing or regression testing are relatively high. Therefore, we hope to have an effective means to reduce the cost and threshold of testing and realize the standardization of testing. We've always done this by writing automated tests. However, traditional automated testing actually has many disadvantages, such as high cost, limited coverage scenarios, and high difficulty in standardization.

1.3 Disadvantages of traditional automation

1.3.1 High cost:

  • Manually writing and maintaining automation use cases is costly

  • The lower test-to-open ratio cannot keep up with the speed of iteration

1.3.2 Limited coverage scenarios:

  • It is difficult to construct test scenarios offline

  • Limited scene coverage

1.3.3 Standardization is difficult:

  • Strongly rely on QA personal experience and ability

  • It is difficult to develop an independent troubleshooting automation problem, and the self-test effect of promoting development is poor

Therefore, we hope to use online traffic to build a traffic playback platform and combine it with automated testing to realize an automated testing system that conforms to the characteristics of the data platform.

2 Introduction to Traffic Playback Platform

The implementation principle of traffic playback is to use the online portal to record the real traffic of user operations, play it back in the pre-release environment, and compare the sub-calls and responses of the input interface in the production and pre-release environments to locate code problems. The scope of access objects is only Read, read and write, and write-only interfaces have the advantages of zero intrusion into business code, automatic traffic diff, real link calls, data checkable, accurate problem location, and increased possibility of finding problems. Carelessness may cause dirty data generated by sub-calls with write operations in the playback interface, affecting business.

2.1 Research on Traffic Playback Platform

After the confirmation, we immediately launched a research, researched and compared the company's traffic playback platform, Ali's Doom and Twitty's Diffy, the differences are shown in the figure below.

1.jpeg

2.2 Data platform business characteristics

  1. Due to the query characteristics of the Shuping report, there are few external query links in the code, and there are many internal dimension condition business combinations. Based on this feature, when using the Pandora platform to record online traffic, the traffic recording is incomplete, and most scenarios cannot be completely recorded. cover.

  2. Complex data platforms generally rely on a large number of attribute configuration management, timing synchronization tasks, etc., so the pre-release environment and production environment configuration libraries need to be isolated to protect data from contamination. Traffic playback depends on the same configuration library as the database, and usage scenarios are highly dependent on configuration data, making it difficult to implement playback.

  3. The traffic playback of the data platform often needs to verify the data when verifying the results. The request will cause a certain query pressure on the production database and may affect the stability of the production environment. Need to control the playback speed and control, monitoring and degradation protection.

  4. Part of the data is real-time, and the playback results need to calculate the volatility.

Based on the above characteristics, the data platform cannot be connected to the company's Pandora platform. We also contacted the person in charge of the company's platform as soon as possible to communicate and propose improvement requirements.

But the urgency of the problem made us decide to do some work at a small cost first. On the one hand, we can alleviate our pain points as soon as possible, and on the other hand, we should also facilitate the later access to the company platform to reduce waste of resources. For this purpose, we used scripts to collect traffic in the first phase, and quickly experimented with a simple traffic playback system with the help of the open source tool Diffy. At the same time, it proposes adaptive access requirements for the platform. In the second phase, the traffic collected by the script is uploaded to the platform, and connected to the platform for traffic playback.

The benefits of this are:

  1. The traffic is autonomous and controllable, and the traffic can be expanded at fixed points according to the needs, without worrying about the sparse traffic, the impact of recording on the online environment, and incomplete interface coverage.

  2. Use logs or buried points to collect traffic, which provides a new idea of ​​traffic collection for traffic collection

  3. Open source tools only need deployment and familiar resource input, and resources can be recycled after accessing the platform later, without wasting resources and reinventing the wheel

Based on the above background, the traffic playback implementation scheme of the data platform is carried out.

2.3 Core Principles

The overall idea is still to obtain traffic along the line, play it back in different code environments, and finally compare the results returned by the interface to achieve the purpose of detecting the accuracy of the tested code. Here we filter the produced traffic according to fields such as time, interface whitelist, and operator, and deduplicate and filter the traffic according to the window, and finally settle into a stable traffic pool. After the task is triggered, it will play back concurrently to the pre-release and production releases at the specified rate to obtain the return results of the interface. After a series of noise reduction operations, the overall success rate will be calculated according to the field comparison results, and a report will be produced. Next, I will introduce the entire solution from four aspects: traffic collection, environment strategy, execution scheduling, and comparison results.

~Traffic Playback Interaction Structure Diagram~

2.png

2.3.1 Flow collection

Through the company's traffic recording method, it is difficult to improve interface coverage, which is not suitable for the characteristics of few external links and many condition combinations. Therefore, we want to collect traffic through buried point screening. The advantage of this is that it perfectly avoids the uneven distribution of traffic during the traffic recording process, reduces the performance impact on online services, and at the same time, the coverage of the interface is very complete. It realizes independent controllability and obtains flow at a fixed point.

In traffic collection, we will go to the production system in batches to continuously collect traffic according to the configured date and quantity, deduplicate the interface for each batch of traffic according to the input parameters and request path, and use the sorted out interface white List, traffic operators, interface keywords, request types, etc. to filter data, and then need to filter dirty data in traffic, and correct special characters and redundant fields in parameters. Finally, save the cleaned data to the local flow pool and wait for the task to use.

3.jpeg

In the later stage, the processed traffic will be uploaded to the traffic playback and playback Pandora platform through the interface, and the traffic and execution can be managed and executed more conveniently and efficiently through our platform tools.

After uploading, you can view the traffic on the traffic playback platform. Here, you can also upload manually through excel, but the number of traffic in each batch is limited.

4.png

2.3.2 Environmental Policy

The environment uses two sets of environment comparisons: pre-release and production. Through the configuration, the data source of the pre-release environment is pointed to the production service. And regularly synchronize the production configuration library to the pre-release environment to solve the gap between data and configuration.

5.png

6.png

2.3.3 Execution Scheduling

There are two ways to schedule, one is to configure the timing trigger, and the other is to manually call the interface to trigger. After the task is triggered, the traffic in the traffic pool will be obtained, and the keyword and execution data level of the traffic will be judged again whether it can be executed. After confirming the execution, put the traffic into the thread pool to start playback. Here, a fixed-length thread pool and a rate controller are used to achieve high concurrency and flexible request rate configuration.

After the task is executed, the configuration can be modified at any time according to the actual execution situation to stop the task or adjust the sending rate of the task to control the impact on the online environment.

7.png

2.3.4 Comparison Results

After getting the return results of production and pre-delivery, compare the results at both ends, and find inconsistent fields and returns. Due to the characteristics of data level, there will be a lot of noise points. Therefore, the AAdiff method is introduced to achieve automatic noise reduction. Function. How to reduce noise:

a. AAdiff: Before the comparison, call the production environment twice in a row, compare the results after obtaining the results, and remove inconsistent fields. Unstable or fluctuating fields can be removed

b. Ignore specified fields: manually configure and ignore some configuration fields or meaningless fields to reduce noise.

After the result differences are compared and summarized, they will be grouped and summarized according to the fields, and the fields that fail to pass the AAdiff will be directly grayed out. Click a field to view the difference data under the field on the right.

8.jpeg

By clicking the difference details, you can further view information such as the requested path, request body, production and pre-delivered return values, and help troubleshoot and locate problems.

9.png

At the same time, in the result report, you can observe information such as traffic volume and playback success rate.

3 Business Practices

Here, taking the intelligent operation system as an example, the performance cost difference before and after traffic playback access is compared.

10.png

Through the way of traffic playback, not only the automatic interface coverage is rapidly improved, the labor input for iteration is reduced, but also the reliability of regression is enhanced. This can also be well reflected by the iterative quality change trend.

Platform data:

The traffic playback tool was initially used in the 513 iteration, but the coverage and stability were poor. The 514 iteration was perfected and officially put into use.

After the 514 iteration tool was officially put into use, it was found that the proportion of missed bugs reached 25%, the quality of the 515 iteration was significantly improved, and no defect escapes occurred online for two consecutive iterations. The platform quality and stability have been significantly improved.

At present, since the intelligent operation traffic playback was put into use, it has continued to support multiple iterations of daily regression testing and daily stress testing. The coverage rate of the read interface has reached 86%, the pass rate of playback has stabilized at 98%, and the regression missed rate has been found to be 25%. Greatly improved system stability and online quality.

11.png

4 Planning and Outlook

The traffic replay of the intelligent operation system has entered the maintenance stage, and it helps test and realize various tasks such as smoke emission, regression, pressure test, and cache verification in daily iterations. In the future, a small number of sparse interfaces will be covered by precise interface traffic acquisition. And upload the traffic to the traffic playback platform. With the capability of the traffic playback platform, it is more stable and convenient to execute plans and troubleshoot problems.

Based on the characteristics of each system on the data platform that mainly reads the interface, it is very suitable for the return form of traffic playback. In the future, each system will be connected to our traffic playback platform according to priority, and the interface coverage will be quickly improved through traffic burying.

* Text/Humble Little Quarter

This article belongs to Dewu technology original, source: Dewu technology official website

It is strictly forbidden to reprint without the permission of Dewu Technology, otherwise legal responsibility will be investigated according to law!

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5783135/blog/10085118