Construction and practice of code change risk visualization system

This article is compiled from the 77th issue of Meituan Technology Salon "Quality Risk Prevention and Control and Stability Management Practices of Meituan's Billion-level Traffic System". The first part of the article introduces software system risks and changes; the second part introduces the capacity building of the code change risk visualization system; the third part introduces the implementation of the entire system within Meituan; and finally, the planning and outlook for the future. Hope it can be helpful or inspiring to everyone.

  • 1 Software system risks and changes

  • 2 Code change risk visualization (Houyi) system construction

  • 3 Code change risk visualization (Houyi) system practice

  • 4 Future planning and prospects

  • 5 Q & A

1 Software system risks and changes

Change is the driving force for the evolution of software systems, and it is also a breeding ground for risks. If a system does not have corresponding iterations and changes, the system will gradually lose its activity and value. However, as the system undergoes changes and iterations, software risks will gradually arise, and avoiding software risks caused by changes is a big challenge in the field of quality assurance. By analyzing the typical software system architecture diagram below, we can extract three major categories of change dimensions:

2fa1860bde8bda9c04f53fe57a7a5b19.png

  • Infrastructure changes : Mainly include basic hardware changes, operator network changes, cloud service container changes, development language changes, operating system changes, and computer room cluster changes. These infrastructure iterations greatly improve the service capabilities of the underlying system. Once the changes trigger Systemic risks usually have a relatively large impact.

  • Changes outside the system : such as sudden increases in user traffic, changes in user needs, and changes in related third-party services and third-party components. These help the system to continuously derive new iterative capabilities, and also increase the occurrence of system stability risks.

  • Internal changes in the system : such as technical staff iterations, new feature releases, and upgrades of the overall system architecture, etc. This is the core change factor that drives the evolution of system software and is also where the most frequent change risks occur.

Here, we first list some of the more common typical online accidents caused by change risks:

  • Online problems caused by external changes, and the optical cable in a certain place was cut, causing a great impact on the entire service.

  • Typical problems with code changes, functional problems caused by side effects of the Google Gmail system when releasing new features.

  • A typical problem with code changes is that Knight's abnormal logic function occurred when upgrading a very old code.

  • Problems caused by configuration changes, resulting in "snatching" events.

  • Deletion of the entire core data caused by personnel operation changes and R&D misoperation;

6c1387c192c98a985f10c98bb0e700cf.png

Pictures from the Internet

It can be seen that in actual work, the risks caused by changes have a great impact on the business. Combined with Meituan’s billion-level traffic form of Daojia business, the possibility of risks caused by system changes is further amplified, and the “butterfly effect” of change risks is more prominent. A single problem may have a great impact on the entire core business of Daojia. .

  • First, from the perspective of access to Daojia business, Meituan’s internal business includes food delivery, flash sales, medicine, etc., and it has many external corporate customers.

  • Second, there are many relevant parties involved in the system, including C-end users, merchants, delivery riders and various platforms.

  • Third, the business is based on the microservice architecture model, the calling relationships between various businesses are complex, and the core links are very long. In addition, the business relies heavily on configuration. Once a change occurs in a certain link, all relevant parties will be affected.

Therefore, for R&D and testing, it becomes crucial to understand and avoid the quality risks introduced by changes.

8c7da5b5a68a9c7bc2dae2fb5907606a.png

So, regarding change risks, what is the core focus of quality construction? Our analysis of historical online problems found that a high proportion of failures were caused by changes within the system, and the changes were directly or indirectly related to code changes. Therefore, we began to build quality construction work points around the core change factor of code change .

We then considered two key questions:

  1. Whether code change risks can be visualized to improve testing and R&D awareness capabilities.

  2. Whether it is possible to build a quality assurance defense system around the risk of code changes.

Through analysis, we found that combined with the code feature tree in the figure below, we can better perceive the visualization capabilities of code changes. Then through each leaf node, all relevant features are well identified, and corresponding quality defense strategies are made.

4d9f6117fcf43903104096e4e22dfe87.png

2 Code change risk visualization (Houyi) system construction

There are two typical problems with the traditional testing model: first, it lacks comprehensive visual analysis capabilities for the code developed by R&D; second, regarding the scope of the impact of code changes, it actually relies more on the experience of R&D personnel and QA . Therefore, in the case of such typical problems in the traditional testing model, we hope to develop a quality assurance construction plan from three dimensions:

  • The first stage is the ability to perceive code changes, focusing on how to cover all code changes to the greatest extent for different code forms and different code engineering models;

  • The second stage focuses on characterizing the code, extracting different characteristics from all changed codes, and labeling them with functions;

  • The third stage is based on the first two construction capabilities to build a corresponding application scenario ecosystem. At different testing pipeline stages, the ability to characterize code change risks can be embedded to continuously provide quality defense.

933c040cb3f53b201cfab9be79fc4553.png

The final form of the quality assurance construction plan for code change risks is to create a code visual analysis system, which we named Houyi System within Daojia. As the name suggests, we hope that this system can be more accurate in terms of quality risk assessment and interception, and at the same time improve quality defense capabilities, just like "Hou Yi shot the sun". The system architecture is mainly divided into four layers:

  • The first layer is the basic component layer.

  • The second layer is the code analysis layer, which focuses on achieving accurate code change analysis and identification for different code forms.

  • The third layer is the characterization layer, which core solves the structural annotation and characteristic annotation of the entire code.

  • The fourth layer is responsible for building the business application layer, embedding code change visualization capabilities into this link under different scenarios and links, and building a complete quality risk interception capability.

In addition, we will also introduce intelligent means to empower the entire system. At the same time, we have also exported our core capabilities to other quality and efficiency-related tool platforms through open APIs to empower other tool platforms within Meituan.

671f19464f5a5795d107ff0583bd9190.png

In general, the key processes of Houyi's visualization system are perceived through two entrances at the application layer: one is the Houyi main station; the other is the project delivery platform. We obtain the corresponding change files, change methods and change line numbers through asynchronous message awareness analysis tasks and source code downloads; at the same time, we use bytecode parsing capabilities to parse the corresponding call link changes and store them in the graph database; and then Change the code for feature marking and identification; finally, a visual analysis report will be generated, which will be directly given to the corresponding QA for use.

a39fc3e1a015df8092b47b1f438e448f.png

Of course, we also encountered some technical challenges during the entire construction process.

The first challenge is code analysis technology. In the early days of system construction, code analysis and identification was done through AST single analysis capabilities, but there were problems with Lambda expression identification and Java generic identification. On this basis, ASM bytecode analysis technology was introduced to solve the previous problems, but ASM There will also be problems related to Java reflection features that cannot be recognized. Therefore, we hope to introduce dynamic code analysis technology in the future to solve reflection problems.

c53a5ef691adf6daef41e0f194734870.png

The second technical difficulty is the problem of massive relational data storage. At the beginning of construction, storage was done through a relational database. However, with the widespread application of the system, a large amount of call link topology relationship data was stored, but its query performance was very poor. Therefore, on this basis, we solved the query performance of storing data in massive data relationships by exploring the storage method of graph database.

27c42184f13f5f415f12e525bc653a29.png

The third technical difficulty is the diversity of code risk features, such as personalized features that are strongly related to our business, such as asset loss features, corresponding paging features, and multi-threading features. In view of this risk characteristic, our previous development model was to identify strategies corresponding to in-depth communication between system developers and business QA, but this method of communication efficiency and development cycle are relatively long.

Therefore, we have improved our overall capabilities by developing a componentized development framework and opening the entire development framework to each business line QA. They can develop their own customized development components and load them into the Houyi system to complete the rapid identification of diverse features. ability.

6d664e186043f3525a61307dbd7bb363.png

3 Code change risk visualization (Houyi) system practice

Next, we will focus on introducing the practical application of the system to you. As shown in the figure below, the figure is an ecological panorama of Houyi's quality assurance application scenarios. Currently, the Houyi system has built eight core application scenarios:

  1. Calibration diagnosis at technical solution level

  2. Enhancement of capabilities in the Code Review stage

  3. Change impact assessment

  4. Interface level use case recommendations

  5. Configuration change risk diagnosis capabilities

  6. Compatibility risk diagnostic capabilities

  7. Code feature risk warning

  8. Open API capabilities

f5db2096b749fdcc0f7b2727c5b9ef95.png

First, the diagnosis of missing items in the technical solution mainly includes the following two pain points during the project testing process:

  1. When developing and writing technical solutions, the degree of standardization is low.

  2. The problem of intermediate feedback is not timely for technical solution updates, but we found that the technical solution is a key dependent input for QA, which leads to QA often missing some key change information when writing test cases. For example, interface definition changes are missing, configuration item definition changes are missing, scheduled task changes are missing, asynchronous message changes are missing, and DB table field changes are missing. These information are often not fully updated in the technical solution.

Based on this situation, the Houyi system can truly get what information has been really changed at the code level through code identification, and then pull the corresponding technical solutions for analysis, and then do a Diff of the key changed information items. This generates a diagnostic report of the missing items of the technical solution and sends it to the corresponding QA. QA can use this report to intercept the corresponding diagnostic items and at the same time complete the test cases.

c5a5f1d9944f19b9ac2659bd54c1d697.png

The second application scenario is an enhanced version of the new Code Review model. Traditional Code Review also faces several pain points:

  1. R&D focuses more on coding standards and architectural design rationality throughout the review process.

  2. The normalized Code Review mode is based on plain text, and it also has several pain points: first, it is impossible to quickly jump to the upstream and downstream of the change method during the review process; second, it is impossible to jump or jump to the change method or modification method. The business logic of the call link is presented.

Based on such pain points, Houyi Systems has created a new model of driven Code Review based on changing link scenarios. The core solution is to be able to perceive quality risks earlier in the Code Review stage. The core approach is that when Houyi system reviews a change file, it can quickly extract the corresponding change methods and change variables in the change file, as well as the risk characteristics of these change methods and variables, so that QA and RD can quickly catch the changes. key information.

On this basis, we also provide the ability to quickly jump upstream and downstream of the change method. Based on the rapid jump of the change method during the Review process, we understand the upstream and downstream relationships of the entire business. At the same time, during the jump process, the jump can be The logical points are drawn into a calling topology diagram in real time, and the business logic relationship between the change methods can be perceived, and the impact of this code change can be better evaluated from the perspective of the entire link, and the pain points in the Code Review stage can be well solved.

1219767e83b26449be5333ff32abf7a6.png

The third application scenario is change impact scope assessment. Currently Houyi system has built six change impact scope assessment capabilities:

  1. Basic properties of the code, such as which methods have been changed;

  2. Can support different code project types such as HTTP, RPC, and JAR packages;

  3. Ability to identify common risk characteristics, such as recursive compatibility issues;

  4. Customized risk characteristics can be identified, such as permissions and algorithm characteristics;

  5. Can support single service impact, including method calling links, interface characteristics, message characteristics, task characteristics, etc.;

  6. Assessment of the impact of cross-end services, such as the impact of the call link within a service and the link between cross-end change points between different microservices, can achieve a more accurate assessment of the scope of the impact. ability.

8877b114437f08334d9a50bc6f6ea5db.png

Example of impact assessment:

  1. A change interface is affected, so which change methods are affected downstream? We can intuitively see what new additions and change methods are affected by this change interface.

  2. What does the link call topology look like between the downstream changed impact methods? We can quickly draw the calling links between all change methods under this interface through the link topology diagram.

  3. For the corresponding change method, we can also use this capability to quickly evaluate and characterize the call link details of all methods in the interface.

  4. What does the link call relationship look like for all methods under an interface? We also provide a full method link view analysis capability.

6e589613b642b8f149f43d7345d3c64e.png

The fourth application scenario is risk diagnosis of configuration changes. In relatively complex large-scale businesses, the entire system often has a strong dependence on the configuration, such as typical grayscale configurations, downgraded configurations, and internal logic-related control configuration items. For the entire system The impact is relatively large, but often QA and R&D personnel actually lack control over configuration risks. They think that code may be more of the focus of quality assurance, so there are more online problems caused by configuration, and the results are relatively serious.

Based on such core pain points, Houyi System focuses on building core capabilities regarding configuration change risks from three levels:

  1. In the configuration change identification analysis layer, we can accurately identify various types of configurations: whether they are new or changed.

  2. Through Houyi's impact assessment capability, we can obtain the interfaces and links affected by the configuration, as well as the asynchronous messages and scheduled tasks affected by the configuration.

  3. With the impact assessment, we can better cover the change scenarios through testing, thus building a configuration change test coverage measurement. Because our ability to perceive test results during the configuration test process is not strong, we built real-time awareness configuration test coverage based on traffic mining and traffic recording capabilities. At the entire pipeline level, the key configuration change stuck point capability can better prevent some problems caused by configuration changes.

The picture below is the application function page we are currently building:

5fcb1f692599660a481da0474232498e.png

The fifth application scenario is risk diagnosis of server-side compatibility. Through summary analysis of online issues, we found that typical issues such as old and new compatibility accounted for a high proportion. We tried to solve them through the Houyi system. QA can do simple compatibility issue identification. For example, the input parameter return value of an interface is Obvious field additions or type changes will be clearly identified.

However, in the analysis of compatibility issues, there is a type of less obvious changes that lead to compatibility issues. For example, there are some constraints on the input parameter level, and the optional fields become required fields. Such changes are actually defined in the entire code. It is difficult to perceive the level; the other type is that the change parameters are directly assembled through internal indirect calls to the VO class. In fact, it is difficult for QA to perceive the compatibility risk impact of internal indirect VO class changes.

Therefore, the Houyi system built reflection and serialization based on such pain points, so as to quickly obtain the corresponding impact on the interface caused by the corresponding underlying change VO class, and provide a compatibility interface warning. QA will further analyze and diagnose the report based on this analysis. Evaluate the corresponding compatibility and reasonable arrangements for the online order.

f71e266495c9590b6c4aa92c103f7cb7.png

The sixth one is interface-level automation use case recommendation. For complex business, how to use many of the automation use cases we have accumulated, whether full regression or selective screening, is also a relatively big pain point. Therefore, based on its ability to identify changes, analyze impact links and corresponding interface capabilities, the Houyi system can open up the historical traffic coverage of the Daojia intelligent code coverage platform, and can quickly obtain the change interfaces and methods, and then use the integrated Based on the platform, we can obtain the corresponding automated use cases and make accurate automated use case recommendations, thereby building an overall solution for use case recommendation.

b92f96b881b9da89e04ab4c9d42c74b2.png

The seventh typical application scenario is early warning of code quality risk characteristics. During the quality construction process, we often encounter a special type of scenarios that need to be verified, such as asset loss scenarios. In addition to the verification function, asset loss also needs to be verified. Do personalized risk function scenario verification, exception scenario verification, special paging logic, and retry scenario verification. However, during the entire code process of these scenarios, it is often difficult for us to find these characteristic risks, thus forgetting to verify special scenarios.

In response to this problem, Houyi System has built a feature risk identification function from two aspects: first, the system will build its own general risk feature identification strategy model, and second, each business party will also create its own corresponding risk feature identification strategy.

As shown in the figure below (bottom side), it is the feature recognition capabilities that are currently available and will be built in the future. After we have the core identification capabilities, we will then integrate the tool capabilities of the corresponding features with the corresponding upstream and downstream dependent platforms in the verification process, and build recommended strategies corresponding to specific testing strategies for different features. After integrating these capabilities Construct a systematic quality assurance plan based on code quality risk characteristics.

cc85c483253bcd555c19f9bf5d93c2c1.png

The eighth application scenario is the enabling ability of Open API. We hope that the information analyzed and identified by the Houyi system can be disclosed through the Open API so that it can be used by the corresponding related tool platforms. At present, the entire capability has been opened to six core performance tool construction platforms in Daojia: interface management platform, intelligent code coverage platform, holographic system, anomaly testing platform, independent engineering delivery system and integrated automation platform.

b25d64dba436f7087df62c656860c6b2.png

4 Future planning and prospects

Combined with specific practices and previously summarized experience. In the future, we will carry out future quality assurance construction from four directions:

  1. With the enhancement of code analysis technology, we hope to improve the overall analysis accuracy through dynamic link analysis technology.

  2. Risk feature identification technology hopes to be able to recommend risk feature analysis and corresponding testing strategies based on the corresponding capabilities of large language models.

  3. Further enrich the application scenario ecosystem, explore better integration of Houyi's capabilities in different testing links and different testing scenarios, and empower them into specific business quality construction.

  4. In the long term, we hope to empower the core capabilities of the system to related testing fields through open source sharing.

3ccc0e309dfbf189074eea03161870c1.png

5 Q & A

Q1: How long does a single analysis report take? Are the reports of the same project code independent or related?

A : Currently, analysis reports can be generated in 1-2 minutes. The report is a service-level report generated based on the iteration task dimension. For example, if there are 5 engineering code changes in a project iteration, the relationship between these 5 engineering code changes will be analyzed as a whole and reflected in the report.

Q2: How is the topological relationship of the link realized?

A : The topological relationship of the link is analyzed based on AST technology and ASM technology. At the same time, it is supplemented by the online Mtrace link relationship.

Q3: Can the correlation between service call links between multiple modules of microservices be identified? What if the HTTP interface subsequently calls multiple RPC service interfaces?

A : The service call links between multiple modules of microservices can identify the correlation. For example, which RPC interfaces are called downstream of HTTP are currently identifiable.

Q4: There are many call chains for the underlying methods. Is there any recommended strategy for this? Does DB change scan Mapper?

A : Currently, the calling link will be marked with risk characteristics. For example, the link contains asset loss characteristics, configuration characteristics, etc. The characteristic risk label can better allow users to perceive the link risk level and provide information for subsequent testing strategies. Key information recommendations. Currently, DB changes are identified by scanning Mapper, and change scanning is performed based on the Mybatis overall DB development framework.

Q5: Regarding use case recommendation, do you recommend singleton interfaces or combine them into scenario interfaces?

A : Currently, single interface use cases are recommended. For example, if this change interface is associated with 10 automation use cases, we will recommend these 10 automation use cases.

Q6: How long does it take to analyze a project?

A : It currently takes 1-2 minutes.

Q7: What are the main benefits in practical applications?

A : The main benefit is based on eight major application scenarios. All application scenarios will bring value benefits to quality and efficiency. For example, many compatibility quality issues have been successfully intercepted in terms of compatibility issues; in configuration change risk assessment, it has been successfully Intercept multiple problems caused by configuration default value encoding errors and online and offline configuration inconsistencies.

Q8: Will this platform affect the availability of online services?

A : Currently, this platform will not affect the availability of online services. When analyzing the online environment, the core operation is to pull the corresponding online deployment JAR package, which will not affect the availability of real services.

Q9: Does this system have a moat? What are the benefits of this system? Does it help the business? How is it reflected in business?

A : Speaking of "moat", the underlying technology of change analysis is mainly based on the common technologies of AST and ASM. The core is the compatibility support for various business code writing methods and the accuracy of business scenario feature recognition. There are certain technical challenges in dimensions.

The main benefits focus on the risk assessment of the business impact scope caused by changes and the effective interception of quality risks. Currently, it has successfully intercepted many bugs with inaccurate assessment of interface impact scope, configuration change-related bugs, compatibility bugs, etc.

In terms of business assistance, firstly, the eight major application scenarios can improve the overall quality and efficiency; secondly, it can improve the efficiency of business understanding. For example, by calling the link visualization capability downstream of the interface, you can quickly understand the business link relationship.

Q10: The link topology is too large. How to solve it? Different services use different languages. Are there any recommendations for bytecode technology?

A : When the link topology is too large, link aggregation can be used to classify and improve visualization. For Meituan, it is a Java technology stack. The focus is on breaking through Java recognition capabilities. Other language service analysis will continue to pay attention and solve it in the future.

Q11: Can the impact on upstream and downstream services be analyzed?

A : Yes.

Q12: When analyzing code changes, how to identify valid changes and invalid changes?

A : For example, if a piece of code is changed, it may not have a caller. This is called embedded code. Code like this is difficult to verify using business scenarios during the testing process, and it is likely to introduce potential risks after it is launched in the future. Currently, such embedded code risks can be effectively identified through link change analysis.

Q13: What are the stages of using the system? Entry stage or exit stage?

A : Currently, the system can be used in both the entry and exit phases, and there are no strict restrictions.

Q14: Is there much noise interference in the identified problems?

A : From the perspective of noise interference, we have made many strategic optimizations to improve the recognition rate. For example, the HTTP/RPC interface recognition rate has now reached more than 98%, and noise interference can be well controlled.

6 Authors of this article

Guilai comes from the Meituan Daojia R&D platform.

----------  END  ----------

 Recommended reading 

  |  Practice and implementation of automated testing in Meituan Waimai

  |  Spock unit testing framework and preferred practices in Meituan

  |  Preliminary practice of quality operations in testing Meituan’s smart payment business

Guess you like

Origin blog.csdn.net/MeituanTech/article/details/133153149